CN111770388B

CN111770388B - Content processing method, device, equipment and storage medium

Info

Publication number: CN111770388B
Application number: CN202010612062.3A
Authority: CN
Inventors: 张倩
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2022-04-19
Anticipated expiration: 2040-06-30
Also published as: CN111770388A

Abstract

The application discloses a content processing method, a content processing device, content processing equipment and a storage medium, and relates to the fields of artificial intelligence, multimedia technology and voice processing. The specific implementation scheme is as follows: determining a target object, wherein the target object is an object displayed by a video frame; receiving comment data for the target object; and converting the comment data into target audio data, and outputting the target audio data for commenting the target object in the video data corresponding to the video frame. Therefore, the interaction mode with the user in video playing is increased, and the display form of the video is enriched.

Description

Content processing method, device, equipment and storage medium

Technical Field

The application relates to the field of data processing, in particular to the technical field of artificial intelligence, multimedia technology and voice processing.

Background

In recent years, with the development of videos, interactive communication is more emphasized in the film watching process, various comment functions and barrage bodies enrich the interaction of people, and the interactive interaction form for video interestingness is more and more the core appeal of users. However, the existing common video interaction methods include text comments or expression comments, but are limited to the text form.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for content processing.

According to an aspect of the present application, there is provided a content processing method including:

determining a target object, wherein the target object is an object displayed in a video frame;

receiving comment data for the target object;

and converting the comment data into target audio data, and outputting the target audio data for commenting the target object in the video data corresponding to the video frame.

According to another aspect of the present application, there is provided a content processing apparatus including:

the device comprises a determining unit, a processing unit and a display unit, wherein the determining unit is used for determining a target object, and the target object is an object displayed in a video frame;

a comment data receiving unit configured to receive comment data for the target object;

and the content processing unit is used for converting the comment data into target audio data and outputting the target audio data for commenting the target object in the video data corresponding to the video frame.

According to still another aspect of the present application, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

According to yet another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method described above.

According to yet another aspect of the application, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as described above.

Like this, the problem that current video interactive mode is single has been solved to this application scheme, if present only to be limited to the picture and text form, richened interactive mode and video show form, moreover, when richening user experience, also promoted user experience.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present application;

FIG. 2 is a schematic diagram according to a second embodiment of the present application;

FIG. 3 is a schematic illustration according to a third embodiment of the present application;

FIG. 4 is a first schematic diagram of a content processing apparatus according to an embodiment of the present application;

FIG. 5 is a second schematic structural diagram of a content processing apparatus according to an embodiment of the present application;

FIG. 6 is a third schematic structural diagram of a content processing apparatus according to an embodiment of the present application;

FIG. 7 is a fourth schematic structural diagram of a content processing apparatus according to an embodiment of the present application;

fig. 8 is a block diagram of an electronic device for implementing a content processing method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

According to an embodiment of the present application, there is also provided a content processing method, as shown in fig. 1, including:

step S101: determining a target object, wherein the target object is an object shown in a video frame. For example, the target object is selected from objects displayed in a video frame.

Step S102: comment data for the target object is received.

Step S103: and converting the comment data into target audio data.

Step S104: and outputting the target audio data commenting the target object in the video data corresponding to the video frame.

Here, in an example, the object presented by the video frame may be embodied as a movie character or the like; correspondingly, the target object is a movie and television figure selected by the user from the objects displayed by the video frame, so that the user can comment the selected movie and television figure in a targeted manner.

In a specific example, step 103 includes: :

step S103-1: and acquiring voiceprint characteristic data of the target object in the video data corresponding to the video frame.

Step S103-2: and converting the comment data into target audio data matched with the voiceprint characteristic data.

Therefore, the comment data can be converted into audio data matched with the voiceprint characteristic data of the target object in the video data, namely the target audio data, so that a new and interesting interaction mode is added in the film watching process, the problem that the existing video interaction mode is single is solved, and the interaction mode and the video display mode are enriched if the existing video interaction mode is only limited in an image-text mode.

Moreover, the user can freely select the interested target object, and the comment mode aiming at the target object is not changed, namely, the user can directly input comment data in any form, so that the user experience is enriched and simultaneously is improved; meanwhile, creative thinking of the user can be stimulated.

In a specific example of the present application, the voiceprint feature data of the target object can be obtained as follows, that is, step S103-1 includes:

acquiring audio data corresponding to the video data of the target object;

and inputting the audio data corresponding to the target object in the video data into a preset audio model to obtain the voiceprint feature data of the target object in the video data.

Therefore, the voiceprint characteristic data are obtained by utilizing the preset audio model, the fineness of the obtained voiceprint characteristic data is improved, the subsequent matching degree is high, the foundation is laid for the target audio data with high imitativeness, and meanwhile, the foundation is laid for improving the user experience.

Certainly, in practical application, the comment data may include not only text information but also information such as pictures and expressions. In a specific example of the present application, the present application may perform a voice conversion process on text information input by a user in comment data. When the comment data contains non-text information such as pictures or expressions, the non-text information such as the pictures or the expressions can be converted into text information, and then voice conversion processing is performed. Specifically, step S103-2:

converting text information in the comment data into target audio data matched with the voiceprint feature data; or, converting other information (namely, non-text information into text information) except the text information in the comment data, and converting at least the converted text information into target audio data matched with the voiceprint feature data.

Therefore, the data form of the comment data input by the user is prevented from being limited, and the user experience is further improved on the basis of being compatible with the data form of the existing comment data.

In a specific example of the present application, in order to effectively identify the target object selected by the user, before step S101, as shown in fig. 2, the following steps are further performed in the present application: further comprising:

step 001: during the playing of the video data, a second user operation, such as a click operation, is detected.

Step 002: and in response to the second user operation, pausing the playing of the video data. Here, after the playback is paused, the video frame corresponding to the second user operation is presented.

Step 003: and identifying objects displayed in the video frame corresponding to the second user operation, and visually highlighting the identified objects, wherein the target object is selected from the visually highlighted objects.

Therefore, the object in the current video frame can be effectively identified and the visual highlighting is carried out, so that the user can conveniently identify the interested target object from the current video frame, the user operation is simplified, and the user experience is further improved.

In a specific example of the solution of the present application, to further enhance the user experience, a visual highlighting area may be further added, specifically, a visual highlighting area is added to at least a part of the frames of the video data after the video frame is taken as the starting time, wherein the visual highlighting area is displayed in the video data following the target object, that is, moves following the movement of the target object in the video data, so as to present the visual effect of the target audio data output from the visual highlighting area. Therefore, the interestingness of the video display effect is increased, the user experience is enriched, and the user experience is improved.

In a specific example of the present application, after the visually highlighted area is provided, a new interactive function may be assigned to the visually highlighted area, that is, the visually highlighted area is capable of responding to the first user operation and presenting a visual effect and an auditory effect of the target audio data output from the visually highlighted area. That is to say, in practical application, the highlight region can also respond to user operation, and then play the target audio data, so as to present the visual effect and the auditory effect of the target audio data output from the highlight region, so as to promote user interactivity and user control, enrich user experience, and promote user experience.

In a specific example of the present application, after the target audio data is determined, data synthesis may be performed in a manner, that is, step S104 includes:

and synthesizing the target audio data with at least part of video of the video data after the video frame is taken as the starting time, and outputting the synthesized video data containing the target audio data.

Naturally, to further increase the display effect, the audio is synthesized and simultaneously the video is synthesized, that is, the target audio data and the visual highlighting area are synthesized with at least a part of the video data after the start time, and the synthesized video data including the visual highlighting area and the target audio data is output.

Therefore, the target audio data can be played in the video data, and in consideration of timeliness, the target audio data is synthesized into a part of video after the current video frame is taken as the starting time, for example, a plurality of continuous frames later, or all the video later, so that the viewing experience is improved, and the matching degree of the target audio data and the current viewing content is increased to the maximum extent.

and the target audio data is superposed to at least part of audio of the video data after the starting time, the target audio data and at least part of video of the video data after the starting time are synthesized, and the video data with an audio superposition effect and containing the target audio data is output.

Of course, in order to further increase the display effect, the audio is synthesized, and simultaneously, the video is synthesized, that is, the target audio data is superimposed on at least part of the audio of the video data after the starting time, the target audio data is synthesized with at least part of the video data after the starting time, the visual highlighting area is synthesized with at least part of the video data after the starting time, and the video data with the audio superimposing effect and containing the visual highlighting area and the target audio data is output.

Therefore, the target audio data can be overlaid and played with the original audio in the video data, and thus, the user experience form is enriched; in addition, in consideration of timeliness, the target audio data is synthesized into a part of the video after the current video frame is taken as the starting time, for example, a plurality of subsequent continuous frames, or all subsequent videos, so that the viewing experience is improved, and the matching degree of the target audio data and the current viewing content is maximally increased.

The following is a detailed description of the present application with reference to a specific example, specifically, to facilitate the thoughts of a particular character in a bullet screen, such as "xx: xxx "is embedded into the original video, and the embedded unique mind is made to be consistent with the kiss and the mood of the specific character in the original video, and the scheme of the application provides a new content processing method, as shown in fig. 3, the specific flow includes:

step 1: the method comprises the steps that a user selects picture characters in a current video frame and adds corresponding comment contents at a client, wherein the picture characters are selected mainly aiming at the characters identified in the picture of a certain specific scene, and after the characters are selected, text comment contents, namely voice-over contents, are added by taking the designated characters as first people.

Step 2: converting the voice comment into corresponding audio, mainly converting the text comment content added by the user to the specified character into the audio resource of the dubbing original sound corresponding to the specified character, namely converting the text comment content into audio matched with the voiceprint characteristics of the specified character, such as kiss, tone and the like; here, the audio corresponding to the text comment content may be generated based on intelligent learning of the vocal sound audio of the specified person so that the generated audio matches the voiceprint feature of the specified person.

And step 3: and audio superposition, namely integrating the audio corresponding to the text comment content into the audio of the original video by taking the selected frame (namely the current video frame) as the starting time to obtain the video audio after the comment audio resource is synthesized. For example, in an audio-superimposed manner.

And 4, step 4: the caption picture follows, namely the text comment content is converted into a bubble caption form, the bubble caption is added to the periphery of the specified person and is displayed along with the specified person, and the display time is determined according to the duration of the audio corresponding to the text comment content.

And 5: and synthesizing the video, namely integrating the previous superposed audio and the bubble caption picture together, and finally synthesizing and blending the superposed audio and the bubble caption picture into the original video.

Therefore, the interactive form can be enriched, the interestingness and the interactivity of the video are improved, the creative thinking of the user is stimulated, and the content form of the video is enriched.

According to an embodiment of the present application, there is also provided a content processing apparatus, as shown in fig. 4, including:

a determining unit 401, configured to determine a target object, where the target object is an object shown in a video frame;

a comment data receiving unit 402 configured to receive comment data for the target object;

a content processing unit 403, configured to convert the comment data into target audio data, and output the target audio data commenting on the target object in video data corresponding to the video frame.

In a specific example of the scheme of the present application, as shown in fig. 5, the method further includes:

a display area adding unit 404, configured to add a visual highlight area in at least part of the frames of the video data after the video frame is taken as a start time, wherein the visual highlight area is displayed following the target object in the video data so as to present a visual effect of the target audio data output from the visual highlight area.

In a specific example of the present solution, the visually highlighted area is capable of responding to a first user action and presenting a visual effect that the target audio data is output from the visually highlighted area.

In a specific example of the present application, the content processing unit 403 is further configured to combine the target audio data with at least a portion of video of the video data after the video frame is taken as a start time, and output the combined video data including the target audio data.

In a specific example of the present application, the content processing unit 403 is further configured to superimpose the target audio data onto at least part of the audio of the video data after the start time, synthesize the target audio data with at least part of the video data after the start time, and output the video data containing the target audio data with an audio superimposing effect.

In a specific example of the scheme of the present application, as shown in fig. 6, the method further includes: a feature data acquisition unit 405; wherein the content of the first and second substances,

the feature data acquiring unit 405 is configured to acquire voiceprint feature data of the target object in video data corresponding to the video frame;

the content processing unit 403 is configured to convert the comment data into target audio data matched with the voiceprint feature data.

In a specific example of the present application, the feature data obtaining unit 405 includes:

the audio data acquisition subunit is used for acquiring audio data corresponding to the video data of the target object;

and the characteristic data extraction subunit is used for inputting the audio data corresponding to the target object in the video data into a preset audio model to obtain the voiceprint characteristic data of the target object in the video data.

In a specific example of the present application, the content processing unit 403 is further configured to:

converting text information in the comment data into target audio data matched with the voiceprint feature data; alternatively, the first and second electrodes may be,

and converting other information except the text information in the comment data into text information, and converting at least the text information obtained after conversion into target audio data matched with the voiceprint feature data.

In a specific example of the scheme of the present application, as shown in fig. 7, the method further includes:

a detecting unit 406, configured to detect a second user operation during the playing process of the video data;

a response unit 407, configured to pause playing of the video data in response to the second user operation;

and an identification processing unit 408, configured to identify an object shown in the video frame corresponding to the second user operation, and visually highlight the identified object, where the target object is selected from the visually highlighted objects.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

As shown in fig. 8, it is a block diagram of an electronic device according to the content processing method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 8, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 8 illustrates an example of a processor 801.

The memory 802 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the content processing method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the content processing method provided by the present application.

The memory 802 is a non-transitory computer-readable storage medium that can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the content processing method in the embodiment of the present application (for example, the determination unit 401, the comment data receiving unit 402, the feature data acquisition unit 405, the content processing unit 403, the detection unit 406, the response unit 407, the recognition processing unit 408, and the display area addition unit 404 shown in fig. 6). The processor 801 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 802, that is, implements the content processing method in the above-described method embodiment.

The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device of the content processing method, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected to the electronics of the content processing method via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the content processing method may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, and are exemplified by a bus in fig. 8.

The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the content processing method, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A content processing method, comprising:

receiving comment data for the target object;

acquiring voiceprint characteristic data of the target object in video data corresponding to the video frame;

and converting the comment data into target audio data matched with the voiceprint feature data, and outputting the target audio data for commenting the target object in the video data corresponding to the video frame.

2. The method of claim 1, further comprising:

adding a visual highlight region in at least a portion of the video data after the video frame is taken as a start time, wherein the visual highlight region is displayed in the video data following the target object so as to present a visual effect of the target audio data output from the visual highlight region.

3. The method of claim 2, the visual highlighting area being responsive to a first user action and presenting a visual effect of the target audio data output from the visual highlighting area.

4. The method of claim 1, 2 or 3, wherein said outputting the target audio data commenting on the target object in video data corresponding to a video frame comprises:

5. The method of claim 1, 2 or 3, wherein said outputting the target audio data commenting on the target object in video data corresponding to a video frame comprises:

and synthesizing the target audio data and at least part of video of the video data after the starting time, and outputting the video data with audio superposition effect and containing the target audio data.

6. The method according to claim 1, wherein the obtaining of the voiceprint feature data of the target object in the video data corresponding to the video frame comprises:

acquiring audio data corresponding to the video data of the target object;

7. The method of claim 1 or 6, wherein said converting the commentary data into target audio data that matches the voiceprint feature data comprises:

8. The method of claim 1, further comprising:

detecting a second user operation in the playing process of the video data;

pausing the playing of the video data in response to the second user operation;

and identifying objects displayed in the video frame corresponding to the second user operation, and visually highlighting the identified objects, wherein the target object is selected from the visually highlighted objects.

9. A content processing apparatus comprising:

the characteristic data acquisition unit is used for acquiring voiceprint characteristic data of the target object in the video data corresponding to the video frame;

and the content processing unit is used for converting the comment data into target audio data matched with the voiceprint feature data and outputting the target audio data for commenting the target object in the video data corresponding to the video frame.

10. The apparatus of claim 9, further comprising:

a display area adding unit configured to add a visual highlight area in at least a part of frames of the video data after the video frame is taken as a start time, wherein the visual highlight area is displayed following the target object in the video data so as to present a visual effect of the target audio data output from the visual highlight area.

11. The apparatus of claim 10, the visual highlighting area being responsive to a first user action and presenting a visual effect of the target audio data output from the visual highlighting area.

12. The apparatus according to claim 9, 10 or 11, wherein the content processing unit is further configured to synthesize the target audio data with at least a portion of video of the video data after the video frame is a start time, and output the synthesized video data including the target audio data.

13. The apparatus according to claim 9, 10 or 11, wherein the content processing unit is further configured to superimpose the target audio data onto at least part of audio of the video data after a start time, synthesize the target audio data with at least part of video of the video data after the start time, and output video data containing the target audio data with an audio superimposing effect.

14. The apparatus of claim 9, the feature data acquisition unit, comprising:

15. The apparatus of claim 9 or 14, the content processing unit to further:

16. The apparatus of claim 9, further comprising:

the detection unit is used for detecting a second user operation in the playing process of the video data;

a response unit for pausing the playing of the video data in response to the second user operation;

and the identification processing unit is used for identifying objects displayed in the video frame corresponding to the second user operation and visually highlighting the identified objects, wherein the target objects are selected from the visually highlighted objects.

17. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.