CN111770388B - Content processing method, device, equipment and storage medium - Google Patents

Content processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN111770388B
CN111770388B CN202010612062.3A CN202010612062A CN111770388B CN 111770388 B CN111770388 B CN 111770388B CN 202010612062 A CN202010612062 A CN 202010612062A CN 111770388 B CN111770388 B CN 111770388B
Authority
CN
China
Prior art keywords
data
video
target object
audio data
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010612062.3A
Other languages
Chinese (zh)
Other versions
CN111770388A (en
Inventor
张倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Original Assignee
Baidu Online Network Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baidu Online Network Technology Beijing Co Ltd filed Critical Baidu Online Network Technology Beijing Co Ltd
Priority to CN202010612062.3A priority Critical patent/CN111770388B/en
Publication of CN111770388A publication Critical patent/CN111770388A/en
Application granted granted Critical
Publication of CN111770388B publication Critical patent/CN111770388B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440236Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application discloses a content processing method, a content processing device, content processing equipment and a storage medium, and relates to the fields of artificial intelligence, multimedia technology and voice processing. The specific implementation scheme is as follows: determining a target object, wherein the target object is an object displayed by a video frame; receiving comment data for the target object; and converting the comment data into target audio data, and outputting the target audio data for commenting the target object in the video data corresponding to the video frame. Therefore, the interaction mode with the user in video playing is increased, and the display form of the video is enriched.

Description

Content processing method, device, equipment and storage medium
Technical Field
The application relates to the field of data processing, in particular to the technical field of artificial intelligence, multimedia technology and voice processing.
Background
In recent years, with the development of videos, interactive communication is more emphasized in the film watching process, various comment functions and barrage bodies enrich the interaction of people, and the interactive interaction form for video interestingness is more and more the core appeal of users. However, the existing common video interaction methods include text comments or expression comments, but are limited to the text form.
Disclosure of Invention
The application provides a method, a device, equipment and a storage medium for content processing.
According to an aspect of the present application, there is provided a content processing method including:
determining a target object, wherein the target object is an object displayed in a video frame;
receiving comment data for the target object;
and converting the comment data into target audio data, and outputting the target audio data for commenting the target object in the video data corresponding to the video frame.
According to another aspect of the present application, there is provided a content processing apparatus including:
the device comprises a determining unit, a processing unit and a display unit, wherein the determining unit is used for determining a target object, and the target object is an object displayed in a video frame;
a comment data receiving unit configured to receive comment data for the target object;
and the content processing unit is used for converting the comment data into target audio data and outputting the target audio data for commenting the target object in the video data corresponding to the video frame.
According to still another aspect of the present application, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.
According to yet another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method described above.
According to yet another aspect of the application, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as described above.
Like this, the problem that current video interactive mode is single has been solved to this application scheme, if present only to be limited to the picture and text form, richened interactive mode and video show form, moreover, when richening user experience, also promoted user experience.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present application;
FIG. 2 is a schematic diagram according to a second embodiment of the present application;
FIG. 3 is a schematic illustration according to a third embodiment of the present application;
FIG. 4 is a first schematic diagram of a content processing apparatus according to an embodiment of the present application;
FIG. 5 is a second schematic structural diagram of a content processing apparatus according to an embodiment of the present application;
FIG. 6 is a third schematic structural diagram of a content processing apparatus according to an embodiment of the present application;
FIG. 7 is a fourth schematic structural diagram of a content processing apparatus according to an embodiment of the present application;
fig. 8 is a block diagram of an electronic device for implementing a content processing method according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
According to an embodiment of the present application, there is also provided a content processing method, as shown in fig. 1, including:
step S101: determining a target object, wherein the target object is an object shown in a video frame. For example, the target object is selected from objects displayed in a video frame.
Step S102: comment data for the target object is received.
Step S103: and converting the comment data into target audio data.
Step S104: and outputting the target audio data commenting the target object in the video data corresponding to the video frame.
Like this, the problem that current video interactive mode is single has been solved to this application scheme, if present only to be limited to the picture and text form, richened interactive mode and video show form, moreover, when richening user experience, also promoted user experience.
Here, in an example, the object presented by the video frame may be embodied as a movie character or the like; correspondingly, the target object is a movie and television figure selected by the user from the objects displayed by the video frame, so that the user can comment the selected movie and television figure in a targeted manner.
In a specific example, step 103 includes: :
step S103-1: and acquiring voiceprint characteristic data of the target object in the video data corresponding to the video frame.
Step S103-2: and converting the comment data into target audio data matched with the voiceprint characteristic data.
Therefore, the comment data can be converted into audio data matched with the voiceprint characteristic data of the target object in the video data, namely the target audio data, so that a new and interesting interaction mode is added in the film watching process, the problem that the existing video interaction mode is single is solved, and the interaction mode and the video display mode are enriched if the existing video interaction mode is only limited in an image-text mode.
Moreover, the user can freely select the interested target object, and the comment mode aiming at the target object is not changed, namely, the user can directly input comment data in any form, so that the user experience is enriched and simultaneously is improved; meanwhile, creative thinking of the user can be stimulated.
In a specific example of the present application, the voiceprint feature data of the target object can be obtained as follows, that is, step S103-1 includes:
acquiring audio data corresponding to the video data of the target object;
and inputting the audio data corresponding to the target object in the video data into a preset audio model to obtain the voiceprint feature data of the target object in the video data.
Therefore, the voiceprint characteristic data are obtained by utilizing the preset audio model, the fineness of the obtained voiceprint characteristic data is improved, the subsequent matching degree is high, the foundation is laid for the target audio data with high imitativeness, and meanwhile, the foundation is laid for improving the user experience.
Certainly, in practical application, the comment data may include not only text information but also information such as pictures and expressions. In a specific example of the present application, the present application may perform a voice conversion process on text information input by a user in comment data. When the comment data contains non-text information such as pictures or expressions, the non-text information such as the pictures or the expressions can be converted into text information, and then voice conversion processing is performed. Specifically, step S103-2:
converting text information in the comment data into target audio data matched with the voiceprint feature data; or, converting other information (namely, non-text information into text information) except the text information in the comment data, and converting at least the converted text information into target audio data matched with the voiceprint feature data.
Therefore, the data form of the comment data input by the user is prevented from being limited, and the user experience is further improved on the basis of being compatible with the data form of the existing comment data.
In a specific example of the present application, in order to effectively identify the target object selected by the user, before step S101, as shown in fig. 2, the following steps are further performed in the present application: further comprising:
step 001: during the playing of the video data, a second user operation, such as a click operation, is detected.
Step 002: and in response to the second user operation, pausing the playing of the video data. Here, after the playback is paused, the video frame corresponding to the second user operation is presented.
Step 003: and identifying objects displayed in the video frame corresponding to the second user operation, and visually highlighting the identified objects, wherein the target object is selected from the visually highlighted objects.
Therefore, the object in the current video frame can be effectively identified and the visual highlighting is carried out, so that the user can conveniently identify the interested target object from the current video frame, the user operation is simplified, and the user experience is further improved.
In a specific example of the solution of the present application, to further enhance the user experience, a visual highlighting area may be further added, specifically, a visual highlighting area is added to at least a part of the frames of the video data after the video frame is taken as the starting time, wherein the visual highlighting area is displayed in the video data following the target object, that is, moves following the movement of the target object in the video data, so as to present the visual effect of the target audio data output from the visual highlighting area. Therefore, the interestingness of the video display effect is increased, the user experience is enriched, and the user experience is improved.
In a specific example of the present application, after the visually highlighted area is provided, a new interactive function may be assigned to the visually highlighted area, that is, the visually highlighted area is capable of responding to the first user operation and presenting a visual effect and an auditory effect of the target audio data output from the visually highlighted area. That is to say, in practical application, the highlight region can also respond to user operation, and then play the target audio data, so as to present the visual effect and the auditory effect of the target audio data output from the highlight region, so as to promote user interactivity and user control, enrich user experience, and promote user experience.
In a specific example of the present application, after the target audio data is determined, data synthesis may be performed in a manner, that is, step S104 includes:
and synthesizing the target audio data with at least part of video of the video data after the video frame is taken as the starting time, and outputting the synthesized video data containing the target audio data.
Naturally, to further increase the display effect, the audio is synthesized and simultaneously the video is synthesized, that is, the target audio data and the visual highlighting area are synthesized with at least a part of the video data after the start time, and the synthesized video data including the visual highlighting area and the target audio data is output.
Therefore, the target audio data can be played in the video data, and in consideration of timeliness, the target audio data is synthesized into a part of video after the current video frame is taken as the starting time, for example, a plurality of continuous frames later, or all the video later, so that the viewing experience is improved, and the matching degree of the target audio data and the current viewing content is increased to the maximum extent.
In a specific example of the present application, after the target audio data is determined, data synthesis may be performed in a manner, that is, step S104 includes:
and the target audio data is superposed to at least part of audio of the video data after the starting time, the target audio data and at least part of video of the video data after the starting time are synthesized, and the video data with an audio superposition effect and containing the target audio data is output.
Of course, in order to further increase the display effect, the audio is synthesized, and simultaneously, the video is synthesized, that is, the target audio data is superimposed on at least part of the audio of the video data after the starting time, the target audio data is synthesized with at least part of the video data after the starting time, the visual highlighting area is synthesized with at least part of the video data after the starting time, and the video data with the audio superimposing effect and containing the visual highlighting area and the target audio data is output.
Therefore, the target audio data can be overlaid and played with the original audio in the video data, and thus, the user experience form is enriched; in addition, in consideration of timeliness, the target audio data is synthesized into a part of the video after the current video frame is taken as the starting time, for example, a plurality of subsequent continuous frames, or all subsequent videos, so that the viewing experience is improved, and the matching degree of the target audio data and the current viewing content is maximally increased.
Therefore, the comment data can be converted into audio data matched with the voiceprint characteristic data of the target object in the video data, namely the target audio data, so that a new and interesting interaction mode is added in the film watching process, the problem that the existing video interaction mode is single is solved, and the interaction mode and the video display mode are enriched if the existing video interaction mode is only limited in an image-text mode.
Moreover, the user can freely select the interested target object, and the comment mode aiming at the target object is not changed, namely, the user can directly input comment data in any form, so that the user experience is enriched and simultaneously is improved; meanwhile, creative thinking of the user can be stimulated.
The following is a detailed description of the present application with reference to a specific example, specifically, to facilitate the thoughts of a particular character in a bullet screen, such as "xx: xxx "is embedded into the original video, and the embedded unique mind is made to be consistent with the kiss and the mood of the specific character in the original video, and the scheme of the application provides a new content processing method, as shown in fig. 3, the specific flow includes:
step 1: the method comprises the steps that a user selects picture characters in a current video frame and adds corresponding comment contents at a client, wherein the picture characters are selected mainly aiming at the characters identified in the picture of a certain specific scene, and after the characters are selected, text comment contents, namely voice-over contents, are added by taking the designated characters as first people.
Step 2: converting the voice comment into corresponding audio, mainly converting the text comment content added by the user to the specified character into the audio resource of the dubbing original sound corresponding to the specified character, namely converting the text comment content into audio matched with the voiceprint characteristics of the specified character, such as kiss, tone and the like; here, the audio corresponding to the text comment content may be generated based on intelligent learning of the vocal sound audio of the specified person so that the generated audio matches the voiceprint feature of the specified person.
And step 3: and audio superposition, namely integrating the audio corresponding to the text comment content into the audio of the original video by taking the selected frame (namely the current video frame) as the starting time to obtain the video audio after the comment audio resource is synthesized. For example, in an audio-superimposed manner.
And 4, step 4: the caption picture follows, namely the text comment content is converted into a bubble caption form, the bubble caption is added to the periphery of the specified person and is displayed along with the specified person, and the display time is determined according to the duration of the audio corresponding to the text comment content.
And 5: and synthesizing the video, namely integrating the previous superposed audio and the bubble caption picture together, and finally synthesizing and blending the superposed audio and the bubble caption picture into the original video.
Therefore, the interactive form can be enriched, the interestingness and the interactivity of the video are improved, the creative thinking of the user is stimulated, and the content form of the video is enriched.
According to an embodiment of the present application, there is also provided a content processing apparatus, as shown in fig. 4, including:
a determining unit 401, configured to determine a target object, where the target object is an object shown in a video frame;
a comment data receiving unit 402 configured to receive comment data for the target object;
a content processing unit 403, configured to convert the comment data into target audio data, and output the target audio data commenting on the target object in video data corresponding to the video frame.
In a specific example of the scheme of the present application, as shown in fig. 5, the method further includes:
a display area adding unit 404, configured to add a visual highlight area in at least part of the frames of the video data after the video frame is taken as a start time, wherein the visual highlight area is displayed following the target object in the video data so as to present a visual effect of the target audio data output from the visual highlight area.
In a specific example of the present solution, the visually highlighted area is capable of responding to a first user action and presenting a visual effect that the target audio data is output from the visually highlighted area.
In a specific example of the present application, the content processing unit 403 is further configured to combine the target audio data with at least a portion of video of the video data after the video frame is taken as a start time, and output the combined video data including the target audio data.
In a specific example of the present application, the content processing unit 403 is further configured to superimpose the target audio data onto at least part of the audio of the video data after the start time, synthesize the target audio data with at least part of the video data after the start time, and output the video data containing the target audio data with an audio superimposing effect.
In a specific example of the scheme of the present application, as shown in fig. 6, the method further includes: a feature data acquisition unit 405; wherein the content of the first and second substances,
the feature data acquiring unit 405 is configured to acquire voiceprint feature data of the target object in video data corresponding to the video frame;
the content processing unit 403 is configured to convert the comment data into target audio data matched with the voiceprint feature data.
In a specific example of the present application, the feature data obtaining unit 405 includes:
the audio data acquisition subunit is used for acquiring audio data corresponding to the video data of the target object;
and the characteristic data extraction subunit is used for inputting the audio data corresponding to the target object in the video data into a preset audio model to obtain the voiceprint characteristic data of the target object in the video data.
In a specific example of the present application, the content processing unit 403 is further configured to:
converting text information in the comment data into target audio data matched with the voiceprint feature data; alternatively, the first and second electrodes may be,
and converting other information except the text information in the comment data into text information, and converting at least the text information obtained after conversion into target audio data matched with the voiceprint feature data.
In a specific example of the scheme of the present application, as shown in fig. 7, the method further includes:
a detecting unit 406, configured to detect a second user operation during the playing process of the video data;
a response unit 407, configured to pause playing of the video data in response to the second user operation;
and an identification processing unit 408, configured to identify an object shown in the video frame corresponding to the second user operation, and visually highlight the identified object, where the target object is selected from the visually highlighted objects.
Therefore, the comment data can be converted into audio data matched with the voiceprint characteristic data of the target object in the video data, namely the target audio data, so that a new and interesting interaction mode is added in the film watching process, the problem that the existing video interaction mode is single is solved, and the interaction mode and the video display mode are enriched if the existing video interaction mode is only limited in an image-text mode.
Moreover, the user can freely select the interested target object, and the comment mode aiming at the target object is not changed, namely, the user can directly input comment data in any form, so that the user experience is enriched and simultaneously is improved; meanwhile, creative thinking of the user can be stimulated.
There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.
As shown in fig. 8, it is a block diagram of an electronic device according to the content processing method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 8, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 8 illustrates an example of a processor 801.
The memory 802 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the content processing method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the content processing method provided by the present application.
The memory 802 is a non-transitory computer-readable storage medium that can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the content processing method in the embodiment of the present application (for example, the determination unit 401, the comment data receiving unit 402, the feature data acquisition unit 405, the content processing unit 403, the detection unit 406, the response unit 407, the recognition processing unit 408, and the display area addition unit 404 shown in fig. 6). The processor 801 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 802, that is, implements the content processing method in the above-described method embodiment.
The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device of the content processing method, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected to the electronics of the content processing method via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the content processing method may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, and are exemplified by a bus in fig. 8.
The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the content processing method, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.
Therefore, the comment data can be converted into audio data matched with the voiceprint characteristic data of the target object in the video data, namely the target audio data, so that a new and interesting interaction mode is added in the film watching process, the problem that the existing video interaction mode is single is solved, and the interaction mode and the video display mode are enriched if the existing video interaction mode is only limited in an image-text mode.
Moreover, the user can freely select the interested target object, and the comment mode aiming at the target object is not changed, namely, the user can directly input comment data in any form, so that the user experience is enriched and simultaneously is improved; meanwhile, creative thinking of the user can be stimulated.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (18)

1. A content processing method, comprising:
determining a target object, wherein the target object is an object displayed in a video frame;
receiving comment data for the target object;
acquiring voiceprint characteristic data of the target object in video data corresponding to the video frame;
and converting the comment data into target audio data matched with the voiceprint feature data, and outputting the target audio data for commenting the target object in the video data corresponding to the video frame.
2. The method of claim 1, further comprising:
adding a visual highlight region in at least a portion of the video data after the video frame is taken as a start time, wherein the visual highlight region is displayed in the video data following the target object so as to present a visual effect of the target audio data output from the visual highlight region.
3. The method of claim 2, the visual highlighting area being responsive to a first user action and presenting a visual effect of the target audio data output from the visual highlighting area.
4. The method of claim 1, 2 or 3, wherein said outputting the target audio data commenting on the target object in video data corresponding to a video frame comprises:
and synthesizing the target audio data with at least part of video of the video data after the video frame is taken as the starting time, and outputting the synthesized video data containing the target audio data.
5. The method of claim 1, 2 or 3, wherein said outputting the target audio data commenting on the target object in video data corresponding to a video frame comprises:
and synthesizing the target audio data and at least part of video of the video data after the starting time, and outputting the video data with audio superposition effect and containing the target audio data.
6. The method according to claim 1, wherein the obtaining of the voiceprint feature data of the target object in the video data corresponding to the video frame comprises:
acquiring audio data corresponding to the video data of the target object;
and inputting the audio data corresponding to the target object in the video data into a preset audio model to obtain the voiceprint feature data of the target object in the video data.
7. The method of claim 1 or 6, wherein said converting the commentary data into target audio data that matches the voiceprint feature data comprises:
converting text information in the comment data into target audio data matched with the voiceprint feature data; alternatively, the first and second electrodes may be,
and converting other information except the text information in the comment data into text information, and converting at least the text information obtained after conversion into target audio data matched with the voiceprint feature data.
8. The method of claim 1, further comprising:
detecting a second user operation in the playing process of the video data;
pausing the playing of the video data in response to the second user operation;
and identifying objects displayed in the video frame corresponding to the second user operation, and visually highlighting the identified objects, wherein the target object is selected from the visually highlighted objects.
9. A content processing apparatus comprising:
the device comprises a determining unit, a processing unit and a display unit, wherein the determining unit is used for determining a target object, and the target object is an object displayed in a video frame;
a comment data receiving unit configured to receive comment data for the target object;
the characteristic data acquisition unit is used for acquiring voiceprint characteristic data of the target object in the video data corresponding to the video frame;
and the content processing unit is used for converting the comment data into target audio data matched with the voiceprint feature data and outputting the target audio data for commenting the target object in the video data corresponding to the video frame.
10. The apparatus of claim 9, further comprising:
a display area adding unit configured to add a visual highlight area in at least a part of frames of the video data after the video frame is taken as a start time, wherein the visual highlight area is displayed following the target object in the video data so as to present a visual effect of the target audio data output from the visual highlight area.
11. The apparatus of claim 10, the visual highlighting area being responsive to a first user action and presenting a visual effect of the target audio data output from the visual highlighting area.
12. The apparatus according to claim 9, 10 or 11, wherein the content processing unit is further configured to synthesize the target audio data with at least a portion of video of the video data after the video frame is a start time, and output the synthesized video data including the target audio data.
13. The apparatus according to claim 9, 10 or 11, wherein the content processing unit is further configured to superimpose the target audio data onto at least part of audio of the video data after a start time, synthesize the target audio data with at least part of video of the video data after the start time, and output video data containing the target audio data with an audio superimposing effect.
14. The apparatus of claim 9, the feature data acquisition unit, comprising:
the audio data acquisition subunit is used for acquiring audio data corresponding to the video data of the target object;
and the characteristic data extraction subunit is used for inputting the audio data corresponding to the target object in the video data into a preset audio model to obtain the voiceprint characteristic data of the target object in the video data.
15. The apparatus of claim 9 or 14, the content processing unit to further:
converting text information in the comment data into target audio data matched with the voiceprint feature data; alternatively, the first and second electrodes may be,
and converting other information except the text information in the comment data into text information, and converting at least the text information obtained after conversion into target audio data matched with the voiceprint feature data.
16. The apparatus of claim 9, further comprising:
the detection unit is used for detecting a second user operation in the playing process of the video data;
a response unit for pausing the playing of the video data in response to the second user operation;
and the identification processing unit is used for identifying objects displayed in the video frame corresponding to the second user operation and visually highlighting the identified objects, wherein the target objects are selected from the visually highlighted objects.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.
CN202010612062.3A 2020-06-30 2020-06-30 Content processing method, device, equipment and storage medium Active CN111770388B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010612062.3A CN111770388B (en) 2020-06-30 2020-06-30 Content processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010612062.3A CN111770388B (en) 2020-06-30 2020-06-30 Content processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111770388A CN111770388A (en) 2020-10-13
CN111770388B true CN111770388B (en) 2022-04-19

Family

ID=72724184

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010612062.3A Active CN111770388B (en) 2020-06-30 2020-06-30 Content processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111770388B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114584824A (en) * 2020-12-01 2022-06-03 阿里巴巴集团控股有限公司 Data processing method and system, electronic equipment, server and client equipment
CN112637409B (en) * 2020-12-21 2022-05-06 维沃移动通信有限公司 Content output method and device and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1845126A (en) * 2005-04-08 2006-10-11 佳能株式会社 Information processing apparatus and information processing method
CN109246473A (en) * 2018-09-13 2019-01-18 苏州思必驰信息科技有限公司 The voice interactive method and terminal system of individualized video barrage based on Application on Voiceprint Recognition

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359473A (en) * 2007-07-30 2009-02-04 国际商业机器公司 Auto speech conversion method and apparatus
US9495713B2 (en) * 2010-12-10 2016-11-15 Quib, Inc. Comment delivery and filtering architecture
CN104811816B (en) * 2015-04-29 2018-04-13 北京奇艺世纪科技有限公司 A kind of is the method, apparatus and system that the object in video pictures plays barrage label
CN104980790B (en) * 2015-06-30 2018-10-09 北京奇艺世纪科技有限公司 The generation method and device of voice subtitle, playing method and device
CN105847939A (en) * 2016-05-12 2016-08-10 乐视控股(北京)有限公司 Bullet screen play method, bullet screen play device and bullet screen play system
US10394831B2 (en) * 2016-06-03 2019-08-27 Facebook, Inc. Profile with third-party content
EP3542360A4 (en) * 2016-11-21 2020-04-29 Microsoft Technology Licensing, LLC Automatic dubbing method and apparatus
CN107493442A (en) * 2017-07-21 2017-12-19 北京奇虎科技有限公司 A kind of method and apparatus for editing video
CN110600000B (en) * 2019-09-29 2022-04-15 阿波罗智联(北京)科技有限公司 Voice broadcasting method and device, electronic equipment and storage medium
CN110891198B (en) * 2019-11-29 2021-06-15 腾讯科技(深圳)有限公司 Video playing prompt method, multimedia playing prompt method, bullet screen processing method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1845126A (en) * 2005-04-08 2006-10-11 佳能株式会社 Information processing apparatus and information processing method
CN109246473A (en) * 2018-09-13 2019-01-18 苏州思必驰信息科技有限公司 The voice interactive method and terminal system of individualized video barrage based on Application on Voiceprint Recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
声纹鉴别及其应用;侯遵泽;《武警学院学报》;20021225(第06期);全文 *

Also Published As

Publication number Publication date
CN111770388A (en) 2020-10-13

Similar Documents

Publication Publication Date Title
RU2698158C1 (en) Digital multimedia platform for converting video objects into multimedia objects presented in a game form
US20180288450A1 (en) Method for inserting information push into live video streaming, server, and terminal
CN104881237A (en) Internet interaction method and client
CN112068750A (en) House resource processing method and device
CN111770388B (en) Content processing method, device, equipment and storage medium
CN109495427B (en) Multimedia data display method and device, storage medium and computer equipment
CN111935551A (en) Video processing method and device, electronic equipment and storage medium
CN111866550A (en) Method and device for shielding video clip
CN112182297A (en) Training information fusion model, and method and device for generating collection video
KR20210152396A (en) Video processing method, device, electronic equipment and storage medium
CN113596553A (en) Video playing method and device, computer equipment and storage medium
CN110913259A (en) Video playing method and device, electronic equipment and medium
CN111770376A (en) Information display method, device, system, electronic equipment and storage medium
CN111770384A (en) Video switching method and device, electronic equipment and storage medium
CN110798736B (en) Video playing method, device, equipment and medium
US20230336818A1 (en) Method and apparatus for shared viewing of media content
CN112528052A (en) Multimedia content output method, device, electronic equipment and storage medium
US20170139933A1 (en) Electronic Device, And Computer-Readable Storage Medium For Quickly Searching Video Segments
US20210392394A1 (en) Method and apparatus for processing video, electronic device and storage medium
CN116095388A (en) Video generation method, video playing method and related equipment
WO2018149170A1 (en) Cross-application control method and device
CN111723343B (en) Interactive control method and device of electronic equipment and electronic equipment
US11249823B2 (en) Methods and systems for facilitating application programming interface communications
CN113840177B (en) Live interaction method and device, storage medium and electronic equipment
CN114143572B (en) Live interaction method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant