WO2024082948A1 - 多媒体数据处理方法、装置、设备及介质 - Google Patents

多媒体数据处理方法、装置、设备及介质 Download PDF

Info

Publication number
WO2024082948A1
WO2024082948A1 PCT/CN2023/122068 CN2023122068W WO2024082948A1 WO 2024082948 A1 WO2024082948 A1 WO 2024082948A1 CN 2023122068 W CN2023122068 W CN 2023122068W WO 2024082948 A1 WO2024082948 A1 WO 2024082948A1
Authority
WO
WIPO (PCT)
Prior art keywords
track
segment
editing
segments
text
Prior art date
Application number
PCT/CN2023/122068
Other languages
English (en)
French (fr)
Inventor
李欣玮
Original Assignee
北京字跳网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字跳网络技术有限公司 filed Critical 北京字跳网络技术有限公司
Priority to EP23825187.0A priority Critical patent/EP4383698A1/en
Publication of WO2024082948A1 publication Critical patent/WO2024082948A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to a multimedia data processing method, device, equipment and medium.
  • the related image video containing the text content is generated according to the text content to be shared by the user.
  • the user's creativity changes at any time, and the current creation method is rigid in style and cannot meet the user's flexible and fine-grained needs, and the quality of multimedia data is not high.
  • the present disclosure provides a multimedia data processing method, device, equipment and medium.
  • the embodiment of the present disclosure provides a multimedia data processing method, the method comprising: receiving text information input by a user; generating multimedia data based on the text information in response to a processing instruction for the text information, and displaying a multimedia editing interface for editing the multimedia data, wherein the multimedia data comprises: a plurality of multimedia segments, the plurality of multimedia segments respectively corresponding to a plurality of text segments divided into the text information, the plurality of multimedia segments comprising a plurality of voice segments generated by reading aloud corresponding to the plurality of text segments respectively, and a plurality of video image segments respectively matching the plurality of text segments; the multimedia editing interface comprises: a first editing track, a second editing track, and a third editing track, wherein the first editing track comprises a plurality of first tracks The first track segments are used to identify the multiple text segments respectively; the second editing track includes multiple second track segments, and the multiple second track segments are used to identify the multiple video image segments respectively; the third editing track includes multiple third track segments, and the multiple third track segments are used to identify the multiple
  • the embodiment of the present disclosure also provides a multimedia data processing device, the device comprising: a receiving module, for receiving text information input by a user; a generating module, for generating multimedia data based on the text information in response to a processing instruction for the text information; a display module, for displaying a multimedia editing interface for performing editing operations on the multimedia data, wherein the multimedia data comprises: a plurality of multimedia segments, the plurality of multimedia segments respectively correspond to the plurality of text segments divided into the text information, the plurality of multimedia segments comprise a plurality of voice segments generated by reading aloud corresponding to the plurality of text segments respectively, and a plurality of video image segments respectively matched to the plurality of text segments; the multimedia editing interface comprises: a first editing track, a second editing track, and a third editing track, wherein the first editing track comprises a plurality of first track segments, the plurality of first track segments are respectively used to identify the plurality of text segments, the second editing track comprises a plurality of second track segments, the plurality of second track segments
  • An embodiment of the present disclosure also provides an electronic device, which includes: a processor; a memory for storing executable instructions of the processor; the processor is used to read the executable instructions from the memory and execute the instructions to implement the multimedia data processing method provided by the embodiment of the present disclosure.
  • the embodiment of the present disclosure further provides a computer-readable storage medium, wherein the storage medium stores a computer program, and the computer program is used to execute the multimedia data processing method provided by the embodiment of the present disclosure.
  • the embodiment of the present disclosure also provides a computer program product.
  • the instructions in the computer program product are executed by a processor, the multimedia data processing method provided by the embodiment of the present disclosure is implemented.
  • the technical solution provided by the disclosed embodiment has the following advantages:
  • the multimedia data processing solution receives text information input by a user, generates multimedia data based on the text information in response to a processing instruction for the text information, and displays a multimedia editing interface for editing the multimedia data
  • the multimedia editing interface includes multiple multimedia segments, the multiple multimedia segments respectively correspond to the multiple text segments into which the text information is divided, the multiple multimedia segments include multiple voice segments generated by reading aloud corresponding to the multiple text segments respectively, and multiple video image segments respectively matched to the multiple text segments
  • the multimedia editing interface includes: a first editing track, a second editing track, and a third editing track, wherein the first track segment corresponding to the first editing track aligned with the timeline on the editing track, the second track segment corresponding to the second editing track, and the third track segment corresponding to the third editing track respectively identify corresponding text segments, video image segments, and voice segments.
  • the editing tracks corresponding to the multimedia data are enriched, which can meet the diversified editing needs of multimedia data and improve the quality of multimedia data.
  • FIG1 is a schematic flow chart of a multimedia data processing method provided by an embodiment of the present disclosure
  • FIG2 is a schematic diagram of a text input interface provided by an embodiment of the present disclosure.
  • FIG3 is a schematic diagram of a multimedia segment composition of multimedia data provided by an embodiment of the present disclosure.
  • FIG4 is a schematic diagram of a multimedia data processing scenario provided by an embodiment of the present disclosure.
  • FIG5 is a schematic diagram of a multimedia editing interface provided by an embodiment of the present disclosure.
  • FIG6 is a flow chart of another multimedia data processing method provided by an embodiment of the present disclosure.
  • FIG7 is a schematic diagram of another multimedia data processing scenario provided by an embodiment of the present disclosure.
  • FIG8 is a schematic diagram of another multimedia data processing scenario provided by an embodiment of the present disclosure.
  • FIG9 is a schematic diagram of another multimedia data processing scenario provided by an embodiment of the present disclosure.
  • FIG10 is a schematic flow chart of another multimedia data processing method provided by an embodiment of the present disclosure.
  • FIG11 is a schematic diagram of another multimedia data processing scenario provided by an embodiment of the present disclosure.
  • FIG12 is a schematic diagram of the structure of a multimedia data processing device provided by an embodiment of the present disclosure.
  • FIG. 13 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present disclosure.
  • the embodiments of the present disclosure provide a multimedia data processing method, in which the multimedia data is split into multiple editing tracks, such as a text editing track, a video image editing track, and a voice editing track, and the corresponding information is edited through the editing operations of the editing tracks, thereby meeting the diversified editing needs of the multimedia data and improving the quality of the multimedia data.
  • the multimedia data processing method is introduced below in conjunction with specific embodiments.
  • FIG1 is a flow chart of a multimedia data processing method provided by an embodiment of the present disclosure.
  • the method can be executed by a multimedia data processing device, wherein the device can be implemented by software and/or hardware, and can generally be integrated into an electronic device such as a computer.
  • the method includes:
  • Step 101 receiving text information input by a user.
  • a text input interface for editing text can be provided, and the input interface includes a text area and a link area.
  • the text information can be customized and edited according to the needs of video production, that is, the text information input by the user in the text area of the text editing interface can be received, or an authorized link can be pasted and text information can be extracted from the link, that is, the link information input by the user in the link area can be received, the link information can be identified to obtain the text information on the corresponding text, and the text information can be displayed in the text area for the user to edit, that is, the text information displayed in the text area can also be edited multiple times.
  • the number of words in the text information is also limited to a certain extent, for example, not more than 2,000 words, etc. Therefore, it is possible to check whether the number of words in the text area exceeds the limit. If the limit is exceeded, a pop-up window indicating the word limit may be displayed to remind the user.
  • the text input interface may also include a generate video button and a timbre selection entry control, and may display a candidate timbre menu in response to the user triggering the timbre selection entry control, wherein the candidate timbre menu includes one or more candidate timbres (the multiple candidate timbres may include multiple timbre types such as uncle, boy, girl, loli, etc.), and an audition control corresponding to the candidate timbre.
  • the candidate timbre menu includes one or more candidate timbres (the multiple candidate timbres may include multiple timbre types such as uncle, boy, girl, loli, etc.), and an audition control corresponding to the candidate timbre.
  • the user can select a timbre from a variety of candidate timbres when inputting text information, thereby determining the first target timbre based on the user's selection operation in the candidate timbre menu. Further, multiple voice segments generated by reading multiple text segments divided into the text information based on the first target timbre are obtained. At this time, the timbre of each voice segment is the first target timbre, which improves the selection efficiency of the first target timbre while satisfying the personalized selection of the first target timbre.
  • the text information when the text information is divided into multiple voice segments, the text information can be segmented according to the reading habits of the first target timbre to determine the text segment contained in each segment.
  • the text information can be divided into multiple voice segments according to the semantic information of the text information, and the text information can be converted into multiple voice segments corresponding to the multiple text segments with the text segments as the conversion granularity.
  • Step 102 in response to the processing instruction of the text information, generates multimedia data based on the text information, and displays a multimedia editing interface for editing the multimedia data, wherein:
  • the multimedia data includes: a plurality of multimedia segments, the plurality of multimedia segments respectively corresponding to the plurality of text segments into which the text information is divided, the plurality of multimedia segments including a plurality of voice segments generated by reading aloud corresponding to the plurality of text segments respectively, and a plurality of video image segments respectively matching the plurality of text segments;
  • the multimedia editing interface includes: a first editing track, a second editing track, and a third editing track, wherein the first editing track contains multiple first track segments, and the multiple first track segments are respectively used to identify multiple text segments; the second editing track contains multiple second track segments, and the multiple second track segments are respectively used to identify multiple video image segments; the third editing track contains multiple third track segments, and the multiple third track segments are respectively used to identify multiple voice segments, wherein the first track segments, the second track segments, and the third track segments aligned with the timeline on the editing track respectively identify corresponding text segments, video image segments, and voice segments.
  • an entry for editing text information is provided, through which processing instructions for the text information are obtained.
  • the entry for editing text information may be a "generate video" control displayed in a text editing interface.
  • the processing instructions for the text information are obtained.
  • the entry for editing text information may also be a gesture action input entry, a voice information input entry, etc.
  • multimedia data in response to an instruction to process text information, multimedia data is generated based on the text information, wherein the multimedia data includes a plurality of multimedia segments, the plurality of multimedia segments respectively corresponding to a plurality of text segments into which the text information is divided, the plurality of multimedia segments including a plurality of voice segments generated by reading aloud corresponding to the plurality of text segments respectively, and a plurality of video image segments respectively matching the plurality of text segments.
  • the generated multimedia data is composed of a plurality of multimedia segments with segment granularity, and each multimedia segment contains at least three types of information, namely, a text segment, a voice segment (the initial timbre of the voice segment can be the first target timbre selected in the text editing interface in the above embodiment, etc.), a video image segment, etc.
  • the video image segment can include a video stream composed of continuous pictures, wherein the continuous pictures can correspond to pictures in the video stream matched in the preset video library, or can correspond to one or more pictures obtained by matching in the preset image material library).
  • multimedia segment A contains text segment A1 "There are many types of women's clothing in today's society", voice segment A2 corresponding to text segment A1, and video image segment A3 matching text segment A1;
  • multimedia segment B contains text segment B "1 such as common Hanfu”, voice segment A2 corresponding to text segment B1, and video image segment B3 matching text segment B1;
  • multimedia segment C contains text segment C1 "trendy clothing”, voice segment C2 corresponding to text segment C1, and video image segment C3 matching text segment C1;
  • multimedia segment D contains text segment D1 "sportswear, etc.”, voice segment D2 corresponding to text segment D1, and video image segment D3 matching text segment D1.
  • the multimedia data includes at least three types of information.
  • a multimedia editing interface for editing multimedia data is displayed, wherein the multimedia editing interface includes: a first editing track, a second editing track, and a third editing track, wherein the first editing track includes multiple first track segments, and the multiple first track segments are respectively used to identify multiple text segments, the second editing track includes multiple second track segments, and the multiple second track segments are respectively used to identify multiple video image segments, and the third editing track includes multiple third track segments, and the multiple third track segments are respectively used to identify multiple voice segments, wherein, in order to facilitate the intuitive reflection that each multimedia data corresponds to multiple types of information segments, the timeline on the editing track is aligned to display the segments, the second track segments and the third track segments, and the first track segments, the second track segments and the third track segments of the first track respectively identify the corresponding text segments, video image segments and voice segments.
  • each multimedia segment is split into multiple editing tracks corresponding to information types.
  • the user can not only edit a single multimedia segment, but also edit a certain information segment corresponding to a certain editing track in a single multimedia segment, thereby satisfying the user's diverse editing requests and ensuring the quality of the generated multimedia data.
  • the multimedia editing interface is displayed in different ways.
  • the multimedia editing interface may include a video playback area, an editing area, and an editing track display area, wherein the editing track display area is aligned with the timeline.
  • the first track fragment corresponding to the first editing track, the second track fragment corresponding to the second editing track, and the third track fragment corresponding to the third editing track are displayed in a manner.
  • the editing area is used to display the editing function controls corresponding to the currently selected information segment (the specific editing function controls can be set according to the needs of the experimental scenario), and the video playback area is used to display the image and text information at the current playback time point of the multimedia data (wherein, the reference line corresponding to the current playback time point can be displayed in the editing track display area in the direction vertical to the time axis, and the reference line is used to indicate the current playback position of the multimedia data, wherein the reference line can also be dragged, and the video playback area synchronously displays the image and text information of the multimedia segment at the real-time position corresponding to the reference line, so that the user can realize frame-by-frame viewing of the multimedia data based on the forward and backward movement of the reference line, etc.), and the video playback area can also play video playback controls.
  • the video playback controls are triggered, the current multimedia data is displayed, so that the user can intuitively know the playback effect of the current multimedia data.
  • the editing track display area in the editing track display area, four multimedia segments A, B, C, and D are displayed in a timeline-aligned manner, as well as A1A2A3B1B2B3C1C2C3D1D2D3 corresponding to the four multimedia segments A, B, C, and D.
  • the editing interface corresponding to the text segment A1 is displayed in the editing area, and the editing interface includes editable A1, as well as editing controls such as font and font size, etc.
  • the video playback area displays the multimedia data at the corresponding position of the reference line of the current multimedia data, etc.
  • the multimedia editing interface in this embodiment may also include other editing tracks.
  • the number of other editing tracks is not displayed.
  • Each other editing track can be used to display information fragments of other dimensions corresponding to the multimedia data.
  • the multimedia editing interface may also include a fourth editing track for identifying the background audio data.
  • the current background sound used by the fourth editing track is displayed in a preset background sound editing area (such as the editing area mentioned in the above embodiment), as well as replaceable candidate background sounds.
  • the replaceable candidate background sounds can be displayed in any style such as labels in the background sound editing area.
  • the target background sound is updated and identified on the fourth editing track.
  • the multimedia data processing method of the embodiment of the present disclosure divides the multimedia data into multiple multimedia segments, each of which has a corresponding information type according to the information type it contains.
  • the editing track can be used to edit and modify the information segments of a certain information type contained in the multimedia segment. This enriches the editing tracks corresponding to the multimedia data, meets the diversified editing needs of the multimedia data, and improves the quality of the multimedia data.
  • the following describes how to edit information segments of different information types of multimedia information segments corresponding to multimedia data.
  • the text segments corresponding to the multimedia information segments may be edited and modified separately.
  • the step of individually editing and modifying the text segment corresponding to the multimedia information segment includes:
  • Step 601 In response to a user selecting a first target track segment on a first editing track, a text segment currently identified on the first target track segment is displayed in a text editing area.
  • the first target track segment in response to a user selecting a first target track segment on a first editing track, the first target track segment may be one or more, and then the text segment currently identified on the first target editing track segment is displayed in a text editing area, wherein the text editing area may be located in the editing area mentioned in the above embodiment.
  • the text editing area may also include other functional editing controls for the text segment, such as a font editing control, a font size editing control, etc.
  • Step 602 based on a target text segment generated by a user modifying a currently displayed text segment in a text editing area, update and identify the target text segment on a first target track segment.
  • the target text segment is updated and identified on the first target track segment.
  • the text segment currently identified on the first target track segment selected by the user is “There are a wide variety of types of women's clothing in today's society”
  • the user modifies the currently displayed text segment to “There are a wide variety of types of women's clothing in today's society and there are many of them” in the text editing area
  • the text segment can be updated to “There are a wide variety of types of women's clothing in today's society and there are many of them” on the first target track, thereby achieving targeted modification of a single text segment and meeting the demand for modification of a single text segment of multimedia data.
  • the images in the video image segment can also be updated synchronously according to the modification of the text segment.
  • a second target track segment corresponding to the first target track segment is determined on a second editing track, a target video image segment matching the target text segment is obtained, and the target video image segment is updated and identified on the second target track segment.
  • the target text can be semantically matched with the pictures in the preset picture material library to determine the corresponding target video image, and then, the target video segment can be generated according to the target video image.
  • the video segment matching the target text segment can also be directly determined in the preset video segment material library as the target video image segment, etc., which is not limited here.
  • the voice segment in order to ensure the synchronization between words and sounds, may also be modified synchronously according to the modification of the text segment.
  • a third target track segment corresponding to the first target track segment is determined on a third editing track, wherein the third track segment contains a voice segment corresponding to the text segment in the first target track segment, and a target voice segment corresponding to the target text segment is obtained, such as reading aloud the target text segment to obtain the target voice segment, and the target voice segment is updated and identified on the third target track segment, thereby achieving synchronous modification of voice and text.
  • the time length corresponding to the modified text segment on the timeline is different from the time length corresponding to the text segment before modification. Therefore, in different application scenarios, the change in time length can be used to perform different display processing on the editing track in the editing track display area.
  • the video image segment corresponding to the multimedia segment is defined as a main information segment with an unchangeable duration according to the scene requirements, in order to ensure the duration of the video image segment, The length cannot be changed.
  • the second editing track is kept unchanged, that is, the length of the corresponding video image segment is guaranteed to remain unchanged, and the first updated track segment corresponding to the first updated time length is displayed in the preset first candidate area.
  • the target text segment is identified on the first update track segment.
  • the first candidate area can be located in other areas such as the upper area of the text segment before modification. Therefore, even if it is known that the first update time length corresponding to the target text segment on the first editing track is inconsistent with the time length corresponding to the text segment before modification, it is not only to display the target text segment in the form of "up track” and the like, but also not to correspond to modify the duration of the video image segment corresponding to the second editing track, and visually does not affect the time length corresponding to the text information segment of other multimedia segments.
  • the second editing track is kept unchanged, and the third updated track segment corresponding to the third updated time length is displayed in the preset second candidate area, wherein the target voice segment is marked on the third updated track segment.
  • the second candidate area may be located in other areas such as the lower area of the voice segment before modification.
  • the target voice segment is not only displayed in the form of "descending track", but also does not correspond to the modification of the duration of the video image segment corresponding to the second editing track, and visually does not affect the time length corresponding to the voice segment of other multimedia information segments.
  • the modified text segment "There are many types of women's clothing in today's society” is obviously longer in time length than the text segment "There are many types of women's clothing in today's society” before modification
  • the modified text segment "There are many types of women's clothing in today's society” can be displayed above the text segment before modification, keeping the duration of the corresponding video image segment unchanged.
  • the modified voice segment "There are many types of women's clothing in today's society” is obviously longer than the voice segment "There are many types of women's clothing in today's society” before the modification
  • the modified voice segment "There are many types of women's clothing in today's society” can be displayed below the voice segment before the modification, and the duration of the corresponding video image segment is kept unchanged, which meets the requirement that the duration of the video image segment cannot be changed in the corresponding scene. The need for change.
  • the length of the first target track segment is adjusted according to the first update time, that is, the length of the first target track segment is scaled at the original display position.
  • the length of the third target track segment is adjusted according to the third update time.
  • the length of the second target track segment corresponding to the first target track segment and the third target track segment on the second editing track is adjusted accordingly, so that the time axes of the adjusted first target track segment, the adjusted second target track segment and the adjusted third target track segment are aligned, thereby achieving the alignment of all information segments contained in the multimedia segment on the timeline.
  • the modified voice segment "There are many types of women's clothing in today's society” is obviously longer in time length than the voice segment "There are many types of women's clothing in today's society” before modification, the length of the second target track segment that displays the voice segment before modification can be lengthened, and the modified voice segment "There are many types of women's clothing in today's society” is displayed in the adjusted second target track segment.
  • the length of the second target track segment corresponding to the first target track segment and the third target track segment on the second editing track is adjusted accordingly, so that the time axes of the adjusted first target track segment, the adjusted second target track segment and the adjusted third target track segment are aligned.
  • the voice segment corresponding to the multimedia information segment can be edited and modified separately.
  • the step of separately editing and modifying the voice segment corresponding to the multimedia information segment includes:
  • Step 1001 responding to a third target track segment selected by a user on a third editing track, wherein the third target track segment corresponds to a voice segment corresponding to the text segment displayed by the first target track segment.
  • the third target track segment in response to a user selecting a third target track segment on a third editing track, where the third target track segment may be one or more, the third target track segment corresponds to a voice segment corresponding to the text segment displayed by the first target track segment, that is, in this embodiment, the voice segment can be edited separately.
  • Step 1002 Display the current timbre used by the voice segment on the third target track segment in a preset audio editing area, and display replaceable candidate timbres.
  • the preset audio editing area in this embodiment can be located in the editing area mentioned in the above embodiment, and the current timbre used by the voice segment on the third target track segment is displayed in the preset audio editing area, as well as the replaceable candidate timbres.
  • the candidate replacement colors can be displayed in any style such as label form. For example, labels such as "uncle”, “girl”, “old man” and so on of the candidate replacement colors can be displayed, and users can select the candidate timbres by triggering the corresponding labels.
  • Step 1003 based on the second target timbre generated by the user modifying the current timbre according to the candidate timbre in the audio editing area, the target voice segment is updated and identified on the third target track segment, wherein the target voice segment is a voice segment generated by reading the text segment identified by the first target track segment using the second target timbre.
  • the user can trigger the candidate timbre to modify the current timbre, and modify the current timbre to the triggered candidate timbre, i.e., the second target timbre.
  • the timbre of the voice segment on the third track segment is modified, satisfying the user's need to modify the timbre of a certain voice segment.
  • the user can modify multiple voice segments corresponding to the third track segment to different timbres, thereby achieving an interesting voice playback effect.
  • the multimedia data processing method of the disclosed embodiment can flexibly edit and modify the text segments, voice segments, etc. corresponding to the multimedia segments separately, further meeting the diversified editing needs of multimedia data and improving the quality of multimedia data.
  • the present disclosure also proposes a multimedia data processing device.
  • FIG12 is a schematic diagram of the structure of a multimedia data processing device provided by an embodiment of the present disclosure.
  • the device can be implemented by software and/or hardware and can generally be integrated into an electronic device for multimedia data processing.
  • the device includes: a receiving module 1210, a generating module 1220 and a display module 1230, wherein:
  • the receiving module 1210 is used to receive text information input by the user;
  • a generating module 1220 for generating multimedia data based on the text information in response to a processing instruction for the text information
  • the display module 1230 is used to display a multimedia editing interface for performing editing operations on multimedia data, wherein:
  • the multimedia data includes: a plurality of multimedia segments, the plurality of multimedia segments respectively corresponding to the plurality of text segments into which the text information is divided, the plurality of multimedia segments including a plurality of voice segments generated by reading aloud corresponding to the plurality of text segments respectively, and a plurality of video image segments respectively matching the plurality of text segments;
  • the multimedia editing interface includes: a first editing track, a second editing track, and a third editing track, wherein the first editing track contains multiple first track segments, and the multiple first track segments are respectively used to identify multiple text segments; the second editing track contains multiple second track segments, and the multiple second track segments are respectively used to identify multiple video image segments; the third editing track contains multiple third track segments, and the multiple third track segments are respectively used to identify multiple voice segments, wherein the first track segments, the second track segments, and the third track segments aligned with the timeline on the editing track respectively identify corresponding text segments, video image segments, and voice segments.
  • the receiving module is specifically used for:
  • Receive the link information input by the user in the link area identify the link information to obtain the text information on the corresponding page, and display it in the text area for the user to edit.
  • a first display module used to display a tone selection entry control
  • a second display module is used to display a candidate timbre menu in response to a user triggering operation on a timbre selection entry control, wherein the candidate timbre menu includes candidate timbres and audition controls corresponding to the candidate timbres;
  • the timbre determination module is used to determine the first candidate timbre according to the user's selection operation in the candidate timbre menu. Mark the tone;
  • the voice segment acquisition module is used to acquire multiple voice segments generated by reading aloud the multiple text segments divided into text information based on the first target timbre.
  • a third display module configured to respond to a first target track segment selected by a user on a first editing track, and display a text segment currently identified on the first target track segment in a text editing area;
  • the text segment editing module is used to update the target text segment on the first target track segment based on the target text segment generated by the user modifying the currently displayed text segment in the text editing area.
  • a track segment determination module configured to determine, in response to a text update operation on a target text segment on the first target track segment, a third target track segment corresponding to the first target track segment on the third editing track;
  • the speech segment acquisition module is used to acquire a target speech segment corresponding to the target text segment, and update the target speech segment on the third target track segment.
  • the method further includes: a first duration display processing module, configured to:
  • the second editing track is kept unchanged, and the first update track segment corresponding to the first update time length is displayed in the preset first candidate area, wherein the target text segment is marked on the first update track segment;
  • the second editing track is kept unchanged, and the third updated track segment corresponding to the third updated time length is displayed in the preset second candidate area, wherein the target voice segment is identified on the third updated track segment.
  • the method further includes: a second duration display processing module, configured to:
  • the length of the second target track segment corresponding to the first target track segment and the third target track segment on the second editing track is correspondingly adjusted so that the time axes of the adjusted first target track segment, the adjusted second target track segment and the adjusted third target track segment are aligned.
  • a video image update module used to:
  • a target video image segment matching the target text segment is obtained, and the target video image segment is updated and identified on the second target track segment.
  • a tone update module used for:
  • the target voice segment is updated and identified on the third target track segment, wherein the target voice segment is a voice segment generated by reading the text segment identified by the first target track segment using the second target timbre.
  • the multimedia editing interface also includes:
  • a background sound display module configured to display the current background sound used by the fourth editing track in a preset background sound editing area in response to a trigger operation on the fourth editing track, and to display replaceable candidate background sounds;
  • the background sound update processing module is used to update the target background sound on the fourth editing track based on the target background sound generated by the user modifying the current background sound according to the candidate background sound in the background sound editing area.
  • the multimedia data processing device provided in the embodiments of the present disclosure can execute the multimedia data processing method provided in any embodiment of the present disclosure, and has the corresponding functional modules and beneficial effects of the execution method. No longer.
  • the present disclosure further proposes a computer program product, including a computer program/instruction, which implements the multimedia data processing method in the above embodiments when executed by a processor.
  • FIG. 13 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present disclosure.
  • the electronic device 1300 in the embodiment of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc.
  • mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc.
  • PDAs personal digital assistants
  • PADs tablet computers
  • PMPs portable multimedia players
  • vehicle-mounted terminals such as vehicle-mounted navigation terminals
  • fixed terminals such as digital TVs, desktop computers, etc.
  • the electronic device shown in FIG13 is only an example and should not bring any limitation to the functions and scope of
  • the electronic device 1300 may include a processor (e.g., a central processing unit, a graphics processing unit, etc.) 1301, which may perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1302 or a program loaded from a memory 1308 to a random access memory (RAM) 1303.
  • ROM read-only memory
  • RAM random access memory
  • Various programs and data required for the operation of the electronic device 1300 are also stored in the RAM 1303.
  • the processor 1301, the ROM 1302, and the RAM 1303 are connected to each other via a bus 1304.
  • An input/output (I/O) interface 1305 is also connected to the bus 1304.
  • the following devices may be connected to the I/O interface 1305: input devices 1306 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; output devices 1307 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; storage devices 1308 including, for example, a magnetic tape, a hard disk, etc.; and communication devices 1309.
  • the communication devices 1309 may allow the electronic device 1300 to communicate wirelessly or wired with other devices to exchange data.
  • FIG. 13 shows an electronic device 1300 with various devices, it should be understood that it is not required to implement or have all the devices shown. More or fewer devices may be implemented or have alternatively.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a non-transitory computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program can be transmitted through a communication
  • the device 1309 is downloaded and installed from the network, or installed from the storage 1308, or installed from the ROM 1302.
  • the computer program is executed by the processor 1301, the above functions defined in the multimedia data processing method of the embodiment of the present disclosure are performed.
  • the computer-readable medium disclosed above may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above.
  • Computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium containing or storing a program that may be used by or in combination with an instruction execution system, device or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, in which a computer-readable program code is carried.
  • This propagated data signal may take a variety of forms, including but not limited to an electromagnetic signal, an optical signal, or any suitable combination of the above.
  • the computer readable signal medium may also be any computer readable medium other than a computer readable storage medium, which may send, propagate or transmit a program for use by or in conjunction with an instruction execution system, apparatus or device.
  • the program code contained on the computer readable medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and server may communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network).
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include a local area network ("LAN”), a wide area network ("WAN”), an internet (e.g., the Internet), and a peer-to-peer network (e.g., an ad hoc peer-to-peer network), as well as any currently known or future developed network.
  • the computer-readable medium may be included in the electronic device, or may exist independently without being incorporated into the electronic device.
  • the computer readable medium carries one or more programs.
  • the electronic device executes the sequence.
  • the multimedia editing interface includes multiple multimedia segments, the multiple multimedia segments respectively correspond to the multiple text segments into which the text information is divided, the multiple multimedia segments include multiple voice segments generated by reading aloud corresponding to the multiple text segments respectively, and multiple video image segments respectively matched to the multiple text segments
  • the multimedia editing interface includes: a first editing track, a second editing track, and a third editing track, wherein the first track segment corresponding to the first editing track aligned with the timeline on the editing track, the second track segment corresponding to the second editing track, and the third track segment corresponding to the third editing track respectively identify corresponding text segments, video image segments, and voice segments.
  • the editing tracks corresponding to the multimedia data are enriched, which can meet the diversified editing needs of the multimedia data and improve the quality of the multimedia data.
  • the electronic device may be written in one or more programming languages or a combination thereof to write computer program code for performing the operations of the present disclosure, including but not limited to object-oriented programming languages, such as Java, Smalltalk, C++, and conventional procedural programming languages, such as "C" or similar programming languages.
  • the program code may be executed entirely on the user's computer, partially on the user's computer, as a separate software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server.
  • the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet service provider).
  • LAN local area network
  • WAN wide area network
  • Internet service provider e.g., via the Internet using an Internet service provider
  • each box in the flowchart or block diagram may represent a module, a program segment, or a portion of a code, which contains one or more executable instructions for implementing a specified logical function.
  • the functions marked in the boxes may also occur in an order different from that marked in the accompanying drawings. For example, two boxes represented in succession may actually be executed substantially in parallel, and they may sometimes be executed in the opposite order, depending on the functions involved.
  • each box in the block diagram and/or flowchart, and the combination of boxes in the block diagram and/or flowchart It may be implemented by a dedicated hardware-based system that performs the specified functions or operations, or it may be implemented by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure may be implemented by software or hardware, wherein the name of a unit does not, in some cases, constitute a limitation on the unit itself.
  • exemplary types of hardware logic components include: field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chip (SOCs), complex programmable logic devices (CPLDs), and the like.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOCs systems on chip
  • CPLDs complex programmable logic devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, device, or equipment.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or equipment, or any suitable combination of the foregoing.
  • a more specific example of a machine-readable storage medium may include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM portable compact disk read-only memory
  • CD-ROM compact disk read-only memory
  • magnetic storage device or any suitable combination of the foregoing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

一种多媒体数据处理方法、装置、设备及介质,其中该方法包括:接收用户输入的文本信息;响应于对文本信息的处理指令,基于文本信息生成多媒体数据,并展示用于对多媒体数据进行编辑操作的多媒体编辑界面,其中,多媒体数据包括:多个多媒体片段,多媒体编辑界面包括:第一编辑轨道,第二编辑轨道,以及第三编辑轨道,编辑轨道上时间线对齐的第一轨道片段、第二轨道片段和第三轨道片段分别标识对应的文本片段、视频图像片段以及语音片段。在本公开的实施例中,丰富了多媒体数据对应的编辑轨道,可以满足多媒体数据的多样化编辑需求,提升了多媒体数据的质量。

Description

多媒体数据处理方法、装置、设备及介质
本申请要求于2022年10月21日递交的中国专利申请第202211295639.8号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。
技术领域
本公开涉及一种多媒体数据处理方法、装置、设备及介质。
背景技术
随着计算机技术的发展,知识信息的分享方式越来越多样化,除了文字类和音频类的信息载体,视频类的信息载体目前也随处可见。
相关技术中,根据用户待分享的文字内容,生成包含文字内容的相关图像视频。但是,用户的创意随时变化,当前的创作方式风格僵化,不能满足用户灵活处理的细粒度需求,多媒体数据的质量不高。
发明内容
为了解决上述技术问题或者至少部分地解决上述技术问题,本公开提供了一种多媒体数据处理方法、装置、设备及介质。
本公开实施例提供了一种多媒体数据处理方法,所述方法包括:接收用户输入的文本信息;响应于对所述文本信息的处理指令,基于所述文本信息生成多媒体数据,并展示用于对所述多媒体数据进行编辑操作的多媒体编辑界面,其中,所述多媒体数据包括:多个多媒体片段,所述多个多媒体片段分别对应于所述文本信息划分成的多个文本片段,所述多个多媒体片段包括与所述多个文本片段分别对应的朗读生成的多个语音片段,以及与所述多个文本片段分别匹配的多个视频图像片段;所述多媒体编辑界面包括:第一编辑轨道,第二编辑轨道,以及第三编辑轨道,其中,所述第一编辑轨道包含多个第一轨道 片段,所述多个第一轨道片段分别用于标识所述多个文本片段,所述第二编辑轨道包含多个第二轨道片段,所述多个第二轨道片段分别用于标识所述多个视频图像片段;所述第三编辑轨道包含多个第三轨道片段,所述多个第三轨道片段分别用于标识所述多个语音片段,其中,所述编辑轨道上时间线对齐的第一轨道片段、第二轨道片段和第三轨道片段分别标识对应的文本片段、视频图像片段以及语音片段。
本公开实施例还提供了一种多媒体数据处理装置,所述装置包括:接收模块,用于接收用户输入的文本信息;生成模块,用于响应于对所述文本信息的处理指令,基于所述文本信息生成多媒体数据;展示模块,用于展示用于对所述多媒体数据进行编辑操作的多媒体编辑界面,其中,所述多媒体数据包括:多个多媒体片段,所述多个多媒体片段分别对应于所述文本信息划分成的多个文本片段,所述多个多媒体片段包括与所述多个文本片段分别对应的朗读生成的多个语音片段,以及与所述多个文本片段分别匹配的多个视频图像片段;所述多媒体编辑界面包括:第一编辑轨道,第二编辑轨道,以及第三编辑轨道,其中,所述第一编辑轨道包含多个第一轨道片段,所述多个第一轨道片段分别用于标识所述多个文本片段,所述第二编辑轨道包含多个第二轨道片段,所述多个第二轨道片段分别用于标识所述多个视频图像片段;所述第三编辑轨道包含多个第三轨道片段,所述多个第三轨道片段分别用于标识所述多个语音片段,其中,所述编辑轨道上时间线对齐的第一轨道片段、第二轨道片段和第三轨道片段分别标识对应的文本片段、视频图像片段以及语音片段。
本公开实施例还提供了一种电子设备,所述电子设备包括:处理器;用于存储所述处理器可执行指令的存储器;所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述指令以实现如本公开实施例提供的多媒体数据处理方法。
本公开实施例还提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行如本公开实施例提供的多媒体数据处理方法。
本公开实施例还提供了一种计算机程序产品,当所述计算机程序产品中的指令由处理器执行时,实现如本公开实施例提供的多媒体数据处理方法。本 公开实施例提供的技术方案具有如下优点:
本公开实施例提供的多媒体数据处理方案,接收用户输入的文本信息,响应于对文本信息的处理指令,基于文本信息生成多媒体数据,并展示用于对多媒体数据进行编辑操作的多媒体编辑界面,其中,多媒体编辑界面包含多个多媒体片段,多个多媒体片段分别对应于文本信息划分成的多个文本片段,多个多媒体片段包括与多个文本片段分别对应的朗读生成的多个语音片段,以及与多个文本片段分别匹配的多个视频图像片段,多媒体编辑界面包括:第一编辑轨道,第二编辑轨道,以及第三编辑轨道,其中,编辑轨道上时间线对齐的第一个编辑轨道对应的第一轨道片段、第二编辑轨道对应的第二轨道片段和第三第二编辑轨道对应的第三轨道片段分别标识对应的文本片段、视频图像片段以及语音片段,在本公开的实施例中,丰富了多媒体数据对应的编辑轨道,可以满足多媒体数据的多样化编辑需求,提升了多媒体数据的质量。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。
图1为本公开实施例提供的一种多媒体数据处理方法的流程示意图;
图2为本公开实施例提供的一种文本输入界面的示意图;
图3为本公开实施例提供的一种多媒体数据的多媒体片段组成示意图;
图4为本公开实施例提供的一种多媒体数据处理场景示意图;
图5为本公开实施例提供的一种多媒体编辑界面示意图;
图6为本公开实施例提供的另一种多媒体数据处理方法的流程示意图;
图7为本公开实施例提供的另一种多媒体数据处理场景示意图;
图8为本公开实施例提供的另一种多媒体数据处理场景示意图;
图9为本公开实施例提供的另一种多媒体数据处理场景示意图;
图10为本公开实施例提供的另一种多媒体数据处理方法的流程示意图;
图11为本公开实施例提供的另一种多媒体数据处理场景示意图;
图12为本公开实施例提供的一种多媒体数据处理装置的结构示意图;
图13为本公开实施例提供的一种电子设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
为了解决上述问题,本公开实施例提供了一种多媒体数据处理方法,在该方法中,将多媒体数据拆分为文本编辑轨道、视频图像编辑轨道和语音编辑轨道等多个编辑轨道,通过编辑轨道的编辑操作去编辑对应的信息,从而,可以满足多媒体数据的多样化编辑需求,提升了多媒体数据的质量。
下面结合具体的实施例对多媒体数据处理方法进行介绍。
图1为本公开实施例提供的一种多媒体数据处理方法的流程示意图,该 方法可以由多媒体数据处理装置执行,其中该装置可以采用软件和/或硬件实现,一般可集成在如电脑等电子设备中。如图1所示,该方法包括:
步骤101,接收用户输入的文本信息。
在本公开的一个实施例中,如图2所示,可以提供编辑文字的文本输入界面,该输入界面包括文本区域和链接区域,在本实施例中,可以根据制作视频的需求,自定义编辑文本信息,即接收用户在文本编辑界面的文本区域输入的文本信息,或者粘贴授权的链接,从链接中提取文本信息等,即接收用户在链接区域输入的链接信息,识别链接信息获取对应文本上的文本信息,并显示在文本区域供用户进行编辑,即显示在文本区域的文本信息也可以被多次编辑。
由于制作视频时通常有时长的限制,相应地,在一些可能的实施例中,文本信息的字数也有一定的限制,例如,不超过2000个字等,因此,可以检查文本区域内的文本的字数是否超限,超限时可以展示字数超限弹窗,以对用户进行提醒等。
在本公开的一个实施例中,继续参见图2,文本输入界面除了包含文本区域、链接区域外,还可包括生成视频按钮和音色选择入口控件,可响应于用户对音色选择入口控件的触发操作,显示候选音色菜单,其中,候选音色菜单中包括一个或者多个候选音色(多个候选音色可以包含大叔、男生、女生、萝莉等多种音色类型),以及与候选音色对应的试听控件,当用户触发试听控件时,将用户输入的文本信息的部分或者全部以对应的候选音色进行播放。
在本实施例中,用户可在输入文本信息时即在多样化的候选音色中进行音色的选择,从而,根据用户对候选音色菜单中的选择操作确定第一目标音色,进一步地,获取基于第一目标音色朗读对文本信息划分的多个文本片段生成的多个语音片段,此时每个语音片段的音色为第一目标音色,在满足对第一目标音色的个性化选择的基础上,提升了对第一目标音色的选择效率。
其中,将文本信息划分为多个语音片段时,可以对文本信息按照第一目标音色的朗读习惯进行断句处理,确定每个断句包含的文本片段,或者,可根据文本信息的语义信息将文本信息划分为多个语音片段,以文本片段为转换粒度将文本信息转换为多个文本片段对应的多个语音片段。
步骤102,响应于对文本信息的处理指令,基于文本信息生成多媒体数据,并展示用于对多媒体数据进行编辑操作的多媒体编辑界面,其中,
多媒体数据包括:多个多媒体片段,多个多媒体片段分别对应于文本信息划分成的多个文本片段,多个多媒体片段包括与多个文本片段分别对应的朗读生成的多个语音片段,以及与多个文本片段分别匹配的多个视频图像片段;
多媒体编辑界面包括:第一编辑轨道,第二编辑轨道,以及第三编辑轨道,其中,第一编辑轨道包含多个第一轨道片段,多个第一轨道片段分别用于标识多个文本片段,第二编辑轨道包含多个第二轨道片段,多个第二轨道片段分别用于标识多个视频图像片段;第三编辑轨道包含多个第三轨道片段,多个第三轨道片段分别用于标识多个语音片段,其中,编辑轨道上时间线对齐的第一轨道片段、第二轨道片段和第三轨道片段分别标识对应的文本片段、视频图像片段以及语音片段。
在本公开的一个实施例中,提供了对文本信息进行编辑处理的入口,通过该入口获取对文本信息的处理指令,例如,对文本信息进行编辑处理的入口可以为显示在文本编辑界面中的“生成视频”控件,当检测到用户触发文本编辑界面中的“生成视频”控件时,则获取到对文本信息的处理指令,当然,在其他可能的实施例中,对文本信息进行编辑处理的入口也可以为手势动作输入入口、语音信息输入入口等。
在本公开的一个实施例中,响应于对文本信息的处理指令,基于文本信息生成多媒体数据,其中,多媒体数据包括多个多媒体片段,多个多媒体片段分别对应于文本信息划分成的多个文本片段,多个多媒体片段包括与所述多个文本片段分别对应的朗读生成的多个语音片段,以及与所述多个文本片段分别匹配的多个视频图像片段。
也即是说,如图3所示,在本实施例中,生成的多媒体数据是以片段为粒度的多个多媒体片段组成的,每个多媒体片段至少包含三种信息类型,分别为文本片段、语音片段(语音片段的初始音色可以为上述实施例中在文本编辑界面选择的第一目标音色等)、视频图像片段等(视频图像片段可包含连续的图片组成的视频流,其中,连续的图片可以对应于在预设视频库中匹配到的视频流中的图片,也可以通过在预设图像素材库中匹配得到的一张或者多张图片)。
举例而言,如图4所示,在用户输入的文本信息为“当今社会女装的类型五花八门比如常见的汉服潮流服饰运动服等”,对文本信息处理后生成了四个多媒体片段A、B、C、D,其中,多媒体片段A包含了文本片段A1“当今社会女装的类型五花八门”、文本片段A1对应的语音片段A2、文本片段A1匹配的视频图像片段A3;多媒体片段B包含了文本片段B“1比如常见的汉服”、文本片段B1对应的语音片段A2、文本片段B1匹配的视频图像片段B3;多媒体片段C包含了文本片段C1“潮流服饰”、文本片段C1对应的语音片段C2、文本片段C1匹配的视频图像片段C3;多媒体片段D包含了文本片段D1“运动服等”、文本片段D1对应的语音片段D2、文本片段D1匹配的视频图像片段D3。
正如以上所提到的,在本实施例中,多媒体数据包含了至少三种信息类型,为了满足对多媒体数据的多样化编辑需求,在本公开的一个实施例中,展示用于对多媒体数据进行编辑操作的多媒体编辑界面,其中,多媒体编辑界面包括:第一编辑轨道,第二编辑轨道,以及第三编辑轨道,其中,第一编辑轨道包含多个第一轨道片段,多个第一轨道片段分别用于标识多个文本片段,第二编辑轨道包含多个第二轨道片段,多个第二轨道片段分别用于标识多个视频图像片段,第三编辑轨道包含多个第三轨道片段,多个第三轨道片段分别用于标识多个语音片段,其中,为了便于直观的体现每个多媒体数据对应多种类型的信息片段,编辑轨道上时间线对齐显示片段、第二轨道片段和第三轨道片段,第一轨道的第一轨道片段、第二轨道片段和第三轨道片段分别标识对应的文本片段、视频图像片段以及语音片段。
在本公开的实施例中,将多媒体数据拆分为多个多媒体片段后,针对每个多媒体片段拆分为多个信息类型对应的编辑轨道,从而,用户不但可以对单个多媒体片段进行编辑,还可以针对单个多媒体片段中的某个编辑轨道下对应的某个信息片段进行编辑,满足了用户多样化的编辑请求,可保证生成的多媒体数据的质量。
需要说明的是,在不同的应用场景中,多媒体编辑界面的显示方式不同,作为一种可能的实现方式,如图5所示,多媒体编辑界面可包括视频播放区域、编辑区域以及编辑轨道显示区域,其中,编辑轨道显示区域以时间线对齐 的方式显示第一编辑轨道对应的第一轨道片段、第二编辑轨道对应的第二轨道片段、以及第三编辑轨道对应的第三轨道片段。
其中,编辑区域用于显示当前选中的信息片段对应的编辑功能控件(具体的编辑功能控件可根据实验场景的需要设定),视频播放区域用于显示多媒体数据的当前播放时间点下的图像以及文字信息等(其中,可在编辑轨道显示区域以垂直时间轴的方向显示当前播放时间点对应的参考线,该参考线用于指示多媒体数据的当前播放位置,其中,参考线也可被拖动,视频播放区域同步显示参考线对应的实时位置下的多媒体片段的图像以及文字信息等,便于用户基于该参考线的前后移动等实现对多媒体数据的逐帧查看),视频播放区域还可以播放视频播放控件,当视频播放控件被触发时,显示当前的多媒体数据,由此用户可直观的获知当前多媒体数据的播放效果。
继续以图4所示的场景为例,在图5中,在编辑轨道显示区域,以时间线对齐的方式显示四个多媒体片段A、B、C、D,以及四个多媒体片段A、B、C、D对应的A1A2A3B1B2B3C1C2C3D1D2D3。若是选中的为文本片段A1,则在编辑区域显示与文本片段A1对应的编辑界面,该编辑界面中包含可编辑的A1,以及字体、字号等编辑控件等。此时,视频播放区域显示当前多媒体数据的参考线对应位置下的多媒体数据等。
需要强调的是,本实施例中的多媒体编辑界面中除了包含上述三个编辑轨道之外,还可以包含其他编辑轨道,其他编辑轨道的数量不作显示,每个其他编辑轨道可用于显示多媒体数据对应的其他维度的信息片段,比如,当其他编辑轨道中包含用于编辑多媒体数据的背景音的编辑轨道时,即多媒体编辑界面还可包括用于标识背景音频数据的第四编辑轨道,从而,响应于对第四编辑轨道的触发操作,在预设的背景音编辑区域(比如上述实施例提到的编辑区域)显示第四编辑轨道使用的当前背景音,以及显示可替换的候选背景音,可替换的候选背景音可以在背景音编辑区域意标签等任意样式显示,基于用户在背景音编辑区域根据候选背景音对当前背景音进行修改生成的目标背景音,在第四编辑轨道上更新标识目标背景音。
综上,本公开实施例的多媒体数据处理方法,将多媒体数据拆分为多个多媒体片段,每个多媒体片段根据其包含的信息类型,具有与每个信息类型对应 的编辑轨道,在编辑轨道上课以对多媒体片段中包含的某个信息类型下的信息片段进行编辑修改,由此,丰富了多媒体数据对应的编辑轨道,可以满足多媒体数据的多样化编辑需求,提升了多媒体数据的质量。
下面结合具体的实施例,说明如何对多媒体数据对应的多媒体信息片段的不同信息类型的信息片段进行编辑。
在本公开的一个实施例中,可对多媒体信息片段对应的文本片段进行单独编辑修改。
在本实施例中,如图6所示,对多媒体信息片段对应的文本片段进行单独编辑修改的步骤包括:
步骤601,响应于用户在第一编辑轨道上选择的第一目标轨道片段,并将第一目标轨道片段上当前标识的文本片段显示在文本编辑区域。
在本公开的一个实施例中,响应于用户在第一编辑轨道上选择的第一目标轨道片段,该第一目标轨道片段可以为一个也可以为多个,进而,将第一目标编辑轨道片段上当前标识的文本片段显示在文本编辑区域,其中,文本编辑区域可以位于上述实施例中提到的编辑区域。其中,文本编辑区域除了包含可编辑的第一目标轨道片段上当前标识的文本片段之外,还可以包含对文本片段的其他功能编辑控件,比如,包含字体编辑控件、字号编辑控件等。
步骤602,基于用户在文本编辑区域对当前显示的文本片段的修改生成的目标文本片段,在第一目标轨道片段上更新标识目标文本片段。
在本实施例中,基于用户在文本编辑区域对当前显示的文本片段的修改生成的目标文本片段,在第一目标轨道片段上更新标识目标文本片段。
例如,如图7所示,当用户选中的第一目标轨道片段上当前标识的文本片段为“当今社会女装的类型五花八门”,若是用户在文本编辑区域将当前显示的文本片段修改为“当今社会女装的类型五花八门数量繁多”,则可以在第一目标轨道上将文本片段更新为当今社会女装的类型五花八门数量繁多”,从而,实现了对单个文本片段的针对性修改,满足了对多媒体数据的单个文本片段修改的需求。
在本公开的一个实施例中,为了进一步提升多媒体数据的质量,保证图文匹配,还可以根据对文本片段的修改同步更新视频图像片段中的图像。在本实 施例中,响应于第一目标轨道片段上对目标文本片段的文本更新操作,在第二编辑轨道上确定与第一目标轨道片段对应的第二目标轨道片段,获取与目标文本片段匹配的目标视频图像片段,并在第二目标轨道片段上更新标识目标视频图像片段。
其中,可以将目标文本与预设图片素材库中的图片进行语义匹配确定对应的目标视频图像,进而,根据目标视频图像生成目标视频片段,也可以直接在预设视频段素材库中确定与目标文本片段匹配的视频段为目标视频图像片段等,在此不作限制。
在本公开的一个实施例中,为了保证字音同步,还可以根据对文本片段的修改同步修改语音片段。
即在本实施例中,响应于第一目标轨道片段上对目标文本片段的文本更新操作,在第三编辑轨道上确定与第一目标轨道片段对应的第三目标轨道片段,该第三轨道片段中包含饿了与第一目标轨道片段中的文本片段对应的语音片段,获取与目标文本片段对应的目标语音片段,比如朗读目标文本片段以获取目标语音片段,并在第三目标轨道片段上更新标识目标语音片段,从而,实现了将语音与文本的同步修改。
即继续参照图7,当用户选中的第一目标轨道片段上当前标识的文本片段为“当今社会女装的类型五花八门”,若是用户在文本编辑区域将当前显示的文本片段修改为“当今社会女装的类型五花八门数量繁多”,则可以在第一目标轨道上将文本片段更新为当今社会女装的类型五花八门数量繁多”,并且在第三编辑轨道上确定与第一目标轨道片段对应的第三目标轨道片段,将第三目标轨道片段上的语音片段由“当今社会女装的类型五花八门”更新为“当今社会女装的类型五花八门数量繁多”。
正如以上所提到的,在对文本片段的编辑修改过程中,修改后的文本片段在时间轴上对应的时间长度和修改之前的文本片段在时间轴上对应的时间长度不同,因此,可在不同的应用场景中,对这种时间长度的变化对编辑轨道显示区域中的编辑轨道进行不同的显示处理。
在本公开的一个实施例中,若是根据场景需要将多媒体片段对应的视频图像片段定义为时长不可更改的主信息片段,则为了保证视频图像片段的时 长的不可更改,在检测获知目标文本片段在第一编辑轨道上对应的第一更新时间长度与修改前文本片段对应的时间长度不一致的情况下,保持第二编辑轨道不变,即保证对应的视频图像片段的时长不变,并在预设的第一候选区域显示与第一更新时间长度对应的第一更新轨道片段。
其中,第一更新轨道片段上标识目标文本片段。第一候选区域可以位于修改前文本片段的上方区域等其他区域,由此,即使获知目标文本片段在第一编辑轨道上对应的第一更新时间长度与修改前的文本片段对应的时间长度不一致,也不仅仅是将目标文本片段以“升轨”等形式显示,不但不对应修改第二编辑轨道对应的视频图像片段的时长,在视觉上也不影响其他多媒体片段的文本信息片段对应的时间长度。
在检测获知目标语音片段在第三编辑轨道上对应的第三更新时间长度与修改前的语音片段对应的时间长度不一致的情况下,保持第二编辑轨道不变,并在预设的第二候选区域显示与第三更新时间长度对应的第三更新轨道片段,其中,第三更新轨道片段上标识目标语音片段。其中,第二候选区域可以位于修改前语音片段的下方区域等其他区域,由此,即使获知目标语音片段在第三编辑轨道上对应的第三更新时间长度与修改前的语音片段对应的时间长度不一致,也不仅仅是将目标语音片段以“降轨”等形式显示,不但不对应修改第二编辑轨道对应的视频图像片段的时长,在视觉上也不影响其他多媒体信息片段的语音片段对应的时间长度。
举例而言,如图8所示,以图7所示的场景为例,由于修改后的文本片段“当今社会女装的类型五花八门数量繁多”相对于修改之前的文本片段的“当今社会女装的类型五花八门”显然对应的时间长度拉长,因此,可将修改后的文本片段“当今社会女装的类型五花八门数量繁多”显示在修改之前的文本片段的上方,保持对应的视频图像片段的时长长度不变。
在本实施例中,由于修改后的语音片段“当今社会女装的类型五花八门数量繁多”相对于修改之前的语音片段的“当今社会女装的类型五花八门”显然对应的时间长度拉长,因此,可将修改后的语音片段“当今社会女装的类型五花八门数量繁多”显示在修改之前的语音片段的下方,且保持对应的视频图像片段的时长长度不变,满足了对应场景下对视频图像拍片段的时间长度不可 更改的需求。
在本公开的一个实施例中,若是根据场景需要将多媒体片段对应的视频图像片段在时间线上同步其他信息片段,则为了保证视频图像片段在时间线上的同步,在检测获知目标文本片段在第一编辑轨道上对应的第一更新时间长度与修改前的文本片段对应的时间长度不一致的情况下,根据第一更新时间调整第一目标轨道片段的长度,即在原有的显示位置对第一目标轨道片段的长度进行缩放,同理,在检测获知目标语音片段在第三编辑轨道上对应的第三更新时间长度与修改前的语音片段对应的时间长度不一致的情况下,根据第三更新时间调整第三目标轨道片段的长度。
进一步地,对应调整第二编辑轨道上与第一目标轨道片段和第三目标轨道片段对应的第二目标轨道片段的长度,使得调整后的第一目标轨道片段、调整后的第二目标轨道片段和调整后的第三目标轨道片段的时间轴对齐,由此实现了对多媒体片段包含的所有信息片段在时间线上的对齐。
举例而言,如图9所示,以图7所示的场景为例,由于修改后的文本片段“当今社会女装的类型五花八门数量繁多”相对于修改之前的文本片段的“当今社会女装的类型五花八门”显然对应的时间长度拉长,因此,可拉长显示修改之前的文本片段的第一目标轨道片段的长度,在调整后的第一目标轨道片段显示修改后的文本片段“当今社会女装的类型五花八门数量繁多”。
在本实施例中,由于修改后的语音片段“当今社会女装的类型五花八门数量繁多”相对于修改之前的语音片段的“当今社会女装的类型五花八门”显然对应的时间长度拉长,因此,可拉长显示修改之前的语音片段的第二目标轨道片段的长度,在调整后的第二目标轨道片段显示修改后的语音片段“当今社会女装的类型五花八门数量繁多”。
为了实现视频图像片段与其他信息片段的同步,在本实施例中,对应调整第二编辑轨道上与第一目标轨道片段和第三目标轨道片段对应的第二目标轨道片段的长度,使得调整后的第一目标轨道片段、调整后的第二目标轨道片段和调整后的第三目标轨道片段的时间轴对齐。
在本公开的一个实施例中,可对多媒体信息片段对应的语音片段进行单独编辑修改。
在本实施例中,如图10所示,对多媒体信息片段对应的语音片段进行单独编辑修改的步骤包括:
步骤1001,响应于用户在第三编辑轨道上选择的第三目标轨道片段,其中,第三目标轨道片段对应标识与第一目标轨道片段显示的文本片段对应的语音片段。
在本公开的一个实施例中,响应于用户在第三编辑轨道上选择的第三目标轨道片段,其中,第三目标轨道片段可以为一个也可以为多个,第三目标轨道片段对应标识与第一目标轨道片段显示的文本片段对应的语音片段,即在本实施例中,可针对语音片段进行单独的编辑。
步骤1002,在预设的音频编辑区域显示第三目标轨道片段上语音片段使用的当前音色,以及显示可替换的候选音色。
其中,本实施例中的预设音频编辑区域可以位于上述实施例提到的编辑区域,在预设的音频编辑区域显示第三目标轨道片段上语音片段使用的当前音色,以及显示可替换的候选音色,其中,如图11所示,候选替换颜色可以为标签形式等任意样式显示,比如,可显示候选替换颜色的标签“大叔”、“女孩”、“老人”等,用户可通过触发对应的标签实现对候选音色的选择。
步骤1003,基于用户在音频编辑区域根据候选音色对当前音色进行修改生成的第二目标音色,在第三目标轨道片段上更新标识目标语音片段,其中,目标语音片段为使用第二目标音色朗读第一目标轨道片段标识的文本片段生成的语音片段。
在本实施例中,用户可触发候选音色以修改当前音色,将当前音色修改为触发的候选音色即第二目标音色,从而,将第三轨道片段上的语音片段的音色进行了修改,满足了用户对某段语音片段的音色进行修改的需求,比如,用户可以将第三轨道片段对应的多个语音片段修改为不同的音色,从而,实现了一种趣味性的语音播放效果。
综上,本公开实施例的多媒体数据处理方法,可灵活对多媒体片段对应的文本片段、语音片段等进行单独编辑修改,进一步满足了多媒体数据的多样化编辑需求,提升了多媒体数据的质量。
为了实现上述实施例,本公开还提出了一种多媒体数据处理装置。
图12为本公开实施例提供的一种多媒体数据处理装置的结构示意图,该装置可由软件和/或硬件实现,一般可集成在电子设备中多媒体数据处理。如图12所示,该装置包括:接收模块1210、生成模块1220和展示模块1230,其中,
接收模块1210,用于接收用户输入的文本信息;
生成模块1220,用于响应于对文本信息的处理指令,基于文本信息生成多媒体数据;
展示模块1230,用于展示用于对多媒体数据进行编辑操作的多媒体编辑界面,其中,
多媒体数据包括:多个多媒体片段,多个多媒体片段分别对应于文本信息划分成的多个文本片段,多个多媒体片段包括与多个文本片段分别对应的朗读生成的多个语音片段,以及与多个文本片段分别匹配的多个视频图像片段;
多媒体编辑界面包括:第一编辑轨道,第二编辑轨道,以及第三编辑轨道,其中,第一编辑轨道包含多个第一轨道片段,多个第一轨道片段分别用于标识多个文本片段,第二编辑轨道包含多个第二轨道片段,多个第二轨道片段分别用于标识多个视频图像片段;第三编辑轨道包含多个第三轨道片段,多个第三轨道片段分别用于标识多个语音片段,其中,编辑轨道上时间线对齐的第一轨道片段、第二轨道片段和第三轨道片段分别标识对应的文本片段、视频图像片段以及语音片段。
可选的,接收模块具体用于:
接收用户在文本区域输入的文本信息;和/或,
接收用户在链接区域输入的链接信息,识别链接信息获取对应页面上的文本信息,并显示在文本区域供用户进行编辑。
可选的,还包括:
第一显示模块,用于显示音色选择入口控件;
第二显示模块,用于响应于用户对音色选择入口控件的触发操作,显示候选音色菜单,其中,候选音色菜单中包括候选音色,以及与候选音色对应的试听控件;
音色确定模块,用于根据用户对候选音色菜单中的选择操作确定第一目 标音色;
语音片段获取模块,用于获取基于第一目标音色朗读对文本信息划分的多个文本片段生成的多个语音片段。
可选的,还包括:
第三显示模块,用于响应于用户在第一编辑轨道上选择的第一目标轨道片段,并将第一目标轨道片段上当前标识的文本片段显示在文本编辑区域;
文本片段编辑模块,用于基于用户在文本编辑区域对当前显示的文本片段的修改生成的目标文本片段,在第一目标轨道片段上更新标识目标文本片段。
可选的,还包括:
轨道片段确定模块,用于响应于第一目标轨道片段上对目标文本片段的文本更新操作,在第三编辑轨道上确定与第一目标轨道片段对应的第三目标轨道片段;
语音片段获取模块,用于获取与目标文本片段对应的目标语音片段,并在第三目标轨道片段上更新标识目标语音片段。
可选的,还包括:第一时长显示处理模块,用于:
在检测获知目标文本片段在第一编辑轨道上对应的第一更新时间长度与修改前文本片段对应的时间长度不一致的情况下,保持第二编辑轨道不变,并在预设的第一候选区域显示与第一更新时间长度对应的第一更新轨道片段,其中,第一更新轨道片段上标识目标文本片段;
在检测获知目标语音片段在第三编辑轨道上对应的第三更新时间长度与修改前的语音片段对应的时间长度不一致的情况下,保持第二编辑轨道不变,并在预设的第二候选区域显示与第三更新时间长度对应的第三更新轨道片段,其中,第三更新轨道片段上标识目标语音片段。
可选的,还包括:第二时长显示处理模块,用于:
在检测获知目标文本片段在第一编辑轨道上对应的第一更新时间长度与修改前的文本片段对应的时间长度不一致的情况下,根据第一更新时间调整第一目标轨道片段的长度;
在检测获知目标语音片段在第三编辑轨道上对应的第三更新时间长度与 修改前的语音片段对应的时间长度不一致的情况下,根据第三更新时间调整第三目标轨道片段的长度;
对应调整第二编辑轨道上与第一目标轨道片段和第三目标轨道片段对应的第二目标轨道片段的长度,使得调整后的第一目标轨道片段、调整后的第二目标轨道片段和调整后的第三目标轨道片段的时间轴对齐。
可选的,还包括:视频图像更新模块,用于:
响应于第一目标轨道片段上对目标文本片段的文本更新操作,在第二编辑轨道上确定与第一目标轨道片段对应的第二目标轨道片段;
获取与目标文本片段匹配的目标视频图像片段,并在第二目标轨道片段上更新标识目标视频图像片段。
可选的,还包括:音色更新模块,用于:
响应于用户在第三编辑轨道上选择的第三目标轨道片段,其中,第三目标轨道片段对应标识与第一目标轨道片段显示的文本片段对应的语音片段;
在预设的音频编辑区域显示第三目标轨道片段上语音片段使用的当前音色,以及显示可替换的候选音色;
基于用户在音频编辑区域根据候选音色对当前音色进行修改生成的第二目标音色,在第三目标轨道片段上更新标识目标语音片段,其中,目标语音片段为使用第二目标音色朗读第一目标轨道片段标识的文本片段生成的语音片段。
可选的,多媒体编辑界面还包括:
用于标识背景音频数据的第四编辑轨道;
背景音显示模块,用于响应于对第四编辑轨道的触发操作,在预设的背景音编辑区域显示第四编辑轨道使用的当前背景音,以及显示可替换的候选背景音;
背景音更新处理模块,用于基于用户在背景音编辑区域根据候选背景音对当前背景音进行修改生成的目标背景音,在第四编辑轨道上更新标识目标背景音。
本公开实施例所提供的多媒体数据处理装置可执行本公开任意实施例所提供的多媒体数据处理方法,具备执行方法相应的功能模块和有益效果,在此 不再赘述。
为了实现上述实施例,本公开还提出一种计算机程序产品,包括计算机程序/指令,该计算机程序/指令被处理器执行时实现上述实施例中的多媒体数据处理方法。
图13为本公开实施例提供的一种电子设备的结构示意图。
下面具体参考图13,其示出了适于用来实现本公开实施例中的电子设备1300的结构示意图。本公开实施例中的电子设备1300可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图13示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图13所示,电子设备1300可以包括处理器(例如中央处理器、图形处理器等)1301,其可以根据存储在只读存储器(ROM)1302中的程序或者从存储器1308加载到随机访问存储器(RAM)1303中的程序而执行各种适当的动作和处理。在RAM 1303中,还存储有电子设备1300操作所需的各种程序和数据。处理器1301、ROM 1302以及RAM 1303通过总线1304彼此相连。输入/输出(I/O)接口1305也连接至总线1304。
通常,以下装置可以连接至I/O接口1305:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置1306;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置1307;包括例如磁带、硬盘等的存储器1308;以及通信装置1309。通信装置1309可以允许电子设备1300与其他设备进行无线或有线通信以交换数据。虽然图13示出了具有各种装置的电子设备1300,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信 装置1309从网络上被下载和安装,或者从存储器1308被安装,或者从ROM1302被安装。在该计算机程序被处理器1301执行时,执行本公开实施例的多媒体数据处理方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程 序被该电子设备执行时,使得该电子设备:
接收用户输入的文本信息,响应于对文本信息的处理指令,基于文本信息生成多媒体数据,并展示用于对多媒体数据进行编辑操作的多媒体编辑界面,其中,多媒体编辑界面包含多个多媒体片段,多个多媒体片段分别对应于文本信息划分成的多个文本片段,多个多媒体片段包括与多个文本片段分别对应的朗读生成的多个语音片段,以及与多个文本片段分别匹配的多个视频图像片段,多媒体编辑界面包括:第一编辑轨道,第二编辑轨道,以及第三编辑轨道,其中,编辑轨道上时间线对齐的第一个编辑轨道对应的第一轨道片段、第二编辑轨道对应的第二轨道片段和第三第二编辑轨道对应的第三轨道片段分别标识对应的文本片段、视频图像片段以及语音片段,在本公开的实施例中,丰富了多媒体数据对应的编辑轨道,可以满足多媒体数据的多样化编辑需求,提升了多媒体数据的质量。
电子设备可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合, 可以用执行规定的功能或操作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上***(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行***、装置或设备使用或与指令执行***、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体***、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例 的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (13)

  1. 一种多媒体数据处理方法,包括:
    接收用户输入的文本信息;
    响应于对所述文本信息的处理指令,基于所述文本信息生成多媒体数据,并展示用于对所述多媒体数据进行编辑操作的多媒体编辑界面,其中,
    所述多媒体数据包括:多个多媒体片段,所述多个多媒体片段分别对应于所述文本信息划分成的多个文本片段,所述多个多媒体片段包括与所述多个文本片段分别对应的朗读生成的多个语音片段,以及与所述多个文本片段分别匹配的多个视频图像片段;
    所述多媒体编辑界面包括:第一编辑轨道,第二编辑轨道,以及第三编辑轨道,其中,所述第一编辑轨道包含多个第一轨道片段,所述多个第一轨道片段分别用于标识所述多个文本片段,所述第二编辑轨道包含多个第二轨道片段,所述多个第二轨道片段分别用于标识所述多个视频图像片段;所述第三编辑轨道包含多个第三轨道片段,所述多个第三轨道片段分别用于标识所述多个语音片段,其中,所述编辑轨道上时间线对齐的第一轨道片段、第二轨道片段和第三轨道片段分别标识对应的文本片段、视频图像片段以及语音片段。
  2. 根据权利要求1所述的方法,其中,所述接收用户输入的文本信息,包括:
    接收用户在文本区域输入的文本信息;和/或,
    接收用户在链接区域输入的链接信息,识别所述链接信息获取对应页面上的文本信息,并显示在所述文本区域供所述用户进行编辑。
  3. 根据权利要求1或2所述的方法,还包括:
    显示音色选择入口控件;
    响应于用户对所述音色选择入口控件的触发操作,显示候选音色菜单,其中,所述候选音色菜单中包括候选音色,以及与所述候选音色对应的试听控件;
    根据所述用户对所述候选音色菜单中的选择操作确定第一目标音色;
    获取基于所述第一目标音色朗读对所述文本信息划分的多个文本片段生 成的多个语音片段。
  4. 根据权利要求1-3任一项所述的方法,还包括:
    响应于所述用户在所述第一编辑轨道上选择的第一目标轨道片段,并将所述第一目标轨道片段上当前标识的文本片段显示在文本编辑区域;
    基于所述用户在所述文本编辑区域对所述当前显示的文本片段的修改生成的目标文本片段,在所述第一目标轨道片段上更新标识所述目标文本片段。
  5. 根据权利要求4所述的方法,还包括:
    响应于所述第一目标轨道片段上对所述目标文本片段的文本更新操作,在所述第三编辑轨道上确定与所述第一目标轨道片段对应的第三目标轨道片段;
    获取与所述目标文本片段对应的目标语音片段,并在所述第三目标轨道片段上更新标识所述目标语音片段。
  6. 根据权利要求5所述的方法,还包括:
    在检测获知所述目标文本片段在所述第一编辑轨道上对应的第一更新时间长度与修改前文本片段对应的时间长度不一致的情况下,保持所述第二编辑轨道不变,并在预设的第一候选区域显示与所述第一更新时间长度对应的第一更新轨道片段,其中,所述第一更新轨道片段上标识所述目标文本片段;
    在检测获知所述目标语音片段在所述第三编辑轨道上对应的第三更新时间长度与修改前的语音片段对应的时间长度不一致的情况下,保持所述第二编辑轨道不变,并在预设的第二候选区域显示与所述第三更新时间长度对应的第三更新轨道片段,其中,所述第三更新轨道片段上标识所述目标语音片段。
  7. 根据权利要求5所述的方法,还包括:
    在检测获知所述目标文本片段在所述第一编辑轨道上对应的第一更新时间长度与修改前的文本片段对应的时间长度不一致的情况下,根据所述第一更新时间调整所述第一目标轨道片段的长度;
    在检测获知所述目标语音片段在所述第三编辑轨道上对应的第三更新时间长度与修改前的语音片段对应的时间长度不一致的情况下,根据所述第三更新时间调整所述第三目标轨道片段的长度;
    对应调整所述第二编辑轨道上与所述第一目标轨道片段和所述第三目标轨道片段对应的第二目标轨道片段的长度,使得调整后的所述第一目标轨道片段、调整后的第二目标轨道片段和调整后的第三目标轨道片段的时间轴对齐。
  8. 根据权利要求4所述的方法,还包括:
    响应于所述第一目标轨道片段上对所述目标文本片段的文本更新操作,在所述第二编辑轨道上确定与所述第一目标轨道片段对应的第二目标轨道片段;
    获取与所述目标文本片段匹配的目标视频图像片段,并在所述第二目标轨道片段上更新标识所述目标视频图像片段。
  9. 根据权利要求1-8任一项所述的方法,还包括:
    响应于所述用户在所述第三编辑轨道上选择的第三目标轨道片段,其中,所述第三目标轨道片段对应标识与所述第一目标轨道片段显示的文本片段对应的语音片段;
    在预设的音频编辑区域显示所述第三目标轨道片段上语音片段使用的当前音色,以及显示可替换的候选音色;
    基于所述用户在所述音频编辑区域根据所述候选音色对所述当前音色进行修改生成的第二目标音色,在所述第三目标轨道片段上更新标识目标语音片段,其中,所述目标语音片段为使用所述第二目标音色朗读所述第一目标轨道片段标识的文本片段生成的语音片段。
  10. 根据权利要求1-9任一所述的方法,所述多媒体编辑界面还包括:
    用于标识背景音频数据的第四编辑轨道;
    响应于对所述第四编辑轨道的触发操作,在预设的背景音编辑区域显示所述第四编辑轨道使用的当前背景音,以及显示可替换的候选背景音;
    基于所述用户在所述背景音编辑区域根据所述候选背景音对所述当前背景音进行修改生成的目标背景音,在所述第四编辑轨道上更新标识所述目标背景音。
  11. 一种多媒体数据处理装置,包括:
    接收模块,用于接收用户输入的文本信息;
    生成模块,用于响应于对所述文本信息的处理指令,基于所述文本信息生成多媒体数据;
    展示模块,用于展示用于对所述多媒体数据进行编辑操作的多媒体编辑界面,其中,
    所述多媒体数据包括:多个多媒体片段,所述多个多媒体片段分别对应于所述文本信息划分成的多个文本片段,所述多个多媒体片段包括与所述多个文本片段分别对应的朗读生成的多个语音片段,以及与所述多个文本片段分别匹配的多个视频图像片段;
    所述多媒体编辑界面包括:第一编辑轨道,第二编辑轨道,以及第三编辑轨道,其中,所述第一编辑轨道包含多个第一轨道片段,所述多个第一轨道片段分别用于标识所述多个文本片段,所述第二编辑轨道包含多个第二轨道片段,所述多个第二轨道片段分别用于标识所述多个视频图像片段;所述第三编辑轨道包含多个第三轨道片段,所述多个第三轨道片段分别用于标识所述多个语音片段,其中,所述编辑轨道上时间线对齐的第一轨道片段、第二轨道片段和第三轨道片段分别标识对应的文本片段、视频图像片段以及语音片段。
  12. 一种电子设备,包括:
    处理器;
    存储器,被配置为存储可执行指令,其中,
    所述处理器,被配置为从所述存储器中读取所述可执行指令,并执行所述可执行指令以实现上述权利要求1-10中任一所述的多媒体数据处理方法。
  13. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序用于执行上述权利要求1-10中任一所述的多媒体数据处理方法。
PCT/CN2023/122068 2022-10-21 2023-09-27 多媒体数据处理方法、装置、设备及介质 WO2024082948A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP23825187.0A EP4383698A1 (en) 2022-10-21 2023-09-27 Multimedia data processing method, apparatus, device and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211295639.8 2022-10-21
CN202211295639.8A CN117956100A (zh) 2022-10-21 2022-10-21 多媒体数据处理方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2024082948A1 true WO2024082948A1 (zh) 2024-04-25

Family

ID=90736894

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/122068 WO2024082948A1 (zh) 2022-10-21 2023-09-27 多媒体数据处理方法、装置、设备及介质

Country Status (3)

Country Link
EP (1) EP4383698A1 (zh)
CN (1) CN117956100A (zh)
WO (1) WO2024082948A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010047266A1 (en) * 1998-01-16 2001-11-29 Peter Fasciano Apparatus and method using speech recognition and scripts to capture author and playback synchronized audio and video
CN107517323A (zh) * 2017-09-08 2017-12-26 咪咕数字传媒有限公司 一种信息分享方法、装置及存储介质
CN111930289A (zh) * 2020-09-09 2020-11-13 智者四海(北京)技术有限公司 一种处理图片和文本的方法和***
CN112015927A (zh) * 2020-09-29 2020-12-01 北京百度网讯科技有限公司 多媒体文件编辑方法、装置、电子设备和存储介质
CN112287128A (zh) * 2020-10-23 2021-01-29 北京百度网讯科技有限公司 多媒体文件编辑方法、装置、电子设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010047266A1 (en) * 1998-01-16 2001-11-29 Peter Fasciano Apparatus and method using speech recognition and scripts to capture author and playback synchronized audio and video
CN107517323A (zh) * 2017-09-08 2017-12-26 咪咕数字传媒有限公司 一种信息分享方法、装置及存储介质
CN111930289A (zh) * 2020-09-09 2020-11-13 智者四海(北京)技术有限公司 一种处理图片和文本的方法和***
CN112015927A (zh) * 2020-09-29 2020-12-01 北京百度网讯科技有限公司 多媒体文件编辑方法、装置、电子设备和存储介质
CN112287128A (zh) * 2020-10-23 2021-01-29 北京百度网讯科技有限公司 多媒体文件编辑方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN117956100A (zh) 2024-04-30
EP4383698A1 (en) 2024-06-12

Similar Documents

Publication Publication Date Title
CN113365134B (zh) 音频分享方法、装置、设备及介质
EP4343514A1 (en) Display method and apparatus, and device and storage medium
US11914845B2 (en) Music sharing method and apparatus, electronic device, and storage medium
WO2022242351A1 (zh) 一种多媒体处理方法、装置、设备及介质
WO2021114979A1 (zh) 视频页面显示方法、装置、电子设备和计算机可读介质
EP4124052A1 (en) Video production method and apparatus, and device and storage medium
WO2021057740A1 (zh) 视频生成方法、装置、电子设备和计算机可读介质
US20220198403A1 (en) Method and device for interacting meeting minute, apparatus and medium
WO2022105271A1 (zh) 歌词视频展示方法、装置、电子设备及计算机可读介质
CN112000267A (zh) 信息显示方法、装置、设备及存储介质
JP2023529571A (ja) オーディオとテキストとの同期方法、装置、読取可能な媒体及び電子機器
US11886484B2 (en) Music playing method and apparatus based on user interaction, and device and storage medium
JP2023508462A (ja) ビデオエフェクト処理方法及び装置
WO2022228553A1 (zh) 视频处理方法、装置、电子设备和存储介质
CN114363686B (zh) 多媒体内容的发布方法、装置、设备和介质
US20240007718A1 (en) Multimedia browsing method and apparatus, device and mediuim
US20240121485A1 (en) Method, apparatus, device, medium and program product for obtaining text material
WO2024104336A1 (zh) 一种信息采集方法、装置、存储介质及电子设备
WO2024078516A1 (zh) 媒体内容展示方法、装置、设备及存储介质
WO2024078293A1 (zh) 图像处理方法、装置、电子设备及存储介质
WO2024037480A1 (zh) 交互方法、装置、电子设备和存储介质
WO2024037491A1 (zh) 媒体内容处理方法、装置、设备及存储介质
WO2024001893A1 (zh) 素材展示方法、装置、设备、计算机可读存储介质及产品
WO2024082948A1 (zh) 多媒体数据处理方法、装置、设备及介质
WO2022257777A1 (zh) 多媒体处理方法、装置、设备及介质

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2023825187

Country of ref document: EP

Effective date: 20231227