CN110895575B

CN110895575B - Audio processing method and device

Info

Publication number: CN110895575B
Application number: CN201810974926.9A
Authority: CN
Inventors: 高欣羽
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-08-24
Filing date: 2018-08-24
Publication date: 2023-06-23
Anticipated expiration: 2038-08-24
Also published as: CN110895575A

Abstract

The application discloses an audio processing method and device, comprising the following steps: converting the first audio information to be processed into text information; searching the converted text information by utilizing the search information to obtain text fragments containing the search information; and processing the audio clips corresponding to the obtained text clips containing the search information to obtain second audio information. According to the audio information processing method and device, the audio content is intuitively presented, the audio content is searched and positioned according to the search information, the audio is edited and spliced, the audio information is edited conveniently, automatically and efficiently like text editing, and the workload of whole audio information processing is greatly reduced.

Description

Audio processing method and device

Technical Field

The present disclosure relates to, but not limited to, speech recognition technology, and more particularly, to a method and apparatus for audio processing.

Background

When a user edits and splices a plurality of pieces of audio, the user usually needs to manually listen to each piece of audio first, then manually mark out a content segment to be spliced, and finally splice the marked content segment. The workload of the audio processing mode in the related art is obviously great; moreover, at the start time point of manual recording of a content piece, accuracy is low, such as: if the exact stop time point required is 00:01:53, and the user records 00:02:01, extraneous noise is left in the audio piece to be spliced; moreover, the user finally obtains a plurality of audio files, records of the user on the time points of the content fragments, the audio fragments and the like, and the content of the audio files before and after splicing is difficult to visually and 'see', so that the difficulty level of auditing, review, modification and reediting is greatly increased, and a large amount of text description is required to be attached by the user during the process of preserving and transferring the audio.

In summary, in the related art, the audio processing method has the advantages of long time consumption, low efficiency, large workload, high marking error rate at time points and low intelligent automation degree.

Disclosure of Invention

The application provides an audio processing method and device, which can accurately and efficiently process audio information.

The application provides an audio processing method, which comprises the following steps:

converting the audio information to be processed into text information;

searching the converted text information by utilizing the search information to obtain text fragments containing the search information;

and processing the audio clips corresponding to the text clips containing the search information to obtain second audio information.

Optionally, searching the converted text information by using the search information, and obtaining the text segment containing the search information includes:

searching in the text information according to the search information to obtain at least one text segment containing the search information;

and respectively determining the start and stop time point information of the audio fragment corresponding to each text fragment according to the start and stop position of the searched at least one text fragment.

Start and stop time point information of a text segment containing search information is identified.

Optionally, the processing according to the audio segment corresponding to the text segment containing the search information includes:

splicing the obtained text fragments containing the search information into text information;

cutting out each audio fragment from the first audio fragment according to the start-stop time points of the audio fragment corresponding to each text fragment in the spliced text information;

and splicing the sheared audio clips to obtain the second audio information.

Optionally, the method further comprises:

identifying an audio source of an audio fragment corresponding to the text fragment;

audio source information is added to the audio clip.

Optionally, identifying an audio source of an audio segment corresponding to the text segment, and adding audio source information to the audio segment includes:

judging a speaker corresponding to the audio fragment through voiceprint recognition;

and adding information of a speaker corresponding to the voiceprint in the text segment.

Optionally, the processing the audio clip corresponding to the text clip containing the search information includes:

converting the text information added with the speaker information into corresponding audio fragments through a voice synthesis technology; and splicing the converted audio fragments into the second audio information.

Optionally, the method further comprises:

generating text information containing the additional information, and converting the text information containing the additional information into a system audio fragment by using a voice synthesis technology;

the processing of the audio frequency segment corresponding to the obtained text segment containing the search information comprises the following steps: and splicing the obtained system audio fragment and the audio fragment corresponding to the text fragment containing the search information to form the second audio information.

Optionally, after the splicing into one text message, the method further includes:

editing the spliced text information according to the operation information from the user.

Optionally, the editing includes: text is added or subtracted, and annotation information is added.

The application also provides an audio processing device, which is characterized by comprising a memory and a processor, wherein the memory stores the following instructions executable by the processor: a step for performing the audio processing method of any of the above.

The present application further provides a computer-readable storage medium storing computer-executable instructions for performing the audio processing method of any one of the above.

The present application further provides an audio processing apparatus, including: the device comprises a conversion unit, a search unit and a processing unit; wherein,,

the conversion unit is used for converting the judicial audio information to be processed into text information;

the searching unit is used for searching the converted text information by utilizing the searching information to obtain text fragments containing the searching information;

and the processing unit is used for processing the audio clips corresponding to the text clips containing the search information to obtain second audio information.

Optionally, the search unit is specifically configured to:

Optionally, the processing unit is specifically configured to: splicing the text fragments containing the search information into text information;

and splicing the sheared audio clips to obtain the second audio information.

Optionally, the processing unit is further configured to: editing the spliced text information according to the operation information from the user.

Optionally, the apparatus further comprises: an adding unit for generating text information containing additional information and converting the text information containing the additional information into a system audio clip;

the processing unit is specifically configured to: and splicing the obtained system audio fragment and the obtained audio fragment corresponding to the text fragment containing the search information to form the second audio information.

The technical scheme of the application at least comprises the following steps: converting the audio information to be processed into text information; searching the converted text information by utilizing preset search information to obtain text fragments containing the search information; and processing the obtained audio fragments corresponding to the text fragments containing the search information to form processed audio information. According to the audio information processing method and device, the audio content is intuitively presented, the audio content is searched, positioned, edited, spliced and the like according to the search information such as keywords, the audio information is edited conveniently, automatically and efficiently like text editing, and the workload of whole audio information processing is greatly reduced.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the technical aspects of the present application, and are incorporated in and constitute a part of this specification, illustrate the technical aspects of the present application and together with the examples of the present application, and not constitute a limitation of the technical aspects of the present application.

FIG. 1 is a flow chart of an audio processing method according to an embodiment of the present application;

fig. 2 is a schematic diagram of the composition structure of an audio processing device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail hereinafter with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be arbitrarily combined with each other.

In one typical configuration of the present application, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmission media), such as modulated data signals and carrier waves.

The steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer-executable instructions. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.

Fig. 1 is a flowchart of an audio processing method in an embodiment of the present application, as shown in fig. 1, including:

step 100: the first audio information to be processed is converted into text information.

Wherein the first audio information to be processed may comprise one or more pieces of audio files.

Alternatively, in this step, a Speech conversion technique such as a Speech-to-Text (STT) technique may be used to convert each audio file into a Text file.

Step 101: and searching the converted text information by utilizing the search information to obtain text fragments containing the search information.

Alternatively, the search information may be a keyword as set in advance.

Optionally, the step may include: searching in the text information according to the search information to obtain at least one text segment containing the search information; and respectively determining the start and stop time point information of the audio fragment corresponding to each text fragment according to the start and stop position of the searched at least one text fragment.

In an exemplary embodiment, taking the search information as a keyword as an example, searching in the converted text information by using the keyword to obtain a text segment containing the search information; and identifying the start-stop time point information of the audio frequency fragment corresponding to the text fragment containing the search information. Specifically: after the text information is searched by utilizing the keywords to obtain text fragments containing the search information, the text fragments containing the search information are identified, namely the text fragments with the identification are the text fragments containing the search information. At this time, the beginning and ending time points of the text segment a are included in the text segment a, for example, 00:04:32-00:25:01, and the text segment a is marked as: audio A00:04:32-00:25:01.

It should be noted that, according to the voice intelligent conversion technology provided in the related art, preliminary sentence breaking processing may be performed. Then, a text segment containing search information in this application can be defined as: a full sentence containing the search information. In the step, the text fragments containing the keywords are searched, so that the context containing the keywords is obtained quickly, and the positioning effect on related text information is achieved.

Optionally, if the user considers that the text segment including the search information is not accurate or complete, the text can be specifically selected as the text segment to be spliced according to the noted positions. The specific implementation may be implemented in a form of providing a man-machine interaction interface for a user, and the specific implementation form is not used to limit the protection scope of the application.

It should be noted that, there may be several correspondence in information, such as a sound chart, a time axis, and text. Wherein the sound map and the time axis are similar to the audio track, carrying time information. The principle of the voice conversion technology is to convert a sound clip into text, and when the unit of the sound clip is sufficiently small, a corresponding point in time can be obtained on the time axis, for example, by back-pushing: the transcription obtains a word of 'cloud', namely the sound clip transcribed into the word of 'cloud', and the voice file has a corresponding time axis, so that the corresponding time point of the sound clip where the 'cloud' is located can be obtained, for example, 00:05:30, and then the 'cloud' in the text can obtain the corresponding time point: 00:05:30, i.e. the converted text information has corresponding time information.

Step 102: and processing the audio clips corresponding to the obtained text clips containing the search information to obtain second audio information.

Optionally, the step includes: splicing the obtained text fragments containing the search information into complete text information; cutting out each audio fragment from the first audio fragment to be processed according to the start-stop time points of the audio fragment corresponding to each text fragment in the spliced complete text information; and splicing the sheared audio clips to obtain second audio information.

It should be noted that, after the text segment containing the search information is spliced through this step, the application further includes: editing the spliced text information according to the operation information from the user. Wherein the editing includes, but is not limited to: adding or deleting text, adding commentary information, etc. For example, the spliced text is edited by the user in a customized manner, such as: adding or deleting text, adding comment and explanation text such as that the text is of mr. King, and can add mr. King to say, etc. The editing is difficult to realize for a simple audio file, and by means of text splicing after the text splicing, the purpose of editing the audio through the convenience of editing the text is realized, and blind editing of the audio information is avoided.

According to the audio processing method, audio content is visually presented, the audio content is searched, positioned, edited, spliced and other tasks are carried out according to the keywords, the audio information is conveniently, automatically and efficiently edited like text editing such as copying, pasting and cutting, and the workload of whole audio information processing is greatly reduced.

Optionally, before step 102, after step 101, the audio processing method of the present application further includes:

and identifying the audio source of the audio fragment corresponding to the text fragment containing the search information, and adding the audio source information to the audio fragment. Thus, the method realizes that the respective audio source information is respectively added to the audio texts with different audio sources.

Wherein the audio source refers to the information of the speaker, and the identification of the audio source of the audio clip is to determine whether the speakers of different audio clips are the same person.

judging a speaker corresponding to the audio fragment through voiceprint recognition; and adding information of a speaker corresponding to the voiceprint in the text segment.

In one illustrative example, voiceprint recognition techniques can be used to identify whether the speaker of different audio segments corresponding to text segments containing search information is the same person, i.e., whether the audio sources are the same; if not all the same speaker, namely different audio sources exist, the user can be prompted whether the information of the speaker needs to be added or not;

if an instruction for selecting to add the information of the speaker is received from the user, the information of the speaker is directly added to the text information containing the search information.

In response to this, the control unit,

step 102 comprises: converting the text information added with the information of the speaker into an audio fragment added with the information of the speaker by utilizing a voice synthesis technology; and splicing the converted audio clip added with the information of the speaker into a second audio file.

Optionally, the method of the present application further comprises:

generating Text information containing the additional information and converting the Text information containing the additional information into a system audio clip, such as using Speech synthesis techniques, such as Text-to-Speech (TTS) techniques; accordingly, step 102 specifically includes: and splicing the obtained system audio fragment and the audio fragment corresponding to the text fragment containing the search information, so as to form second audio information.

The audio processing method of the present application is described in detail below with reference to specific embodiments.

Assuming that a user is a product manager, four pieces of audio files such as audio a, audio B, audio C, and audio D from four clients, respectively, are collected after visiting the client. In this embodiment, it is assumed that the original audio material collected by the client manager includes: an audio file a for speaker a, an audio file B for speaker B, an audio file C for speaker C, and an audio file D for speaker D. Now, these clients' views of "alicloud" need to be sorted out from the collected audio files, that is, all the pieces referring to "alicloud" need to be searched out from the four pieces of collected audio files and spliced into one audio file for the voice of "alicloud". The audio processing method provided by the application specifically comprises the following steps:

first, the audio file a, the audio file B, the audio file C, and the audio file D are all converted into text information using a voice conversion technique. Taking the conversion of audio file a into text file a as an example:

"we are a company focusing on human genome data analysis and gene information application development. With the maturation of gene sequencing technology and the rapid decrease of cost, gene detection has slowly emerged from scientific research into the home of the general public. The united states has introduced the 'accurate medical' program, the uk has initiated the one hundred thousand genome project, and china is believed to initiate the corresponding project immediately. When each person performs a whole genome sequencing, a data volume of 90 gbps is generated. If the whole genome sequencing is carried out by tens of thousands, millions or even tens of millions, the generated mass data can be solved by never constructing a plurality of servers by themselves, and the large-scale calculation and mass storage of cloud calculation are required to be relied on. The Arian cloud is a mature cloud product provider in China at present, covers various aspects of calculation, storage, safety and the like, saves the cost of manpower and material resources of a self-built machine room, and has good elasticity. Our massive genome data analysis relies on a number of products such as ECS, OSS, OTS, batchCompute. Fastq data generated by the Hiseq X ten sequencer is directly transmitted to the OSS through a high-speed network, so that the problems of data storage and backup are solved. In data analysis, ECS, batchCompute directly reads the genome data of OSS from the intranet, and concurrently analyzes a plurality of genome data, and rapidly returns a gene reading result.

The Ali cloud provides various products, so that people do not need to spend too much manpower and material resources on the deployment and maintenance of the server, the greatest concentration force can be achieved on the products, the best genome data interpretation and analysis service is provided for users, people and the Ali cloud are cooperated with the hand-in of the hand, and the early arrival of personalized medical treatment is expected. "

Then, keyword search is carried out on the text information obtained through conversion by using the keyword 'Aliyun', and related content is rapidly located. Taking the text file a as an example, first, according to the voice intelligent conversion technology provided in the related art, preliminary sentence breaking processing may be performed. Then, the text segment containing the search information that is located quickly in this embodiment may be defined as: full sentence containing the search information, i.e. "Arian cloud is a relatively mature cloud product provider in China at present", and'Ariyun is A wide variety of products are provided". And then, the user decides which contexts are to be selected as text fragments to be spliced according to the two positions which are positioned quickly. For example, the user selects the text information underlined in the text file a as the searched related content: "… … must rely on mass computing and mass storage of cloud computing.Ariyun is a comparative product in China at present Cooked cloud product provider, culvertCovers various aspects of calculation, storage, safety and the like, saves the cost of manpower and material resources of a self-built machine room, and has good elasticity. Our massive genome data analysis relies on a number of ECS, OSS, OTS, batchCompute and the like And (5) a product. Fastq data generated by the Hiseq X ten sequencer is directly transmitted to an OSS through a high-speed network, so that the data is solved Storage and backup. In data analysis, ECS, batchCompute reads the genome data of OSS directly from intranet and concurrently And (3) a plurality of genome data, and rapidly returning a gene reading result.

Arian cloud provides a variety of products, so that people do not need to spend too much manpower and material resources on a server The system can be deployed and maintained, can concentrate power on products to the greatest extent, and provides the best genome data interpretation and division for users And analyzing services, so that people can work with the hand-in of the ali cloud, and the early arrival of personalized medicine is expected.”

Thus, the user quickly locates the desired segment, assuming that the start and stop time points of the located text segment are: 00:04:32-00:25:01, then this piece of text information a will be automatically marked as: audio files A00:04:32-00:25:01.

In this embodiment, it is assumed that the text information B, the text information C, and the text information D obtained by conversion are searched according to the above method, and then are located and marked, respectively, to obtain: audio files B00:05:45-00:35:06, audio files C00:01:22-00:15:03, and audio files D00:34:01-00:46:45.

Then, the marked audio fragments are cut out from the four sections of audio files collected by the product manager, and the audio files are spliced into processed audio files: "Audio A00:04:32-00:25:01+Audio B00:05:45-00:35:06+Audio C00:01:22-00:15:03+Audio D00:34:01-00:46:45".

Further, the voiceprint recognition technology can be used for recognizing that the four sections of audio files to be spliced are from four different speakers, and at the moment, whether the user needs to add information of the speakers can be further prompted; if the user chooses to add the speakerInformation can be added directly to the text information, such as: "User a says:audio a00:04:32-00:25:01 text information.User B says: audio file B00:05:45-00:35:06.User C The method is as follows:text information of audio files C00:01:22-00:15:03.User D says:audio file D00:34:01-00:46:45 "; then, the text information added with the additional information is converted into an audio fragment with the additional information by utilizing a voice synthesis technology; and finally, splicing the converted audio clips with the additional information into a processed audio file.

Optionally, the additional information may be a description of the spliced audio file, for example, a title or other description may be added to the four audio files identified and cut in the above embodiment, for example:"the content of this audio is Four customers rated alicloud. I go to beijing city for 3 days at 12 of 2016 to visit these four users ", etc.At this time, this additional information may be used alone as a text information for representing the additional information. In the subsequent splicing processing, the text information representing the additional information is converted into an audio fragment of the additional information only by utilizing a voice synthesis technology; finally, splicing the converted audio clips of the additional information with the marked audio clips cut out from the four-section audio file collected by the product manager to obtain a processed audio file: the audio fragments of the additional information plus audio A00:04:32-00:25:01 plus audio B00:05:45-00:35:06 plus audio C00:01:22-00:15:03 plus audio D00:34:01-00:46:45 are spliced into the processed audio file.

According to the audio processing method, audio content is intuitively presented, people can read sound, cognitive processing speed and information transparency are improved, and friendliness to listeners or subsequent audio processing personnel is greatly improved.

The embodiment of the application also provides an audio processing device, which comprises a memory and a processor, wherein the memory stores the following instructions executable by the processor: the executable instructions are for performing the steps of the audio processing method described in one or more embodiments above.

Embodiments of the present application also provide a computer-readable storage medium storing computer-executable instructions for performing the audio processing method described in one or more embodiments above.

Fig. 2 is a schematic structural diagram of an audio processing apparatus according to an embodiment of the present application, which at least includes: the device comprises a conversion unit, a search unit and a processing unit; wherein,,

the conversion unit is used for converting the first audio information to be processed into text information;

and the processing unit is used for processing the audio frequency fragments corresponding to the obtained text fragments containing the search information to obtain second audio frequency information.

Optionally, the search unit is specifically configured to:

Optionally, the processing unit is specifically configured to:

and splicing the sheared audio clips to obtain second audio information.

Optionally, the processing unit is further configured to: editing the spliced text information according to the operation information from the user. Wherein editing includes, but is not limited to: adding or deleting text, adding commentary information, etc.

Optionally, the processing unit is further configured to:

and identifying the audio sources of the audio clips corresponding to the text clips containing the search information, and adding audio source information for the audio clips with different audio sources.

Optionally, the audio processing device of the present application further includes: an adding unit for generating text information containing additional information and converting the text information containing the additional information into a system audio clip;

correspondingly, the processing unit is specifically configured to: and splicing the obtained system audio fragment and the obtained audio fragment corresponding to the text fragment containing the search information to form the second audio information.

Alternatively, the search information may include keywords.

Although the embodiments disclosed in the present application are described above, the embodiments are only used for facilitating understanding of the present application, and are not intended to limit the present application. Any person skilled in the art to which this application pertains will be able to make any modifications and variations in form and detail of implementation without departing from the spirit and scope of the disclosure, but the scope of the application is still subject to the scope of the claims appended hereto.

Claims

1. An audio processing method, comprising:

converting the first audio information to be processed into text information;

judging a speaker corresponding to the audio fragment corresponding to the text fragment through voiceprint recognition; adding information of a speaker corresponding to the voiceprint in the text segment;

processing the audio segment corresponding to the text segment containing the search information to obtain second audio information, wherein the processing comprises the following steps: converting the text information added with the speaker information into corresponding audio clips through voice synthesis; and splicing the converted audio fragments to obtain the second audio information.

2. The audio processing method of claim 1, wherein searching the converted text information using the search information to obtain the text segment containing the search information comprises:

according to the start and stop positions of at least one searched text segment, respectively determining start and stop time point information of an audio segment corresponding to each text segment;

3. The audio processing method according to claim 1, wherein before the speaker corresponding to the audio piece corresponding to the text piece is judged by voiceprint recognition, the method further comprises:

and cutting out each audio fragment from the first audio fragment according to the start-stop time points of the audio fragment corresponding to each text fragment in the spliced text information.

4. A method of audio processing according to claim 1 or 3, characterized in that the method further comprises:

generating text information containing the additional information, and converting the text information containing the additional information into a system audio fragment through voice synthesis;

the processing the audio frequency segment corresponding to the text segment containing the search information comprises the following steps: and splicing the system audio fragment and the audio fragment corresponding to the text fragment containing the search information to form the second audio information.

5. The audio processing method of claim 3, wherein after said concatenating a text message, the method further comprises:

6. The audio processing method according to claim 5, wherein the editing includes: text is added or subtracted, and annotation information is added.

7. An audio processing apparatus comprising a memory and a processor, wherein the memory stores instructions executable by the processor to: steps for performing the audio processing method of any one of claims 1 to 6.

8. A computer-readable storage medium storing computer-executable instructions for performing the audio processing method of any one of claims 1 to 6.

9. An audio processing apparatus, comprising: the device comprises a conversion unit, a search unit and a processing unit; wherein,,

the searching unit is used for searching the converted text information by utilizing the searching information to obtain text fragments containing the searching information; judging a speaker corresponding to the audio fragment corresponding to the text fragment through voiceprint recognition; adding information of a speaker corresponding to the voiceprint in the text segment;

the processing unit is configured to process an audio segment corresponding to the text segment containing the search information to obtain second audio information, and includes: converting the text information added with the speaker information into corresponding audio clips through voice synthesis; and splicing the converted audio fragments to obtain the second audio information.

10. The audio processing apparatus according to claim 9, wherein the search unit is specifically configured to:

11. The audio processing device according to claim 9, wherein the processing unit is specifically configured to: splicing the text fragments containing the search information into text information;

12. The audio processing device of claim 11, wherein the processing unit is further configured to: editing the spliced text information according to the operation information from the user.

13. The audio processing apparatus according to claim 9 or 11, characterized in that the apparatus further comprises: an adding unit for generating text information containing additional information and converting the text information containing the additional information into a system audio clip;