CN111415651A - Audio information extraction method, terminal and computer readable storage medium - Google Patents

Audio information extraction method, terminal and computer readable storage medium Download PDF

Info

Publication number
CN111415651A
CN111415651A CN202010094370.1A CN202010094370A CN111415651A CN 111415651 A CN111415651 A CN 111415651A CN 202010094370 A CN202010094370 A CN 202010094370A CN 111415651 A CN111415651 A CN 111415651A
Authority
CN
China
Prior art keywords
audio
track
text
extraction method
information extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010094370.1A
Other languages
Chinese (zh)
Inventor
张文海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Microphone Holdings Co Ltd
Shenzhen Transsion Holdings Co Ltd
Original Assignee
Shenzhen Microphone Holdings Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Microphone Holdings Co Ltd filed Critical Shenzhen Microphone Holdings Co Ltd
Priority to CN202010094370.1A priority Critical patent/CN111415651A/en
Publication of CN111415651A publication Critical patent/CN111415651A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses an audio information extraction method, a terminal and a computer readable storage medium, wherein the method comprises the following steps: determining audio to be extracted, determining audio tracks of the audio based on the audio, wherein the audio at least has two audio tracks, analyzing the audio to obtain audio texts generated by each audio track, and storing each audio text in the audio based on each audio track to serve as the extracted audio information. Because the audio text generated by each audio track in the audio can be directly obtained through computer analysis, and the audio information extracted from the audio can be obtained by storing each audio text in the audio based on each audio track, namely, the technical scheme provided by the invention can obtain the audio information in the audio without a playing and listening mode manually by a user, so that the user can conveniently look up the audio information, and the user can obtain higher experience.

Description

Audio information extraction method, terminal and computer readable storage medium
Technical Field
The present invention relates to the field of information technology, and more particularly, to an audio information extraction method, a terminal, and a computer-readable storage medium.
Background
The existing intelligent terminal can generate audio through diversified recording software, the audio is stored in sound formats such as WAV format, MIDI format, CDA format and MP3 format, when a user wants to obtain information in the audio, the user needs to listen to the audio again or more times, which is time-consuming for the user and reduces the user experience.
Disclosure of Invention
The invention provides an audio information extraction method, a terminal and a computer readable storage medium, and solves the technical problem that the user experience is reduced because the audio information in the audio needs to be acquired in a playing and listening mode in the prior art.
The invention provides an audio information extraction method, which comprises the following steps:
determining audio to be extracted, and determining tracks of the audio based on the audio, wherein the audio has at least two tracks;
analyzing the audio to obtain audio texts generated by each audio track;
each audio text in the audio is stored on the basis of each audio track as extracted audio information.
Optionally, determining the track of the audio based on the audio comprises:
analyzing the audio to obtain the voiceprint characteristics of the audio, wherein the audio comprises at least two sound waves and at least has two voiceprint characteristics;
and judging the sound waves with the same voiceprint characteristics in the audio to belong to the same audio track to obtain each audio track in the audio.
Optionally, determining the track of the audio based on the audio comprises:
determining a sound source of the audio according to the audio, wherein the audio comprises at least two sound waves and at least two sound sources;
and judging the sound waves with the same sound source in the audio frequency to belong to the same audio track to obtain each audio track in the audio frequency.
Optionally, determining the sound source of the audio according to the audio includes:
determining a social application that generates audio;
and searching each contact person participating in audio generation in the social application, and determining each contact person as a sound source of the audio.
Optionally, after storing each audio text in the audio based on each audio track, the audio information extraction method further includes:
acquiring characters to be extracted;
determining the time sequence of the characters and each audio text;
characters are added between each audio text in chronological order.
Optionally, after the audio is analyzed to obtain the audio text generated by each audio track, the audio information extraction method further includes:
displaying a track selection box, wherein the track selection box comprises an audio track of audio and audio text generated by the audio track;
and detecting the selection operation of the audio track in the audio track selection frame, and modifying the audio text generated by the selected audio track.
Optionally, storing each audio text in the audio based on each audio track includes:
determining a storage language of the audio and an audio text different from the storage language in the audio;
translating the audio text different from the stored language in the audio into the stored language to obtain each audio text after the audio is unified into the language;
the time sequence of each audio text in the audio is determined, and each audio text identified by each audio track is stored according to the time sequence.
Optionally, after storing each audio text in the audio based on each audio track to serve as the extracted audio information, the audio information extracting method further includes:
sending audio information to a preset mailbox account or/and a social application account;
or sending audio texts of preselected audio tracks in the audio to a preset mailbox account or/and a social application account;
or selecting at least one conference element from the audio information according to a preset conference recording template, and filling the conference element into the preset conference recording template according to a preset format to generate a conference record.
Furthermore, the invention also provides a terminal, which comprises a processor, a memory and a communication bus;
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is configured to execute one or more programs stored in the memory to implement the steps of the audio information extraction method as described in any one of the above.
Further, the present invention also provides a computer-readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the audio information extraction method as described in any one of the above.
Aiming at the defect that the audio information can be obtained only by playing and listening the audio in the prior art, the technical scheme provided by the invention can directly obtain the audio information of the audio by analyzing the audio text generated by each audio track in the audio to be extracted and storing each audio text in the audio based on each audio track.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a sound wave diagram of a segment of audio provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of sound waves of the audio tracks of FIG. 1;
FIG. 3 is a flowchart illustrating an audio information extraction method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a first interaction interface of a social application according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an interaction interface of another social application provided in an embodiment of the present invention;
FIG. 6 is a diagram illustrating parsing of the audio of FIG. 1 to obtain audio texts;
FIG. 7 is a diagram illustrating an interface with a track selection box according to an embodiment of the present invention;
FIG. 8 is a second interactive interface of the social application shown in FIG. 4;
FIG. 9 is a diagram illustrating an interface with a track selection box according to an embodiment of the present invention;
fig. 10 is a preset conference recording template according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Before the audio information extraction method provided by the invention is introduced, some basic descriptions are carried out:
audio, which refers to recorded sounds that can be heard by human beings, includes speaking voices, singing voices, sounds produced by musical instruments, and meaningless noises, may be stored in format files of WAV format, MIDI format, CDA format, MP3 format, and the like.
Sound is propagated by sound waves and the recorded sound in audio can be represented by sound waves as shown in fig. 1. Referring to fig. 1, fig. 1 shows an audio frequency Q, i.e., a recorded sound, including a plurality of sound waves. The audio duration shown in fig. 1 is 23 seconds, wherein Q5 represents the fifth sound wave in the audio Q.
It should be understood that a piece of audio may include sound waves from one sound source, or sound waves from multiple sound sources, and in the embodiment of the present invention, sound waves generated from the same sound source belong to a unified track, and each track has different specific properties, such as timbre and voiceprint characteristics, from other tracks.
In other embodiments, the specific attributes of the tracks may further include a library of timbres, number of channels, input/output ports, volume, etc. in the audio editing processing device of the sound sequencer, digital music software, etc. And the tracks can be displayed as one parallel "track" in the audio editing processing means, such as the first track a and the second track B in fig. 2.
It is noted that the sound waves in the track are time-sequentially present, that is, a track includes information not only carried by all the sound waves in the track, but also the time information of the sound waves appearing on the track.
Referring to fig. 2, the audio track of fig. 1 has a first audio track a and a second audio track B. The first track a includes a first sound wave (a 1 in fig. 2, the time of occurrence is 0 th to 10 th seconds), a second sound wave (a 2 in fig. 2, the time of occurrence is 18 th to 23 th seconds). The second track B includes a third sound wave segment (B3 in fig. 2, the occurrence time is 18 th to 23 th seconds).
The following will describe the audio information extraction method provided by the present invention based on the above description, please refer to fig. 3, which is a schematic flow chart of the audio information extraction method provided by the embodiment of the present invention, and the method includes:
s301, determining the audio to be extracted, and determining the audio track of the audio based on the audio.
Wherein the audio has at least two tracks. It should be clear that the method provided by the embodiment of the present invention can also be used to extract audio with only one track.
The audio to be extracted here may be a complete uninterrupted piece of audio, or may be audio composed of multiple pieces of audio. It is to be understood that the combination of multiple pieces of audio into one piece of audio vocalization is in many cases:
for example, referring to fig. 4, which shows that the user M and the user N use the social application to perform online voice chat, it can be known that the voice chat generates five voices with durations of 25 seconds, 32 seconds, 12 seconds, 8 seconds and 2 seconds, respectively, and the user can select the five voices to store, in this example, the five voices can be combined into one audio.
For another example, please refer to fig. 5, which shows that nine users from user C to user K are using a certain conference software to perform an online voice conference/video conference, and it should be noted that the conference software can store the voice uttered by each user no matter when performing the voice conference or the video conference, and then the conference software stores the voice generated by each user according to the time sequence to obtain the conference audio of the conference.
In the embodiment of the present invention, the number of audios to be extracted may be one. In other examples, the number of the audios to be extracted may be multiple, for example, when an audio recording interruption occurs, and a same batch of multiple audios is received, it is necessary to extract audio information of multiple pieces of audios simultaneously and integrate the audio information into one complete audio information, and for this example that the number of the audios to be extracted is multiple, a more detailed description is provided later.
Determining the track of the audio may be accomplished based on voiceprint characteristics. The voiceprint characteristics of the sound such as the level, frequency and the like of the sound can be stored in the audio frequency, and the stored sound is not changed, the voiceprint characteristics of the sound source (the object generating the sound) are still kept, and the sound waves with the same voiceprint characteristics can be determined to belong to the same track. The method for extracting the audio information based on the voiceprint features can be used for recording software which does not distinguish sound sources, and can also be used for social application which can accommodate a plurality of users to carry out social communication.
In other embodiments, determining the tracks of the audio may be based on the source of the audio. It should be understood that a piece of audio is obtained by combining sound waves emitted from a plurality of sound sources, and strictly speaking, the audio itself has no sound source, and it is emphasized that the sound source in this document refers to a generation source of each sound wave in the audio, and sound waves having the same sound source can be determined as belonging to a unified soundtrack. The method for extracting the audio information based on the sound source can be used for social applications capable of accommodating social communication of a plurality of users.
It should be noted that, in this embodiment, the sound source of the audio is not determined by analyzing the audio and based on the voiceprint features in the audio, but may be determined by tracing back the sound source of each sound wave in the audio through social application. Specifically, when a social application is used for voice conference, video conference, voice chat, and blackout of a game, the social application used by each user can record the sound waves generated by the user, and the relationship between the user and the sound waves generated by the user is uniquely determined, so that the sound waves with the same sound source can be determined to belong to the same sound track after the sound source of each sound wave is determined.
Referring to fig. 4, among the five voices generated in the voice chat, the voices with the durations of 25 seconds, 12 seconds and 8 seconds are generated by the user M, and the voices with the durations of 32 seconds and 2 seconds are generated by the user N, so that after the user merges, stores and integrates the five voices into one audio, the sound sources (users) of the voices in the current conversation can be traced back from the social application, and the sound source of the integrated audio can be determined.
S302, analyzing the audio to obtain audio texts generated by each audio track.
It is to be understood that a piece of audio includes one or more sound waves, and one or more consecutive sound waves may constitute a semantic sound piece, which may be parsed into audio text. Referring to fig. 6, fig. 6 shows audio texts obtained by analyzing the audio Q in fig. 2. The first sound track a, a first segment of sound wave a1, of the first sound track a in fig. 2 (fig. 6) can be interpreted as the first audio text, "i have recently found that there are shops that do you eat well, are you going to taste together? ", the second sound wave A2 can be parsed into the second audio text" good, that we go to a bar today in the evening! ", the third sound wave segment B3 of the second sound track B could be interpreted as the third audio text" good o, recently also wanting to go to a little nice |)! ".
It should be understood that each audio text obtained by parsing the audio is an audio text generated from each audio track in the audio. These audio texts may be unordered text segments, semantically incoherent, and not as information extracted from the audio. With continued reference to fig. 6, parsing the audio Q may obtain a first, a second and a third audio texts, where the three audio texts are not sequential and are not semantically coherent, and then:
s303, storing each audio text in the audio based on each audio track as the extracted audio information.
As can be seen from the foregoing description, the information included in a sound track includes not only the information carried by all sound waves in the sound track, but also the time information of the occurrence of each sound wave on the sound track, and then the steps may be:
(1) and storing each audio text in the audio according to the time information to obtain a text with semantics, wherein the text with semantics is the audio information extracted from the audio.
(2) Or storing an audio text generated by a certain audio track in the audio according to the time information to obtain a text with semantics, wherein the text with semantics is the audio information extracted from the audio. It should be noted that the extracted audio information in this example is the audio information of the certain track in the audio, and not the audio information of the audio (having multiple tracks in the audio).
In other embodiments, the obtained text with semantics may be translated to obtain a text in a unified language, where the text in the unified language is audio information extracted from audio.
In some other embodiments, the step of translating the text may be performed after obtaining the audio text, sequentially storing the audio texts in the unified language according to the time information, and using the obtained text with semantic meaning as the audio information extracted from the audio.
After the audio information is extracted, the audio information can be utilized, such as sending the audio information to a preset mailbox account or/and a social application account. It is also possible in some examples to send only some audio text of a certain track to a preset mailbox account or/and social application account. In some other examples, the extracted audio information may also be used to generate a conference recording.
The embodiment of the invention provides an audio information extraction method, aiming at the defect that in the prior art, audio information can be obtained only by playing and listening to audio manually, the technical scheme provided by the invention can directly obtain audio texts generated by each audio track in the audio through computer analysis, and the audio information extracted from the audio can be obtained by storing each audio text in the audio based on each audio track.
Other embodiments may be realized by the method based on the above-described audio information extraction method as described further below.
The present invention also provides an embodiment, in which the audio information extraction method includes steps S1001 to S1013:
s1001, determining an audio to be extracted;
in the present embodiment, the audio to be extracted here is a complete uninterrupted piece of audio. In other examples, the audio to be extracted here is audio composed of a plurality of pieces of audio.
And S1002, analyzing the audio to obtain the voiceprint characteristics of the audio.
In the embodiment of the invention, the audio comprises at least two sound waves, and the audio has at least two voiceprint characteristics. It should be noted that the embodiment of the present invention can also be used in the case that the audio has only one sound wave and one voiceprint feature.
It should be understood that there are differences in size, shape and function of human vocal organs (including vocal cords, soft palate, tongue, teeth, lips, etc.), and different human vocal resonators (including pharyngeal cavity, oral cavity, nasal cavity), and the small differences of these organs will cause the change of vocal airflow, resulting in the difference of tone quality and tone color. In addition, the habit of human sounding is fast or slow, and the difference between the sound intensity and the sound length is caused by the strength of the force.
Based on the language graph instrument, the sound (audio frequency) can be converted into electric signals, and further, the change of the electric signals can be drawn into a spectrum graph to form a voiceprint graph with voiceprint characteristics. Therefore, the audio can be analyzed to obtain the voiceprint characteristics of the audio.
As can be seen from the above description, different people have different voiceprint characteristics. In some examples, a voiceprint feature may be characterized by three dimensions of a syllable, a speech sentence, and a speech paragraph in a piece of audio:
the syllable dimension comprises physiological characteristics such as vocal cord shape, vocal tract length, size and the like;
the sentence dimension comprises the voice characteristics of tone, strength, rhythm, pause and the like;
the paragraph dimensions include behavioral characteristics such as word usage, accent, pronunciation, and the like.
In some examples of embodiments of the present invention, the voiceprint features include primarily a sentence dimension and a paragraph dimension.
S1003, judging the sound waves with the same voiceprint characteristics in the audio to belong to the same audio track, and obtaining each audio track in the audio.
The audio frequency comprises a plurality of sound waves, and each sound wave in the audio frequency can be divided according to a rule that continuous sound waves with the same voiceprint characteristics are determined as a sound segment to obtain each audio track in the audio frequency.
And S1004, analyzing the audio to obtain audio texts generated by each audio track.
It is to be understood that a piece of audio includes one or more sound waves, and one or more consecutive sound waves may constitute a semantic sound piece, which may be parsed into audio text.
It should be understood that each audio text obtained by parsing the audio is an audio text generated from each audio track in the audio. These audio texts may be unordered text segments, semantically incoherent, and not as information extracted from the audio.
When the user thinks that the audio text obtained by analysis is wrong, the audio text can be modified. In the embodiment of the invention, the user has at least the following two modification conditions: (1) modifying audio texts generated by all audio tracks in the audio; (2) a user whose sound has a certain voiceprint characteristic can modify the audio text, and the modified audio text is obtained based on the audio track determined by the voiceprint characteristic.
The embodiment of the present invention mainly introduces a modification condition (1), and after step S1004, the audio information extraction method provided by the embodiment of the present invention further includes:
s1005, displaying the track selection frame.
The audio track selection box includes audio tracks of the audio and audio texts generated by the audio tracks. The user may review the audio text generated for each track at the interface with the track selection box.
Referring to fig. 7, an interface with a track selection box is shown, which includes first through fourth tracks of audio, and audio text generated by the four tracks.
After the user consults, if the audio text generated by a certain audio track is found to have a problem, the user can modify the audio text with the problem. For example, when the user finds that the audio text in the second track in fig. 7 is incorrect, he may select the selection button z corresponding to the second track.
S1006, detecting the selection operation of the audio track in the audio track selection frame, and modifying the audio text generated by the selected audio track.
The terminal can detect the selection operation of the user on the audio track in the audio track selection frame, and modify the audio text generated by the selected audio track.
The modifications herein include: discarding all or part of the audio text generated by the selected audio track, or/and adding or/and deleting characters in the respective audio text.
The audio information extraction method provided by the embodiment of the invention can also perform uniform language on the audio text, and comprises the following steps:
s1007, determining the storage language of the audio and the audio text different from the storage language in the audio.
The storage language here is a language for storing each audio text, and may be a language of each international country or a local dialect. For example, when a section of audio in the audio is japanese, the audio text obtained by parsing is also japanese, and if the user wants to unify the languages of the audio, it is necessary to determine the audio text different from the stored language in the audio.
There are two ways to determine the storage language of the audio:
(1) the language used by the terminal is determined according to the language used by the terminal set by the user, namely the language used by the terminal is used as the storage language of the audio.
(2) Determined according to the user's selection. Referring to fig. 8, fig. 8 shows an interface having a language selection box, on which a user can select a language to be selected as a stored language of audio.
S1008, translating the audio texts in the audio different from the storage language into the storage language to obtain the audio texts with the unified audio language.
With continued reference to fig. 8, after the user selects mandarin as the storage language, the audio text other than mandarin in the audio needs to be translated into mandarin, that is, the languages of other countries and dialects of various regions are translated into mandarin, so that the finally obtained audio text is all mandarin.
S1009 determines a time sequence of each audio text in the audio, and stores each audio text identified by each audio track in the time sequence.
One track in the audio comprises one or more sound waves, one or more continuous sound waves can form a sound segment with semantics, one sound segment in the audio can be analyzed to obtain one audio text, the audio can be analyzed to obtain a plurality of audio texts, and the plurality of audio texts are unordered and cannot be used for extracting audio information from the audio.
As can be known from the foregoing description of the audio track, information included in one audio track includes not only information (audio text) carried by all sound waves in the audio track, but also time information of occurrence of each sound wave on the audio track, so that each audio text in the audio can be sequentially stored according to the time information, and a text with semantics is obtained, where the text with semantics is audio information extracted from the audio.
In the embodiment of the invention, the storage file obtained by directly storing the audio texts according to the time sequence is the audio information obtained by extraction. In other examples, after obtaining the audio information extracted from the audio, the following steps are required:
s1010, obtaining characters to be extracted;
when social communication is performed by using social application, the situation that voice call/video call is inconvenient to perform due to no audio access exists, at the moment, a user can communicate by adopting a character sending mode, and the characters are characters to be extracted in the communication.
Referring to fig. 9, it is shown that the user M and the user N use the social application to perform online voice communication, and it can be known that three voices (with durations of 25 seconds, 32 seconds and 2 seconds, respectively) and two characters are generated in the communication. If the audio and the characters generated in the communication are extracted, it is necessary to:
and S1011, determining the time sequence of the characters and each audio text.
And S1012, adding characters among the audio texts according to the time sequence to serve as the extracted audio information.
In other examples, the audio information may also be extracted:
and S1013, sending the audio text of the pre-selected audio track in the audio to a preset mailbox account or/and a social application account.
In this example, each user participating in a voice call can only modify the words (audio text) that they personally speak, when only the audio text of the respective audio track is sent to the user to whom the audio track corresponds.
The audio text sending mode can be sent to a preset mailbox account of the corresponding user through a mail, and can also be sent to a social application account of the corresponding user.
In other examples, the step of extracting the audio information further includes the following two ways.
(1) And sending the audio information to a preset mailbox account or/and a social application account.
The difference from the previous example is that the example is that all audio text in the audio (the discarded audio text may be removed) is sent to some or all users.
(2) And selecting at least one conference element from the audio information according to a preset conference recording template, and filling the conference element into the preset conference recording template according to a preset format to generate a conference record.
In this example, the meeting elements may include one or more of meeting subject, time, location, participants, moderator, meeting summary.
The preset conference record template is a preset template, and can be in a format of Word, Excel, PDF and the like. Referring to fig. 10, which is a preset conference recording template provided by the embodiment of the present invention, including the above-mentioned conference elements, it should be understood that the extracted audio information will be recorded in the conference summary part of the template.
The audio information extraction method provided by the embodiment of the invention can determine the audio tracks based on the voiceprint characteristics, obtain the audio texts generated by each audio track in the audio by analyzing, and store each audio text in the audio to obtain the audio information extracted from the audio, namely, the audio information extraction method provided by the embodiment of the invention can obtain the audio information in the audio without a user manually in a playing and listening mode, so that the user can conveniently look up the audio information, and the user can obtain higher user experience.
The audio information extraction method provided by the invention also has other examples, and comprises the following steps of S1201 to S1204:
s1201, determining the audio to be extracted.
As in the previous embodiment, the audio to be extracted here is a complete uninterrupted piece of audio. In other examples, the audio to be extracted here is audio composed of a plurality of pieces of audio. A piece of audio may include sound waves from one sound source or may include sound waves from multiple sound sources.
In the present embodiment, the audio to be extracted has at least two sound sources, but the audio information extraction method provided by the present embodiment can also be used in the case where the audio has only one sound source.
It should be noted that many steps of the present embodiment are the same as those described above, and in order to avoid repetition, the same steps are briefly described, and the above description may be referred to where unclear points exist.
S1202, determining the social application generating the audio.
The embodiment can trace the sound source of each sound wave in the audio through social application. Specifically, when a social application is used for voice conference, video conference, voice chat and game blacking, the social application used by each user can record sound waves generated by the user, and the relationship between the user and the sound waves generated by the user is uniquely determined.
The social application may store audio left over from social communications using the social application, and based on the audio, the social application that generated this audio may be determined.
S1203, searching each contact person participating in audio generation in the social application, and determining each contact person as a sound source of the audio.
When the social application is utilized for voice conference, video conference, voice chat and game blacking, the participants are known and recorded in the social application, and after the social application is determined, the contacts participating in the audio generation are found based on the history of the social application, and the contacts can be determined as the sound source of the audio to be extracted.
S1204, sound waves with the same sound source in the audio are judged to belong to the same audio track, and each audio track in the audio is obtained.
It should be understood that a piece of audio is obtained by combining sound waves emitted from a plurality of sound sources, and strictly speaking, the audio itself has no sound source.
And S1205, analyzing the audio to obtain audio texts generated by each audio track.
A piece of audio comprises one or more sound waves, one or more continuous sound waves can form a sound segment with semantics, and the sound segment with semantics can be analyzed into audio text.
And S1206, displaying the track selection frame.
The audio track selection box includes audio tracks of the audio and audio texts generated by the audio tracks. The user may review the audio text generated for each track at the interface with the track selection box.
S1207, detecting the selection operation of the track in the track selection box, and modifying the audio text generated by the selected track.
The modifications herein include: discarding all or part of the audio text generated by the selected audio track, or/and adding or/and deleting characters in the respective audio text.
S1208, determining the storage language of the audio and the audio text different from the storage language in the audio.
The storage language here is a language for storing each audio text, and may be a language of each international country or a local dialect. There are two ways to determine the storage language of the audio:
(1) the language used by the terminal is determined according to the language used by the terminal set by the user, namely the language used by the terminal is used as the storage language of the audio.
(2) The selected language is determined as the stored language of the audio according to the user's selection.
S1209, translating the audio text different from the storage language in the audio into the storage language to obtain each audio text after the audio is unified into the language.
S1210, determining the time sequence of each audio text in the audio, and storing each audio text identified by each audio track according to the time sequence.
As can be known from the foregoing description of the audio track, information included in one audio track includes not only information (audio text) carried by all sound waves in the audio track, but also time information of occurrence of each sound wave on the audio track, so that each audio text in the audio can be sequentially stored according to the time information, and a text with semantics is obtained, where the text with semantics is audio information extracted from the audio.
In the embodiment of the present invention, storing the audio text in time sequence here results in the extracted audio information. In some other examples of this embodiment, the audio information extracting method provided by the present invention further includes the following steps:
s1211, obtaining the character to be extracted.
When social communication is performed by using social application, the situation that voice call/video call is inconvenient to perform due to no audio access exists, at the moment, a user can communicate by adopting a character sending mode, and the characters are characters to be extracted in the communication.
And S1212, determining the time sequence of the characters and each audio text.
S1213, adding characters between the audio texts in time sequence to obtain the extracted audio information.
S1214, sending the audio information to a preset mailbox account or/and a social application account.
In other examples, only audio text of some tracks may be sent to a preset mailbox account or/and a social application account. In some other examples, the extracted audio information may also be used to generate a conference recording.
The embodiment of the invention provides an audio information extraction method, aiming at the defect that in the prior art, audio information can be obtained only by playing and listening to audio, social application is determined based on audio, a sound source of the audio in the social application is further determined, audio tracks in the audio are determined based on the sound source, audio texts generated by the audio tracks are obtained by analysis, and the audio information extracted from the audio can be obtained by storing the audio texts in the audio.
It is emphasized that, unlike the previous example in which the track is determined based on the voiceprint characteristics, the present example is based on the social application to determine the sound source of the audio and further determine the track in the audio, which can reduce the consumption of computer resources and also take less time to determine the track in the audio, thereby providing a better experience for the user.
The embodiment of the present invention further provides a terminal, please refer to fig. 11, which includes a processor 1101, a memory 1102 and a communication bus 1103;
the communication bus 1103 is used for implementing connection communication between the processor 1101 and the memory 1102;
the processor 1101 is configured to execute one or more programs stored in the memory 1102 to implement the steps of the audio information extraction method mentioned in any one of the above-described embodiments.
Embodiments of the present invention also provide a computer-readable storage medium, which stores one or more programs that can be executed by one or more processors to implement the steps of the audio information extraction method mentioned in any one of the above-described embodiments.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no acts or modules are necessarily required of the invention.
In the above embodiments, the description of each embodiment has its own emphasis, and parts of a certain embodiment that are not described in detail can be referred to related descriptions of other embodiments, and the above serial numbers of the embodiments of the present invention are merely for description and do not represent advantages and disadvantages of the embodiments, and those skilled in the art can make many forms without departing from the spirit and scope of the present invention and as claimed in the claims, and these forms are within the protection of the present invention.

Claims (10)

1. An audio information extraction method, characterized in that the audio information extraction method comprises:
determining audio to be extracted, determining tracks of the audio based on the audio, wherein the audio has at least two tracks;
analyzing the audio to obtain audio texts generated by each audio track;
and storing each audio text in the audio on the basis of each audio track to serve as the extracted audio information.
2. The audio information extraction method of claim 1, wherein the determining the track of the audio based on the audio comprises:
analyzing the audio to obtain voiceprint characteristics of the audio, wherein the audio comprises at least two sound waves and at least has two voiceprint characteristics;
and judging the sound waves with the same voiceprint characteristics in the audio frequency to belong to the same audio track to obtain each audio track in the audio frequency.
3. The audio information extraction method of claim 1, wherein the determining the track of the audio based on the audio comprises:
determining a sound source of the audio according to the audio, wherein the audio comprises at least two sound waves and the audio has at least two sound sources;
and judging the sound waves with the same sound source in the audio frequency to belong to the same audio track to obtain each audio track in the audio frequency.
4. The audio information extraction method of claim 3, wherein the determining a sound source of the audio from the audio comprises:
determining a social application that generated the audio;
and searching each contact person participating in the audio generation in the social application, and determining each contact person as a sound source of the audio.
5. The audio information extraction method of any one of claims 1 to 4, wherein after storing each audio text in the audio based on each of the tracks, the audio information extraction method further comprises:
acquiring characters to be extracted;
determining a time sequence of the characters and each of the audio texts;
adding said characters between each of said audio texts in said temporal order.
6. The audio information extraction method of any one of claims 1-4, wherein after parsing the audio to obtain audio text generated by each of the audio tracks, the audio information extraction method further comprises:
displaying an audio track selection box, wherein the audio track selection box comprises an audio track of the audio and audio text generated by the audio track;
and detecting the selection operation of the audio track in the audio track selection frame, and modifying the audio text generated by the selected audio track.
7. The audio information extraction method of any one of claims 1 to 4, wherein the storing of each audio text in the audio based on each of the tracks comprises:
determining a storage language of the audio and audio text in the audio different from the storage language;
translating the audio text different from the storage language in the audio into the storage language to obtain each audio text after the audio is unified into the language;
determining a temporal order of audio texts in the audio, and storing each audio text identified by each audio track according to the temporal order.
8. The audio information extraction method according to any one of claims 1 to 4, wherein, after storing each audio text in the audio on a per-track basis as the extracted audio information, the audio information extraction method further comprises:
sending the audio information to a preset mailbox account or/and a social application account;
or sending an audio text of a preselected audio track in the audio to a preset mailbox account or/and a social application account;
or selecting at least one conference element from the audio information according to a preset conference recording template, and filling the conference element into the preset conference recording template according to a preset format to generate a conference record.
9. A terminal, characterized in that the terminal comprises a processor, a memory and a communication bus;
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is configured to execute one or more programs stored in the memory to implement the steps of the audio information extraction method of any of claims 1-8.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores one or more programs which are executable by one or more processors to implement the steps of the audio information extraction method according to any one of claims 1 to 8.
CN202010094370.1A 2020-02-15 2020-02-15 Audio information extraction method, terminal and computer readable storage medium Pending CN111415651A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010094370.1A CN111415651A (en) 2020-02-15 2020-02-15 Audio information extraction method, terminal and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010094370.1A CN111415651A (en) 2020-02-15 2020-02-15 Audio information extraction method, terminal and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111415651A true CN111415651A (en) 2020-07-14

Family

ID=71492792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010094370.1A Pending CN111415651A (en) 2020-02-15 2020-02-15 Audio information extraction method, terminal and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111415651A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488006A (en) * 2021-07-05 2021-10-08 功夫(广东)音乐文化传播有限公司 Audio processing method and system
CN113823250A (en) * 2021-11-25 2021-12-21 广州酷狗计算机科技有限公司 Audio playing method, device, terminal and storage medium
WO2023236794A1 (en) * 2022-06-06 2023-12-14 华为技术有限公司 Audio track marking method and electronic device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113488006A (en) * 2021-07-05 2021-10-08 功夫(广东)音乐文化传播有限公司 Audio processing method and system
CN113823250A (en) * 2021-11-25 2021-12-21 广州酷狗计算机科技有限公司 Audio playing method, device, terminal and storage medium
CN113823250B (en) * 2021-11-25 2022-02-22 广州酷狗计算机科技有限公司 Audio playing method, device, terminal and storage medium
WO2023236794A1 (en) * 2022-06-06 2023-12-14 华为技术有限公司 Audio track marking method and electronic device

Similar Documents

Publication Publication Date Title
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
US8027837B2 (en) Using non-speech sounds during text-to-speech synthesis
KR101274961B1 (en) music contents production system using client device.
CN111415651A (en) Audio information extraction method, terminal and computer readable storage medium
US20200211565A1 (en) System and method for simultaneous multilingual dubbing of video-audio programs
CN108242238B (en) Audio file generation method and device and terminal equipment
TW201214413A (en) Modification of speech quality in conversations over voice channels
CN111739556A (en) System and method for voice analysis
Kostov et al. Emotion in user interface, voice interaction system
CN111213200A (en) System and method for automatically generating music output
Cooper Text-to-speech synthesis using found data for low-resource languages
CN110741430A (en) Singing synthesis method and singing synthesis system
JP2011186143A (en) Speech synthesizer, speech synthesis method for learning user's behavior, and program
Barbosa Pleasantness and wellbeing in poem declamation in European and Brazilian Portuguese depends mostly on pausing and voice quality
US20040176957A1 (en) Method and system for generating natural sounding concatenative synthetic speech
Mitsui et al. Towards human-like spoken dialogue generation between AI agents from written dialogue
CN112185341A (en) Dubbing method, apparatus, device and storage medium based on speech synthesis
JP2005215888A (en) Display device for text sentence
Aylett et al. Combining statistical parameteric speech synthesis and unit-selection for automatic voice cloning
Wu et al. Modeling the expressivity of input text semantics for Chinese text-to-speech synthesis in a spoken dialog system
CN115472185A (en) Voice generation method, device, equipment and storage medium
CN112235183B (en) Communication message processing method and device and instant communication client
JP4409279B2 (en) Speech synthesis apparatus and speech synthesis program
JP2003099089A (en) Speech recognition/synthesis device and method
Ferris Techniques and challenges in speech synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination