CN112632321A - Audio file processing method and device and audio file playing method and device - Google Patents

Audio file processing method and device and audio file playing method and device Download PDF

Info

Publication number
CN112632321A
CN112632321A CN201910900442.4A CN201910900442A CN112632321A CN 112632321 A CN112632321 A CN 112632321A CN 201910900442 A CN201910900442 A CN 201910900442A CN 112632321 A CN112632321 A CN 112632321A
Authority
CN
China
Prior art keywords
audio
audio file
content
time
piece
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910900442.4A
Other languages
Chinese (zh)
Inventor
王晓涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201910900442.4A priority Critical patent/CN112632321A/en
Publication of CN112632321A publication Critical patent/CN112632321A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/686Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides an audio file processing method and device and an audio file playing method and device, and belongs to the field of audio signal processing. The method comprises the following steps: acquiring the starting time and the ending time of each audio content in an audio file in the audio file, wherein each audio content comprises one or more sentences of audio content; acquiring text content corresponding to each section of audio content in an audio file; and associating the starting time and the ending time of each section of audio content with the text content corresponding to each section of audio content to generate the associated text content of the audio file. Which helps the user to quickly and accurately find and play out the audio content segment of interest from the audio file.

Description

Audio file processing method and device and audio file playing method and device
Technical Field
The present invention relates to the field of audio signal processing, and in particular, to an audio file processing method and apparatus, and an audio file playing method and apparatus.
Background
In a meeting recording or like scenario, it is often necessary to use audio files to assist in recording meeting content. In the related art, the audio content of interest in the audio file needs to be found by fast forwarding or dragging a progress bar, etc. However, the interested content cannot be found at one time by fast forwarding or dragging the progress bar, and the searching is usually required to be repeated because important information is missed, which is low in efficiency. In addition, in the related art, the content of interest is found in the text by converting the audio file into the text, but the position of the audio content corresponding to the content of interest in the audio file cannot be accurately located according to the content of interest.
Disclosure of Invention
An embodiment of the present invention provides an audio file processing method and apparatus, and an audio file playing method and apparatus, which are used to at least solve the foregoing technical problems.
In order to achieve the above object, an embodiment of the present invention provides an audio file processing method, where the method includes: acquiring the starting time and the ending time of each audio content in an audio file in the audio file, wherein each audio content comprises one or more sentences of audio content; acquiring text content corresponding to each section of audio content in an audio file; and associating the starting time and the ending time of each section of audio content with the text content corresponding to each section of audio content to generate the associated text content of the audio file.
Optionally, the obtaining the start time and the end time of each piece of audio content in the audio file includes: dividing the audio file into a plurality of fragments, wherein the time length of each fragment is less than the preset time, and each sentence of audio content comprises one or more fragments; inputting the plurality of fragments into a voice transcription engine according to a time sequence to obtain a text result returned by the voice transcription engine, wherein the text result comprises: text content corresponding to the fragments, sequence numbers of the fragments and a mark indicating whether the fragments are the last fragment of a sentence of audio content; determining a specific segment included in each piece of audio content based on the text result, wherein the specific segment is a first segment and/or a last segment included in each piece of audio content; and determining the starting time and the ending time of each piece of audio content based on the sequence number of the specific slice included in each piece of audio content and the time length of each slice.
Optionally, the obtaining of the text content corresponding to each segment of audio content in the audio file includes: determining the text content corresponding to each piece of audio content based on the sequence number of the specific segment included in each piece of audio content in the audio file and the text content corresponding to each segment.
Optionally, the preset time is not greater than 200 ms.
Optionally, the associating the start time and the end time of each piece of audio content with the text content corresponding to each piece of audio content includes: storing a start time and an end time of the each piece of audio content at a particular location of text content to which the each piece of audio content corresponds, and/or the method further comprises one or more of: correspondingly storing the audio file and the associated text content; or to hide the display of the start time and the end time of said each piece of audio content.
Correspondingly, an embodiment of the present invention further provides an audio file playing method, where the audio file is processed according to the audio file processing method, and the audio file playing method includes: identifying a keyword input or selected by a user; retrieving the keywords from the associated text content of the audio file; determining the starting time and the ending time of the audio content corresponding to the text content comprising the keywords in the audio file; and playing the audio file based on the start time and the end time.
Correspondingly, an embodiment of the present invention further provides an audio file processing apparatus, where the apparatus includes: the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring the starting time and the ending time of each piece of audio content in an audio file in the audio file, and each piece of audio content comprises one or more sentences of audio content; the second acquisition module is used for acquiring text contents corresponding to each section of audio contents in the audio file; and the association module is used for associating the starting time and the ending time of each section of audio content with the text content corresponding to each section of audio content so as to generate the associated text content of the audio file.
Accordingly, an embodiment of the present invention further provides an audio file playing apparatus, where the audio file is processed according to the audio file processing method, and the audio file playing apparatus includes: the identification module is used for identifying keywords input or selected by a user; the retrieval module is used for retrieving the keywords from the associated text content of the audio file; the determining module is used for determining the starting time and the ending time of the audio content corresponding to the text content comprising the keywords in the audio file; and the playing module is used for playing the audio file based on the starting time and the ending time.
Accordingly, an embodiment of the present invention further provides a machine-readable storage medium, on which instructions are stored, the instructions being configured to cause a machine to perform: the audio file processing method and/or the audio file playing method are/is provided.
Correspondingly, the embodiment of the invention also provides electronic equipment, which comprises at least one processor, at least one memory connected with the processor and a bus; the processor and the memory complete mutual communication through the bus; the processor is used for calling the program instructions in the memory so as to execute the audio file processing method and/or the audio file playing method.
Through the technical scheme, the starting time and the ending time of each section of audio content of the audio file are associated with the text content corresponding to each section of audio content to generate the associated text content, so that the audio content section where the keyword is located can be quickly and accurately searched and played from the audio file based on the keyword input or selected by a user.
Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:
FIG. 1 shows a schematic flow diagram of an audio file processing method according to an embodiment of the invention;
FIG. 2 is a flow chart illustrating a process of obtaining a start time and an end time of each piece of audio content according to an embodiment of the invention;
FIG. 3 is a flow chart illustrating an audio file playing method according to an embodiment of the invention;
FIG. 4 is a block diagram showing the construction of an audio file processing apparatus according to an embodiment of the present invention;
fig. 5 is a block diagram showing the construction of an audio file playback apparatus according to an embodiment of the present invention; and
fig. 6 shows a block diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.
Fig. 1 shows a flow diagram of an audio file processing method according to an embodiment of the present invention. As shown in fig. 1, an embodiment of the present invention provides an audio file processing method, where the audio file may be an audio file, or may also be a video file containing an audio file, and the like. The audio file processing method may include steps S110 to S130.
Step S110, acquiring a start time and an end time of each audio content in the audio file.
Each piece of audio content may include one or more sentences of audio content. Take an audio file for the following speech segment as an example: "eating breakfast in the morning today, go out with a fruit knife and prepare for children to eat apples. When the user walks to the intersection of the Yangtze river and the like to cross the road and starts to cross the road when the user turns green, a white soar strides without speed reduction and directly rushes over, and the speed per hour is at least 60 strides. I overtake the coming vehicle stopped from the first half meter past and complain about a sentence ". In this speech, the audio divided by punctuation can be determined to be a sentence of audio content, i.e., a short pause in the audio can be recognized as the end of a sentence of audio content. It is to be understood that the embodiments of the present invention are not limited thereto, and a plurality of short pauses may be included in one sentence of audio content as needed.
Alternatively, the start time and the end time of each piece of audio content in the audio file can be manually obtained from the audio file and then manually input into the audio file processing device. Or the start time and the end time of each piece of audio content in the audio file may be automatically obtained, which will be described later.
Step S120, acquiring text content corresponding to each section of audio content in the audio file.
Any suitable method for converting voice to text can be adopted to obtain the text content corresponding to the audio content. Optionally, the text content of the entire audio file may be first obtained, and then the text content corresponding to each piece of audio content may be obtained from the text content of the entire audio file.
It is understood that the execution sequence of step S110 and step S120 may be arbitrary, and the present invention is not particularly limited.
Step S130, associating the start time and the end time of each audio content with the text content corresponding to each audio content, so as to generate an associated text content of the audio file.
The association may be any way, for example, for a piece of audio content, the start time and the end time of the piece of audio content may be set at specific positions of the text content for association, for example, may be set at the start position, the end position, or a position in the middle of the text (for example, after or before the nth word, where n is any suitable integer) of the piece of audio content.
The method comprises the steps of associating the starting time and the ending time of each section of audio content of the audio file with the text content corresponding to each section of audio content to generate associated text content, and correspondingly storing the audio file and the associated text content, so that the corresponding section of the audio content can be quickly and accurately found out from the audio file through the time attribute in the text content based on the keywords input or selected by a user.
In an optional embodiment, the audio file processing method provided in the embodiment of the present invention may further include storing the associated text content of the audio file and the audio file correspondingly. For example, the audio file and the text content associated with the audio file may be stored in an ES (elastic search) server together for indexing, and specifically, a storage interface of the ES server may be called to store the text content associated with the audio file in the ES server. Optionally, the ES server stores the URL (Uniform Resource Locator) of the audio file instead of storing the entire audio file. When the index is created, words in the text content can be segmented, so that audio content segments which are interested by the user can be conveniently obtained through the associated text content at any time.
In an alternative embodiment, when the corresponding associated text content of an audio file is obtained from, for example, an ES server and displayed on a web page, the display of the start time and the end time of each piece of audio content may be hidden so as not to affect the viewing effect of the text content. The hidden attribute can be set for the content of the starting time and the ending time in a regular or character string searching mode. Taking the example of the associated text content being "I go out with a fruit knife {7:18}, ready to eat an apple for a child {17:26 }", the start time and end time may be hidden by: the method comprises the steps that I take a fruit knife to go out of a door < span > none '> (7: 18} </span >, and the fruit knife is prepared to be used for cutting apples into the door < span > none' > (17: 26} </span >). "so that only text content will remain when the web page is rendered: ' I go out with a fruit knife and prepare to cut apples for children. "does not affect the viewing effect of the text content at all.
In an alternative embodiment, the start time and the end time of each piece of audio content in the audio file may be automatically obtained. As shown in fig. 2, obtaining the start time and the end time of each piece of audio content in the audio file may include the following steps:
step S202, the audio file is divided into a plurality of fragments.
The audio file may be divided using any suitable media divider, and the time length of each divided slice may be less than a preset time, which may be any suitable value, for example. Here, the purpose of splitting the audio file is to simulate the audio stream received by a voice receiving device (e.g., a microphone), and thus the time length of the slice should be controlled as short as possible. Optionally, the preset time may be set to be not greater than 200ms, and each sentence of audio content may include one or more of the slices.
For example, if the length of the audio file is 2 minutes and the preset time is 200ms, the audio file will be divided into 600 pieces. Alternatively, the time length of the segment may be set according to different requirements of different types or manufacturers of speech transcription engines.
Step S204, inputting the plurality of fragments into a voice transcription engine according to a time sequence to obtain a text result returned by the voice transcription engine.
The text results returned by the speech transcription engine may include: text content of the fragment, sequence number of the fragment, and a mark of whether the fragment is the last fragment of a sentence of audio content.
For example, the slices may be sequentially numbered chronologically, and if the audio file is divided into 600 slices, the sequential number of each slice may be 0 to 599. And inputting the fragments into the voice transcription engine according to the sequence numbers, namely transmitting the audio stream to the voice transcription engine and then receiving the text result returned by the voice transcription engine.
The sequential numbering of the slices can be characterized by a parameter seq, for example. The flag indicating whether the segment is the last segment of a sentence of audio content in the text result may be characterized by a parameter end, for example, when end is 1, the segment is determined to be the last segment of a sentence of audio content, and when end is 0, the segment is determined not to be the last segment of a sentence of audio content. For example, after the last tile is identified, punctuation may be added after the text content of the tile to facilitate recognition of the sentence from the text content.
Step S206, based on the text result, determining the specific segment included in each piece of audio content.
The specific segment is the first segment and/or the last segment in time sequence included in each piece of audio content. As mentioned above, each piece of audio content may include one or more sentences of audio content, and the last fragment of each piece of audio content may be determined according to the flag of whether the fragment in step S204 is the last fragment of a sentence of audio content. The first segment of a piece of audio content is a segment following the last segment of the previous piece of audio content of the piece of audio content.
Step S208, determining a start time and an end time of each piece of audio content based on the sequence number of the specific slice included in each piece of audio content and the time length of each slice.
The start time of the audio file is 0 and the end time is the total time length of the audio file. The start time of the first piece of audio content may be defaulted to 0 and the start time of the last piece of audio content may be the total length of time without calculating the two again.
In an alternative embodiment, the specific segment may be the first segment included in a piece of audio content, and the start time T of the first segment (i is a positive integer) is the ith segment of audio contentb(i)=seqf(i) T, wherein seqf(i) Is the sequence number of the first slice of the ith piece of audio content, and T is the time length of each slice. Ith stage of soundThe end time of the audio content may be calculated based on the start time of the next segment (i.e., i +1 segments) of audio content, and the end time T of the ith segment of audio contente(i)=seqf(i +1) × T, wherein seqf(i +1) is the sequential number of the first slice of the ith piece of audio content. Alternatively, the calculated start time may be rounded down, and the calculated end time may be rounded up, so that the time period between the start time and the end time of the calculated audio content can completely cover the audio content.
In an alternative embodiment, the specific segment may be the last segment included in the audio content, and the end time T of the i-th (i is a positive integer) segment of the audio contente(i)=seql(i) T. Wherein seql(i) Is the sequence number of the first slice of the ith piece of audio content, and T is the time length of each slice. The start time of the ith piece of audio content can be calculated according to the end time of the last piece of audio content (i.e., the i-1 piece), and then the start time T of the ith piece of audio contentb(i)=seql(i-1) T, wherein seql(i-1) is the sequential numbering of the first slice of the ith piece of audio content. Alternatively, the calculated start time may be rounded down, and the calculated end time may be rounded up, so that the time period between the start time and the end time of the calculated audio content can completely cover the audio content.
In an alternative embodiment, the specific segment may be the first segment and the last segment included in a piece of audio content, and the start time T of the i (i is a positive integer) segment of audio contentb(i)=seqf(i) T, wherein seqf(i) Is the sequence number of the first slice of the ith piece of audio content, and T is the time length of each slice. End time T of ith segment of audio contente(i)=seqf(i) T, wherein seql(i) The order of the last slice of the ith piece of audio content is numbered. Optionally, the calculated start time may be rounded down, and the calculated end time may be rounded upIn this way, the time period between the start time and the end time of the audio content of a segment is calculated to completely cover the audio content of the segment.
Further, the text content corresponding to each audio content may be determined based on the sequence number of the specific segment included in each audio content in the audio file and the text content corresponding to each segment. The text content corresponding to each piece of audio content and the start time and the end time of each piece of audio content can be synchronously acquired. After the start time and the end time of each audio content in the audio file and the corresponding text content are obtained, the start time and the end time of each audio content and the corresponding text content can be associated to generate the associated text content of the audio file, so that the corresponding audio content segment can be conveniently searched based on the keywords related to the text.
The audio file processing method provided in the embodiment of the present invention is described by taking an audio file related to a piece of speech as an example: "eating breakfast in the morning today, go out with a fruit knife and prepare for children to eat apples. When the user walks to the intersection of the Yangtze river and the like to cross the road and starts to cross the road when the user turns green, a white soar strides without speed reduction and directly rushes over, and the speed per hour is at least 60 strides. I overtake the coming vehicle stopped from the first half meter past and complain about a sentence ". The time length of the audio file is 2 min.
In this embodiment, each piece of audio content may include a sentence of audio content, the time length of each slice may be set to 200ms, a specific slice refers to the last slice in time sequence included in each piece of audio content, the start time and the end time are in seconds, the end time is rounded up, the start time is rounded down. The process of processing the audio file is as follows:
the audio file is divided. It can be known that the audio file can be divided into 600 slices, and the 600 slices are numbered 0-599 according to the chronological order.
And inputting the 600 divided fragments into a voice transcription engine according to a time sequence to obtain a text result returned by the voice transcription engine, wherein the text result comprises text contents of the fragments, sequence numbers of the fragments and a mark of whether the fragments are the last fragment of a sentence of audio contents. Since each piece of audio content may comprise a sentence of audio content in this embodiment, the flag may also be regarded as a flag whether the segment is the last segment of a piece of audio content.
The last segment of each piece of audio content may be determined based on the text results returned by the speech transcription engine. It is to be appreciated that the last tile can be determined based on the flag.
Determining the starting time and the ending time of each piece of audio content and the corresponding text content thereof based on the sequence number of the specific fragment included in each piece of audio content and the time length of each fragment. For example, the text results of the speech composition engine, such as the recognized text results returned, are: seq "breakfast in the morning" todayl38, then the end time of the segment of audio content is: t ise38 × 200ms/1000 ≈ 7.6s ≈ 8s, and the start time of the piece of audio content is 0 s. The returned result of the next piece of audio content is identified as "go out with a fruit knife, seq ═ 88", then the end time of the piece of audio content is: t ise88 × 200ms/1000 ≈ 17.6s ≈ 18s, and the start time is calculated according to the position of the last slice of the previous audio content, and then the start time is Tb38 ms/1000 7.6s ≈ 7 s. And analogizing in turn, respectively determining the starting time and the ending time of each section of audio content and the text content corresponding to the starting time and the ending time.
And associating the starting time and the ending time of each section of audio content with the corresponding text content to generate the associated text content of the audio file. For example, the start time and the end time of each piece of audio content may be set at the end position of the piece of audio content, that is, the following rules are adopted for association: text content { start time: end time }. The associated text content is that the user has a relationship of' eating breakfast in the morning {0:8}, going out with a fruit knife {7:18}, preparing to eat apple for children {17:26}, walking to a crossing of a long river road and the like to cross a road with a traffic light {25:34}, and when the light is green, the user starts to cross the road {34:43}, at the moment, a white maiken is not decelerated and directly rushes over {43:80}, the speed is at least 60 {79:89}, the user immediately stops the vehicle and passes through {89:113} from the position of the previous half meter, and the user complains about a sentence! {112:120}".
Thereafter, the audio file and the associated text content of the audio file may be correspondingly stored. For example, it may be stored in the ES server for retrieval.
It can be understood that the audio file processing method provided by the embodiment of the present invention may be applicable to audio files of any language, and the language of the audio file may be consistent with or inconsistent with the language of the text content, for example, the audio file may be an english audio, and the transcribed text content may be a chinese content. Therefore, the user can conveniently use the familiar language to search.
Fig. 3 is a flowchart illustrating an audio file playing method according to an embodiment of the present invention. As shown in fig. 2, an embodiment of the present invention further provides an audio file playing method, where the audio file may be an audio file, or may also be a video file containing an audio file, and the audio file is processed according to the audio file processing method described in any embodiment of the present invention. The audio file playing method may include steps S310 to S340.
In step S310, keywords input or selected by the user are identified.
Optionally, a search interface may be provided for the user, and the user may input keywords in the search interface to perform search. Or optionally, the associated text content corresponding to the audio file may be displayed, and the user may select a keyword from the displayed text content to search for the audio file.
The keywords can be single words, phrases, short sentences, one sentence or multiple sentences of text content.
Step S320, retrieving the keywords from the associated text content of the audio file.
In the case where the associated text content is not currently displayed, after a keyword input by the user is identified, the keyword may be retrieved from the associated text content of the audio file to determine whether the keyword is included. In an alternative case, if a keyword is retrieved, the associated text content may be displayed and the keyword marked upon display, e.g., highlighted, etc.
In the case where the associated text content is currently displayed, after identifying keywords entered or selected in the text content by the user, the keywords may be retrieved from the associated text content of the audio file. If a keyword is retrieved, the keyword may be tagged, e.g., highlighted, etc.
For example, if the keyword input or selected by the user is "fruit knife", the keyword may be marked from the associated text content, and the form of the keyword may be, for example, "eat breakfast today in the morning {0:7}, with a keyword of" color: red "> fruit knife" @ go out {7:18}, where "color: red" > fruit knife "@ is the marked keyword, and neither the time attribute nor the mark attribute is displayed.
Step S330, determining a start time and an end time of the audio content corresponding to the text content including the keyword in the audio file.
Since the start time and the end time of a piece of audio content can be set at a specific position of the text content, after the keyword is retrieved, the specific position can be searched from the keyword. For example, if the start time and the end time of a piece of audio content are set at the end position of the text content corresponding to the piece of audio content, the start time and the end time of the first occurrence can be found by searching backwards from the keyword, that is, the start time and the end time of the audio content corresponding to the text content including the keyword in the audio file. For example, for a keyword "fruit knife" selected by the user, it may be obtained that the start time and the end time of the audio content corresponding to the text content including the keyword in the audio file are 7 th second and 18 th second, respectively.
Optionally, if the keyword input or selected by the user includes a plurality of sentences of text content, the start time of the audio content corresponding to the text content including the keyword in the audio file is the start time of the audio content corresponding to the start of the plurality of sentences of text content, and the end time is the end time of the audio content corresponding to the end of the plurality of sentences of text content.
Step S340, playing the audio file based on the start time and the end time.
And playing from the starting time in the audio file, pausing playing at the ending time, and further selecting whether to continue playing backwards by the user according to requirements.
Thus, the corresponding audio content can be played according to the keywords input or selected by the user.
The audio file processing method and the audio file playing method provided by the embodiment of the invention are particularly suitable for public security service scenes, recording and videotaping are generally carried out and recorded in the inquiry process, the record is generally read in the case handling process, and the original audio and video files can be checked if necessary; with the implementation of the new program regulation for handling administrative cases by public security institutions, a fast handling process is added, documents do not need to be made in the fast handling process, and only audio and video records need to be recorded, so that a large number of audio and video files are generated. For legal departments of public security institutions, cases can only be known by viewing audio and video files when reviewing files, and interesting contents cannot be quickly positioned from the audio and video files, so that the efficiency is low. By adopting the audio file processing method and the audio file playing method provided by the embodiment of the invention, the interested contents can be quickly and accurately searched and played from the audio and video files, so that the office efficiency is obviously improved.
Fig. 4 is a block diagram showing the construction of an audio file processing apparatus according to an embodiment of the present invention. As shown in fig. 4, an embodiment of the present invention further provides an audio file processing apparatus, where the audio file may be an audio file, or may also be a video file containing an audio file, and the like. The audio file processing apparatus may include: a first obtaining module 410, configured to obtain a start time and an end time of each piece of audio content in an audio file in the audio file, where each piece of audio content includes one or more pieces of audio content; a second obtaining module 420, configured to obtain text content corresponding to each piece of audio content in the audio file; and an association module 430, configured to associate the start time and the end time of each piece of audio content with the text content corresponding to each piece of audio content, so as to generate an associated text content of the audio file. The method and the device can realize that the corresponding audio content segment can be quickly and accurately searched from the audio file through the time attribute in the text content based on the keyword input or selected by the user.
Optionally, the association module 430 may store the start time and the end time of each piece of audio content at a specific position of the text content corresponding to each piece of audio content to implement the association, where the specific position may be, for example, the start position and the end position of the text content corresponding to each piece of audio content.
In some optional embodiments, the first obtaining module may obtain the start time and the end time of each piece of audio content according to the following steps: and dividing the audio file into a plurality of fragments, wherein the time length of each fragment is less than a preset time, and the preset time is not more than 200ms for example. Each sentence of audio content comprises one or more of the slices; inputting the plurality of fragments into a voice transcription engine according to a time sequence to obtain a text result returned by the voice transcription engine, wherein the text result comprises: text content corresponding to the fragments, sequence numbers of the fragments and a mark indicating whether the fragments are the last fragment of a sentence of audio content; determining a specific segment included in each piece of audio content based on the text result, wherein the specific segment is a first segment and/or a last segment included in each piece of audio content; and determining the starting time and the ending time of each piece of audio content based on the sequence number of the specific slice included in each piece of audio content and the time length of each slice. Further optionally, the second obtaining module may determine the text content corresponding to each audio content based on the sequence number of the specific segment included in each audio content in the audio file and the text content corresponding to each segment.
In some optional embodiments, the audio file processing apparatus provided in the embodiments of the present invention may further include a storage module, configured to correspondingly store the audio file and the associated text content. Further optionally, when the associated text content is displayed, the display of the start time and the end time of each piece of audio content may be hidden so as not to affect the viewing effect of the associated text content.
The specific working principle and benefits of the audio file processing apparatus provided by the embodiment of the present invention are the same as those of the audio file processing method provided by the embodiment of the present invention, and will not be described herein again.
Fig. 5 is a block diagram showing the structure of an audio file playback apparatus according to an embodiment of the present invention. As shown in fig. 5, an embodiment of the present invention further provides an audio file playing apparatus, where the audio file may be an audio file, or may also be a video file or the like including an audio file, and the audio file is processed according to the audio file processing method as claimed in any embodiment of the present invention. The audio file playing device may include: an identification module 510 for identifying a keyword input or selected by a user; a retrieving module 520, configured to retrieve the keyword from the associated text content of the audio file; a determining module 530, configured to determine a start time and an end time of an audio content in the audio file, where the audio content corresponds to a text content including the keyword; and a playing module 540, configured to play the audio file based on the start time and the end time. It realizes playing the corresponding audio content according to the keyword input or selected by the user.
The specific working principle and benefits of the audio file playing apparatus provided by the embodiment of the present invention are the same as those of the audio file playing method provided by the embodiment of the present invention, and will not be described herein again.
The audio file processing device comprises a processor and a memory, wherein the first acquisition module, the second acquisition module, the association module, the storage module and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions. The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. One or more than one kernel can be set, and the audio file processing method according to any embodiment of the invention is executed by adjusting the kernel parameters.
The audio file playing device comprises a processor and a memory, wherein the identification module, the retrieval module, the determination module, the playing module and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions. The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the audio file playing method according to any embodiment of the invention is executed by adjusting the kernel parameters.
Accordingly, an embodiment of the present invention further provides a machine-readable storage medium, on which instructions are stored, the instructions being configured to cause a machine to perform: the audio file processing method according to any embodiment of the present invention and/or the audio file playing method according to any embodiment of the present invention.
The embodiment of the invention provides a processor, which is used for running a program, wherein the program executes the following steps when running: the audio file processing method according to any embodiment of the present invention and/or the audio file playing method according to any embodiment of the present invention.
An embodiment of the present invention provides an electronic device, as shown in fig. 6, an electronic device 70 includes at least one processor 701, and at least one memory 702 and a bus 703 that are connected to the processor 701; the processor 701 and the memory 702 complete mutual communication through a bus 703; the processor 701 is configured to call program instructions in the memory 702 to execute the audio file processing method according to any embodiment of the present invention and/or the audio file playing method according to any embodiment of the present invention. The electronic equipment of the embodiment of the invention can be a server, a PC, a PAD, a mobile phone and the like.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:
a method of audio file processing, the method comprising: acquiring the starting time and the ending time of each audio content in an audio file in the audio file, wherein each audio content comprises one or more sentences of audio content; acquiring text content corresponding to each section of audio content in an audio file; and associating the starting time and the ending time of each section of audio content with the text content corresponding to each section of audio content to generate the associated text content of the audio file.
The acquiring the start time and the end time of each piece of audio content in the audio file comprises: dividing the audio file into a plurality of fragments, wherein the time length of each fragment is less than the preset time, and each sentence of audio content comprises one or more fragments; inputting the plurality of fragments into a voice transcription engine according to a time sequence to obtain a text result returned by the voice transcription engine, wherein the text result comprises: text content corresponding to the fragments, sequence numbers of the fragments and a mark indicating whether the fragments are the last fragment of a sentence of audio content; determining a specific segment included in each piece of audio content based on the text result, wherein the specific segment is a first segment and/or a last segment included in each piece of audio content; and determining the starting time and the ending time of each piece of audio content based on the sequence number of the specific slice included in each piece of audio content and the time length of each slice.
The acquiring of the text content corresponding to each section of audio content in the audio file includes: determining the text content corresponding to each piece of audio content based on the sequence number of the specific segment included in each piece of audio content in the audio file and the text content corresponding to each segment.
The preset time is not more than 200 ms.
The associating the start time and the end time of each piece of audio content and the text content corresponding to each piece of audio content comprises: storing a start time and an end time of the each piece of audio content at a particular location of text content to which the each piece of audio content corresponds, and/or the method further comprises one or more of: correspondingly storing the audio file and the associated text content; or to hide the display of the start time and the end time of said each piece of audio content.
An audio file playing method, wherein the audio file is processed according to the audio file processing method, and the audio file playing method comprises the following steps: identifying a keyword input or selected by a user; retrieving the keywords from the associated text content of the audio file; determining the starting time and the ending time of the audio content corresponding to the text content comprising the keywords in the audio file; and playing the audio file based on the start time and the end time.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. An audio file processing method, characterized in that the method comprises:
acquiring the starting time and the ending time of each audio content in an audio file in the audio file, wherein each audio content comprises one or more sentences of audio content;
acquiring text content corresponding to each section of audio content in an audio file; and
and associating the starting time and the ending time of each section of audio content with the text content corresponding to each section of audio content to generate the associated text content of the audio file.
2. The audio file processing method of claim 1, wherein the obtaining the start time and the end time of each piece of audio content in the audio file comprises:
dividing the audio file into a plurality of fragments, wherein the time length of each fragment is less than the preset time, and each sentence of audio content comprises one or more fragments;
inputting the plurality of fragments into a voice transcription engine according to a time sequence to obtain a text result returned by the voice transcription engine, wherein the text result comprises: text content corresponding to the fragments, sequence numbers of the fragments and a mark indicating whether the fragments are the last fragment of a sentence of audio content;
determining a specific segment included in each piece of audio content based on the text result, wherein the specific segment is a first segment and/or a last segment included in each piece of audio content; and
determining a start time and an end time of the each piece of audio content based on the sequence number of the specific slice included in the each piece of audio content and the time length of each slice.
3. The audio file processing method according to claim 2, wherein the obtaining of the text content corresponding to each piece of audio content in the audio file comprises:
determining the text content corresponding to each piece of audio content based on the sequence number of the specific segment included in each piece of audio content in the audio file and the text content corresponding to each segment.
4. The audio file processing method according to claim 2 or 3, wherein the preset time is not more than 200 ms.
5. The audio file processing method according to claim 1,
the associating the start time and the end time of each piece of audio content and the text content corresponding to each piece of audio content comprises: storing the start time and the end time of each piece of audio content at a specific position of the text content corresponding to each piece of audio content, and/or
The method also includes one or more of: correspondingly storing the audio file and the associated text content; or to hide the display of the start time and the end time of said each piece of audio content.
6. An audio file playing method, wherein the audio file is processed according to the audio file processing method of any one of claims 1 to 5, the audio file playing method comprising:
identifying a keyword input or selected by a user;
retrieving the keywords from the associated text content of the audio file;
determining the starting time and the ending time of the audio content corresponding to the text content comprising the keywords in the audio file; and
playing the audio file based on the start time and the end time.
7. An audio file processing apparatus, characterized in that the apparatus comprises:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring the starting time and the ending time of each piece of audio content in an audio file in the audio file, and each piece of audio content comprises one or more sentences of audio content;
the second acquisition module is used for acquiring text contents corresponding to each section of audio contents in the audio file; and
and the association module is used for associating the starting time and the ending time of each section of audio content with the text content corresponding to each section of audio content so as to generate the associated text content of the audio file.
8. An audio file playback apparatus that performs processing of an audio file according to the audio file processing method of any one of claims 1 to 5, the audio file playback apparatus comprising:
the identification module is used for identifying keywords input or selected by a user;
the retrieval module is used for retrieving the keywords from the associated text content of the audio file;
the determining module is used for determining the starting time and the ending time of the audio content corresponding to the text content comprising the keywords in the audio file; and
and the playing module is used for playing the audio file based on the starting time and the ending time.
9. A machine-readable storage medium having instructions stored thereon for causing a machine to perform: the audio file processing method according to any one of claims 1 to 5 and/or the audio file playing method according to claim 6.
10. An electronic device comprising at least one processor, at least one memory connected to the processor, and a bus; the processor and the memory complete mutual communication through the bus; the processor is configured to invoke program instructions in the memory to perform an audio file processing method according to any one of claims 1 to 5 and/or an audio file playing method according to claim 6.
CN201910900442.4A 2019-09-23 2019-09-23 Audio file processing method and device and audio file playing method and device Pending CN112632321A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910900442.4A CN112632321A (en) 2019-09-23 2019-09-23 Audio file processing method and device and audio file playing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910900442.4A CN112632321A (en) 2019-09-23 2019-09-23 Audio file processing method and device and audio file playing method and device

Publications (1)

Publication Number Publication Date
CN112632321A true CN112632321A (en) 2021-04-09

Family

ID=75282553

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910900442.4A Pending CN112632321A (en) 2019-09-23 2019-09-23 Audio file processing method and device and audio file playing method and device

Country Status (1)

Country Link
CN (1) CN112632321A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114822506A (en) * 2022-04-15 2022-07-29 广州易而达科技股份有限公司 Message broadcasting method and device, mobile terminal and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105338424A (en) * 2015-10-29 2016-02-17 努比亚技术有限公司 Video processing method and system
CN105975568A (en) * 2016-04-29 2016-09-28 腾讯科技(深圳)有限公司 Audio processing method and apparatus
CN108174280A (en) * 2018-01-18 2018-06-15 湖南快乐阳光互动娱乐传媒有限公司 Audio and video online playing method and system
US20180226105A1 (en) * 2017-02-09 2018-08-09 Juant Inc. Using sharding to generate virtual reality content
CN109246472A (en) * 2018-08-01 2019-01-18 平安科技(深圳)有限公司 Video broadcasting method, device, terminal device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105338424A (en) * 2015-10-29 2016-02-17 努比亚技术有限公司 Video processing method and system
CN105975568A (en) * 2016-04-29 2016-09-28 腾讯科技(深圳)有限公司 Audio processing method and apparatus
US20180226105A1 (en) * 2017-02-09 2018-08-09 Juant Inc. Using sharding to generate virtual reality content
CN108174280A (en) * 2018-01-18 2018-06-15 湖南快乐阳光互动娱乐传媒有限公司 Audio and video online playing method and system
CN109246472A (en) * 2018-08-01 2019-01-18 平安科技(深圳)有限公司 Video broadcasting method, device, terminal device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114822506A (en) * 2022-04-15 2022-07-29 广州易而达科技股份有限公司 Message broadcasting method and device, mobile terminal and storage medium

Similar Documents

Publication Publication Date Title
US9032429B1 (en) Determining importance of scenes based upon closed captioning data
US8751502B2 (en) Visually-represented results to search queries in rich media content
US8132103B1 (en) Audio and/or video scene detection and retrieval
CN108520046B (en) Method and device for searching chat records
US9972340B2 (en) Deep tagging background noises
KR101916874B1 (en) Apparatus, method for auto generating a title of video contents, and computer readable recording medium
US20140164371A1 (en) Extraction of media portions in association with correlated input
US8719025B2 (en) Contextual voice query dilation to improve spoken web searching
CN112632326B (en) Video production method and device based on video script semantic recognition
CN108170294B (en) Vocabulary display method, field conversion method, client, electronic equipment and computer storage medium
JP2003289387A (en) Voice message processing system and method
CN106033418A (en) A voice adding method and device, a voice play method and device, a picture classifying method and device, and a picture search method and device
CN104349173A (en) Video repeating method and device
US20140163956A1 (en) Message composition of media portions in association with correlated text
CN112632321A (en) Audio file processing method and device and audio file playing method and device
US20150221114A1 (en) Information processing apparatus, information processing method, and program
CN109710844A (en) The method and apparatus for quick and precisely positioning file based on search engine
CN114730355B (en) Using closed captioning as parallel training data for closed captioning customization systems
KR101902784B1 (en) Metohd and apparatus for managing audio data using tag data
CN106411975B (en) Data output method and device and computer readable storage medium
US20240193207A1 (en) Organizing media content items utilizing detected scene types
CN109599097B (en) Method and device for positioning homophone words
KR102472194B1 (en) System for Analyzing Personal Media Contents using AI and Driving method thereof
CN118175347A (en) Processing method, device, equipment and system for recorded video
US20200210476A1 (en) Personalized video and memories creation based on enriched images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination