CN115136233B

CN115136233B - Multi-mode rapid transfer and labeling system based on self-built template

Info

Publication number: CN115136233B
Application number: CN202280002307.8A
Authority: CN
Inventors: 李斌
Original assignee: Hunan Normal University
Current assignee: Hunan Normal University
Priority date: 2022-05-06
Filing date: 2022-05-06
Publication date: 2023-09-22
Anticipated expiration: 2042-05-06
Also published as: WO2023212920A1; CN115136233A

Abstract

The application discloses a multimode rapid transcription and labeling system based on a self-built template, which comprises the following steps: the method comprises the steps that a first acquisition unit acquires project engineering files corresponding to media files; the second acquisition unit acquires the audio data of the media file according to the catalogue of the project engineering file; the segmentation unit segments the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data; the display unit displays sentence segment data on an operation interface, and the operation interface is used for providing a display interface and a boundary axis control; the processing unit responds to the editing operation aiming at the boundary axis control to carry out boundary adjustment or sentence segment combination on the sentence segment data to obtain processed sentence segment data, and then carries out voice recognition processing to obtain a transcription text; the transfer unit updates project engineering files according to the transfer text; and when the updated project engineering file is played on the display interface, the playing unit displays the media file and the text segment corresponding to the playing progress of the media file in the transfer text.

Description

Multi-mode rapid transfer and labeling system based on self-built template

Technical Field

The application relates to the technical field of voice processing, in particular to a multi-mode rapid transcription and labeling method based on a self-built template, a multi-mode rapid transcription and labeling system based on the self-built template and a storage medium.

Background

With the development of computer technology, speech recognition technology is increasingly used. The voice recognition technology is to recognize corresponding voice content from the collected voice information, namely, recognize digital voice information into corresponding text.

Speech transcription techniques are used to convert speech into text. The voice transcription is used for simple single voice transcription and complex multi-person voice transcription, such as conference voice transcription, court trial voice transcription, classroom transcription, and the like.

However, the existing voice transcription labeling tool cannot build a language template and has poor expansibility. Meanwhile, rapid merging of sentence segments and fine adjustment of boundaries cannot be realized, and the method cannot adapt to the use requirements of various scenes in reality. For example: video plug-in caption (SRT) manufacture, mp3 music plug-in Lyrics (LRC) manufacture, various recording and transcription, language hearing teaching, visual listening and speaking teaching, spoken language corpus construction, multimedia resource library construction, situation language research, classroom teaching multi-modal research and the like.

Disclosure of Invention

The embodiment of the application provides a multi-mode quick transcription and labeling method based on a self-built template, a multi-mode quick transcription and labeling system based on the self-built template and a storage medium, which can provide a simple and convenient voice transcription labeling mode, can realize voice transcription labeling through the self-built language template, can realize quick merging of sentence segments and fine adjustment of boundaries, and improves transcription labeling efficiency so as to adapt to the use requirements of various scenes.

In one aspect, a method for multi-modal rapid transcription and labeling based on a self-built template is provided, the method comprising: acquiring project engineering files corresponding to media files to be processed; acquiring audio data of the media file according to the catalogue of the project engineering file; segmenting the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data; displaying sentence segment data of the audio data on an operation interface, wherein the operation interface is used for providing a display interface and a boundary axis control; responding to the editing operation aiming at the boundary axis control, and carrying out boundary adjustment processing or sentence merging processing on the sentence data to obtain processed sentence data; performing voice recognition processing on the processed sentence segment data to obtain a transcription text; updating the project file according to the transfer text to obtain an updated project file, wherein the updated project file carries the transfer text; and displaying the media file and a text segment corresponding to the playing progress of the media file in the transfer text when the updated project engineering file is played on the display interface.

In some embodiments, the responding to the editing operation for the boundary axis control performs boundary adjustment processing or sentence merging processing on the sentence segment data to obtain processed sentence segment data, and includes: responding to a first editing operation of the movable end of a first boundary axis control of an active sentence segment in the sentence segment data, and controlling the movable end of the first boundary axis control to move to a first position; judging whether a second boundary axis control overlapped with the movable end of the first boundary axis control exists at the first position, wherein the second boundary axis control is a boundary axis control corresponding to a second sentence segment, and the movable sentence segment and the second sentence segment are adjacent sentence segments; and if a second boundary axis control overlapped with the active end of the first boundary axis control exists at the first position, merging the active sentence segment with the second sentence segment.

In some embodiments, after the determining whether there is a second boundary axis control at the first location that overlaps the active end of the first boundary axis control, further comprising: and if the second boundary axis control overlapped with the active end of the first boundary axis control does not exist at the first position, adjusting the boundary of the active sentence segment according to the first position.

In some embodiments, the responding to the editing operation for the boundary axis control performs boundary adjustment processing or sentence merging processing on the sentence segment data to obtain processed sentence segment data, and includes: controlling the movable end of the first boundary axis control to move to a second position in response to a second editing operation of the movable end of the first boundary axis control for the active sentence segment in the sentence segment data; judging whether a third boundary axis control overlapped with the movable end of the first boundary axis control exists at the second position, wherein the third boundary axis control is a boundary axis control corresponding to a third sentence segment, and the movable sentence segment and the third sentence segment are non-adjacent sentence segments; and if a third boundary axis control overlapped with the active end of the first boundary axis control exists at the second position, merging the active sentence segment, the third sentence segment and an intermediate sentence segment between the active sentence segment and the third sentence segment.

In some embodiments, after the determining whether there is a third boundary axis control at the second location that overlaps the active end of the first boundary axis control, further comprising: if a third boundary axis control which is overlapped with the movable end of the first boundary axis control does not exist at the second position, judging whether the target area from the position of the static end of the first boundary axis control to the second position is overlapped with any middle sentence segment or not; if the target area between the static end position of the first boundary axis control and the second position is not overlapped with any intermediate sentence segment, the boundary of the movable sentence segment is adjusted according to the second position; or if the target area between the static end position of the first boundary axis control and the second position is overlapped with at least one middle sentence segment, merging the active sentence segment and all middle sentence segments with overlapping relation with the target area.

In some embodiments, the segmenting the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data includes: and carrying out segmentation processing on the audio data according to the magnitude relation between the noise amplitude threshold and the amplitude of the audio data to obtain sentence segment data of the audio data.

In some embodiments, the processing the audio data in segments according to the magnitude relation between the noise magnitude threshold and the magnitude of the audio data to obtain sentence segment data of the audio data includes: acquiring initial segmentation data of the audio data; judging whether the average amplitude in the current segment in the initial segment data is larger than the noise amplitude threshold value or not; if the average amplitude in the current segment in the initial segment data is larger than the noise amplitude threshold value, marking the current segment as a voiced segment; cutting out a sentence segment start point and a sentence segment end point of an audio point in the current segment marked as a voiced segment so as to remove silence or noise in the current segment; if the starting point position of the cut current segment is the same as the ending point position of the last segment, merging the cut current segment with the last segment; if the starting point position of the current cut segment is different from the ending point position of the last segment, marking the current cut segment as a new segment; and traversing and processing the initial segment data of the audio data to obtain sentence segment data of the audio data.

In some embodiments, the acquiring initial segment data of the audio data comprises: and carrying out initial segmentation processing on the audio data according to a preset language template to obtain initial segmentation data of the audio data.

In some embodiments, the obtaining the project engineering file corresponding to the media file to be processed includes: acquiring a media file to be processed; detecting whether the media file has created a corresponding project engineering file; if the fact that the project file corresponding to the media file is not created is detected, creating the project file corresponding to the media file based on the template file; or if the media file is detected to be established to be corresponding to the project engineering file, acquiring the project engineering file corresponding to the established media file.

In some embodiments, the method further comprises: and responding to an export instruction carrying a target file type, and exporting an export file corresponding to the target file type from the project engineering file, wherein the target file type belongs to any one of preset file types.

In some embodiments, the method further comprises: responding to an import instruction, and acquiring an import file; and when the file type of the imported file belongs to any one of the preset file types, importing the imported file into the project engineering file.

In some embodiments, the displaying the sentence segment data of the audio data on the operation interface includes: and displaying sentence waveform information of the sentence data of the audio data and time axis information corresponding to the sentence waveform information on an operation interface.

In some embodiments, the method further comprises: and hiding the sentence segment waveform information and the time axis information on an operation interface in response to the waveform hiding instruction. In some embodiments, the method further comprises: and in response to breakpoint insertion operation for a target sentence in the sentence data, inserting a breakpoint in a boundary axis control of the target sentence, so as to segment the target sentence based on the breakpoint.

In some embodiments, the transcribed text includes a text segment corresponding to each sentence segment in the sentence segment data, and after the performing speech recognition processing on the processed sentence segment data to obtain the transcribed text, the method further includes: and modifying the target text segment in the transcribed text in response to a modification instruction for the target text segment in the transcribed text to obtain a modified transcribed text, wherein the target text segment is at least one text segment in the transcribed text.

In some embodiments, the method further comprises: and marking the target text segment in response to a marking instruction aiming at the target text segment, and obtaining marked transfer text.

In another aspect, a multi-modal rapid transcription and annotation system based on a self-built template is provided, the system comprising:

the first acquisition unit is used for acquiring project engineering files corresponding to the media files to be processed;

the second acquisition unit is used for acquiring the audio data of the media file according to the catalogue of the project engineering file;

the segmentation unit is used for carrying out segmentation processing on the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data;

the display unit is used for displaying sentence segment data of the audio data on an operation interface, and the operation interface is used for providing a display interface and a boundary axis control;

the processing unit is used for responding to the editing operation aiming at the boundary axis control and carrying out boundary adjustment processing or sentence merging processing on the sentence data to obtain processed sentence data;

the transcription unit is used for carrying out voice recognition processing on the processed sentence segment data to obtain a transcription text;

The updating unit is used for updating the project file according to the transfer text to obtain an updated project file, wherein the updated project file carries the transfer text;

and the playing unit is used for displaying the text fragments corresponding to the playing progress of the media file in the media file and the transfer text when the updated project file is played on the display interface.

In another aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program adapted to be loaded by a processor for performing the steps of the multi-modal rapid transcription and labeling method based on self-built templates as described in the first aspect.

In another aspect, there is provided a terminal device, the terminal device including a processor and a memory, the memory storing a computer program, the processor being configured to perform the steps of the multi-modal rapid transcription and labeling method based on a self-created template according to the first aspect by calling the computer program stored in the memory.

The embodiment of the application provides a multi-mode quick transfer and labeling method based on a self-built template, a multi-mode quick transfer and labeling system based on the self-built template and a storage medium, wherein project engineering files corresponding to media files to be processed are obtained; acquiring audio data of the media file according to the catalogue of the project engineering file; segmenting the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data; displaying sentence segment data of the audio data on an operation interface, wherein the operation interface is used for providing a display interface and a boundary axis control; responding to the editing operation aiming at the boundary axis control, carrying out boundary adjustment processing or sentence merging processing on sentence data to obtain processed sentence data; performing voice recognition processing on the processed sentence segment data to obtain a transcription text; updating the project file according to the transfer text to obtain an updated project file, wherein the updated project file carries the transfer text; and displaying the text fragments corresponding to the playing progress of the media file in the media file and the transfer text when the updated project engineering file is played on the display interface.

The embodiment of the application can provide a simple and convenient voice transcription labeling mode, can realize multiple voice transcription through a self-built multi-language template, can support template introduction when a large number of languages or dialects cannot be subjected to voice recognition, finally realize quick and efficient sentence breaking and transcription labeling, realize quick combination of sentence segments by dragging boundary axis controls corresponding to the sentence segments displayed on an operation interface, and can directly realize boundary fine adjustment by horizontal dragging on the boundary axis controls corresponding to the sentence segment waveforms displayed on the operation interface, thereby improving voice transcription labeling efficiency and meeting the use requirements of various scenes.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a multi-mode rapid transcription and labeling method based on a self-built template according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a first application scenario provided in an embodiment of the present application.

Fig. 3 is a schematic diagram of a second application scenario provided in an embodiment of the present application.

Fig. 4 is a schematic diagram of a third application scenario provided in an embodiment of the present application.

Fig. 5 is a schematic diagram of a fourth application scenario provided in an embodiment of the present application.

Fig. 6 is a schematic diagram of a fifth application scenario provided in an embodiment of the present application.

Fig. 7 is a schematic diagram of a sixth application scenario provided in an embodiment of the present application.

Fig. 8 is a schematic structural diagram of a multi-mode rapid transcription and labeling system based on a self-built template according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application provides a multi-mode quick transfer and labeling method based on a self-built template, a multi-mode quick transfer and labeling system based on the self-built template and a storage medium. Specifically, the multi-mode rapid transcription and labeling method based on the self-built template in the embodiment of the application can be executed by terminal equipment, wherein the terminal equipment can be equipment such as a terminal or a server. The terminal can be terminal equipment such as a smart phone, a tablet personal computer, a touch screen, a personal computer (Personal Computer, PC) and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content distribution network services, basic cloud computing services such as big data and artificial intelligence platforms, but is not limited thereto.

The following will describe in detail. It should be noted that the following description order of embodiments is not a limitation of the priority order of embodiments.

Referring to fig. 1 to fig. 7, fig. 1 is a flowchart of a multi-mode fast transcription and labeling method based on a self-built template according to an embodiment of the present application, and fig. 2 to fig. 7 are application scenario diagrams according to an embodiment of the present application. The multi-mode rapid transcription and labeling method based on the self-built template can be applied to the multi-mode rapid transcription and labeling system based on the self-built template, and the multi-mode rapid transcription and labeling system based on the self-built template can be configured on terminal equipment. The terminal device may be a terminal device, the method comprising the steps of:

step 110, obtaining project engineering files corresponding to the media files to be processed.

In some embodiments, obtaining a project engineering file corresponding to a media file to be processed includes: acquiring a media file to be processed; detecting whether a media file has created a corresponding project engineering file; if the fact that the corresponding project engineering file is not created by the media file is detected, creating the project engineering file corresponding to the media file based on the template file; or if the media file is detected to be created into the corresponding project engineering file, acquiring the project engineering file corresponding to the created media file.

For example, a target client may be provided, the target client may be started, and then a media file to be processed may be opened or imported by the target client to obtain the media file. For example, the media file may be an audio file or a video file.

For example, the target client can be tool software which is specially developed for rapid transcription and labeling of audio and video language materials and is based on a multi-mode rapid transcription and labeling system of a self-built template, and the software can be internally provided with multi-language templates such as Mandarin, chinese dialects, minority nationality languages and the like to directly provide support for the transcription of the Chinese language resource protection engineering. Wherein the multi-language template may be a multi-layer annotation template. In addition, a multilingual template can be built according to project requirements, for example, a language transcription labeling template corresponding to different languages can be built in. In addition, the target client can be applied to the application scenes of video plug-in Subtitle (SRT) production, mp3 music plug-in Lyrics (LRC) production, various recording and transcription, language hearing teaching, visual listening and speaking teaching, spoken language corpus construction, multimedia resource library construction, situation language research, classroom teaching multi-modal research and the like.

Then, whether the media file has created a corresponding project file is detected by detecting whether a project file having the same name as the media file exists in the storage path. For the media file which is opened in the history, the target client can store the history record, so that when the media file is opened next time, the project file with the same name corresponding to the history record is directly called, and the project file is only required to be created for the media file which is opened for the first time or is not recorded in the history record, so that the optimization of the processing flow can be realized. For example, the history record is record information of media files opened in a history period recorded by the target client.

For example, if there is an project file with the same name as the media file, it is determined that the media file has created a corresponding project file, and then the project file with the same name as the media file created in the storage path is directly obtained, and step 120 is further performed.

For example, if there is no project file with the same name as the media file, the project file with the same name corresponding to the media file is created based on the template file, and the corresponding project file is loaded, and then step 120 is executed.

Step 120, obtaining the audio data of the media file according to the catalogue of the project engineering file.

For example, an audio/video data analysis thread is started, a media file to be processed corresponding to a catalog of project engineering files is found from a storage path of the media file according to media file information recorded in the catalog, and audio data of the media file is extracted from the media file based on the audio/video data analysis thread.

And 130, carrying out segmentation processing on the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data.

For example, before the segmentation process is performed, it is also necessary to determine whether the segmentation process is required for the audio data. If the audio data is judged to need to be segmented, the audio data is segmented, and after the segmentation of the audio data is finished, a notification of the end of the segmentation of the audio data is sent to the main thread. And if the judgment that the segmentation processing is not needed for the audio data is carried out, sending a notification of the end of the segmentation of the audio data to the main thread.

The method can judge whether the audio data needs to be segmented or not by detecting whether the divided sentence segment data exists in the audio data in the project engineering file or not. If the divided sentence segment data exists, it is judged that the segmentation processing of the audio data is not needed. If the divided sentence segment data does not exist, judging that the audio data needs to be segmented.

In some embodiments, the segmentation processing is performed on the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data, including: and carrying out segmentation processing on the audio data according to the magnitude relation between the noise amplitude threshold and the amplitude of the audio data to obtain sentence segment data of the audio data.

For example, the audio data may be initially segmented according to a preset segmentation interval or according to a mute segment. And then, carrying out second segmentation processing on the audio data according to the magnitude relation between the noise amplitude threshold and the amplitude of the audio data to obtain sentence segment data of the audio data.

In some embodiments, the audio data is segmented according to the magnitude relation between the noise amplitude threshold and the amplitude of the audio data, so as to obtain sentence segment data of the audio data, which comprises: acquiring initial segmentation data of the audio data; judging whether the average amplitude in the current segment in the initial segment data is larger than a noise amplitude threshold value or not; if the average amplitude in the current segment in the initial segment data is larger than the noise amplitude threshold value, marking the current segment as a voiced segment; cutting out a sentence segment starting point and a sentence segment ending point for an audio point in a current segment marked as a voiced segment so as to remove silence or noise in the current segment; if the starting point position of the current segment after cutting is the same as the ending point position of the last segment, combining the current segment after cutting with the last segment; if the starting point position of the current cut segment is different from the ending point position of the last segment, marking the current cut segment as a new segment; and traversing and processing the initial segment data of the audio data to obtain sentence segment data of the audio data.

In some embodiments, obtaining initial segment data of the audio data includes: and carrying out initial segmentation processing on the audio data according to a preset language template to obtain initial segmentation data of the audio data.

For example, the preset language template has segmentation processing capabilities of the sentence fragments. The preset language template may include a multi-language template built-in or built-in to the target client to enable rapid creation of the initial segment data. Wherein the multi-language template may be a multi-layer annotation template. For example, the multi-language template may be a language template including corresponding language of different countries, dialects of different regions, voices of different personas, and the like, such as a language template including corresponding English, mandarin, minority language, chinese dialect, female voice, male voice, child voice, and the like. The built-in multi-language template can be a language template built-in through third-party software, and multiple voice transcription can be realized through the built-in multi-language template. The self-built multi-language template can be a language template directly built in the target client, and can realize various voice transcription labels through the self-built multi-language template.

In some embodiments, the preset language templates include multiple language templates built in or built in the target client, where the multiple language templates may be corresponding language templates including different country languages, dialects in different regions, different character voices, and the like. Since different speaker sexes and their corresponding languages may cause different noises, the single noise threshold is used to determine the one-sidedness that may cause speech segmentation. Thus, the corresponding noise amplitude threshold is automatically generated based on the current segmented speech signal in this embodiment. For example, a noise amplitude threshold generation module can be built in, and a preset language template is input into the noise amplitude threshold generation module to adaptively determine a noise amplitude threshold corresponding to the voice signal of the current segment.

Specifically, in this embodiment, a voice signal corresponding to a current segment is obtained, and an amplitude distribution function corresponding to the voice signal of the current segment is obtained by fitting the obtained amplitude distribution function as follows:

wherein x represents the signal amplitude corresponding to the voice of the current segment, and sigma represents the signal variance of the voice of the current segment;

determining a signal standard deviation corresponding to the voice of the current segment based on the amplitude distribution function;

based on the product among the standard deviation, the average amplitude and the preset amplitude factor, determining that the noise amplitude threshold corresponding to the current segmented voice is:

Where Tam represents the noise amplitude threshold,represents standard deviation->Represents the average amplitude and alpha represents a preset amplitude factor. In this embodiment, by determining the noise amplitude threshold and performing the speech segmentation, noise or non-noise in speech can be adaptively detected according to the speech situation, so as to improve the accuracy of noise detection and segmentation.

For example, the audio data may be initially segmented according to a preset segmentation interval, and initial segmentation data of the audio data may be obtained. For example, the preset segmentation interval may be an interval set according to a conventional sentence-break time.

For example, the audio data may be initially segmented according to the mute segment, and initial segment data of the audio data may be obtained. For example, the audio data is initially segmented by detecting a mute segment in the audio data, the initial segmentation is performed based on the position of the mute segment in the audio data, the head end of the mute segment is connected to the end of the previous initial segment, and the end of the mute segment is connected to the head end of the next initial segment.

For example, when a complete sentence is segmented into a plurality of initial segments in order to avoid excessive initial segments, which results in short silence segments caused by conventional sentence-breaking speech, short silence segments may be ignored before the initial segments are performed, and only silence segments with audio length greater than a preset length are used as target silence segments for use as a basis for the initial segments. For example, a mute segment in audio data may be detected first, then a mute segment having an audio length greater than a preset length is selected as a target mute segment for use as a basis for initial segmentation, and then initial segmentation is performed based on the position of the target mute segment in the audio data.

Then, the initial segment data is subjected to a second segmentation process according to the relationship between the noise amplitude threshold and the magnitude of the amplitude of the audio data. Specifically, judging whether the average amplitude in the current segment is larger than a noise amplitude threshold value or not; if the average amplitude in the current segment is larger than the noise amplitude threshold, marking the current segment as a voiced segment, cutting the audio point in the current segment marked as the voiced segment to remove silence or noise in the current segment, if the starting and stopping positions of the current segment and the last segment are the same, merging the current segment and the last segment, and taking the merged segment as one sentence segment in sentence segment data; if the start and stop positions of the current segment and the last segment are different, the current segment is marked as a new segment, and the new segment can be used as a sentence segment in sentence segment data.

For example, if the average amplitude within the current segment in the initial segment data is not greater than the noise amplitude threshold, the current segment is marked as silence, and the current segment marked as silence may be discarded without being a segment in the segment data.

And 140, displaying sentence segment data of the audio data on an operation interface, wherein the operation interface is used for providing a display interface and a boundary axis control.

For example, as shown in fig. 2, an operation interface 200 of the target client is provided, sentence segment data 201 of the audio data is displayed on the operation interface 200, and the operation interface 200 is used to provide a presentation interface 202 and a boundary axis control 203.

For example, other editing interfaces or operation interfaces may be displayed on the operation interface 200. Such as file, edit, setup, help, etc.; an operation interface such as a transcription mode, a labeling mode and a full text mode; such as a play interface of a presentation interface, etc.

In some embodiments, displaying sentence segment data of audio data on an operator interface includes: and displaying sentence waveform information of sentence data of the audio data and time axis information corresponding to the sentence waveform information on an operation interface.

In some embodiments, the method further comprises: and hiding the sentence segment waveform information and the time axis information on the operation interface in response to the waveform hiding instruction.

For example, based on an instruction input by a user, display operation or hiding operation can be realized on sentence waveform information and time axis information, and a display mode is flexible.

And step 150, responding to the editing operation aiming at the boundary axis control, and carrying out boundary adjustment processing or sentence merging processing on the sentence data to obtain the processed sentence data.

For example, the segment dragging operation may be implemented by dragging the boundary axis control to perform segment boundary adjustment processing, or segment merging processing. The quick combination of the sentence fragments is realized by dragging the boundary axis control corresponding to the sentence fragments displayed on the operation interface, and the boundary fine adjustment can be realized by directly carrying out left or right horizontal dragging on the boundary axis control corresponding to the sentence fragments displayed on the operation interface, for example, the sentence fragment waveform can also be displayed on the operation interface, and the boundary fine adjustment can be realized by directly carrying out left or right horizontal dragging on the boundary axis control corresponding to the sentence fragment waveform displayed on the operation interface.

For example, the boundary axis control can be clicked right, and the current active sentence segment information is recorded; caching all current sentence segment information lists; dragging the boundary axis control in response to a dragging operation triggered by long pressing of a left mouse button; judging whether the active sentence segment exists or not, if so, updating the left boundary point and the right boundary point of the temporary active sentence segment; loosening the left key, judging whether the previous dragging operation is performed, and if yes, acquiring a sentence segment where the current mouse is positioned; and judging whether the merging condition is met, if so, merging the sentence segments, and if not, updating the boundary information of the active sentence segments.

For example, when determining whether the merging condition is satisfied, it is mainly determined whether the final boundary point of the active segment exceeds the adjacent boundary of the merged segment. For example, when merging segments to the right, the right boundary of the active segment must exceed the left boundary of the merged segment to merge, and ensure that the two segments are not identical. When merging segments to the left, the left boundary of the active segment must exceed the right boundary of the merged segment to merge, and ensure that the two segments are not identical.

For example, the implementation logic for obtaining the ending paragraph is: and traversing the whole sentence segment list sequentially, and judging the left and right boundaries of each sentence segment and the horizontal direction of the mouse end point. When the right boundary of a certain sentence segment is larger than the position where the mouse ends during left merging, the sentence segment is indicated to be the ending sentence segment; when the left boundary of a certain sentence segment is larger than the ending position of the mouse, the previous sentence segment of the sentence segment is the ending sentence segment.

For example, taking right merging of sentence segments as an example, when judging whether the merging condition is satisfied, detecting whether an ending sentence segment exists; if the ending sentence segment does not exist, the sentence segments can not be combined, and the combining condition is not satisfied; if the ending sentence segment exists, judging whether the ending sentence segment is the same sentence segment; if the sentence segments are the same, the sentence segments cannot be combined, and the combination condition is not satisfied; if the current right boundary of the active sentence segment is not the same sentence segment, judging whether the current right boundary of the active sentence segment is larger than the left boundary of the ending sentence segment, if so, merging the current right boundary of the active sentence segment to meet merging conditions; if the combination condition is smaller than the preset combination condition, the combination is not possible, and the combination condition is not satisfied.

For example, a diagram of a view change of the operator interface 300 shown in fig. 3 illustrates a diagram of adjusting sentence segment boundaries. For example, the user hovers (river) the mouse over the first boundary axis control 3031 of the sentence segment to be adjusted, the terminal determines the active sentence segment 3011 to be adjusted currently by detecting the hovering position of the mouse, then the user can start dragging one end boundary label of the first boundary axis control 3031 by pressing the left mouse button for a long time, and after dragging to the determined position, the left mouse button is released, the dragging operation of the active sentence segment is completed, and the boundary of the active sentence segment 3011 is updated to a new position. The editing operation for the first boundary axis control 3031 may be a drag operation, a click operation, or the like. For example, taking a drag operation as an example, an end boundary tab of the first boundary shaft control 3031 that is not dragged is defined as a stationary end, which is located at the position a; one end boundary tab of the dragged first boundary axis control 3031 is defined as the active end, which is located at position B before being dragged. The diagram 3-1 in fig. 3 shows a screen before dragging, and the diagram 3-2 in fig. 3 shows a screen after dragging to update the boundary position of the first boundary axis control 3031. In response to a first editing operation for the active end of the first boundary axis control 3031 of the active sentence segment 3011, the active end of the first boundary axis control 3031 is controlled to move from position B to position C to adjust the boundary. If the boundary bit of the dragged active sentence segment is not within the boundary range of other sentence segments, updating the boundary tag at one end of the boundary of the active sentence segment 303 to the position C, that is, adjusting the boundary of the active field 3011 from the AB segment to the AC segment.

For example, a diagram of a view change of the operation interface 400 shown in fig. 4 shows a schematic diagram of a sentence segment merging operation. For example, the user hovers (river) the mouse over the first boundary axis control 4031 of the sentence segment to be adjusted, the terminal determines the current active sentence segment 4011 to be adjusted by detecting the hovering position of the mouse, then the user can start dragging one end boundary label of the first boundary axis control 4031 by pressing the left mouse button for a long time, and after dragging to the determined position, the left mouse button is released, the dragging operation of the active sentence segment is completed, and the boundary of the active sentence segment 4011 is updated to a new position. The editing operation for the first boundary axis control 4031 may be a drag operation, a click operation, or the like. For example, taking a drag operation as an example, an end boundary tab of the first boundary axis control 4031 that is not dragged is defined as a stationary end, which is located at the position D; one end boundary tab of the dragged first boundary axis control 4031 is defined as the active end, which is located at position E before being dragged. Fig. 4-1 is a diagram showing a screen before dragging, fig. 4-2 is a diagram showing a screen in which the boundary position of the first boundary axis control 3031 changes during dragging, and fig. 4-3 is a diagram showing a screen in which sentence segments merge after dragging. In response to a first editing operation for the active end of the first boundary axis control 4031 of the active sentence segment 4011, the active end of the first boundary axis control 4031 is controlled to move from position E past position a to position F. For example, when dragging a border tag into another sentence segment, border tags located in other sentence segments may be displayed as icons different from other border tags, for example, if the active end of the first border shaft control 4031 is controlled to move from the position E to the position F beyond the position a, so as to drag the active end into the other sentence segment, at this time, the icon at the active end of the position F may be in a light blue small candle shape, while other border tags may be displayed as red right angle icons, and the user may release the mouse to implement sentence segment merging. If the boundary of the dragged active sentence segment 4031 exceeds the adjacent boundary of the other sentence segments, all sentence segments within the range overlapping the boundary of the dragged active sentence segment can be merged. For example, the boundary of the active sentence segment 4031 exceeds the left boundary (position a) of the other sentence segments 4032, so that the active sentence segment 4031 and the other sentence segments 4032 can be combined to obtain a combined sentence segment 4013, and the boundary of the boundary axis control 4033 of the combined sentence segment 4013 is a DC segment.

In some embodiments, in response to an editing operation for the boundary axis control, performing boundary adjustment processing or sentence merging processing on the sentence data to obtain processed sentence data, including: responding to a first editing operation of the movable end of a first boundary axis control of an active sentence segment in sentence segment data, and controlling the movable end of the first boundary axis control to move to a first position; judging whether a second boundary axis control overlapped with the movable end of the first boundary axis control exists at the first position, wherein the second boundary axis control is a boundary axis control corresponding to a second sentence segment, and the movable sentence segment and the second sentence segment are adjacent sentence segments; and if a second boundary axis control overlapped with the active end of the first boundary axis control exists at the first position, merging the active sentence segment with the second sentence segment.

In some embodiments, when processing audio data, the back-end program further needs to determine whether the active sentence segment is identical to the second sentence segment before merging the active sentence segment with the second sentence segment in order to avoid merging the identical sentence segments. Specifically, whether the left boundaries of the two sentence segments are identical and whether the right boundaries of the two sentence segments are identical can be determined, and if the left boundaries of the two sentence segments are identical and the right boundaries of the two sentence segments are also identical, the active sentence segment and the second sentence segment are determined to be identical. If the left boundaries of the two sentence segments are different and/or the right boundaries of the two sentence segments are different, judging that the active sentence segment and the second sentence segment are not the same sentence segment, thereby accurately distinguishing the active sentence segment from the second sentence segment, and then merging the active sentence segment and the second sentence segment.

In some embodiments, after determining whether there is a second boundary axis control at the first location that overlaps the active end of the first boundary axis control, further comprising: and if the second boundary axis control overlapped with the active end of the first boundary axis control does not exist at the first position, adjusting the boundary of the active sentence segment according to the first position.

For example, the rapid merging between every two adjacent segments can be realized by dragging the segments displayed on the operation interface. When the sentence segments are combined, the combining function of two adjacent sentence segments can be realized. On the basis, if a plurality of sentence fragments are required to be combined at the same time, the combination of any plurality of sentence fragments can be realized by sequentially combining the sentence fragments in pairs according to the sentence fragment sequence. For example, the boundary labels of two adjacent sentence segments can be dragged to touch and be combined into a new sentence segment; merging multiple segments may also be accomplished, for example, by touching the boundary tags of other segments across the boundary tags dragging a segment.

Referring to fig. 3 and 4, fig. 3 shows a schematic diagram of performing boundary adjustment processing on segment data, and fig. 4 shows a schematic diagram of performing segment merging processing on segment data.

As shown in fig. 3, in response to a first editing operation for the active end of the first boundary axis control 3031 of the active segment 3011 in the segment data, the active end of the first boundary axis control 3031 is controlled to move from the position a to a first position, which is the position C in fig. 3. There is no second boundary axis control at the first position (position C) that overlaps the active end of the first boundary axis control 3031, then the boundary of the active segment 3011 is adjusted according to the first position (position C), i.e., the boundary of the active field 3011 is adjusted from the AB segment to the AC segment.

As shown in fig. 4, in response to a first editing operation for the active end of the first boundary axis control 4031 of the active segment 4011 in the segment data, the active end of the first boundary axis control 4031 is controlled to move to a first position, which is position F in fig. 4. If there is a second boundary axis control 4032 overlapping the active end of the first boundary axis control 4031 at the first position (position F), merging the active segment 4011 with the second segment 4012 to obtain a merged segment 4013, where the boundary of the boundary axis control 4033 of the merged segment 4013 is a DC segment.

In some embodiments, in response to an editing operation for the boundary axis control, performing boundary adjustment processing or sentence merging processing on the sentence data to obtain processed sentence data, including: responding to a second editing operation of the movable end of the first boundary axis control for the active sentence segment in the sentence segment data, and controlling the movable end of the first boundary axis control to move to a second position; judging whether a third boundary axis control overlapped with the movable end of the first boundary axis control exists at the second position, wherein the third boundary axis control is a boundary axis control corresponding to a third sentence segment, and the movable sentence segment and the third sentence segment are non-adjacent sentence segments; and if a third boundary axis control overlapped with the active end of the first boundary axis control exists at the second position, merging the active sentence segment, the third sentence segment and the middle sentence segment between the active sentence segment and the third sentence segment.

In some embodiments, after determining whether there is a third boundary axis control at the second location that overlaps the active end of the first boundary axis control, further comprising: if the third boundary axis control which is overlapped with the movable end of the first boundary axis control does not exist at the second position, judging whether the target area between the position of the static end of the first boundary axis control and the second position is overlapped with any middle sentence segment or not; if the target area between the static end position of the first boundary axis control and the second position is not overlapped with any middle sentence segment, the boundary of the movable sentence segment is adjusted according to the second position; or if the target area between the static end position and the second position of the first boundary axis control is overlapped with at least one middle sentence segment, merging the active sentence segment and all middle sentence segments with overlapping relation with the target area.

For example, rapid merging of multiple sentence segments may be achieved by dragging a boundary axis control of the sentence segments displayed on the operation interface. Specifically, the multiple sentence segments can be combined through a dragging operation, after the first boundary axis control of the active sentence segment is dragged, the active sentence segment and all middle sentence segments with overlapping relation with the target area can be combined, so that the multiple sentence segments can be combined at the same time, the target area is an area between the static end position of the first boundary axis control and the second position, namely, the boundary position after dragging needs to be located in the range of other sentence segments, and all sentence segments in the range can be combined.

In some embodiments, the method further comprises: and in response to breakpoint insertion operation for a target sentence in the sentence data, inserting a breakpoint in a boundary axis control of the target sentence so as to segment the target sentence based on the breakpoint.

For example, the segmentation processing of the target sentence segment can be realized by inserting the breakpoint, so that the flexibility of sentence segment adjustment is increased.

And 160, performing voice recognition processing on the processed sentence segment data to obtain a transcription text.

For example, the automatic transcription can be realized by calling a voice recognition module configured by the terminal or a voice recognition module of a third party, so as to perform voice recognition processing on the processed sentence segment data to obtain a transcription text.

In some embodiments, the transcribed text includes a text segment corresponding to each sentence segment in the sentence segment data, and after performing a speech recognition process on the processed sentence segment data to obtain the transcribed text, the method further includes: and modifying the target text segment in the transcribed text in response to a modification instruction for the target text segment in the transcribed text to obtain a modified transcribed text, wherein the target text segment is at least one text segment in the transcribed text.

For example, after the initial transcribed text generated by automatic transcription, the user may input a modification instruction for the target text segment in the transcribed text through the operation interface to manually update the transcribed text. The modification instructions may include instructions to modify words, delete words, add words, modify fonts, modify font sizes, modify font colors, and the like.

In some embodiments, the method further comprises: and labeling the target text segment in response to the labeling instruction aiming at the target text segment, and obtaining the labeled transcription text.

For example, a labeling instruction for the target text segment can be input through an operation interface, the target text segment is labeled, and the labeled transcription text is obtained. For example, the target text segment may be annotated with any of the following: industry domain labels, content category labels, part-of-speech labels, dependency labels, entity labels, relationship labels, event labels, reading understanding labels and question and answer labels.

Step 170, updating the project file according to the transcription text to obtain an updated project file, wherein the updated project file carries the transcription text.

For example, the transcribed text is saved in a project file in a fixed format (. Baf) along with the path of the media file to update the project file. The updated project file carries the transcription text.

For example, when the project engineering file is updated, the waveform of the audio data can be initialized, a sentence segment waveform information array is constructed, and a display interface of the sentence segment waveform information is updated; saving the media file information and sentence segment data to project engineering files; notifying the media file change message; the player changes the media file; updating title information by software; the controller updates the interface and related control information.

When initializing, the memory data used for display can be initialized by adopting the audio result analyzed by the audio and video data analysis thread and the segmentation information obtained by segmentation processing, and then default values are set for some parameters needed to be used.

And 180, displaying text fragments corresponding to the playing progress of the media file in the media file and the transfer text when the updated project engineering file is played on the display interface.

For example, when the updated project engineering file is played on the presentation interface, a text segment corresponding to the playing progress of the media file in the media file and the transcription text is displayed. The playing progress can also be controlled by a playing control on the display interface.

For example, the embodiment of the application also provides a multi-format import and export function, which can support import of Word (docx, txt, aud. Txt), excel (xls, xlsx), lrc, srt, json format files and the like, and support export of the files with the above file types and eaf formats. Transfer of the transfer file and the like can be conveniently carried out, so that multi-format file import and file export can be realized. Regarding the multi-format import and export functions, corresponding interface functions for writing in files and writing out files can be provided for different file types and file reading and writing modes, so that different types of files can be written in and written out when the files are imported or exported. For example, excel, srt and other format files and corresponding media files can be imported simultaneously, the data files can be converted into Baf format, and one-time optional export of multiple file formats can be realized.

For example, the correspondence between the file type of the import format and the import interface may be as shown in table 1:

TABLE 1

File type	Lead-in interface
		Xls、Xlsx	DoImportFile_Excel
Lrc	DoImportFile_Lrc
		Srt	DoImportFile_Srt
Docx	DoImportFile_Docx
		Json	DoImportFile_Json
Aud	DoImportFile_Aud
		Txt	DoImportFile_Txt

For example, the correspondence between the file type of the export format and the export interface may be as shown in table 2:

TABLE 2

File type	Export interface
		Xls、Xlsx	ExportFile_Excel
Lrc	DoExportFile_LRC
		Srt	DoExportFile_SRT
Aud	DoExportFile_Audacity
		STL	DoExportFile_STL
Docx、Txt	DoExportFile_Txt
		EAF	IBAF::SaveTo

In some embodiments, the method further comprises: and responding to an export instruction carrying a target file type, exporting an export file corresponding to the target file type from the project engineering file, wherein the target file type belongs to any one of preset file types.

For example, an application scenario of file export as shown in fig. 5, a schematic diagram of a file export interface as shown in fig. 5-1, may set an exported target file type, etc. on the file export interface, for example, the target file type is set to Excel, and the export language is set to mandarin. After the export instruction is executed, a file may be exported according to the set content, for example, an exported Excel format file is shown as 5-2 in fig. 5.

For example, another application scenario of file export as shown in fig. 6, such as a file export interface as shown in fig. 6-1, may set an exported target file type, etc. on the file export interface, for example, the target file type may be set to Excel, word, EAF at the same time, and the export language is set to dialect. After the export instruction is executed, the file can be exported according to the set content, and when the target file type is set to multiple file formats at the same time, the multiple file formats can be exported in an optional mode, wherein the exported Excel format file is shown as the content shown as 6-2 in fig. 6.

For example, the preset file types may include: word (docx, txt, aud. Txt), excel (xls, xlsx), lrc, srt, json format file, etc. File exports in eaf format may be supported for the above file types. Transfer of the transfer file and the like can be conveniently carried out so as to realize multi-format file export.

In some embodiments, the method further comprises: responding to an import instruction, and acquiring an import file; when the file type of the import file belongs to any one of the preset file types, the import file is imported into the project engineering file.

For example, as shown in the schematic diagram of the file export interface in fig. 7, an import file may be selected on the file import interface, or the import file and the media file, and when the file type of the import file belongs to any one of the preset file types, the import file is imported into the project file.

For example, the preset file types may include: word (docx, txt, aud. Txt), excel (xls, xlsx), lrc, srt, json format file, etc. File importation supporting the above file types may be supported. Transfer of the transfer file and the like can be conveniently carried out, so that multi-format file import is realized.

All the above technical solutions may be combined to form an optional embodiment of the present application, and will not be described in detail herein.

According to the embodiment of the application, the project engineering files corresponding to the media files to be processed are obtained; acquiring audio data of the media file according to the catalogue of the project engineering file; segmenting the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data; displaying sentence segment data of the audio data on an operation interface, wherein the operation interface is used for providing a display interface and a boundary axis control; responding to the editing operation aiming at the boundary axis control, carrying out boundary adjustment processing or sentence merging processing on sentence data to obtain processed sentence data; performing voice recognition processing on the processed sentence segment data to obtain a transcription text; updating the project file according to the transfer text to obtain an updated project file, wherein the updated project file carries the transfer text; and displaying the text fragments corresponding to the playing progress of the media file in the media file and the transfer text when the updated project engineering file is played on the display interface. The embodiment of the application can provide a simple and convenient voice transcription mode, can realize multiple voice transcription through a self-built multiple language template, realizes rapid combination of sentence fragments by dragging the boundary axis control corresponding to the sentence fragments displayed on the operation interface, can directly drag horizontally on the boundary axis control corresponding to the sentence fragment waveform displayed on the operation interface to realize boundary fine adjustment, and improves the voice transcription labeling efficiency so as to adapt to the use requirements of various scenes.

In order to facilitate better implementation of the multi-mode rapid transcription and labeling method based on the self-built template, the embodiment of the application also provides a multi-mode rapid transcription and labeling system based on the self-built template. Referring to fig. 8, fig. 8 is a schematic structural diagram of a multi-mode rapid transcription and labeling system based on a self-built template according to an embodiment of the present application. The multi-mode rapid transcription and labeling system 800 based on the self-built template is applied to a terminal device providing a graphical user interface, and the multi-mode rapid transcription and labeling system 800 based on the self-built template may include:

a first obtaining unit 801, configured to obtain a project engineering file corresponding to a media file to be processed;

a second obtaining unit 802, configured to obtain audio data of the media file according to the catalog of the project engineering file;

a segmentation unit 803, configured to perform segmentation processing on the audio data according to the amplitude of the audio data, so as to obtain sentence segment data of the audio data;

the display unit 804 is configured to display sentence segment data of the audio data on an operation interface, where the operation interface is configured to provide a presentation interface and a boundary axis control;

the processing unit 805 is configured to perform boundary adjustment processing or sentence merging processing on the sentence data in response to an editing operation for the boundary axis control, to obtain processed sentence data;

A transcription unit 806, configured to perform speech recognition processing on the processed sentence segment data to obtain a transcription text;

an updating unit 807, configured to update the project file according to the transcription text, to obtain an updated project file, where the updated project file carries the transcription text;

and the playing unit 808 is configured to display a text segment corresponding to the playing progress of the media file in the media file and the transcription text when the updated project file is played on the presentation interface.

In some embodiments, the processing unit 805 may be configured to: responding to a first editing operation of the movable end of a first boundary axis control of an active sentence segment in sentence segment data, and controlling the movable end of the first boundary axis control to move to a first position; judging whether a second boundary axis control overlapped with the movable end of the first boundary axis control exists at the first position, wherein the second boundary axis control is a boundary axis control corresponding to a second sentence segment, and the movable sentence segment and the second sentence segment are adjacent sentence segments; and if a second boundary axis control overlapped with the active end of the first boundary axis control exists at the first position, merging the active sentence segment with the second sentence segment.

In some embodiments, the processing unit 805, after determining whether there is a second boundary axis control at the first location that overlaps the active end of the first boundary axis control, may be further configured to: and if the second boundary axis control overlapped with the active end of the first boundary axis control does not exist at the first position, adjusting the boundary of the active sentence segment according to the first position.

In some embodiments, the processing unit 805 may be configured to: responding to a second editing operation of the movable end of the first boundary axis control for the active sentence segment in the sentence segment data, and controlling the movable end of the first boundary axis control to move to a second position; judging whether a third boundary axis control overlapped with the movable end of the first boundary axis control exists at the second position, wherein the third boundary axis control is a boundary axis control corresponding to a third sentence segment, and the movable sentence segment and the third sentence segment are non-adjacent sentence segments; and if a third boundary axis control overlapped with the active end of the first boundary axis control exists at the second position, merging the active sentence segment, the third sentence segment and the middle sentence segment between the active sentence segment and the third sentence segment.

In some embodiments, the processing unit 805, after determining whether there is a third boundary axis control at the second location that overlaps the active end of the first boundary axis control, may be further configured to: if the third boundary axis control which is overlapped with the movable end of the first boundary axis control does not exist at the second position, judging whether the target area between the position of the static end of the first boundary axis control and the second position is overlapped with any middle sentence segment or not; if the target area between the static end position of the first boundary axis control and the second position is not overlapped with any middle sentence segment, the boundary of the movable sentence segment is adjusted according to the second position; or if the target area between the static end position and the second position of the first boundary axis control is overlapped with at least one middle sentence segment, merging the active sentence segment and all middle sentence segments with overlapping relation with the target area.

In some embodiments, the segmentation unit 803 may be configured to perform a segmentation process on the audio data according to a magnitude relation between the noise amplitude threshold and the amplitude of the audio data, so as to obtain sentence segment data of the audio data.

In some embodiments, the segmentation unit 803 may be configured to, when performing segmentation processing on the audio data according to the magnitude relation between the noise amplitude threshold and the amplitude of the audio data, obtain sentence segment data of the audio data: acquiring initial segmentation data of the audio data; judging whether the average amplitude in the current segment in the initial segment data is larger than a noise amplitude threshold value or not; if the average amplitude in the current segment in the initial segment data is larger than the noise amplitude threshold value, marking the current segment as a voiced segment; cutting out a sentence segment starting point and a sentence segment ending point for an audio point in a current segment marked as a voiced segment so as to remove silence or noise in the current segment; if the starting point position of the current segment after cutting is the same as the ending point position of the last segment, combining the current segment after cutting with the last segment; if the starting point position of the current cut segment is different from the ending point position of the last segment, marking the current cut segment as a new segment; and traversing and processing the initial segment data of the audio data to obtain sentence segment data of the audio data.

In some embodiments, the segmentation unit 803, when acquiring the initial segmentation data of the audio data, may be configured to: and carrying out initial segmentation processing on the audio data according to a preset language template to obtain initial segmentation data of the audio data.

In some embodiments, the first obtaining unit 801 may be configured to: acquiring a media file to be processed; detecting whether a media file has created a corresponding project engineering file; if the fact that the corresponding project engineering file is not created by the media file is detected, creating the project engineering file corresponding to the media file based on the template file; or if the media file is detected to be created into the corresponding project engineering file, acquiring the project engineering file corresponding to the created media file.

In some embodiments, the processing unit 805 may be further configured to derive, from the project engineering file, an export file corresponding to a target file type in response to an export instruction carrying the target file type, where the target file type belongs to any one of the preset file types.

In some embodiments, the processing unit 805 may also be configured to: responding to an import instruction, and acquiring an import file;

when the file type of the import file belongs to any one of the preset file types, the import file is imported into the project engineering file.

In some embodiments, the display unit 804 may be configured to display, on the operation interface, sentence waveform information of sentence data of the audio data, and time axis information corresponding to the sentence waveform information.

In some embodiments, the display unit 804 may be further configured to hide the sentence waveform information and the time axis information on the operation interface in response to the hide waveform instruction.

In some embodiments, the processing unit 805 may be further configured to insert a breakpoint in a boundary axis control of a target segment in response to an insert breakpoint operation for the target segment in the segment data, so as to segment the target segment based on the breakpoint.

In some embodiments, the transcribed text includes a text segment corresponding to each sentence segment in the sentence segment data, and after performing speech recognition processing on the processed sentence segment data to obtain the transcribed text, the transcription unit 806 may be further configured to: and modifying the target text segment in the transcribed text in response to a modification instruction for the target text segment in the transcribed text to obtain a modified transcribed text, wherein the target text segment is at least one text segment in the transcribed text.

In some embodiments, the transcription unit 806 may be further configured to, in response to a labeling instruction for the target text segment, label the target text segment, and obtain labeled transcription text.

It should be understood that system embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the system shown in fig. 8 may execute the foregoing embodiment of the multi-mode fast transcription and labeling method based on the self-building template, and the foregoing and other operations and/or functions of each unit in the system implement the corresponding flow of the foregoing embodiment of the method, which is not described herein for brevity.

Correspondingly, the embodiment of the application also provides terminal equipment which can be a terminal or a server, wherein the terminal can be equipment such as a smart phone, a tablet personal computer, a notebook computer, a smart television, a smart sound box, wearable smart equipment, a personal computer and the like. As shown in fig. 9, fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application. The terminal device 900 comprises a processor 901 having one or more processing cores, a memory 902 having one or more computer readable storage media, and a computer program stored on the memory 902 and executable on the processor. The processor 901 is electrically connected to the memory 902. It will be appreciated by those skilled in the art that the terminal device structure shown in the figures does not constitute a limitation of the terminal device, and may include more or less components than those illustrated, or may combine certain components, or may have a different arrangement of components.

Processor 901 is a control center of terminal device 900, connects various parts of the entire terminal device 900 using various interfaces and lines, and performs various functions of terminal device 900 and processes data by running or loading software programs and/or modules stored in memory 902 and calling data stored in memory 902, thereby performing overall monitoring of terminal device 900.

In the embodiment of the present application, the processor 901 in the terminal device 900 loads the instructions corresponding to the processes of one or more application programs into the memory 902 according to the following steps, and the processor 901 executes the application programs stored in the memory 902, so as to implement various functions:

acquiring project engineering files corresponding to media files to be processed; acquiring audio data of the media file according to the catalogue of the project engineering file; segmenting the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data; displaying sentence segment data of the audio data on an operation interface, wherein the operation interface is used for providing a display interface and a boundary axis control; responding to the editing operation aiming at the boundary axis control, and carrying out boundary adjustment processing or sentence merging processing on the sentence data to obtain processed sentence data; performing voice recognition processing on the processed sentence segment data to obtain a transcription text; updating the project file according to the transfer text to obtain an updated project file, wherein the updated project file carries the transfer text; and displaying the media file and a text segment corresponding to the playing progress of the media file in the transfer text when the updated project engineering file is played on the display interface.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

In some embodiments, as shown in fig. 9, the terminal device 900 further includes: a display unit 903, a radio frequency circuit 904, an audio circuit 905, an input unit 906, and a power supply 907. The processor 901 is electrically connected to the display unit 903, the radio frequency circuit 904, the audio circuit 905, the input unit 906, and the power supply 907, respectively. It will be appreciated by those skilled in the art that the terminal device structure shown in fig. 9 is not limiting of the terminal device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The display unit 903 may be used to display information input by a user or information provided to the user and various graphical user interfaces of the terminal device, which may be composed of graphics, text, icons, video, and any combination thereof. The display unit 903 may include a display panel and a touch panel.

The rf circuit 904 may be configured to receive and transmit rf signals to and from a network device or other terminal device via wireless communication to and from the network device or other terminal device.

The audio circuitry 905 may be used to provide an audio interface between a user and the terminal device through a speaker, microphone.

The input unit 906 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

A power supply 907 is used to power the various components of terminal device 900. In some embodiments, the power supply 907 may be logically connected to the processor 901 through a power management system, so as to perform functions of managing charging, discharging, and power consumption management through the power management system. The power supply 907 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown in fig. 9, the terminal device 900 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which will not be described herein.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application provides a computer readable storage medium storing a plurality of computer programs capable of being loaded by a processor to execute any of the steps in the multi-modal rapid transcription and labeling method based on a self-built template provided by the embodiment of the present application. The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The computer program stored in the storage medium can execute steps in any multi-mode quick transfer and labeling method based on the self-built template provided by the embodiment of the application, so that any multi-mode quick transfer and labeling method based on the self-built template provided by the embodiment of the application can be realized, and the detailed description of the previous embodiment is omitted.

Embodiments of the present application also provide a computer program product comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the corresponding flow in any of the multi-mode rapid transcription and labeling methods based on the self-built template in the embodiments of the present application, which is not described herein for brevity.

The embodiments of the present application also provide a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the corresponding flow in any of the multi-mode rapid transcription and labeling methods based on the self-built template in the embodiments of the present application, which is not described herein for brevity.

The foregoing describes in detail a multi-mode quick transcription and labeling method based on a self-built template, a multi-mode quick transcription and labeling system based on a self-built template, and a storage medium, and specific examples are applied to illustrate the principles and embodiments of the present application, and the description of the foregoing examples is only used to help understand the method and core ideas of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims

1. A multi-mode rapid transcription and labeling method based on a self-built template is characterized by comprising the following steps:

acquiring project engineering files corresponding to media files to be processed;

acquiring audio data of the media file according to the catalogue of the project engineering file;

performing initial segmentation processing on the audio data according to a preset language template to obtain initial segmentation data of the audio data, wherein the preset language template comprises language templates corresponding to different national languages, dialects of different areas and voices of different personas;

judging whether the average amplitude in the current segment in the initial segment data is larger than a noise amplitude threshold value or not;

if the average amplitude in the current segment in the initial segment data is larger than the noise amplitude threshold value, marking the current segment as a voiced segment;

cutting out a sentence segment start point and a sentence segment end point of an audio point in the current segment marked as a voiced segment so as to remove silence or noise in the current segment;

if the starting point position of the cut current segment is the same as the ending point position of the last segment, merging the cut current segment with the last segment;

If the starting point position of the current cut segment is different from the ending point position of the last segment, marking the current cut segment as a new segment;

traversing and processing the initial segment data of the audio data to obtain sentence segment data of the audio data;

displaying sentence segment data of the audio data on an operation interface, wherein the operation interface is used for providing a display interface and a boundary axis control;

responding to the dragging operation aiming at the boundary axis control, and carrying out boundary adjustment processing or sentence merging processing on the sentence data to obtain processed sentence data;

performing voice recognition processing on the processed sentence segment data to obtain a transcription text;

updating the project file according to the transfer text to obtain an updated project file, wherein the updated project file carries the transfer text;

and displaying the media file and a text segment corresponding to the playing progress of the media file in the transfer text when the updated project engineering file is played on the display interface.

2. The method for multi-modal rapid transcription and labeling based on a self-built template according to claim 1, wherein the responding to the drag operation for the boundary axis control performs boundary adjustment processing or sentence merging processing on the sentence data to obtain processed sentence data, and the method comprises:

Responding to a first dragging operation of the movable end of a first boundary axis control of an active sentence segment in the sentence segment data, and controlling the movable end of the first boundary axis control to move to a first position;

judging whether a second boundary axis control overlapped with the movable end of the first boundary axis control exists at the first position, wherein the second boundary axis control is a boundary axis control corresponding to a second sentence segment, and the movable sentence segment and the second sentence segment are adjacent sentence segments;

and if a second boundary axis control overlapped with the active end of the first boundary axis control exists at the first position, merging the active sentence segment with the second sentence segment.

3. The method for multi-modal rapid transcription and labeling based on self-built templates as claimed in claim 2, further comprising, after said determining whether there is a second boundary axis control at the first location that overlaps with the active end of the first boundary axis control:

and if the second boundary axis control overlapped with the active end of the first boundary axis control does not exist at the first position, adjusting the boundary of the active sentence segment according to the first position.

4. The method for multi-modal rapid transcription and labeling based on a self-built template according to claim 1, wherein the responding to the drag operation for the boundary axis control performs boundary adjustment processing or sentence merging processing on the sentence data to obtain processed sentence data, and the method comprises:

responding to a second dragging operation of the movable end of a first boundary axis control for an active sentence segment in the sentence segment data, and controlling the movable end of the first boundary axis control to move to a second position;

judging whether a third boundary axis control overlapped with the movable end of the first boundary axis control exists at the second position, wherein the third boundary axis control is a boundary axis control corresponding to a third sentence segment, and the movable sentence segment and the third sentence segment are non-adjacent sentence segments;

and if a third boundary axis control overlapped with the active end of the first boundary axis control exists at the second position, merging the active sentence segment, the third sentence segment and an intermediate sentence segment between the active sentence segment and the third sentence segment.

5. The method for multi-modal rapid transcription and labeling based on self-built templates as claimed in claim 4, further comprising, after said determining if there is a third boundary axis control at said second location that overlaps with the active end of said first boundary axis control:

If a third boundary axis control which is overlapped with the movable end of the first boundary axis control does not exist at the second position, judging whether the target area from the position of the static end of the first boundary axis control to the second position is overlapped with any middle sentence segment or not;

if the target area between the static end position of the first boundary axis control and the second position is not overlapped with any intermediate sentence segment, the boundary of the movable sentence segment is adjusted according to the second position; or alternatively

And if the target area between the static end position of the first boundary axis control and the second position is overlapped with at least one middle sentence segment, merging the movable sentence segment and all middle sentence segments which have overlapping relation with the target area.

6. The method for multi-modal rapid transcription and labeling based on self-built templates as claimed in claim 1, wherein said obtaining project engineering files corresponding to media files to be processed comprises:

acquiring a media file to be processed;

detecting whether the media file has created a corresponding project engineering file;

if the fact that the project file corresponding to the media file is not created is detected, creating the project file corresponding to the media file based on the template file; or alternatively

And if the fact that the project file corresponding to the media file is created is detected, acquiring the project file corresponding to the created media file.

7. The multi-modal rapid transcription and labeling method based on self-built templates as claimed in claim 1, wherein said method further comprises:

and responding to an export instruction carrying a target file type, and exporting an export file corresponding to the target file type from the project engineering file, wherein the target file type belongs to any one of preset file types.

8. The multi-modal rapid transcription and labeling method based on self-built templates as claimed in claim 7, further comprising:

responding to an import instruction, and acquiring an import file;

and when the file type of the imported file belongs to any one of the preset file types, importing the imported file into the project engineering file.

9. The method for multi-modal rapid transcription and labeling based on self-built templates as claimed in claim 1, wherein said displaying sentence segment data of said audio data on an operation interface comprises:

and displaying sentence waveform information of the sentence data of the audio data and time axis information corresponding to the sentence waveform information on an operation interface.

10. The multi-modal rapid transcription and labeling method based on self-built templates as claimed in claim 9, wherein said method further comprises:

and hiding the sentence segment waveform information and the time axis information on an operation interface in response to the waveform hiding instruction.

11. The multi-modal rapid transcription and labeling method based on self-built templates as claimed in claim 1, wherein said method further comprises:

and in response to breakpoint insertion operation for a target sentence in the sentence data, inserting a breakpoint in a boundary axis control of the target sentence, so as to segment the target sentence based on the breakpoint.

12. The method for multi-modal rapid transcription and labeling based on self-built templates as claimed in claim 1, wherein the transcription text includes a text segment corresponding to each sentence segment in the sentence segment data, and after the speech recognition processing is performed on the processed sentence segment data to obtain the transcription text, the method further comprises:

and modifying the target text segment in the transcribed text in response to a modification instruction for the target text segment in the transcribed text to obtain a modified transcribed text, wherein the target text segment is at least one text segment in the transcribed text.

13. The multi-modal rapid transcription and labeling method based on self-built templates as claimed in claim 12, further comprising:

and marking the target text segment in response to a marking instruction aiming at the target text segment, and obtaining marked transfer text.

14. A multi-modal rapid transcription and annotation system based on a self-built template, the system comprising:

a segmentation unit for:

the processing unit is used for responding to the dragging operation of the boundary axis control and carrying out boundary adjustment processing or sentence merging processing on the sentence data to obtain processed sentence data;

15. A computer-readable storage medium, characterized in that it stores a computer program adapted to be loaded by a processor for performing the steps of the multi-modal rapid transcription and labeling method based on self-built templates according to any of claims 1-13.