CN115136233A

CN115136233A - Multi-mode rapid transcription and labeling system based on self-built template

Info

Publication number: CN115136233A
Application number: CN202280002307.8A
Authority: CN
Inventors: 李斌
Original assignee: Hunan Normal University
Current assignee: Hunan Normal University
Priority date: 2022-05-06
Filing date: 2022-05-06
Publication date: 2022-09-30
Anticipated expiration: 2042-05-06
Also published as: CN115136233B; WO2023212920A1

Abstract

The application discloses multi-modal quick transcription and marking system based on self-built template includes: a first acquisition unit acquires project engineering files corresponding to media files; the second acquisition unit acquires audio data of the media file according to the catalog of the project engineering file; the segmenting unit carries out segmentation processing on the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data; the display unit displays sentence segment data on an operation interface, and the operation interface is used for providing a display interface and a boundary axis control; the processing unit responds to the editing operation aiming at the boundary axis control, carries out boundary adjustment or sentence segment combination on the sentence segment data to obtain processed sentence segment data, and then carries out voice recognition processing to obtain a transcription text; the transfer unit updates the project engineering file according to the transfer text; and when the playing unit plays the updated project file on the display interface, displaying the media file and the text segment corresponding to the playing progress of the media file in the transcription text.

Description

Multi-mode rapid transcription and labeling system based on self-built template

Technical Field

The application relates to the technical field of voice processing, in particular to a multi-modal rapid transcription and labeling method based on a self-built template, a multi-modal rapid transcription and labeling system based on the self-built template and a storage medium.

Background

With the development of computer technology, speech recognition technology is applied more and more widely. The speech recognition technology is to recognize corresponding speech content from the collected speech information, i.e. to recognize digital speech information as corresponding text.

Speech transcription techniques are used to convert speech into text. The voice transcription is used for simple single voice transcription and also used for complex multi-person voice transcription, such as conference voice transcription, court trial voice transcription, classroom voice transcription and the like.

However, the existing voice transcription marking tool cannot self-build a language template, and has poor expansibility. Meanwhile, rapid combination and boundary fine tuning of sentence segments cannot be realized, and the using requirements of various scenes in reality cannot be met. For example: the method comprises the steps of video plug-in Subtitle (SRT) making, mp3 music plug-in Lyric (LRC) making, various recording transcription, language hearing teaching, audio-visual speaking teaching, spoken language corpus construction, multimedia resource library construction, situation language research, classroom teaching multi-mode research and the like.

Disclosure of Invention

The embodiment of the application provides a multi-mode rapid transcription and labeling method based on a self-built template, a multi-mode rapid transcription and labeling system based on the self-built template and a storage medium, and can provide a simple and convenient voice transcription and labeling mode, realize voice transcription and labeling through the self-built language template, realize rapid combination and boundary fine tuning of sentence sections, improve the transcription and labeling efficiency, and adapt to the use requirements of various scenes.

On one hand, a multi-modal rapid transcription and labeling method based on a self-built template is provided, and the method comprises the following steps: acquiring a project engineering file corresponding to a media file to be processed; acquiring audio data of the media file according to the catalog of the project engineering file; carrying out segmentation processing on the audio data according to the amplitude of the audio data to obtain sentence fragment data of the audio data; displaying sentence fragment data of the audio data on an operation interface, wherein the operation interface is used for providing a display interface and a boundary axis control; responding to the editing operation aiming at the boundary axis control, and performing boundary adjustment processing or sentence segment combination processing on the sentence segment data to obtain processed sentence segment data; carrying out voice recognition processing on the processed sentence segment data to obtain a transcribed text; updating the project file according to the transcription text to obtain an updated project file, wherein the updated project file carries the transcription text; and when the updated project file is played on the display interface, displaying the media file and a text segment corresponding to the playing progress of the media file in the transcription text.

In some embodiments, the performing, in response to the editing operation for the boundary axis control, boundary adjustment processing or sentence segment merging processing on the sentence segment data to obtain processed sentence segment data includes: in response to a first editing operation directed to an active end of a first boundary axis control of an active period in the period data, controlling the active end of the first boundary axis control to move to a first position; judging whether a second boundary axis control part overlapped with the active end of the first boundary axis control part exists at the first position, wherein the second boundary axis control part is a boundary axis control part corresponding to a second sentence section, and the active sentence section and the second sentence section are adjacent sentence sections; and if a second boundary axis control piece which is overlapped with the movable end of the first boundary axis control piece exists at the first position, combining the movable sentence section and the second sentence section.

In some embodiments, after said determining whether there is a second boundary axis control at said first position that overlaps the active end of said first boundary axis control, further comprising: and if a second boundary axis control which is overlapped with the movable end of the first boundary axis control does not exist at the first position, adjusting the boundary of the movable sentence segment according to the first position.

In some embodiments, the performing, in response to the editing operation for the boundary axis control, boundary adjustment processing or sentence segment merging processing on the sentence segment data to obtain processed sentence segment data includes: in response to a second editing operation directed to an active end of a first boundary axis control for an active period in the period data, controlling the active end of the first boundary axis control to move to a second position; judging whether a third boundary axis control part overlapped with the movable end of the first boundary axis control part exists at the second position, wherein the third boundary axis control part is a boundary axis control part corresponding to a third sentence section, and the movable sentence section and the third sentence section are non-adjacent sentence sections; and if a third boundary axis control part which is overlapped with the movable end of the first boundary axis control part exists at the second position, combining the movable sentence section, the third sentence section and a middle sentence section between the movable sentence section and the third sentence section.

In some embodiments, after said determining whether there is a third boundary axis control at said second position that overlaps the active end of said first boundary axis control, further comprising: if a third boundary shaft control part which is overlapped with the movable end of the first boundary shaft control part does not exist at the second position, judging whether a target area between the static end position of the first boundary shaft control part and the second position is overlapped with any middle sentence section or not; if the target area between the static end position of the first boundary shaft control and the second position is not overlapped with any middle sentence section, adjusting the boundary of the movable sentence section according to the second position; or if the target area between the static end position of the first boundary axis control piece and the second position is overlapped with at least one intermediate sentence segment, combining the movable sentence segment and all the intermediate sentence segments which are overlapped with the target area.

In some embodiments, the segmenting the audio data according to the amplitude of the audio data to obtain period data of the audio data includes: and carrying out segmentation processing on the audio data according to the magnitude relation between the noise amplitude threshold and the amplitude of the audio data to obtain sentence fragment data of the audio data.

In some embodiments, the segmenting the audio data according to the magnitude relation between the noise amplitude threshold and the amplitude of the audio data to obtain period data of the audio data includes: acquiring initial segmentation data of the audio data; judging whether the average amplitude in the current segment in the initial segment data is larger than the noise amplitude threshold value or not; if the average amplitude in the current segment in the initial segment data is larger than the noise amplitude threshold value, marking the current segment as a sound segment; cutting the beginning point and the end point of the sentence segment of the audio point in the current segment marked as the voiced segment to remove silence or noise in the current segment; if the starting position of the cut current subsection is the same as the end position of the last subsection, combining the cut current subsection and the last subsection; if the starting position of the cut current subsection is different from the end position of the last subsection, marking the cut current subsection as a new subsection; and traversing the initial segmentation data of the audio data to obtain sentence fragment data of the audio data.

In some embodiments, the obtaining initial segmentation data for the audio data comprises: and carrying out initial segmentation processing on the audio data according to a preset language template to obtain initial segmentation data of the audio data.

In some embodiments, the obtaining the project file corresponding to the media file to be processed includes: acquiring a media file to be processed; detecting whether the media file creates a corresponding project file; if the media file is detected not to create the corresponding project file, creating the project file corresponding to the media file based on the template file; or if the fact that the corresponding project engineering file of the media file is created is detected, the project engineering file corresponding to the created media file is obtained.

In some embodiments, the method further comprises: and responding to an export instruction carrying a target file type, and exporting an export file corresponding to the target file type from the project engineering file, wherein the target file type belongs to any one of preset file types.

In some embodiments, the method further comprises: responding to an import instruction, and acquiring an import file; and when the file type of the imported file belongs to any one of the preset file types, importing the imported file into the project engineering file.

In some embodiments, the displaying the sentence fragment data of the audio data on the operation interface includes: and displaying the sentence segment waveform information of the sentence segment data of the audio data and the time axis information corresponding to the sentence segment waveform information on an operation interface.

In some embodiments, the method further comprises: and responding to a hidden waveform instruction, and hiding the period waveform information and the time axis information on an operation interface. In some embodiments, the method further comprises: and in response to the breakpoint inserting operation aiming at the target sentence segment in the sentence segment data, inserting a breakpoint into the boundary axis control piece of the target sentence segment so as to perform segmentation processing on the target sentence segment based on the breakpoint.

In some embodiments, the transcription text includes a text segment corresponding to each sentence segment in the sentence segment data, and after the speech recognition processing is performed on the processed sentence segment data to obtain the transcription text, the method further includes: and responding to a modification instruction aiming at a target text segment in the transcription text, and modifying the target text segment to obtain a modified transcription text, wherein the target text segment is at least one text segment in the transcription text.

In some embodiments, the method further comprises: and responding to a labeling instruction aiming at the target text segment, labeling the target text segment to obtain a labeled transcription text.

In another aspect, a multi-modal fast transcription and annotation system based on a self-built template is provided, the system comprising:

the first acquisition unit is used for acquiring project engineering files corresponding to the media files to be processed;

the second acquisition unit is used for acquiring the audio data of the media file according to the catalog of the project engineering file;

the segmenting unit is used for segmenting the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data;

the display unit is used for displaying the sentence segment data of the audio data on an operation interface, and the operation interface is used for providing a display interface and a boundary axis control;

the processing unit is used for responding to the editing operation aiming at the boundary axis control, and carrying out boundary adjustment processing or sentence segment combination processing on the sentence segment data to obtain processed sentence segment data;

the transcription unit is used for carrying out voice recognition processing on the processed sentence segment data to obtain a transcription text;

the updating unit is used for updating the project engineering file according to the transcription text to obtain an updated project engineering file, and the updated project engineering file carries the transcription text;

and the playing unit is used for displaying the media file and the text segment corresponding to the playing progress of the media file in the transcription text when the updated project engineering file is played on the display interface.

In another aspect, a computer-readable storage medium is provided, which stores a computer program adapted to be loaded by a processor to perform the steps of the self-built template-based multi-modal fast transcription and annotation method according to the first aspect.

In another aspect, a terminal device is provided, where the terminal device includes a processor and a memory, where the memory stores a computer program, and the processor is configured to execute the steps in the self-built template-based multimodal rapid transcription and annotation method according to the first aspect by calling the computer program stored in the memory.

The embodiment of the application provides a multi-mode quick transcription and labeling method based on a self-built template, a multi-mode quick transcription and labeling system based on the self-built template and a storage medium, and project engineering files corresponding to media files to be processed are obtained; acquiring audio data of the media file according to the catalog of the project engineering file; carrying out segmentation processing on the audio data according to the amplitude of the audio data to obtain sentence fragment data of the audio data; displaying sentence section data of the audio data on an operation interface, wherein the operation interface is used for providing a display interface and a boundary axis control; in response to the editing operation aiming at the boundary axis control, performing boundary adjustment processing or sentence segment merging processing on the sentence segment data to obtain processed sentence segment data; carrying out voice recognition processing on the processed sentence segment data to obtain a transcribed text; updating the project engineering file according to the transcription text to obtain an updated project engineering file, wherein the updated project engineering file carries the transcription text; and when the updated project file is played on the display interface, displaying the media file and the text segment corresponding to the playing progress of the media file in the transfer text.

The embodiment of the application can provide a simple and convenient voice transcription labeling mode, multiple voice transcription can be realized through a self-built multi-language template, when voice recognition can not be carried out on a large number of languages or dialects, template leading-in is supported, rapid and efficient sentence break and transcription labeling are finally realized, rapid combination of sentence segments is realized by dragging boundary axis controls corresponding to the sentence segments displayed on an operation interface, boundary fine tuning is realized by horizontally dragging the boundary axis controls corresponding to the sentence segment waveforms displayed on the operation interface directly, the voice transcription labeling efficiency is improved, and the use requirements of various scenes are met.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a multi-modal fast transcription and labeling method based on a self-built template according to an embodiment of the present application.

Fig. 2 is a schematic view of a first application scenario provided in the embodiment of the present application.

Fig. 3 is a schematic view of a second application scenario provided in the embodiment of the present application.

Fig. 4 is a schematic diagram of a third application scenario provided in the embodiment of the present application.

Fig. 5 is a schematic diagram of a fourth application scenario provided in the embodiment of the present application.

Fig. 6 is a schematic view of a fifth application scenario provided in the embodiment of the present application.

Fig. 7 is a schematic view of a sixth application scenario provided in the embodiment of the present application.

Fig. 8 is a schematic structural diagram of a multi-modal fast transcription and labeling system based on a self-built template according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a multi-mode rapid transcription and labeling method based on a self-built template, a multi-mode rapid transcription and labeling system based on the self-built template and a storage medium. Specifically, the multi-modal rapid transcription and labeling method based on the self-built template in the embodiment of the present application may be executed by a terminal device, where the terminal device may be a terminal or a server. The terminal can be terminal equipment such as a smart phone, a tablet Computer, a touch screen, a Personal Computer (PC) and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content distribution network service, and a big data and artificial intelligence platform, but is not limited thereto.

The following are detailed below. It should be noted that the description sequence of the following embodiments is not intended to limit the priority sequence of the embodiments.

Referring to fig. 1 to 7, fig. 1 is a schematic flowchart of a multi-modal fast transcription and labeling method based on a self-built template according to an embodiment of the present application, and fig. 2 to 7 are schematic application scenarios according to the embodiment of the present application. The multi-mode rapid transcription and labeling method based on the self-built template can be applied to the multi-mode rapid transcription and labeling system based on the self-built template, and the multi-mode rapid transcription and labeling system based on the self-built template can be configured on terminal equipment. The terminal device can be a terminal device, and the method comprises the following steps:

and step 110, acquiring project engineering files corresponding to the media files to be processed.

In some embodiments, obtaining the project file corresponding to the media file to be processed includes: acquiring a media file to be processed; detecting whether the media file creates a corresponding project file or not; if the media file is detected not to create the corresponding project file, the project file corresponding to the media file is created based on the template file; or if the fact that the corresponding project engineering file of the media file is created is detected, the project engineering file corresponding to the created media file is obtained.

For example, a target client may be provided, the target client is started, and then a pending media file is opened or imported by the target client to obtain the media file. For example, the media file may be an audio file or a video file.

For example, the target client can be tool software developed for rapid transcription and labeling of audio and video language materials based on a self-built template multi-modal rapid transcription and labeling system, and the software can be internally provided with multi-language templates such as mandarin, Chinese dialects and minority languages, and directly provides support for the transcription of Chinese language resource protection projects. Wherein the multi-language template may be a multi-layered annotation template. In addition, a multi-language template can be self-established according to project requirements, for example, language transcription marking templates corresponding to different languages can be built in. In addition, the target client can be applied to application scenes in multiple aspects such as video plug-in subtitle (. SRT) making, mp3 music plug-in lyrics (. LRC) making, various recording transcription, language hearing teaching, audio-visual and audio-visual teaching, spoken language corpus construction, multimedia resource library construction, situation language research, classroom teaching multi-mode research and the like.

And then, detecting whether the media file creates a corresponding project file or not by detecting whether the project file with the same name as the media file exists in the storage path or not. For the media files which are opened historically, the target client can store the historical records, so that the project engineering files with the same name corresponding to the historical records are directly called when the media files are opened next time, and the project engineering files are only needed to be created for the media files which are opened for the first time or are not recorded in the historical records, so that the optimization of the processing flow can be realized. For example, the history is the recording information of the media files that were opened during the history period recorded by the target client.

For example, if there is a project engineering file with the same name as the media file, it is determined that the corresponding project engineering file has been created for the media file, and the project engineering file with the same name as the media file created in the storage path is directly obtained, and step 120 is performed.

For example, if there is no project file with the same name as the media file, the project file with the same name corresponding to the media file is created based on the template file, and the corresponding project file is loaded, and then step 120 is executed.

And step 120, acquiring audio data of the media file according to the catalog of the project engineering file.

For example, an audio/video data analysis thread is started, according to media file information recorded in a directory of a project engineering file, a to-be-processed media file corresponding to the directory is found from a storage path of the media file, and audio data of the media file is extracted from the media file based on the audio/video data analysis thread.

And step 130, performing segmentation processing on the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data.

For example, before the segmentation process is performed, it is also necessary to determine whether the segmentation process is necessary for the audio data. And if the audio data needs to be segmented, segmenting the audio data, and sending a notification of the segmentation end of the audio data to the main thread after the segmentation of the audio data is finished. And if the audio data does not need to be segmented, sending a notification of the segmentation end of the audio data to the main thread.

Whether the audio data needs to be segmented or not can be judged by detecting whether the audio data in the project engineering file has the divided sentence fragment data or not. And if the divided sentence segment data exists, judging that the audio data does not need to be subjected to segmentation processing. And if the divided sentence segment data does not exist, judging that the audio data needs to be subjected to segmentation processing.

In some embodiments, segmenting the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data includes: and carrying out segmentation processing on the audio data according to the magnitude relation between the noise amplitude threshold and the amplitude of the audio data to obtain sentence fragment data of the audio data.

For example, the audio data may be initially fragmented according to a preset fragmentation interval, or may be initially fragmented according to a silent section. And then, according to the magnitude relation between the noise amplitude threshold value and the amplitude of the audio data, carrying out second segmentation processing on the audio data to obtain sentence segment data of the audio data.

In some embodiments, segmenting the audio data according to the magnitude relationship between the noise amplitude threshold and the amplitude of the audio data to obtain sentence segment data of the audio data, includes: acquiring initial segmentation data of the audio data; judging whether the average amplitude in the current subsection in the initial subsection data is larger than a noise amplitude threshold value or not; if the average amplitude in the current subsection in the initial subsection data is larger than the noise amplitude threshold value, marking the current subsection as a sound section; cutting the beginning point and the end point of the sentence segment of the audio point in the current segment marked as the voiced segment to remove the silence or the noise in the current segment; if the starting position of the cut current subsection is the same as the end position of the last subsection, merging the cut current subsection and the last subsection; if the starting position of the current section after being cut is different from the end position of the last section, marking the current section after being cut as a new section; and traversing initial segmentation data of the audio data to obtain sentence fragment data of the audio data.

In some embodiments, obtaining initial segmentation data for the audio data comprises: and carrying out initial segmentation processing on the audio data according to a preset language template to obtain initial segmentation data of the audio data.

For example, the preset language template has a fragment processing capability of a period segment. The pre-set language template may include a multi-language template built in or self-built into the target client to enable fast creation of the initial segmented data. Wherein the multi-language template may be a multi-layered annotation template. For example, the multi-language template may be a language template including corresponding languages of different countries, dialects of different regions, voices of different characters, and the like, such as a language template including corresponding languages of english, mandarin, minority languages, chinese dialects, female voices, male voices, child voices, and the like. The built-in multi-language template can be a language template built in through third-party software, and multiple voice transcription can be achieved through the built-in multi-language template. The self-built multi-language template can be a language template directly built in the target client, and can realize multiple voice transcription and annotation by self-building multiple language templates.

In some embodiments, the predetermined language templates include multi-language templates built in or built in the target client, and the multi-language templates may be language templates corresponding to languages of different countries, dialects of different regions, voices of different characters, and the like. Since different speakers' genders and their corresponding languages may cause different noises, one-sidedness that may cause speech segmentation is judged by a single noise threshold. Therefore, in the present embodiment, the noise amplitude threshold corresponding to the current segmented speech signal is automatically generated based on the current segmented speech signal. For example, a noise amplitude threshold generation module may be built in, and a preset language template is input into the noise amplitude threshold generation module, so as to adaptively determine a noise amplitude threshold corresponding to the speech signal of the current segment.

Specifically, in this embodiment, the speech signal corresponding to the current segment is obtained, and the amplitude distribution function corresponding to the speech signal of the current segment is obtained by fitting:

wherein x represents the signal amplitude corresponding to the speech of the current segment, and σ represents the signal variance of the speech of the current segment;

determining a signal standard deviation corresponding to the current segmented voice based on the amplitude distribution function;

based on the product of the standard deviation, the average amplitude and a preset amplitude factor, determining that the noise amplitude threshold corresponding to the current segmented voice is as follows:

where Tam represents the noise amplitude threshold,

the standard deviation is expressed in terms of the standard deviation,

representing the average amplitude and alpha representing a preset amplitude factor. In this embodiment, by determining the noise amplitude threshold and performing speech segmentation, noise or non-noise in speech can be adaptively detected according to a speech condition, so as to improve accuracy of noise detection and segmentation.

For example, the initial segmentation processing may be performed on the audio data according to a preset segmentation interval, and initial segmentation data of the audio data may be obtained. For example, the preset segmentation interval may be an interval set according to a regular sentence break time.

For example, the audio data may be subjected to initial segmentation processing based on the mute segment, and initial segmentation data of the audio data may be acquired. For example, the audio data is initially segmented by detecting a mute segment in the audio data, the initial segmentation is performed based on the position of the mute segment in the audio data, the head end of the mute segment is connected to the end of the previous initial segment, and the end of the mute segment is connected to the head end of the next initial segment.

For example, when a complete sentence is segmented into a plurality of initial segments in order to avoid excessive initial segments, which result in short silence segments caused by conventional punctuation, the short silence segments may be ignored before the initial segments are performed, and only the silence segments with the audio length greater than the preset length are used as the target silence segments to be used as the basis of the initial segments. For example, the silence segments in the audio data may be detected, and then the silence segments with audio lengths greater than a preset length are selected as target silence segments to be used as bases for initial segmentation, and then the initial segmentation is performed based on the positions of the target silence segments in the audio data.

Then, the initial segmented data is subjected to a second segmentation process according to a relationship between the noise amplitude threshold and the magnitude of the amplitude of the audio data. Specifically, whether the average amplitude in the current segment is larger than a noise amplitude threshold value is judged; if the average amplitude in the current subsection is larger than the noise amplitude threshold value, marking the current subsection as a sound section, cutting a start point and an end point of a sentence section of an audio point in the current subsection marked as the sound section to remove silence or noise in the current subsection, if the start and end positions of the current subsection and the last subsection are the same, merging the current subsection and the last subsection, and taking the merged subsection as a sentence section in the sentence section data; if the starting and ending positions of the current segment and the last segment are different, the current segment is marked as a new segment, and the new segment can be used as a sentence segment in the sentence segment data.

For example, if the average amplitude in the current segment in the initial segment data is not greater than the noise amplitude threshold, the current segment is marked as an unvoiced segment, and the current segment marked as an unvoiced segment may be discarded and not used as a sentence segment in the sentence segment data.

And 140, displaying the sentence fragment data of the audio data on an operation interface, wherein the operation interface is used for providing a display interface and a boundary axis control.

For example, as shown in fig. 2, an operation interface 200 of the target client is provided, period data 201 of audio data is displayed on the operation interface 200, and the operation interface 200 is used to provide a presentation interface 202 and a boundary axis control 203.

For example, other editing interfaces or operation interfaces may be displayed on the operation interface 200. Interfaces such as file, edit, setup, help, etc.; operation interfaces such as a transcription mode, a labeling mode and a full-text mode; such as a play interface of a presentation interface, etc.

In some embodiments, displaying the period data of the audio data on the operation interface includes: and displaying the sentence segment waveform information of the sentence segment data of the audio data and the time axis information corresponding to the sentence segment waveform information on an operation interface.

In some embodiments, the method further comprises: and hiding period waveform information and time axis information on the operation interface in response to the hidden waveform command.

For example, the display operation or the hiding operation can be realized for the sentence waveform information and the time axis information based on the instruction input by the user, and the display mode is flexible.

And 150, in response to the editing operation aiming at the boundary axis control, performing boundary adjustment processing or sentence segment merging processing on the sentence segment data to obtain processed sentence segment data.

For example, a sentence segment dragging operation may be implemented by dragging a boundary axis control to perform a sentence segment boundary adjustment process or a sentence segment merging process. The sentence segment waveform can be displayed on the operation interface, and the left side or the right side horizontal dragging can be directly carried out on the boundary axis control piece corresponding to the sentence segment displayed on the operation interface to realize the boundary fine tuning.

For example, the boundary axis control can be right-clicked to record the current active period information; caching the current all sentence segment information list; dragging the boundary axis control in response to a dragging operation triggered by long-pressing of a left mouse button; judging whether the active sentence segment exists, if so, updating the left boundary point and the right boundary point of the temporary active sentence segment; loosening the left key, judging whether dragging operation is carried out at the last time, and if yes, acquiring a sentence segment where the current mouse is located; and judging whether a merging condition is met, if so, merging the sentence segments, and if not, updating the boundary information of the movable sentence segments.

For example, when determining whether the merging condition is satisfied, it is mainly determined whether the final boundary point of the active period exceeds the adjacent boundary of the merged period. For example, when merging segments to the right, the right boundary of the active segment must exceed the left boundary of the merged segment to be merged, and it is ensured that the two segments are different. When merging segments to the left, the left boundary of the active segment must exceed the right boundary of the merged segment to be merged, and it is ensured that the two segments are different.

For example, the logic implemented to capture the end period segment is: and traversing the whole sentence segment list in sequence, and judging the left and right boundaries of each sentence segment and the size of the horizontal direction of the mouse end point. When left combining, when the right boundary of a certain sentence segment is larger than the end position of the mouse, the sentence segment is indicated as the end sentence segment; when right merge occurs, when the left boundary of a sentence segment is greater than the position where the mouse ends, it indicates that the previous sentence segment of the sentence segment is the ending sentence segment.

For example, taking the right merge sentence segment as an example, when determining whether the merge condition is satisfied, detecting whether the end sentence segment exists; if the end sentence segments do not exist, the end sentence segments cannot be combined, and the combining condition is not met; if the ending sentence segment exists, judging whether the ending sentence segment is the same sentence segment; if the sentence segments are the same, the sentence segments cannot be merged, and the merging condition is not met; if not, judging whether the current right boundary of the active sentence segment is larger than the left boundary of the end sentence segment, if so, merging the segments to meet the merging condition; if the sum is less than the preset threshold, the merging cannot be carried out, and the merging condition is not met.

For example, as shown in the view change diagram of operation interface 300 of FIG. 3, a diagram of adjusting period boundaries is shown. For example, a user hovers (hover) a mouse over a first boundary axis control 3031 of a period required to be adjusted, the terminal determines the active period 3011 to be adjusted by detecting the hovering position of the mouse, then the user can press a left mouse button for a long time to start dragging a boundary tab at one end of the first boundary axis control 3031, and release the left mouse button after dragging to the determined position to complete the dragging operation of the active period, so that the boundary of the active period 3011 is updated to a new position. The editing operation for the first boundary axis control 3031 may be a drag operation, a click operation, or the like. For example, taking a dragging operation as an example, an end boundary label of the first boundary axis control 3031 that is not dragged is defined as a stationary end, and the stationary end is located at the position a; the one end boundary label of the dragged first boundary axis control 3031 is defined as the active end, which is located at position B before being dragged. A schematic view of 3-1 in fig. 3 shows a screen before dragging, and a schematic view of 3-2 in fig. 3 shows a screen after dragging for updating the boundary position of the first boundary axis control 3031. In response to a first editing operation for the active end of the first boundary axis control 3031 of the active period segment 3011, the active end of the first boundary axis control 3031 is controlled to move from position B to position C to adjust the boundary. If the boundary position of the dragged active period segment is not within the boundary range of other period segments, the boundary label at one end of the boundary of the active period segment 303 is updated to position C, i.e., the boundary of the active field 3011 is adjusted from the AB segment to the AC segment.

For example, as shown in the view change diagram of the operation interface 400 shown in fig. 4, a diagram of a period merging operation is shown. For example, a user hovers (hover) a mouse over a first boundary axis control 4031 of a sentence segment to be adjusted, the terminal determines a current active sentence segment 4011 to be adjusted by detecting the hovering position of the mouse, then the user can press a left mouse button for a long time to start dragging a boundary label at one end of the first boundary axis control 4031, the left mouse button is released after the mouse is dragged to the determined position, the dragging operation of the active sentence segment is completed, and the boundary of the active sentence segment 4011 is updated to a new position. The editing operation on the first border axis control 4031 may be a drag operation, a click operation, or the like. For example, taking the dragging operation as an example, an end boundary label of the first boundary axis control 4031 that is not dragged is defined as a stationary end, and the stationary end is located at the position D; an end boundary label of the dragged first boundary axis control 4031 is defined as an active end, which is located at position E before being dragged. A 4-1 diagram in fig. 4 shows a screen before dragging, a 4-2 diagram in fig. 4 shows a screen in which the boundary position of the first boundary axis control 3031 changes during dragging, and a 4-3 diagram in fig. 4 shows a screen in which periods are merged after dragging. In response to a first editing operation directed to the active end of the first border axis control 4031 of active period segment 4011, the active end of the first border axis control 4031 is controlled to move from position E across position A to position F. For example, when a boundary is dragged to sign into another sentence, the boundary labels located in the other sentence may be displayed as different icons from the other boundary labels, for example, if the active end of the first boundary axis control 4031 is controlled to move from the position E to the position F beyond the position a, so as to drag the active end into the other sentence, at this time, the icon located at the active end of the position F may be in a light blue candle shape, and the other boundary labels may be displayed as red right-angle icons, and the sentence merging may be implemented by the user releasing the mouse. If the boundary of the dragged active sentence segment 4031 exceeds the adjacent boundaries of other sentence segments, all sentence segments within the range overlapped with the boundary of the dragged active sentence segment can be merged. For example, when the boundary of active period 4031 exceeds the left boundary (position a) of other period 4032, active period 4031 and other period 4032 may be merged to obtain merged period 4013, and the boundary of boundary axis control 4033 of merged period 4013 is DC segment.

In some embodiments, in response to an editing operation for the boundary axis control, performing boundary adjustment processing or sentence segment merging processing on the sentence segment data to obtain processed sentence segment data, including: in response to a first editing operation directed to an active end of a first boundary axis control of an active sentence segment in the sentence segment data, controlling the active end of the first boundary axis control to move to a first position; judging whether a second boundary axis control part overlapped with the active end of the first boundary axis control part exists at the first position, wherein the second boundary axis control part is a boundary axis control part corresponding to a second sentence section, and the active sentence section and the second sentence section are adjacent sentence sections; and if a second boundary axis control piece which is overlapped with the movable end of the first boundary axis control piece exists at the first position, combining the movable sentence section and the second sentence section.

In some embodiments, in order to avoid merging the same periods when the back-end program processes the audio data, it is further determined whether the active period and the second period are the same before merging the active period and the second period. Specifically, it is determined whether the left boundaries of the two sentence segments are the same and the right boundaries of the two sentence segments are also the same, and if the left boundaries of the two sentence segments are the same and the right boundaries of the two sentence segments are also the same, it is determined that the active sentence segment and the second sentence segment are the same sentence segment. If the left boundaries of the two sentence segments are different and/or the right boundaries of the two sentence segments are different, the active sentence segment and the second sentence segment are judged not to be the same sentence segment, so that the active sentence segment and the second sentence segment are accurately distinguished, and then the active sentence segment and the second sentence segment are combined.

In some embodiments, after determining whether there is a second boundary axis control at the first position that overlaps the active end of the first boundary axis control, further comprising: if a second boundary axis control which is overlapped with the movable end of the first boundary axis control does not exist at the first position, the boundary of the movable sentence section is adjusted according to the first position.

For example, the sentence segments displayed on the operation interface can be dragged to realize the rapid combination between every two adjacent sentence segments. When the sentence segments are combined, the combining function of two adjacent sentence segments can be realized. On the basis, if a plurality of sentence segments need to be combined simultaneously, two sentence segments are combined in sequence according to the sequence of the sentence segments, and the combination of any plurality of sentence segments can be realized. For example, the boundary labels of two adjacent sentence segments can be dragged to touch, and then the two sentence segments can be combined into a new sentence segment; merging multiple periods may also be accomplished, for example, by touching the boundary labels of other periods across the boundary label that drags a period.

Referring to fig. 3 and 4, fig. 3 is a schematic diagram illustrating a process of adjusting a boundary of sentence data, and fig. 4 is a schematic diagram illustrating a process of merging sentences with sentence data.

As shown in fig. 3, in response to a first editing operation for the active end of the first boundary axis control 3031 of the active period 3011 in the period data, the active end of the first boundary axis control 3031 is controlled to move from position a to a first position, which is position C in fig. 3. If there is no second boundary axis control overlapping the active end of the first boundary axis control 3031 at the first position (position C), the boundary of the active period 3011 is adjusted according to the first position (position C), i.e., the boundary of the active field 3011 is adjusted from the AB segment to the AC segment.

As shown in fig. 4, in response to a first editing operation for the active end of the first boundary axis control 4031 in the active period 4011 in the period data, the active end of the first boundary axis control 4031 is controlled to move to a first position, which is position F in fig. 4. If the second boundary axis control 4032 overlapped with the active end of the first boundary axis control 4031 exists at the first position (position F), the active sentence segment 4011 and the second sentence segment 4012 are merged to obtain a merged sentence segment 4013, and the boundary of the boundary axis control 4033 of the merged sentence segment 4013 is a DC segment.

In some embodiments, in response to an editing operation for the boundary axis control, performing boundary adjustment processing or sentence segment merging processing on the sentence segment data to obtain processed sentence segment data, including: in response to a second editing operation directed to the active end of the first boundary axis control for an active period in the period data, controlling the active end of the first boundary axis control to move to a second position; judging whether a third boundary axis control part overlapped with the movable end of the first boundary axis control part exists at the second position, wherein the third boundary axis control part is a boundary axis control part corresponding to a third sentence section, and the movable sentence section and the third sentence section are non-adjacent sentence sections; if a third boundary axis control part which is overlapped with the movable end of the first boundary axis control part exists at the second position, the movable sentence section, the third sentence section and the middle sentence section between the movable sentence section and the third sentence section are merged.

In some embodiments, after determining whether there is a third boundary axis control at the second position that overlaps the active end of the first boundary axis control, further comprising: if a third boundary shaft control part which is overlapped with the movable end of the first boundary shaft control part does not exist at the second position, judging whether a target area between the static end position of the first boundary shaft control part and the second position is overlapped with any middle sentence section or not; if the target area between the static end position and the second position of the first boundary shaft control is not overlapped with any middle sentence segment, adjusting the boundary of the movable sentence segment according to the second position; or if the target area between the static end position and the second position of the first boundary axis control is overlapped with at least one intermediate sentence segment, combining the active sentence segment and all the intermediate sentence segments which are overlapped with the target area.

For example, a fast merge of multiple periods may be achieved by dragging a bounding axis control of a period displayed on the operation interface. Specifically, a plurality of sentence segments can be merged by dragging operation, after the first boundary axis control of the active sentence segment is dragged, the active sentence segment and all the intermediate sentence segments which have an overlapping relationship with the target region can be merged, so that the simultaneous merging of the plurality of sentence segments is realized, the target region is a region between the static end position of the first boundary axis control and the second position, that is, the boundary position after dragging needs to be located in the range of other sentence segments, and all the sentence segments in the range can be merged.

In some embodiments, the method further comprises: and in response to the breakpoint inserting operation aiming at the target sentence segment in the sentence segment data, inserting a breakpoint into the boundary axis control piece of the target sentence segment so as to perform segmentation processing on the target sentence segment based on the breakpoint.

For example, the target sentence segments can be segmented by inserting break points, so that the flexibility of adjusting the sentence segments is increased.

And 160, performing voice recognition processing on the processed sentence segment data to obtain a transcribed text.

For example, automatic transcription may be implemented by calling a speech recognition module configured in the terminal or a speech recognition module of a third party, so as to perform speech recognition processing on the processed sentence fragment data to obtain a transcribed text.

In some embodiments, the transcription text includes a text segment corresponding to each sentence segment in the sentence segment data, and after performing speech recognition processing on the processed sentence segment data to obtain the transcription text, the method further includes: and responding to a modification instruction aiming at the target text segment in the transcription text, and modifying the target text segment to obtain the modified transcription text, wherein the target text segment is at least one text segment in the transcription text.

For example, after automatically transcribing the generated initial transcription text, the user can input a modification instruction for a target text segment in the transcription text through the operation interface to realize manual updating of the transcription text. The modification instructions may include instructions to modify words, delete words, add words, modify fonts, modify font sizes, modify font colors, and the like.

In some embodiments, the method further comprises: and responding to a labeling instruction aiming at the target text segment, and labeling the target text segment to obtain a transcribed text after labeling.

For example, a labeling instruction for the target text segment may be input through the operation interface, and the target text segment is labeled to obtain a labeled transcription text. For example, the target text segment may be labeled as any of the following: the method comprises the following steps of industry field labeling, content category labeling, part of speech labeling, dependency relationship labeling, entity labeling, relationship labeling, event labeling, reading comprehension labeling and question and answer labeling.

And 170, updating the project engineering file according to the transcription text to obtain an updated project engineering file, wherein the updated project engineering file carries the transcription text.

For example, the transcription text is saved in a project file of fixed format (. Baf) along with the path of the media file to update the project file. And carrying the transcription text by the updated project engineering file.

For example, when a project file is updated, the waveform of audio data can be initialized, a period waveform information array is constructed, and a display interface of period waveform information is updated; storing the media file information and sentence segment data to project engineering files; notifying a media file change message; the player replaces the media file; software update header information; and the controller updates the interface and the related control information.

When initializing, the memory data used for display can be initialized by adopting the audio result analyzed by the audio and video data analysis thread and the segmentation information obtained by segmentation processing, and then default values are set for some parameters required to be used.

And step 180, displaying the media file and the text segment corresponding to the playing progress of the media file in the transcription text when the updated project file is played on the display interface.

For example, when the updated project engineering file is played on the display interface, the text segment corresponding to the playing progress of the media file in the media file and the transcription text is displayed. And the playing progress can be controlled through a playing control on the display interface.

For example, the embodiment of the present application further provides a multi-format import/export function, which can support import of Word (docx, txt, aud. txt), Excel (xls, xlsx), lrc, srt, json format files, and the like, and simultaneously support export of files of the above file types and eaf format. The transfer file migration and the like can be conveniently carried out, so that the file import and the file export of multiple formats can be realized. Regarding the multi-format import/export function, interface functions of corresponding write-in files and write-out files can be provided for different file types and file read-write modes, so that different types of files can be written in and written out when the files are imported or exported. For example, files in formats such as Excel and srt and corresponding media files can be imported simultaneously, data files can be converted into a Baf format, and one-time optional export of multiple file formats can be realized.

For example, the correspondence between the file type of the import format and the import interface can be as shown in table 1:

TABLE 1

File type	Import interface
		Xls、Xlsx	DoImportFile_Excel
Lrc	DoImportFile_Lrc
		Srt	DoImportFile_Srt
Docx	DoImportFile_Docx
		Json	DoImportFile_Json
Aud	DoImportFile_Aud
		Txt	DoImportFile_Txt

For example, the correspondence between the file type in export format and the export interface may be as shown in table 2:

TABLE 2

File type	Export interface
		Xls、Xlsx	ExportFile_Excel
Lrc	DoExportFile_LRC
		Srt	DoExportFile_SRT
Aud	DoExportFile_Audacity
		STL	DoExportFile_STL
Docx、Txt	DoExportFile_Txt
		EAF	IBAF::SaveTo

For example, the schematic diagram of the application scenario of file export as shown in fig. 5, the schematic diagram of the file export interface as shown in 5-1 in fig. 5, the target file type of export and the like may be set on the file export interface, for example, the target file type is set to Excel, and the export language is set to mandarin. After executing the export instruction, the file can be exported according to the setting content, for example, the exported Excel format file has the content shown as 5-2 in fig. 5.

For example, another application scenario diagram of file export as shown in fig. 6, a diagram of a file export interface as shown in fig. 6-1, may set an exported target file type and the like on the file export interface, for example, the target file type may be set to Excel, Word, EAF, and the export language is set to dialect at the same time. After the export instruction is executed, the file can be exported according to the set content, and when the type of the target file is simultaneously set into multiple file formats, the export of multiple file formats in one time can be realized, wherein the exported Excel format file has the content shown as 6-2 in fig. 6.

For example, the preset file types may include: word (docx, txt, aud. txt), Excel (xls, xlsxx), lrc, srt, json format files, etc. The above file types may be supported as well as file exports in the eaf format. The transfer file can be conveniently migrated, and the like, so that the export of the multi-format file is realized.

For example, as shown in the schematic diagram of the file export interface shown in fig. 7, an import file, or an import file and a media file may be selected on the file import interface, and when the file type of the import file belongs to any one of the preset file types, the import file is imported into the project engineering file.

For example, the preset file types may include: word (docx, txt, aud. txt), Excel (xls, xlsxx), lrc, srt, json format files, etc. File importation may be supported in support of the above file types. The transfer of the transfer file can be conveniently carried out, and the like, so that the multi-format file import is realized.

All the above technical solutions can be combined arbitrarily to form the optional embodiments of the present application, and are not described herein again.

According to the embodiment of the application, project engineering files corresponding to media files to be processed are obtained; acquiring audio data of the media file according to the catalog of the project engineering file; carrying out segmentation processing on the audio data according to the amplitude of the audio data to obtain sentence fragment data of the audio data; displaying sentence segment data of the audio data on an operation interface, wherein the operation interface is used for providing a display interface and a boundary axis control; in response to the editing operation aiming at the boundary axis control, performing boundary adjustment processing or sentence segment combination processing on the sentence segment data to obtain processed sentence segment data; carrying out voice recognition processing on the processed sentence segment data to obtain a transcribed text; updating the project file according to the transcription text to obtain an updated project file, wherein the updated project file carries the transcription text; and when the updated project file is played on the display interface, displaying the media file and the text segment corresponding to the playing progress of the media file in the transcription text. The embodiment of the application can provide a simple and convenient voice transcription mode, can realize multiple voice transcription by self-building multiple language templates, can realize quick combination of sentence segments by dragging the boundary axis control corresponding to the sentence segments displayed on the operation interface, and can realize boundary fine adjustment by horizontally dragging the boundary axis control corresponding to the sentence segment waveforms displayed on the operation interface directly, thereby improving the voice transcription labeling efficiency and adapting to the use requirements of various scenes.

In order to better implement the multi-modal rapid transcription and labeling method based on the self-built template in the embodiment of the application, the embodiment of the application also provides a multi-modal rapid transcription and labeling system based on the self-built template. Referring to fig. 8, fig. 8 is a schematic structural diagram of a multi-modal fast transcription and labeling system based on a self-built template according to an embodiment of the present application. The system 800 for rapidly transferring and labeling multiple modalities based on a self-built template is applied to a terminal device providing a graphical user interface, and the system 800 for rapidly transferring and labeling modalities based on a self-built template may include:

a first obtaining unit 801, configured to obtain a project engineering file corresponding to a media file to be processed;

a second obtaining unit 802, configured to obtain audio data of the media file according to the directory of the project engineering file;

a segmenting unit 803, configured to perform segmentation processing on the audio data according to the amplitude of the audio data to obtain sentence segment data of the audio data;

a display unit 804, configured to display sentence segment data of the audio data on an operation interface, where the operation interface is used to provide a presentation interface and a boundary axis control;

a processing unit 805, configured to perform boundary adjustment processing or sentence segment merging processing on sentence segment data in response to an editing operation for a boundary axis control, to obtain processed sentence segment data;

a transcription unit 806, configured to perform speech recognition processing on the processed sentence segment data to obtain a transcription text;

an updating unit 807, configured to update the project file according to the transcription text to obtain an updated project file, where the updated project file carries the transcription text;

the playing unit 808 is configured to display a text segment corresponding to the playing progress of the media file in the media file and the transcription text when the updated project engineering file is played on the display interface.

In some embodiments, the processing unit 805 may be configured to: in response to a first editing operation directed to the active end of the first boundary axis control of an active period in the period data, controlling the active end of the first boundary axis control to move to a first position; judging whether a second boundary axis control part overlapped with the active end of the first boundary axis control part exists at the first position, wherein the second boundary axis control part is a boundary axis control part corresponding to a second sentence section, and the active sentence section and the second sentence section are adjacent sentence sections; and if a second boundary axis control piece which is overlapped with the movable end of the first boundary axis control piece exists at the first position, combining the movable sentence section and the second sentence section.

In some embodiments, the processing unit 805, after determining whether there is a second boundary axis control at the first position that overlaps the active end of the first boundary axis control, may be further configured to: and if a second boundary axis control which is overlapped with the movable end of the first boundary axis control does not exist at the first position, adjusting the boundary of the movable sentence segment according to the first position.

In some embodiments, the processing unit 805 may be configured to: in response to a second editing operation directed to the active end of the first boundary axis control for an active period in the period data, controlling the active end of the first boundary axis control to move to a second position; judging whether a third boundary axis control part overlapped with the movable end of the first boundary axis control part exists at the second position, wherein the third boundary axis control part is a boundary axis control part corresponding to a third sentence section, and the movable sentence section and the third sentence section are non-adjacent sentence sections; if a third boundary axis control piece which is overlapped with the movable end of the first boundary axis control piece exists at the second position, the movable sentence section, the third sentence section and the middle sentence section between the movable sentence section and the third sentence section are combined.

In some embodiments, the processing unit 805, after determining whether there is a third boundary axis control at the second position that overlaps the active end of the first boundary axis control, may be further configured to: if a third boundary shaft control part which is overlapped with the movable end of the first boundary shaft control part does not exist at the second position, judging whether a target area between the static end position of the first boundary shaft control part and the second position is overlapped with any middle sentence section or not; if the target area between the static end position and the second position of the first boundary shaft control is not overlapped with any middle sentence section, adjusting the boundary of the movable sentence section according to the second position; or if the target area between the static end position and the second position of the first boundary axis control is overlapped with at least one intermediate sentence segment, combining the active sentence segment and all the intermediate sentence segments which are overlapped with the target area.

In some embodiments, the segmenting unit 803 may be configured to segment the audio data according to a size relationship between the noise amplitude threshold and the amplitude of the audio data, so as to obtain period data of the audio data.

In some embodiments, the segmenting unit 803, when segmenting the audio data according to the magnitude relation between the noise amplitude threshold and the amplitude of the audio data to obtain the period data of the audio data, may be configured to: acquiring initial segmentation data of the audio data; judging whether the average amplitude in the current subsection in the initial subsection data is larger than a noise amplitude threshold value or not; if the average amplitude in the current subsection in the initial subsection data is larger than the noise amplitude threshold value, marking the current subsection as a sound section; cutting the beginning point and the end point of the sentence segment of the audio point in the current segment marked as the voiced segment to remove the silence or the noise in the current segment; if the starting position of the cut current subsection is the same as the end position of the last subsection, merging the cut current subsection and the last subsection; if the starting position of the cut current subsection is different from the end position of the last subsection, marking the cut current subsection as a new subsection; and traversing initial segmentation data of the audio data to obtain sentence fragment data of the audio data.

In some embodiments, the segmentation unit 803, when obtaining the initial segment data of the audio data, may be configured to: and carrying out initial segmentation processing on the audio data according to a preset language template to obtain initial segmentation data of the audio data.

In some embodiments, the first obtaining unit 801 may be configured to: acquiring a media file to be processed; detecting whether the media file creates a corresponding project file or not; if the media file is detected not to create the corresponding project file, the project file corresponding to the media file is created based on the template file; or if the fact that the corresponding project engineering file of the media file is created is detected, the project engineering file corresponding to the created media file is obtained.

In some embodiments, the processing unit 805 may be further configured to, in response to an export instruction carrying a target file type, export an export file corresponding to the target file type from the project engineering file, where the target file type belongs to any one of preset file types.

In some embodiments, the processing unit 805 may be further configured to: responding to an import instruction, and acquiring an import file;

when the file type of the imported file belongs to any one of the preset file types, importing the imported file into the project engineering file.

In some embodiments, the display unit 804 may be configured to display, on the operation interface, the period waveform information of the period data of the audio data and the time axis information corresponding to the period waveform information.

In some embodiments, the display unit 804 may be further configured to hide the period waveform information and the time axis information on the operation interface in response to the hiding waveform instruction.

In some embodiments, the processing unit 805 may be further configured to insert a break point in the boundary axis control of the target sentence segment in response to an insert break point operation for the target sentence segment in the sentence segment data, so as to perform segmentation processing on the target sentence segment based on the break point.

In some embodiments, the transcription unit 806 may further be configured to, after performing speech recognition processing on the processed sentence segment data to obtain the transcription text,: and responding to a modification instruction aiming at the target text segment in the transcription text, and modifying the target text segment to obtain the modified transcription text, wherein the target text segment is at least one text segment in the transcription text.

In some embodiments, the transcription unit 806 may be further configured to label the target text segment in response to a labeling instruction for the target text segment, so as to obtain a labeled transcription text.

All the above technical solutions may be combined arbitrarily to form an optional embodiment of the present application, and are not described in detail herein.

It is to be understood that system embodiments and method embodiments may correspond to one another and similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the system shown in fig. 8 may execute the above embodiment of the multi-modal fast transcription and labeling method based on the self-built template, and the foregoing and other operations and/or functions of each unit in the system implement the corresponding processes of the above embodiment of the method respectively, which are not described herein again for brevity.

Correspondingly, the embodiment of the application further provides a terminal device, the terminal device can be a terminal or a server, and the terminal can be a smart phone, a tablet computer, a notebook computer, a smart television, a smart sound box, a wearable smart device, a personal computer and the like. As shown in fig. 9, fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application. The terminal device 900 includes a processor 901 having one or more processing cores, a memory 902 having one or more computer-readable storage media, and a computer program stored on the memory 902 and executable on the processor. The processor 901 is electrically connected to the memory 902. It will be appreciated by those skilled in the art that the terminal device configurations shown in the figures are not intended to be limiting of terminal devices and may include more or fewer components than shown, or some of the components may be combined, or a different arrangement of components.

The processor 901 is a control center of the terminal apparatus 900, connects various parts of the entire terminal apparatus 900 by various interfaces and lines, executes various functions of the terminal apparatus 900 and processes data by running or loading software programs and/or modules stored in the memory 902 and calling data stored in the memory 902, thereby monitoring the terminal apparatus 900 as a whole.

In this embodiment of the present application, the processor 901 in the terminal device 900 loads instructions corresponding to processes of one or more application programs into the memory 902 according to the following steps, and the processor 901 runs the application programs stored in the memory 902, thereby implementing various functions:

acquiring a project engineering file corresponding to a media file to be processed; acquiring audio data of the media file according to the catalog of the project engineering file; carrying out segmentation processing on the audio data according to the amplitude of the audio data to obtain sentence fragment data of the audio data; displaying sentence fragment data of the audio data on an operation interface, wherein the operation interface is used for providing a display interface and a boundary axis control; responding to the editing operation aiming at the boundary axis control, and performing boundary adjustment processing or sentence segment combination processing on the sentence segment data to obtain processed sentence segment data; carrying out voice recognition processing on the processed sentence segment data to obtain a transcribed text; updating the project file according to the transcription text to obtain an updated project file, wherein the updated project file carries the transcription text; and when the updated project file is played on the display interface, displaying the media file and a text segment corresponding to the playing progress of the media file in the transcription text.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

In some embodiments, as shown in fig. 9, the terminal device 900 further comprises: a display unit 903, a radio frequency circuit 904, an audio circuit 905, an input unit 906, and a power supply 907. The processor 901 is electrically connected to the display unit 903, the radio frequency circuit 904, the audio circuit 905, the input unit 906, and the power supply 907. Those skilled in the art will appreciate that the terminal device configuration shown in fig. 9 does not constitute a limitation of the terminal device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The display unit 903 may be used to display information input by or provided to a user and various graphical user interfaces of the terminal device, which may be made up of graphics, text, icons, video, and any combination thereof. The display unit 903 may include a display panel and a touch panel.

The radio frequency circuit 904 may be configured to transmit and receive radio frequency signals to establish wireless communication with a network device or other terminal devices via wireless communication, and transmit and receive signals with the network device or other terminal devices.

The audio circuitry 905 may be used to provide an audio interface between the user and the terminal device through a speaker, microphone.

The input unit 906 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

Power supply 907 is used to power the various components of terminal device 900. In some embodiments, power supply 907 may be logically coupled to processor 901 through a power management system, such that functions of managing charging, discharging, and power consumption are performed through the power management system. Power supply 907 may also include any component such as one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown in fig. 9, the terminal device 900 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described in detail herein.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the embodiment of the present application provides a computer-readable storage medium, in which a plurality of computer programs are stored, where the computer programs can be loaded by a processor to execute the steps in any self-built template-based multi-modal rapid transcription and annotation method provided by the embodiment of the present application. The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the computer program stored in the storage medium can execute the steps in any self-built template-based multi-modal fast transcription and labeling method provided by the embodiment of the present application, the beneficial effects that can be achieved by any self-built template-based multi-modal fast transcription and labeling method provided by the embodiment of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

Embodiments of the present application also provide a computer program product including computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device executes a corresponding process in any one of the self-built template-based multi-modal fast transcription and labeling methods in the embodiments of the present application, which is not described herein again for brevity.

Embodiments of the present application further provide a computer program, where the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device executes a corresponding process in any one of the self-built template-based multi-modal fast transcription and labeling methods in the embodiments of the present application, which is not described herein again for brevity.

The multi-modal rapid transcription and labeling method based on the self-built template, the multi-modal rapid transcription and labeling system based on the self-built template and the storage medium provided by the embodiment of the application are described in detail above, a specific example is applied in the description to explain the principle and the implementation of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for those skilled in the art, according to the idea of the present application, the specific implementation manner and the application scope may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A multi-mode rapid transcription and labeling method based on a self-built template is characterized by comprising the following steps:

acquiring project engineering files corresponding to media files to be processed;

acquiring audio data of the media file according to the catalog of the project engineering file;

carrying out segmentation processing on the audio data according to the amplitude of the audio data to obtain sentence fragment data of the audio data;

displaying sentence fragment data of the audio data on an operation interface, wherein the operation interface is used for providing a display interface and a boundary axis control;

responding to the editing operation aiming at the boundary axis control, and performing boundary adjustment processing or sentence segment combination processing on the sentence segment data to obtain processed sentence segment data;

carrying out voice recognition processing on the processed sentence segment data to obtain a transcribed text;

updating the project file according to the transcription text to obtain an updated project file, wherein the updated project file carries the transcription text;

and when the updated project engineering file is played on the display interface, displaying the media file and a text segment corresponding to the playing progress of the media file in the transfer text.

2. The method for multi-modal fast transcription and labeling based on self-built templates as claimed in claim 1, wherein the performing a boundary adjustment process or a sentence segment merging process on the sentence segment data in response to the editing operation on the boundary axis control to obtain processed sentence segment data comprises:

in response to a first editing operation on an active end of a first boundary axis control of an active period in the period data, controlling the active end of the first boundary axis control to move to a first position;

judging whether a second boundary axis control part overlapped with the active end of the first boundary axis control part exists at the first position, wherein the second boundary axis control part is a boundary axis control part corresponding to a second sentence section, and the active sentence section and the second sentence section are adjacent sentence sections;

and if a second boundary axis control piece which is overlapped with the movable end of the first boundary axis control piece exists at the first position, combining the movable sentence segment and the second sentence segment.

3. The method according to claim 2, wherein after determining whether there is a second boundary axis control at the first position overlapping with the active end of the first boundary axis control, the method further comprises:

and if a second boundary axis control which is overlapped with the movable end of the first boundary axis control does not exist at the first position, adjusting the boundary of the movable sentence segment according to the first position.

4. The method for multi-modal fast transcription and labeling based on self-built templates as claimed in claim 1, wherein the performing a boundary adjustment process or a sentence segment merging process on the sentence segment data in response to the editing operation on the boundary axis control to obtain processed sentence segment data comprises:

in response to a second editing operation directed to an active end of a first boundary axis control for an active period in the period data, controlling the active end of the first boundary axis control to move to a second position;

judging whether a third boundary axis control part overlapped with the movable end of the first boundary axis control part exists at the second position, wherein the third boundary axis control part is a boundary axis control part corresponding to a third sentence section, and the movable sentence section and the third sentence section are non-adjacent sentence sections;

and if a third boundary axis control part which is overlapped with the movable end of the first boundary axis control part exists at the second position, combining the movable sentence section, the third sentence section and a middle sentence section between the movable sentence section and the third sentence section.

5. The method as claimed in claim 4, wherein after said determining whether there is a third boundary axis control at the second position overlapping with the active end of the first boundary axis control, further comprising:

if a third boundary shaft control part which is overlapped with the movable end of the first boundary shaft control part does not exist at the second position, judging whether a target area between the static end position of the first boundary shaft control part and the second position is overlapped with any middle sentence section or not;

if the target area between the static end position of the first boundary shaft control and the second position is not overlapped with any middle sentence section, adjusting the boundary of the movable sentence section according to the second position; or

And if the target area between the static end position and the second position of the first boundary axis control is overlapped with at least one intermediate sentence segment, combining the movable sentence segment and all the intermediate sentence segments which are overlapped with the target area.

6. The method for multi-modal fast transcription and labeling based on self-built templates as claimed in claim 1, wherein the step of segmenting the audio data according to the amplitude of the audio data to obtain sentence fragment data of the audio data comprises:

and carrying out segmentation processing on the audio data according to the magnitude relation between the noise amplitude threshold and the amplitude of the audio data to obtain sentence fragment data of the audio data.

7. The method as claimed in claim 6, wherein the step of segmenting the audio data according to the magnitude relationship between the noise amplitude threshold and the amplitude of the audio data to obtain the sentence segment data of the audio data comprises:

acquiring initial segmentation data of the audio data;

determining whether the average amplitude within the current segment in the initial segment data is greater than the noise amplitude threshold;

if the average amplitude in the current segment in the initial segment data is larger than the noise amplitude threshold value, marking the current segment as a sound segment;

cutting the beginning point and the end point of the sentence segment of the audio point in the current segment marked as the voiced segment to remove silence or noise in the current segment;

if the starting position of the cut current subsection is the same as the end position of the last subsection, combining the cut current subsection and the last subsection;

if the starting position of the cut current subsection is different from the end position of the last subsection, marking the cut current subsection as a new subsection;

and traversing the initial segmentation data of the audio data to obtain sentence fragment data of the audio data.

8. The self-built template-based multi-modal fast transcription and labeling method according to claim 7, wherein the obtaining of initial segment data of the audio data comprises:

and carrying out initial segmentation processing on the audio data according to a preset language template to obtain initial segmentation data of the audio data.

9. The multi-modal rapid transcription and labeling method based on the self-built template as claimed in claim 1, wherein said obtaining project engineering files corresponding to the media files to be processed comprises:

acquiring a media file to be processed;

detecting whether the media file creates a corresponding project file;

if the media file is detected not to create the corresponding project file, creating the project file corresponding to the media file based on the template file; or alternatively

And if the fact that the corresponding project engineering file is created in the media file is detected, obtaining the project engineering file corresponding to the created media file.

10. The method for multi-modal fast transcription and labeling based on self-built templates as claimed in claim 1, wherein the method further comprises:

and responding to an export instruction carrying a target file type, and exporting an export file corresponding to the target file type from the project engineering file, wherein the target file type belongs to any one of preset file types.

11. The method for multi-modal fast transcription and labeling based on self-built templates as claimed in claim 10, wherein the method further comprises:

responding to an import instruction, and acquiring an import file;

and when the file type of the imported file belongs to any one of the preset file types, importing the imported file into the project engineering file.

12. The method for multi-modal rapid transcription and labeling based on self-built templates as claimed in claim 1, wherein said displaying sentence segment data of said audio data on an operation interface comprises:

and displaying the sentence segment waveform information of the sentence segment data of the audio data and the time axis information corresponding to the sentence segment waveform information on an operation interface.

13. The self-built template-based multi-modal rapid transcription and labeling method of claim 12, wherein the method further comprises:

and responding to a hidden waveform instruction, and hiding the period waveform information and the time axis information on an operation interface.

14. The self-built template-based multi-modal rapid transcription and labeling method of claim 1, wherein the method further comprises:

and in response to the breakpoint inserting operation aiming at the target sentence segment in the sentence segment data, inserting a breakpoint into the boundary axis control piece of the target sentence segment so as to perform segmentation processing on the target sentence segment based on the breakpoint.

15. The method as claimed in claim 1, wherein the transcribed text includes text segments corresponding to each sentence segment in the sentence segment data, and after the speech recognition processing is performed on the processed sentence segment data to obtain the transcribed text, the method further comprises:

and responding to a modification instruction aiming at a target text segment in the transcription text, and modifying the target text segment to obtain a modified transcription text, wherein the target text segment is at least one text segment in the transcription text.

16. The method for multi-modal rapid transcription and labeling based on self-built templates as claimed in claim 15, wherein the method further comprises:

and responding to a labeling instruction aiming at the target text segment, labeling the target text segment to obtain a labeled transcription text.

17. A multi-modal rapid transcription and labeling system based on self-built templates is characterized by comprising:

the processing unit is used for responding to the editing operation aiming at the boundary axis control, and carrying out boundary adjustment processing or sentence section combination processing on the sentence section data to obtain processed sentence section data;

the updating unit is used for updating the project file according to the transcription text to obtain an updated project file, and the updated project file carries the transcription text;

18. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program adapted to be loaded by a processor for performing the steps of the method for multimodal rapid transcription and annotation based on self-created templates according to any one of claims 1-16.