CN114360545A

CN114360545A - Voice recognition and audio/video processing method, device, system and storage medium

Info

Publication number: CN114360545A
Application number: CN202011034174.1A
Authority: CN
Inventors: 王凯
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2022-04-15

Abstract

The embodiment of the application provides a method, equipment, a system and a storage medium for voice recognition and audio/video processing. In the embodiment of the application, the audio and video data are segmented to provide a basis for performing voice detection and voice recognition in parallel, and in consideration of the unreasonable situation of the segmented audio and video data, the voice fragments obtained by voice detection are normalized before the voice recognition, and then the voice recognition is performed on the normalized voice fragments to ensure the quality of the voice recognition; in the whole process, voice detection and voice recognition are carried out in parallel, time consumption of voice recognition is saved, the efficiency of voice recognition is improved, and user experience is improved.

Description

Voice recognition and audio/video processing method, device, system and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, device, system, and storage medium for speech recognition and audio/video processing.

Background

With the development of internet technology and 4G/5G communication technology, audio and video applications are increasing, such as 'jittering' and 'B station' applications which are popular at present. These audiovisual applications generate a large amount of audiovisual data each day, and the processing of such audiovisual data typically relies on speech recognition techniques. The existing voice recognition processing process sequentially carries out voice detection, feature extraction, neural network scoring, decoding search, text post-processing and the like on audio and video data, the processing time consumption of the voice recognition method can be obviously increased along with the increase of the length of the audio and video, if the audio and video to be recognized is longer, the processing time consumption is overlong, and the user experience is seriously reduced.

Disclosure of Invention

Various aspects of the present application provide a method, device, system and storage medium for speech recognition and audio/video processing, so as to reduce the time consumption of speech recognition, improve the efficiency of speech recognition, and ensure the user experience.

The embodiment of the application provides a voice recognition method, which comprises the following steps: dividing audio and video data to be identified into M audio and video segments; performing voice detection on the M audio and video clips in parallel to obtain N voice clips; regularizing the N voice segments to obtain K regularized voice segments; performing voice recognition on the K normalized voice segments in parallel to obtain K text segments; wherein M, N is an integer of 2 or more, K is an integer of 1 or more, and K.ltoreq.N.

An embodiment of the present application further provides a voice processing chip, including: the segmentation module is used for segmenting the audio and video data to be identified into M audio and video segments; the multi-channel voice detection module is used for performing voice detection on the M audio and video clips in parallel to obtain N voice clips; the normalization module is used for normalizing the N voice segments to obtain K normalized voice segments; the multi-channel voice recognition module is used for performing voice recognition on the K normalized voice segments in parallel to obtain K text segments; wherein M, N is an integer of 2 or more, K is an integer of 1 or more, and K.ltoreq.N.

An embodiment of the present application further provides a speech recognition device, including: a memory and a plurality of processors; a memory for storing a computer program; a plurality of processors are coupled with the memory for executing computer programs for: dividing audio and video data to be identified into M audio and video segments; performing voice detection on the M audio and video clips in parallel to obtain N voice clips; regularizing the N voice segments to obtain K regularized voice segments; performing voice recognition on the K normalized voice segments in parallel to obtain K text segments; wherein M, N is an integer of 2 or more, K is an integer of 1 or more, and K.ltoreq.N.

An embodiment of the present application further provides an audio/video processing system, including: the system comprises an audio and video terminal, audio and video editing equipment, an audio and video server and a cloud server; the audio and video editing equipment is used for sending audio and video data to be added with subtitles to the cloud server, synthesizing at least one text segment corresponding to the audio and video data returned by the cloud server into subtitle information, and uploading the audio and video data and the subtitle information to the audio and video server; the audio and video server is used for receiving the audio and video data and the subtitle information sent by the audio and video editing equipment and providing the audio and video data and the subtitle information for the audio and video terminal to synchronously display the subtitle information in the process of playing the audio and video data; the cloud server is used for segmenting the audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice fragments in parallel to obtain at least one text fragment and returning the text fragment to the audio and video editing equipment.

An embodiment of the present application further provides an audio/video processing system, including: an audio and video terminal and an audio and video server; the audio and video terminal is used for displaying a playing interface, and a subtitle adding control is arranged on the playing interface; responding to the triggering operation of a user on the caption adding control, requesting caption information corresponding to currently played audio and video data from the audio and video server, receiving the caption information returned by the audio and video server and synchronously displaying the caption information on a playing interface; the audio and video server is used for segmenting the currently played audio and video data into a plurality of audio and video segments according to the request; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice fragments in parallel to obtain at least one text fragment, synthesizing the at least one text fragment into caption information, and returning the caption information to the audio/video terminal.

An embodiment of the present application further provides an audio/video processing system, including: the system comprises an audio and video terminal, an audio and video server and a cloud server; the audio and video terminal is used for displaying a playing interface, and a subtitle adding control is arranged on the playing interface; responding to the triggering operation of a user on a caption adding control, requesting caption information corresponding to currently played audio and video data from an audio and video server, receiving the caption information sent by the audio and video server and synchronously displaying the caption information on a playing interface; the audio and video server is used for sending the audio and video data to the cloud server according to the request, synthesizing at least one text segment corresponding to the audio and video data returned by the cloud server into caption information and sending the caption information to the audio and video terminal; the cloud server is used for segmenting the audio and video data sent by the audio and video server into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice fragments in parallel to obtain at least one text fragment and returning the text fragment to the audio and video server.

The embodiment of the present application further provides an audio and video playing method, including: displaying a playing interface, wherein a subtitle adding control is arranged on the playing interface; segmenting currently played audio and video data into a plurality of audio and video segments in response to the triggering operation of adding a control to the subtitle; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment, synthesizing the at least one text segment into subtitle information, and then synchronously displaying the subtitle information on a playing interface.

An embodiment of the present application further provides an audio/video processing system, including: the system comprises an audio and video terminal, an audio and video server and a cloud server; the audio and video terminal is used for uploading audio and video data to the audio and video server; the audio and video server is used for sending the audio and video data to the cloud server and receiving at least one text segment corresponding to the audio and video data returned by the cloud server; performing content discovery or security audit on at least one text fragment; the cloud server is used for segmenting the audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice fragments in parallel to obtain at least one text fragment and returning the text fragment to the audio and video server.

An embodiment of the present application further provides an audio/video processing system, including: an audio and video terminal and an audio and video server; the audio and video terminal is used for uploading audio and video data to the audio and video server; the audio and video server is used for segmenting the audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment; and performing content discovery or security audit on the at least one text fragment.

The embodiment of the application further provides an audio and video uploading method, which comprises the following steps: responding to the uploading operation of a user, and acquiring audio and video data to be uploaded; dividing the audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment; and uploading the audio and video data to an audio and video server based on the at least one text segment.

An embodiment of the present application further provides a terminal device, including: a memory and a processor; a memory for storing a computer program; a processor is coupled with the memory for executing a computer program for: displaying a playing interface, wherein a subtitle adding control is arranged on the playing interface; segmenting currently played audio and video data into a plurality of audio and video segments in response to the triggering operation of adding a control to the subtitle; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment, synthesizing the at least one text segment into subtitle information, and then synchronously displaying the subtitle information on a playing interface.

An embodiment of the present application further provides a terminal device, including: a memory and a processor; a memory for storing a computer program; a processor is coupled with the memory for executing a computer program for: responding to the uploading operation of a user, and acquiring audio and video data to be uploaded; dividing the audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment; and uploading the audio and video data to an audio and video server based on the at least one text segment.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement the steps in any one of the methods provided by the embodiments of the present application.

In the embodiment of the application, the audio and video data are segmented to provide a basis for performing voice detection and voice recognition in parallel, and in consideration of the situation that the segmentation of the audio and video data is possibly unreasonable, before the voice recognition, the voice fragments obtained by the voice detection are normalized, and then the voice recognition is performed on the normalized voice fragments to ensure the quality of the voice recognition; in the whole process, voice detection and voice recognition are carried out in parallel, time consumption of voice recognition is reduced, the efficiency of voice recognition is improved, and user experience is guaranteed.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flow chart of a speech recognition method according to an exemplary embodiment of the present application;

FIG. 2a is a schematic diagram of an exemplary speech recognition process provided by an exemplary embodiment of the present application;

fig. 2b shows the position relationship of 3 audio/video clips and 6 voice clips in time in the embodiment shown in fig. 2 a;

fig. 3 is a schematic structural diagram of a speech processing chip according to an exemplary embodiment of the present application;

fig. 4 is a schematic structural diagram of a speech recognition device according to an exemplary embodiment of the present application;

fig. 5a is a schematic structural diagram of an audio/video processing system according to an exemplary embodiment of the present application;

fig. 5b is a schematic structural diagram of another audio/video processing system provided in an exemplary embodiment of the present application;

fig. 5c is a schematic structural diagram of another audio/video processing system provided in an exemplary embodiment of the present application;

fig. 5d is a schematic flowchart of an audio/video playing method provided in an exemplary embodiment of the present application;

fig. 6a is a schematic structural diagram of another audio/video processing system provided in an exemplary embodiment of the present application;

fig. 6b is a schematic structural diagram of another audio/video processing system provided in an exemplary embodiment of the present application;

fig. 6c is a schematic flowchart of another audio/video playing method provided in an exemplary embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal device according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Aiming at the problem that the processing time consumption is long in the existing voice recognition, in the embodiment of the application, the audio and video data are segmented to provide a basis for performing voice detection and voice recognition in parallel, and in addition, the situation that the segmentation of the audio and video data is unreasonable is considered, before the voice recognition, the voice fragments obtained by the voice detection are normalized, and then the normalized voice fragments are subjected to the voice recognition, so that the quality of the voice recognition is ensured; in the whole process, voice detection and voice recognition are carried out in parallel, time consumption of voice recognition is reduced, the efficiency of voice recognition is improved, and user experience is guaranteed.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a speech recognition method according to an exemplary embodiment of the present application, where as shown in fig. 1, the method includes:

101. dividing audio and video data to be identified into M audio and video segments;

102. performing voice detection on the M audio and video clips in parallel to obtain N voice clips;

103. regularizing the N voice segments to obtain K regularized voice segments;

104. performing voice recognition on the K normalized voice segments in parallel to obtain K text segments; wherein M, N is an integer of 2 or more, K is an integer of 1 or more, and K.ltoreq.N.

In the embodiment of the present application, the audio/video data may be, but is not limited to: the Video data such as a tv show, a movie, a variety program, a Music Video, a Music, an animation, or a short Video, etc. played in various audio/Video playing platforms or applications, may also be data such as a song, a Video, etc. recorded by various recording and playing devices, may also be audio or audio/Video chat content in various audio/Video chat tools, may also be a background sound effect or a game picture with audio/Video in game applications, etc. In this embodiment, the audio/video data may include both audio data and video data, or may include only audio data. In the audio-visual data, there may be a voice, and there may also be other sounds than the voice, such as an environmental noise, a sound of an animal, or a sound of an object collision. Wherein, the voice is the sound with certain social meaning sent by the pronunciation organ of the human. In many application scenarios, it is necessary to perform speech recognition on audio-video data so as to convert the speech data contained therein into corresponding text data. For example, in some dramas and movies, there is a need to add subtitles to audio/video data, and speech in the audio/video data can be converted into text data as subtitles and displayed on a screen synchronously. For another example, in some application scenarios, in order to ensure the safety compliance of the audio/video data, there is a need for safety audit of the content of the audio/video data, and the speech in the audio/video data may be converted into text data, and safety audit is performed based on the text data. For another example, in some application scenarios, in order to perform content mining, there is a need to perform content discovery on audio/video data, and speech in the audio/video data may be converted into text data, so as to perform content discovery based on the text data.

In this embodiment, before performing speech recognition on audio and video data, speech detection is performed on the audio and video data to obtain speech segments included in the audio and video data, and then speech recognition is performed on the detected speech segments. In order to reduce the time consumption of voice recognition on audio and video data, voice detection is performed in a parallel mode in a voice detection stage so as to reduce the time consumption of voice detection and further improve the efficiency of voice recognition. When parallel voice detection is implemented, firstly, audio and video data to be recognized are cut into M audio and video segments; wherein M is an integer greater than or equal to 2, providing a data basis for parallel speech detection. In this embodiment, the embodiment of segmenting the audio/video data into M audio/video segments is not limited. In an optional embodiment, the audio/video data to be recognized may be divided into M audio/video segments according to a set slicing duration, where the set slicing duration may be 3 seconds(s), 5s, 10s, or 15s, and the like. In this optional embodiment, the number M of the audio/video clips may be specifically determined by the time length of the audio/video data and the slicing time length; in addition, by adopting the slicing mode, the audio and video clips with fixed length can be obtained. In a further alternative embodiment, the audio-video data to be recognized may be divided into M audio-video segments according to the set number of slices. The set number of slices is M, and may be, for example, 5, 10, 15, 20, or 30. In this optional embodiment, the time lengths of the M audio/video clips may be the same or different, and preferably, the audio/video data to be identified may be divided into M audio/video clips having the same length. As shown in fig. 2a, the example of dividing the audio/video data to be recognized into 3 audio/video segments with fixed length is shown. In addition, the audio/video data to be identified can also be randomly divided into M audio/video segments, and the lengths of the M audio/video segments can be the same or different.

In the above various segmentation manners, adjacent audio and video segments may not overlap with each other, or there may be a certain overlap between adjacent audio and video segments. In the segmentation mode that the adjacent audio and video segments are overlapped, the overlapping duration can be flexibly set, for example, the overlapping duration can be 1s, 0.5s and the like, and the overlapping duration is less than the duration of the audio and video segments. For example, in a specific segmentation mode, 15s may be segmented into one audio/video segment, and adjacent audio/video segments overlap by 2s, that is, the first audio/video segment is from 0 to 15s, the second audio/video segment is from 13s to 28s, the third audio/video segment is from 26s to 41s, and so on. When the audio and video clips are cut, a certain overlap exists between the adjacent audio and video clips, so that the overlapped voice clips are possibly detected in the subsequent voice detection process, and the probability that normal characters, words or sentences are unreasonably cut off can be reduced through further regularization processing. The shorter the overlapping time between adjacent audio and video segments is, the smaller the calculation amount is, but the too short is not, otherwise, the unreasonable segmentation problem is not solved.

Further alternatively, in the various segmentation manners, some adjacent audio and video segments may not overlap, and some adjacent audio and video segments may overlap. For example, assuming that the audio/video data duration is from T0-T3, the duration includes several time periods, respectively T0-T1, T1-T2, and T2-T3, which respectively correspond to the header audio/video data, the middle audio/video data, and the trailer audio/video data, adjacent audio/video segments cut out by the audio/video data between T0-T1 and T2-T3 (i.e., the header and trailer audio/video data) do not overlap with each other, and adjacent audio/video segments cut out by the audio/video data between T1-T2 (i.e., the trailer audio/video data) overlap with each other. The time length of the audio and video data at the head part, the middle part and the tail part can be flexibly set according to the application requirement. In the optional embodiment, the fact that the probability of the voice existing in the middle audio and video data is high and the voice information is important is mainly considered, so that in order to avoid unreasonable problems brought by segmentation, adjacent audio and video segments segmented from the middle audio and video data are required to be overlapped to a certain extent. Of course, according to different voice scenes, it may also be selected that there is a certain overlap between adjacent audio and video segments cut from the head or tail audio and video data, or it may also be selected that there is a certain overlap between any two adjacent audio and video segments cut from the head, middle and tail audio and video data.

In an alternative embodiment, the audiovisual data to be recognized may be compression-coded. For example, the audio/video data to be identified is usually provided by an author of the audio/video data, and the author may need to store the audio/video data to be identified, or need to transmit the audio/video data to the cloud via a network link. In any of the above processes, the audio and video data needs to be compressed and encoded, and the main purpose of the compression and encoding is to reduce the storage space and transmission bandwidth resources occupied by the audio and video data on the premise of ensuring the playing quality of the audio and video data. Of course, different compression coding modes may be adopted for different audio-video data to be recognized. The compression coding mode adopted by the audio and video data to be identified includes but is not limited to: moving Picture Experts Group (MPEG), Audio Video Interleaved (AVI), Advanced Streaming Format (ASF), or the like. In view of this, before the audio/video data to be recognized is segmented, the format normalization processing may be performed on the audio/video data to be recognized to obtain the audio/video data in the uncompressed format. Optionally, the uncompressed format may be a Pulse-code modulation (PCM) format, that is, the compressed and encoded audio/video data may be converted into the PCM format audio/video data, and the PCM format is an uncompressed original audio/video format. As shown in the flowchart of fig. 2a, a process of performing format normalization processing on audio/video data to obtain audio/video data in a PCM format is shown. After the audio/video data in the PCM format is obtained, the audio/video data can be segmented into M audio/video segments by adopting the segmentation mode.

In this embodiment, after the M audio/video clips are acquired, voice detection may be performed on the M audio/video clips in parallel to obtain N voice clips, where N is an integer greater than or equal to 2. It should be noted that an audio-video segment may include one or more speech segments, or may not include a speech segment, so that the sizes of M and N are not necessarily related. In this embodiment, a speech detection model may be used to perform speech detection on the M audio/video clips. The voice recognition model can recognize the voice-containing segments in the audio and video segments to obtain the voice segments. The speech detection model may employ, but is not limited to: long Short-Term Memory network (LSTM) or a neural network-based time series class Classification (CTC).

In this embodiment, the audio/video clip includes a plurality of voice detection nodes, each voice detection node is a computing resource, and can perform voice detection on the audio/video clip by operating a voice detection model, and the voice detection node may be a computing device, a CPU or a GPU in the computing device, or a computing instance such as a virtual machine or a container in the computing device, which is not limited to this. Based on the method, M audio and video clips can be sent to a plurality of voice detection nodes, each voice detection node performs voice detection on the audio and video clip responsible for the voice detection node by using a voice detection model, and outputs the detected voice clip. As shown in fig. 2a, 3 audio/video clips are sent to 3 voice detection nodes for illustration, but the invention is not limited thereto. In addition, in fig. 2a, it is illustrated that 3 voice detection nodes detect a voice segment 1.1, a voice segment 1.2, a voice segment 2.1, a voice segment 2.2, a voice segment 2.3, and a voice segment 3.1 from 3 audio/video segments in total.

If the audio and video segment is not reasonably segmented, for example, a segmentation point just destroys a word, a word or a sentence in a certain voice in the audio and video segment, and the result of voice recognition based on the audio and video segment after the unreasonable segmentation may be damaged, resulting in inaccurate voice recognition. In order to solve the problem of unreasonable segmentation, in this embodiment, after N voice segments are obtained through voice detection, normalization processing is performed on the N voice segments to obtain K normalized voice segments, where K is an integer greater than or equal to 1 and K is less than or equal to N. The purpose of regularizing the N voice segments is to obtain more coherent voice segments with more complete semantic expression, overcome the problem that a certain character, word or sentence with continuity in the voice is cut off due to unreasonable segmentation, and provide a basis for lossless voice recognition and guarantee of accuracy of the voice recognition.

In this embodiment, the embodiment in which the normalization processing is performed on the N speech segments is not limited. Optionally, a manner of merging partial speech segments of the N speech segments may be adopted to achieve the purpose of regularization processing. Further considering that unreasonable segmentation occurs at the segmentation point, namely the adjacent boundary of two adjacent audio and video clips, based on the unreasonable segmentation, the adjacent voice clips from the adjacent audio and video clips in the N voice clips can be combined, namely the two voice clips are combined into one voice clip; as for the N voice segments, other voice segments may be kept unchanged as independent voice segments, and finally K normalized voice segments are obtained. In an alternative embodiment, the voice segments adjacent to and respectively from the adjacent audio-video segments can be identified, and the voice segments adjacent to and respectively from the adjacent audio-video segments can be directly combined into one voice segment. In another optional embodiment, the voice segments adjacent to and respectively from the adjacent audio and video segments among the N voice segments may be identified, and then it is determined whether there is an overlap between the adjacent voice segments and respectively from the adjacent audio and video segments or whether a time interval between the adjacent voice segments and respectively from the adjacent audio and video segments is smaller than a set interval threshold; if the voice segments which are adjacent and come from the adjacent audio-video segments are overlapped or the time interval between the two voice segments is smaller than a set interval threshold value, combining the voice segments which are adjacent and come from the adjacent audio-video segments into one voice segment; otherwise, if the time interval between the adjacent voice segments from the adjacent audio and video segments is greater than or equal to the set interval threshold, the two voice segments are taken as independent voice segments, and finally K normalized voice segments are obtained. Wherein, the existence of the overlap between the voice segments means that the two voice segments have the overlap in duration, for example, one voice segment is from 3s to 10s, and the other voice segment is from 7s to 15s, then the two voice segments have the overlap; the time interval between two adjacent speech segments can be calculated according to the time difference between the end time of the previous speech segment and the start time of the next speech segment, and the set interval threshold may be 1 millisecond (ms), 5ms, or 1s, and the like, which is not limited herein.

Specifically, as shown in fig. 2a, audio/video data in the PCM format is divided into 3 audio/video segments, which are respectively an audio/video segment 1, an audio/video segment 2, and an audio/video segment 3, and voice detection is performed on the 3 audio/video segments to obtain 6 voice segments. Specifically, voice detection is performed on the audio/video segment 1 to obtain a voice segment 1.1 and a voice segment 1.2, voice detection is performed on the audio/video segment 2 to obtain a voice segment 2.1, a voice segment 2.2 and a voice segment 2.3, and voice detection is performed on the audio/video segment 3 to obtain a voice segment 3.1. The positional relationship of 3 audio/video clips and 6 voice clips in time shown in fig. 2a is shown in fig. 2 b. In fig. 2b, it is assumed that there is an overlap between audio-video segments 1, 2 and 3. Then, regularization processing is carried out on the obtained 6 voice segments, and if the voice segment 1.2 and the voice segment 2.1 are overlapped, the voice segment 1.2 and the voice segment 2.1 can be combined to obtain a voice segment b; the voice segment 1.1 and the adjacent audio segment 1.2 are from the same audio/video segment 1, so that combination is not needed, and the voice segment 1.1 can exist as an independent voice segment, namely a regular voice segment a; the voice segment 2.2 does not have a voice segment adjacent to the voice segment and coming from another audio/video segment, so that merging is not needed, and the voice segment 2.2 can exist as an independent voice segment, namely a regular voice segment c; assuming that the speech segment 2.3 and the speech segment 3.1 are not overlapped and the time interval between the two is greater than the set interval threshold, the speech segment 2.3 and the speech segment 3.1 do not need to be merged and can exist as independent speech segments, that is, the normalized speech segments d and e. Further, 5 speech segments a-e obtained by regularization processing may be subjected to speech recognition to obtain 5 text segments.

In this embodiment, an embodiment in which a text segment is obtained by performing speech recognition on a speech segment is not limited. For example, a speech recognition algorithm or a speech recognition network model may be used to perform speech recognition on the speech segment to obtain the text segment. Wherein the speech recognition algorithm may be, but is not limited to: an algorithm based on Dynamic Time Warping (Dynamic Time Warping), a method based on a Hidden Markov Model (HMM) of a parametric Model, or a method based on Vector Quantization (VQ) of a nonparametric Model, and the like. The speech recognition network model can adopt but is not limited to: LSTM or CTC implementation.

In this embodiment, after performing speech recognition on K normalized speech segments, K text segments can be obtained. Alternatively, the text segment obtained by speech recognition may have some disadvantages. For example: (1) the recognition result has no punctuation mark; (2) for a relatively long sentence, for example, a text segment obtained by performing speech recognition on a speech segment of 40s to 50s has no sentence break; (3) spoken text, for example, is interspersed with more linguistic words, etc. Based on this, in an optional embodiment, the text post-processing may be performed on the K text segments to obtain a speech text corresponding to the audio/video data. Wherein the text post-processing comprises: adding punctuation, processing tone words, sentence breaks, and the like.

In the embodiment of the application, the audio and video data are segmented to provide a basis for performing the voice detection and the voice recognition in parallel, and in consideration of the unreasonable situation of the segmented audio and video data, the voice fragments obtained by the voice detection are normalized before the voice recognition, and then the voice recognition is performed on the normalized voice fragments to ensure the quality of the voice recognition; in the whole process, voice detection and voice recognition are carried out in parallel, time consumption of voice recognition is reduced, voice recognition efficiency is improved, and user experience is improved.

Further, in this embodiment, the parallel mode is adopted for voice detection and voice recognition, so that the time for voice recognition of the audio and video data is shortened.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps 101 to 103 may be device a; for another example, the execution subject of

steps

101 and 102 may be device a, and the execution subject of step 103 may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 101, 102, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

The speech recognition method provided by the method embodiment of the present application can be implemented in a software manner, and can also be implemented in a hardware manner. When implemented in hardware, the implementation may be performed by a speech processing chip. In view of this, an embodiment of the present application further provides a voice processing chip, where the voice processing chip may be implemented by a Complex Programmable Logic Device (CPLD), a Field Programmable Gate Array (FPGA), or a single chip microcomputer. No matter what process or technique is adopted for implementation, as shown in fig. 3, the voice processing chip at least includes the following modules: a segmentation module 301, a multi-channel speech detection module 302, a warping module 303, and a multi-channel speech recognition module 304.

Wherein. The segmentation module 301 is configured to segment audio/video data to be identified into M audio/video segments, and output the M audio/video segments to the multi-channel voice detection module 302. The multi-channel voice detection module 302 is configured to perform voice detection on the M audio/video segments output by the segmentation module 301 in parallel to obtain N voice segments, and output the N voice segments to the normalization module 303.

A regularization module 303, configured to perform regularization on the N speech segments output by the multi-channel speech detection module 302 to obtain K regularized speech segments, and output the K regularized speech segments to the multi-channel speech recognition module 304. A multi-channel speech recognition module 304, configured to perform speech recognition on the K normalized speech segments output by the normalization module 303 in parallel to obtain K text segments; wherein M, N is an integer of 2 or more, K is an integer of 1 or more, and K.ltoreq.N.

For details of the working process or principle of the segmentation module 301, the multi-channel speech detection module 302, the warping module 303 and the multi-channel speech recognition module 304, reference may be made to the foregoing embodiments, which are not described herein again.

Further optionally, the voice processing chip of this embodiment may further include: a sound pickup module 305, a speaker 306, a communication component 307 (such as a wifi module, a bluetooth module, or an infrared module), a power supply module 308, and the like.

The voice processing chip provided by the embodiment of the application cuts audio and video data, and then performs voice detection and voice recognition on the cut audio and video fragments in parallel, so as to obtain text fragments corresponding to the audio and video data. In order to avoid unreasonable segmentation of audio and video fragments, the voice fragments obtained by voice detection are regularized before voice recognition. In the whole process, the voice detection and the voice recognition are carried out in parallel, the time length of the voice recognition of the audio and video clips is reduced, and the experience of a user is improved.

When implemented in software, the above-described embodiments of the speech recognition method may be implemented by a speech recognition apparatus including a plurality of processors. In view of this, an embodiment of the present application further provides a speech recognition apparatus, as shown in fig. 4, including: a memory 401 and a plurality of processors 402.

The memory 401 is used to store computer programs and may be configured to store other various data to support operations on the speech recognition device. Examples of such data include instructions for any application or method operating on the speech recognition device, contact data, phonebook data, messages, pictures, videos, and the like.

The processors 402 may be a Central Processing Unit (CPU) and/or a Graphics Processing Unit (GPU), but are not limited thereto.

A plurality of processors 402, coupled to the memory 401, for executing computer programs in the memory 401 for: dividing audio and video data to be identified into M audio and video segments; performing voice detection on the M audio and video clips in parallel to obtain N voice clips; regularizing the N voice segments to obtain K regularized voice segments; performing voice recognition on the K normalized voice segments in parallel to obtain K text segments; wherein M, N is an integer of 2 or more, K is an integer of 1 or more, and K.ltoreq.N.

In an optional embodiment, when the multiple processors 402 divide the audio/video data to be recognized into M audio/video segments, the multiple processors are specifically configured to: dividing the audio and video data to be identified into M audio and video segments according to the set slicing duration; or according to the set number of the slices, dividing the audio and video data to be identified into M audio and video segments.

In an alternative embodiment, the time lengths of the M audio-video clips are the same.

In an optional embodiment, there is no overlap between adjacent audio and video segments, or there is overlap between adjacent audio and video segments, and the overlap time is less than the duration of the audio and video segments.

In an alternative embodiment, when obtaining N speech segments, the processors 402 are specifically configured to: and sending the M audio and video clips to a plurality of voice detection nodes, wherein each voice detection node performs voice detection on the audio and video clip responsible for the voice detection node by using a voice detection model and outputs the detected voice clip.

In an alternative embodiment, the processors 402, when obtaining the K normalized speech segments, are specifically configured to: and combining the adjacent voice fragments from the adjacent audio and video fragments in the N voice fragments to obtain K normalized voice fragments.

In an optional embodiment, when the plurality of processors 402 performs the merging process on the voice segments adjacent to and from the adjacent audio and video segments in the N voice segments, the processors are specifically configured to: if the voice segments adjacent to and from the adjacent audio-video segments are overlapped or the time interval between the two voice segments is smaller than the set interval threshold, combining the voice segments adjacent to and from the adjacent audio-video segments into one voice segment to obtain K normalized voice segments.

In an alternative embodiment, the plurality of processors 402 are further configured to: and performing text post-processing on the K text segments to obtain a voice text corresponding to the audio and video data.

In an alternative embodiment, before the dividing the audiovisual data to be recognized into M audiovisual segments, the plurality of processors 402 are further configured to: and carrying out format normalization processing on the audio and video data to be identified so as to obtain the audio and video data in a non-compressed format.

Further, as shown in fig. 4, the speech recognition apparatus further includes: communication components 406, display 407, power components 408, audio components 409, and other components. Only some of the components are schematically shown in fig. 4, and it is not meant that the speech recognition apparatus comprises only the components shown in fig. 4. It should be noted that the components within the dashed line box in fig. 4 are optional components, not necessary components, and may be determined according to the product form of the speech recognition device. The voice recognition device of the embodiment can be implemented as a terminal device such as a desktop computer, a notebook computer or a smart phone, and can also be a server device such as a conventional server, a cloud server or a server array. If the speech recognition device of this embodiment is implemented as a terminal device such as a desktop computer, a notebook computer, a smart phone, etc., the speech recognition device may include components within a dashed line frame in fig. 4; if the speech recognition device of this embodiment is implemented as a server device such as a conventional server, a cloud server, or a server array, the components in the dashed box in fig. 4 may not be included.

Accordingly, the present application also provides a computer readable storage medium storing a computer program, and the computer program can implement the steps in the speech recognition method shown in fig. 1 when executed.

The voice recognition method provided by the embodiment of the application can be applied to various audio and video scenes, and accordingly, the implementation forms of the voice recognition equipment are different in different audio and video scenes. For example, the method can be applied to automatically adding subtitles in various audio and video scenes, such as automatically adding subtitles in a video editing scene, and automatically adding subtitles in a short video or traditional audio and video application. For another example, the method can also be applied to various audio and video applications for content discovery or security audit based on text. The following respectively exemplifies the situation that the speech recognition method is applied to automatically adding subtitles in various audio and video scenes:

fig. 5a is a schematic structural diagram of an audio/video processing system according to an exemplary embodiment of the present application, and as shown in fig. 5a, the system includes: the system comprises an audio and video terminal 501a, an audio and video editing device 502a, an audio and video server 503a and a cloud server 504 a. In this embodiment, the cloud server 504a may be used as a specific implementation form of the speech recognition device shown in fig. 4.

In this embodiment, the audio/video terminal 501a is a terminal that is to play audio/video data, and may be a terminal device such as a desktop computer, a notebook computer, and a smart phone; the audio/video server 503a may respond to the request of the audio/video terminal 501a, and provide audio/video data and other audio/video services to the audio/video terminal 501a, and may be a conventional server or a server array or other server side device.

In this embodiment, the audio/video producer can record or produce audio/video data, and provide the audio/video data to the audio/video editing device 502a for the audio/video editing device 502a to perform various editing processes. The audio and video editing device 502a performs editing processing on the audio and video data, including adding subtitles to the audio and video data, and is responsible for uploading the audio and video data with the subtitles added to the audio and video server 503a, so that the audio and video server 503a provides audio and video services for the audio and video terminal 501 a. The audio and video editing device 502a receives audio and video data provided by an audio and video producer, judges whether the audio and video data needs to be added with subtitles according to requirements, identifies the audio and video data to be added with subtitles from the audio and video data, and sends the audio and video data to the cloud server 504 a.

The cloud server 504a may receive audio and video data to be added with subtitles, which is sent by the audio and video editing device 502a, and segment the audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment, and returning the text segment to the audio/video editing equipment 502 a. The audio/video editing device 502a receives at least one text segment, synthesizes the at least one text segment into subtitle information, and uploads the audio/video data and the subtitle information to the audio/video server 503 a. The audio/video server 503a receives the audio/video data and the subtitle information and sends the audio/video data and the subtitle information to the audio/video terminal 501a, and the audio/video terminal 501a receives the audio/video data and the subtitle information and synchronously displays the subtitle information in the process of playing the audio/video data. The audio/video server 503a may actively send the audio/video data and the subtitle information to the audio/video terminal 501a, so that the audio/video terminal 501a plays the audio/video data and synchronously displays the subtitle information; or, the audio/video server 503a may also send the audio/video data and the subtitle information to the audio/video terminal 501a according to the request of the audio/video terminal 501a, so that the audio/video terminal 501a plays the audio/video data and synchronously displays the subtitle information.

In addition, it should be noted that the function of the cloud server 504a can also be implemented by the audio/video editing device 502a, so that an audio/video processing system including an audio/video terminal, an audio/video editing device, and an audio/video server can be implemented, in the system, after the audio/video editing device identifies audio/video data to be added with subtitles, the audio/video data is not uploaded to the cloud server 504a, but the audio/video data is segmented into a plurality of audio/video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment; and synthesizing at least one text segment into subtitle information, and uploading the audio and video data and the subtitle information to the audio and video server 503 a. For a detailed description of performing speech detection and speech recognition in parallel, reference may be made to the foregoing embodiments, which are not described herein again.

Fig. 5b is a schematic structural diagram of another audio/video processing system provided in an exemplary embodiment of the present application, and as shown in fig. 5b, the system includes: an audio-video terminal 501b and an audio-video server 502 b. In this embodiment, the audio/video server 502b may be used as a specific implementation form of the speech recognition device shown in fig. 4.

In this embodiment, the audio/video terminal 501b is used for playing audio/video data, and may be a terminal device such as a desktop computer, a notebook computer, and a smart phone; the audio/video server 502b may respond to the request of the audio/video terminal 501b, and provide audio/video data and other audio/video services to the audio/video terminal 501b, and may be a conventional server or a server array or other server side device.

In this embodiment, when the audio/video terminal 501b plays the audio/video data, a playing interface is displayed, and the audio/video data is played on the playing interface; in addition, a subtitle adding control is arranged on the playing interface, and the control can be a button or a sliding switch. When a user watches audio and video data, the control can be triggered to initiate a request for adding subtitles; the trigger operation for adding the control to the subtitle can be, but is not limited to: click, long press, slide, hover, double click, touch, etc. The audio/video terminal 501b may respond to the user's trigger operation on the subtitle adding control, and request the audio/video server 502b for subtitle information corresponding to the currently played audio/video data.

The audio/video server 502b may respond to the request of the audio/video terminal 501b, and segment the audio/video data currently played by the audio/video terminal 501b into a plurality of audio/video segments according to the request; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice fragments in parallel to obtain at least one text fragment, synthesizing the at least one text fragment into caption information, and returning the caption information to the audio/video terminal 501 b. And the audio/video terminal 501b receives the subtitle information returned by the audio/video server and synchronously displays the subtitle information on a playing interface. For a detailed description of performing speech detection and speech recognition in parallel, reference may be made to the foregoing embodiments, which are not repeated herein

Fig. 5c is a schematic structural diagram of another audio/video processing system provided in an exemplary embodiment of the present application, and as shown in fig. 5c, the system includes: the system comprises an audio and video terminal 501c, an audio and video server 502c and a cloud server 503 c. The cloud server 503c may be used as a specific implementation form of the speech recognition device shown in fig. 4.

In this embodiment, the audio/video terminal 501c is used for playing audio/video data, and may be a terminal device such as a desktop computer, a notebook computer, and a smart phone; the audio/video server 502c may respond to the request of the audio/video terminal 501c, and provide audio/video data and other audio/video services to the audio/video terminal 501c, and may be a conventional server or a server array or other server side device.

In this embodiment, when the audio/video terminal 501c plays the audio/video data, a playing interface is displayed, and the audio/video data is played on the playing interface; in addition, a subtitle adding control is arranged on the playing interface, and the control can be a button or a sliding switch. When a user watches audio and video data, the control can be triggered to initiate a request for adding subtitles; the trigger operation for adding the control to the subtitle can be, but is not limited to: click, long press, slide, hover, double click, touch, etc. The audio/video terminal 501c may respond to the user's trigger operation on the subtitle adding control, and request the audio/video server 502c for subtitle information corresponding to the currently played audio/video data.

The audio/video server 502c may send the audio/video data currently played by the audio/video terminal 501c to the cloud server 503c according to the request of the audio/video terminal 501 c. The cloud server 503c receives the audio and video data, and segments the audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice fragments in parallel to obtain at least one text fragment, and returning the text fragment to the audio/video server 502 c. The audio/video server 502c receives the at least one text segment, synthesizes the at least one text segment into caption information, and sends the caption information to the audio/video terminal 501c, and the audio/video terminal 501c receives the caption information sent by the audio/video server 502c and synchronously displays the caption information on a playing interface.

Whether the audio and video processing system is shown in fig. 5b or fig. 5c, optionally, it may be an online education system, and then the audio and video terminal is a teacher terminal or a student terminal. For example, in the embodiment shown in fig. 5c, a teacher uploads a teaching video, arranges an audio or video job task, and uploads the teaching video, audio or video job task to an audio/video server, the audio/video server calls a voice recognition function provided by a cloud server to perform voice recognition on the teaching video, audio or video job task to obtain a voice text, and the voice text is used as subtitle information of the teaching video, audio or video job task and is synchronized to a student terminal. The students watch teaching videos on line through the terminals of the students and understand the contents of the teaching videos through the subtitle information in the teaching videos, can receive audio or video operation tasks arranged by teachers, play the audio or video operation tasks, understand operation requirements according to the subtitle information in the operation tasks, record the contents of the audio or video operation according to the operation requirements and upload the contents to an audio and video server; the audio and video server calls a voice recognition function provided by the cloud server to perform voice recognition to obtain a voice text, and the voice text is used as subtitle information of audio or video operation content submitted by students and is synchronized to the teacher terminal; the teacher plays the audio or video homework content submitted by the students through the terminal of the teacher, and understands the homework intentions of the students according to the subtitle information in the homework content, so that the students' homework is corrected more accurately.

In another alternative embodiment, if the audio/video processing system shown in fig. 5b or fig. 5c is an online live system, the audio/video terminal is a main broadcast terminal. For example, in the embodiment shown in fig. 5c, the anchor terminal starts a live broadcast function, records live broadcast video in real time through recording devices such as a camera of the anchor terminal and uploads the live broadcast video to the audio/video server, the audio/video server calls a voice recognition function provided by the cloud server to perform voice recognition on the live broadcast video to obtain a voice text, and the voice text is used as subtitle information of the live broadcast video of the anchor terminal and is synchronized to a live broadcast user terminal; the live broadcast user side plays the live broadcast video and synchronously displays the subtitle information corresponding to the live broadcast video, and the live broadcast content can be more fully known according to the subtitle information. Optionally, a subtitle control can be further arranged on the live interface of the live broadcast user end, a user can select whether to display subtitle information through the control, and if the user selects to close the subtitle display function through the control, the subtitle information is not displayed on the live broadcast interface.

In yet another alternative embodiment, the audio/video processing system shown in fig. 5b or fig. 5c is a short video application system, and the audio/video terminal is a short video recording terminal. For example, in the embodiment shown in fig. 5c, the short video recording terminal uploads the recorded short video to the audio/video server, the audio/video server invokes a voice recognition function provided by the cloud server to perform voice recognition on the short video to obtain a voice text, and the voice text is used as subtitle information of the content of the short video and is synchronized to the short video client; the short video user terminal plays the short video and synchronously displays the caption information corresponding to the short video, and the short video content can be more fully known according to the caption information. Optionally, a subtitle control may be further set on the play interface of the short video client, and the user may select whether to display the subtitle information through the control, and if the user selects to close the subtitle display function through the control, the subtitle information will not be displayed on the play interface.

In the audio/video processing system shown in fig. 5b and 5c, the speech recognition process is performed by either the audio/video server or the cloud server, but is not limited thereto. For example, the voice recognition process may also be performed locally by the audio-video terminal. Based on this, an embodiment of the present application further provides an audio/video playing method, which is applicable to an audio/video terminal, and as shown in fig. 5d, the method includes:

501d, displaying a playing interface, wherein a subtitle adding control is arranged on the playing interface;

502d, responding to the triggering operation of the subtitle adding control, and segmenting the currently played audio and video data into a plurality of audio and video segments;

503d, performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips;

504d, performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment, synthesizing the at least one text segment into subtitle information, and then synchronously displaying the subtitle information on a playing interface.

With respect to steps 502d-504d, the difference from the previous embodiment is: the execution main bodies are different, and the detailed implementation process is the same, so that reference can be made to the foregoing embodiments, which are not described herein again.

In addition to the above automatic subtitle adding scene, the speech recognition method provided in the embodiment of the present application may also be applied to content discovery or security audit in various audio and video applications, and based on this, an audio and video processing system is further provided in the embodiment of the present application, as shown in fig. 6a, the system includes: the system comprises an audio and video terminal 601a, an audio and video server 602a and a cloud server 603 a. The cloud server 603a may be used as a specific implementation form of the speech recognition device shown in fig. 4.

In this embodiment, the audio/video terminal 601a may obtain audio/video data and upload the audio/video data to the audio/video server 602 a. For example, recording and playing software may be installed on the audio/video terminal 601a, and a short video is recorded by a camera and uploaded to the audio/video server 602 a. For another example, the audio/video terminal 601a may be installed with instant messaging software, and during a process of performing a video chat or a video conference with an opposite-end user through the instant messaging software, local-end video data may be collected through a camera, and the collected video data may be uploaded to the audio/video server 602a and synchronized to the opposite-end user by the audio/video server 602 a. Correspondingly, the audio/video server 602a may receive the audio/video data uploaded by the audio/video terminal 601a, store the audio/video data, or send the audio/video data to other audio/video terminals. In order to ensure the safety compliance of the audio and video data, in this embodiment, the audio and video server 602a may further provide the audio and video data to the cloud server 603 a; the cloud server 603a performs voice recognition on the audio/video data to convert the audio/video data into a voice text, and then the audio/video server 602a performs security audit or content discovery based on the voice text. The content discovery of at least one text segment aims to effectively retrieve information from increasing audio and video data and expand means for acquiring the information. The safety audit aims to screen out unhealthy information such as violence, obscene and the like. In the audio/video server 602a, a security audit policy or a content discovery policy may be preset, and accordingly, security audit or content discovery is performed.

In this embodiment, the process of performing speech recognition on the audio/video data by the cloud server 603a includes: receiving the audio and video data, and dividing the audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice fragments in parallel to obtain at least one text fragment, and returning the text fragment to the audio/video server 602 a. The audio/video server 602a receives the at least one text fragment, and performs content discovery or security audit on the at least one text fragment.

Optionally, the voice recognition function implemented by the cloud server 603a may also be implemented by the audio/video server 602a, based on which, an embodiment of the present application further provides an audio/video processing system, as shown in fig. 6b, the system includes: an audio-video terminal 601b and an audio-video server 602 b.

In this embodiment, the audio/video terminal 601b is configured to acquire audio/video data and upload the audio/video data to the audio/video server 602 b. The audio and video server 602b receives the audio and video data and segments the audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment; and performing content discovery or security audit on the at least one text fragment.

It is noted that, in the audio/video processing system shown in fig. 6a or fig. 6b, it may alternatively be an online education system, and the audio/video terminal is a teacher terminal or a student terminal; or the audio and video processing system is an online live broadcast system, and the audio and video terminal is a main broadcast terminal; or the audio and video processing system is a short video application system, and the audio and video terminal is a short video recording terminal.

In the audio/video processing system shown in fig. 6a and 6b, the speech recognition process is performed by either the cloud server or the audio/video server, but not limited thereto. For example, the voice recognition process may also be performed locally by the audio-video terminal. Based on this, an embodiment of the present application further provides an audio and video uploading method, which is applicable to an audio and video terminal, and as shown in fig. 6c, the method includes:

601c, responding to the uploading operation of the user, and acquiring audio and video data to be uploaded;

602c, dividing the audio and video data into a plurality of audio and video segments;

603c, performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips;

604c, performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment;

605c, uploading the audio and video data to the audio and video server based on the at least one text fragment.

In an optional embodiment, uploading the audio-video data to the audio-video server based on the at least one text segment includes: performing security audit on at least one text fragment; and uploading the audio and video data to an audio and video server under the condition that at least one text film passes the security audit.

In another optional embodiment, uploading the audio-video data to the audio-video server based on the at least one text segment comprises: performing content discovery on at least one text fragment; and uploading the audio and video data to an audio and video server under the condition that the specified content is found in at least one text piece.

In the embodiments of the present application, no matter who the execution subject is, in the speech recognition process, the audio and video data are segmented to provide a basis for performing speech detection and speech recognition in parallel, and in consideration of the situation that the segmentation of the audio and video data is not reasonable, before the speech recognition, the speech segment obtained by the speech detection is normalized, and then the speech recognition is performed on the normalized speech segment, so as to ensure the quality of the speech recognition; in the whole process, voice detection and voice recognition are carried out in parallel, time consumption of voice recognition is reduced, the efficiency of voice recognition is improved, and user experience is guaranteed.

Fig. 7 is a schematic structural diagram of a terminal device according to an exemplary embodiment of the present application. As shown in fig. 7, the apparatus includes: a memory 74 and a processor 75.

The memory 74 is used for storing computer programs and may be configured to store other various data to support operations on the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, contact data, phonebook data, messages, pictures, videos, etc.

A processor 75, coupled to the memory 74, for executing computer programs in the memory 74 for: displaying a playing interface, wherein a subtitle adding control is arranged on the playing interface; responding to the triggering operation of the subtitle adding control, and segmenting currently played audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment, synthesizing the at least one text segment into subtitle information, and then synchronously displaying the subtitle information on the playing interface.

In an optional embodiment, when the currently played audio/video data is segmented into a plurality of audio/video segments, the processor 75 is specifically configured to: segmenting audio and video data to be identified into a plurality of audio and video segments according to the set slicing duration; or according to the set number of the slices, segmenting the audio and video data to be identified into a plurality of audio and video segments. Optionally, the time lengths of the plurality of audio-video clips are the same. Further optionally, there is no overlap between adjacent audio and video segments, or there is overlap between adjacent audio and video segments, and the overlap time is less than the duration of the audio and video segments.

In an alternative embodiment, the processor 75, when obtaining the normalized audio-video clip, is specifically configured to: and combining the adjacent voice fragments from the adjacent audio and video fragments to obtain a normalized voice fragment. Further optionally, the processor 75 is specifically configured to: if the voice segments adjacent to and from the adjacent audio-video segments have an overlap or the time interval between the two is smaller than a set interval threshold, the voice segments adjacent to and from the adjacent audio-video segments are combined into one voice segment.

Further, as shown in fig. 7, the terminal device further includes: communication components 76, display 77, power components 78, audio components 79, and the like. Only some of the components are schematically shown in fig. 7, and the terminal device is not meant to include only the components shown in fig. 7.

Accordingly, the present application further provides a computer readable storage medium storing a computer program, and the computer program can implement the steps in the method embodiment shown in fig. 5d when executed.

The embodiment of the present application further provides a terminal device, where an implementation structure of the terminal device is the same as or similar to the implementation structure of the terminal device shown in fig. 7, and may be implemented by referring to the structure of the terminal device shown in fig. 7. The difference between the terminal device provided in this embodiment and the terminal device in the embodiment shown in fig. 7 mainly lies in: the functions performed by the processor to execute the computer programs stored in the memory are different. For the terminal device provided in this embodiment, the processor thereof executes the computer program stored in the memory, and is operable to: responding to the uploading operation of a user, and acquiring audio and video data to be uploaded; segmenting the audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment; and uploading the audio and video data to an audio and video server based on the at least one text fragment. In an optional embodiment, at least one text fragment may be subjected to security audit or content discovery; and uploading the audio and video data to an audio and video server under the condition that at least one text passes the security audit or finds the specified content from the text. For detailed implementation of parallel speech detection and speech recognition, reference may be made to the foregoing embodiments, which are not described herein again.

Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program can implement the steps in the method embodiment shown in fig. 6c when executed.

The memories of fig. 4 and 7 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The communication components of fig. 4 and 7 described above are configured to facilitate wired or wireless communication between the device in which the communication component is located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as a WiFi, a 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The displays in fig. 4 and 7 described above include screens, which may include Liquid Crystal Displays (LCDs) and Touch Panels (TPs). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The power supply components of fig. 4 and 7 described above provide power to the various components of the device in which the power supply components are located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

The audio components of fig. 4 and 7 described above may be configured to output and/or input audio signals. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A speech recognition method, comprising:

dividing audio and video data to be identified into M audio and video segments;

performing voice detection on the M audio and video clips in parallel to obtain N voice clips;

regularizing the N voice segments to obtain K regularized voice segments;

performing voice recognition on the K normalized voice segments in parallel to obtain K text segments; wherein M, N is an integer of 2 or more, K is an integer of 1 or more, and K.ltoreq.N.

2. The method of claim 1, wherein the segmenting of the audio-visual data to be recognized into M audio-visual segments comprises:

dividing the audio and video data to be identified into M audio and video segments according to the set slicing duration;

or

And according to the set number of the slices, dividing the audio and video data to be identified into M audio and video segments.

3. The method according to claim 2, wherein the time lengths of the M audio-video clips are the same.

4. The method of claim 1, wherein performing speech detection on the M audio/video segments in parallel to obtain N speech segments comprises:

and sending the M audio and video clips to a plurality of voice detection nodes, wherein each voice detection node performs voice detection on the audio and video clip responsible for the voice detection node by using a voice detection model and outputs the detected voice clip.

5. The method of claim 1, wherein regularizing the N speech segments to obtain K regularized speech segments comprises:

and combining the adjacent voice fragments from the adjacent audio and video fragments in the N voice fragments to obtain K normalized voice fragments.

6. The method according to claim 5, wherein the merging the adjacent voice segments from the adjacent audio-video segments of the N voice segments to obtain K normalized voice segments comprises:

if the voice segments which are adjacent in the N voice segments and come from the adjacent audio-video segments are overlapped or the time interval between the voice segments is smaller than the set interval threshold, the voice segments which are adjacent and come from the adjacent audio-video segments are combined into one voice segment to obtain K normalized voice segments.

7. The method of any one of claims 1-6, further comprising:

and performing text post-processing on the K normalized text segments to obtain a voice text corresponding to the audio and video data.

8. The method according to any one of claims 1 to 6, wherein before the dividing of the audiovisual data to be recognized into M audiovisual segments, further comprising:

and carrying out format normalization processing on the audio and video data to be identified so as to obtain the audio and video data in a non-compressed format.

9. A speech processing chip, comprising:

the segmentation module is used for segmenting the audio and video data to be identified into M audio and video segments;

the multi-channel voice detection module is used for performing voice detection on the M audio and video clips in parallel to obtain N voice clips;

the normalization module is used for normalizing the N voice segments to obtain K normalized voice segments;

the multi-channel voice recognition module is used for performing voice recognition on the K normalized voice fragments in parallel to obtain K text fragments; wherein M, N is an integer of 2 or more, K is an integer of 1 or more, and K.ltoreq.N.

10. A speech recognition device, comprising: a memory and a plurality of processors;

the memory for storing a computer program;

the plurality of processors coupled with the memory for executing the computer program for:

dividing audio and video data to be identified into M audio and video segments;

regularizing the N voice segments to obtain K regularized voice segments;

11. An audio-video processing system, comprising: the system comprises an audio and video terminal, audio and video editing equipment, an audio and video server and a cloud server;

the audio and video editing equipment is used for sending audio and video data to be added with subtitles to the cloud server, synthesizing at least one text segment corresponding to the audio and video data returned by the cloud server into subtitle information, and uploading the audio and video data and the subtitle information to the audio and video server;

the audio and video server is used for receiving the audio and video data and the subtitle information sent by the audio and video editing equipment and providing the audio and video data and the subtitle information for the audio and video terminal to synchronously display the subtitle information in the process of playing the audio and video data;

the cloud server is used for segmenting the audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice fragments in parallel to obtain at least one text fragment and returning the text fragment to the audio and video editing equipment.

12. An audio-video processing system, comprising: an audio and video terminal and an audio and video server;

the audio and video terminal is used for displaying a playing interface, and a subtitle adding control is arranged on the playing interface; responding to the triggering operation of the user on the caption adding control, requesting caption information corresponding to the currently played audio and video data from the audio and video server, receiving the caption information returned by the audio and video server and synchronously displaying the caption information on a playing interface;

the audio and video server is used for segmenting the currently played audio and video data into a plurality of audio and video segments according to the request; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice fragments in parallel to obtain at least one text fragment, synthesizing the at least one text fragment into caption information, and returning the caption information to the audio/video terminal.

13. An audio-video processing system, comprising: the system comprises an audio and video terminal, an audio and video server and a cloud server;

the audio and video terminal is used for displaying a playing interface, and a subtitle adding control is arranged on the playing interface; responding to the triggering operation of the user on the caption adding control, requesting caption information corresponding to currently played audio and video data from the audio and video server, receiving the caption information sent by the audio and video server and synchronously displaying the caption information on a playing interface;

the audio and video server is used for sending the audio and video data to the cloud server according to the request, synthesizing at least one text segment corresponding to the audio and video data returned by the cloud server into caption information and sending the caption information to the audio and video terminal;

the cloud server is used for segmenting the audio and video data sent by the audio and video server into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice fragments in parallel to obtain at least one text fragment and returning the text fragment to the audio and video server.

14. An audio/video playing method is characterized by comprising the following steps:

displaying a playing interface, wherein a subtitle adding control is arranged on the playing interface;

responding to the triggering operation of the subtitle adding control, and segmenting currently played audio and video data into a plurality of audio and video segments;

performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips;

and performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment, synthesizing the at least one text segment into subtitle information, and then synchronously displaying the subtitle information on the playing interface.

15. An audio-video processing system, comprising: the system comprises an audio and video terminal, an audio and video server and a cloud server;

the audio and video terminal is used for uploading audio and video data to the audio and video server;

the audio and video server is used for sending the audio and video data to the cloud server and receiving at least one text segment corresponding to the audio and video data returned by the cloud server; performing content discovery or security audit on the at least one text fragment;

the cloud server is used for segmenting the audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; and performing voice recognition on the normalized voice fragments in parallel to obtain at least one text fragment and returning the text fragment to the audio and video server.

16. The system of claim 15, wherein the audio/video processing system is an online education system, and the audio/video terminal is a teacher terminal or a student terminal;

or,

the audio and video processing system is an online live broadcast system, and the audio and video terminal is a main broadcast terminal;

or

The audio and video processing system is a short video application system, and the audio and video terminal is a short video recording terminal.

17. An audio-video processing system, comprising: an audio and video terminal and an audio and video server;

the audio and video server is used for segmenting the audio and video data into a plurality of audio and video segments; performing voice detection on the plurality of audio and video clips in parallel to obtain a plurality of voice clips, and performing regularization processing on the plurality of voice clips; performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment; and performing content discovery or security audit on the at least one text fragment.

18. An audio and video uploading method is characterized by comprising the following steps:

responding to the uploading operation of a user, and acquiring audio and video data to be uploaded;

segmenting the audio and video data into a plurality of audio and video segments;

performing voice recognition on the normalized voice segments in parallel to obtain at least one text segment;

and uploading the audio and video data to an audio and video server based on the at least one text fragment.

19. The method of claim 18, wherein uploading the audio-video data to an audio-video server based on the at least one text segment comprises:

performing security audit or content discovery on the at least one text fragment;

and uploading the audio and video data to an audio and video server under the condition that the at least one text passes the security audit or finds the specified content from the at least one text.

20. A terminal device, comprising: a memory and a processor;

the memory for storing a computer program;

the processor is coupled with the memory for executing the computer program for:

21. A terminal device, comprising: a memory and a processor;

the memory for storing a computer program;

22. A computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to carry out the steps of the method of any one of claims 1-8, 14 and 18-19.