CN111883135A

CN111883135A - Voice transcription method and device and electronic equipment

Info

Publication number: CN111883135A
Application number: CN202010735724.6A
Authority: CN
Inventors: 陈孝良; 苏少炜; 张国超; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-11-03

Abstract

The embodiment of the disclosure discloses a voice transcription method, a voice transcription device, electronic equipment and a computer readable storage medium. The voice transcription method comprises the following steps: collecting a sound signal; separating at least one character voice signal from the sound signal; transcribing the at least one character voice signal into text; and displaying the characters in at least one display mode. The method solves the technical problems of inconvenient, unstable and inaccurate retrieval and the like in the scheme of recording the voice content in the prior art by separating the voice signal and transcribing the voice signal into characters.

Description

Voice transcription method and device and electronic equipment

Technical Field

The present disclosure relates to the field of speech recognition, and in particular, to a method and an apparatus for speech transcription, an electronic device, and a computer-readable storage medium.

Background

As a man-machine interaction means, the voice recognition acquisition technology is significant in the aspect of liberation of both hands of human beings. More and more intelligent devices are added with the trend of voice recognition, and become a bridge for communication between people and devices, so that the voice recognition technology becomes more and more important.

Voice is the most natural way of communicating, and many occasions communicate by voice, for example: telephone, speech, doctor, trial, meeting, etc. The communication content needs to be recorded, and the communication content is recorded by means of recorded audio or manual typing. But audio recording is inconvenient to retrieve and query and cannot be quickly positioned to the content needing to be positioned; the content of voice communication is quickly recorded by the typographer, and is limited by the typing speed, sometimes the voice speed is too high, part of important content may be missed, and the content is also influenced by the state of the typographer. Therefore, a scheme for automatically, stably and accurately converting speech into text is needed.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, an embodiment of the present disclosure provides a voice transcription method, including:

collecting a sound signal;

separating at least one character voice signal from the sound signal;

transcribing the at least one character voice signal into text;

and displaying the characters in at least one display mode.

Further, the separating at least one character voice signal from the sound signal includes:

converting the multi-channel sound signal into a plurality of single-channel sound signals;

determining at least one single-channel speech signal from the plurality of single-channel speech signals;

and determining at least one character voice signal according to the at least one single-channel voice signal.

Further, the determining at least one single-channel speech signal from the plurality of single-channel speech signals includes:

and inputting each single-channel sound signal in the plurality of single-channel sound signals into a deep learning model to obtain at least one single-channel sound signal.

Further, the determining at least one character voice signal according to the at least one single-channel voice signal includes:

calculating the similarity between the at least one single-channel voice signal;

recognizing a plurality of single-channel voice signals with similarity higher than a similarity threshold value as the same role voice signal;

and recognizing a plurality of single-channel voice signals with the similarity lower than a similarity threshold value as different character voice signals.

Further, the converting the at least one character voice signal into text includes:

extracting voice characteristics of the at least one character voice signal;

and inputting the voice characteristics into a voice recognition model to obtain characters corresponding to the at least one role voice signal.

Further, the displaying the text in at least one displaying manner includes:

displaying the characters in a real-time display area; and/or the presence of a gas in the gas,

displaying the characters transcribed before the characters in a history display area; and/or

And adding subtitles formed by the characters to the displayed information in the information display area.

Further, the method further comprises:

acquiring a text corresponding to the characters;

inputting the text into a summary generation model to generate a summary of the text.

Further, the method further comprises:

receiving a selection signal of a character voice signal;

and highlighting characters corresponding to the character voice signals.

Further, the method further comprises:

receiving a selection signal of the characters;

and highlighting the character voice signal corresponding to the character. In a second aspect, an embodiment of the present disclosure provides a voice transcription apparatus, including:

the acquisition module is used for acquiring sound signals;

the voice separation module is used for separating at least one character voice signal from the sound signal;

the transfer module is used for transferring the at least one role voice signal into characters;

and the display module is used for displaying the characters in at least one display mode.

Further, the voice separation module is further configured to:

Further, the transfer module is further configured to:

extracting voice characteristics of the at least one character voice signal;

Further, the display module is further configured to:

Further, the voice transcription apparatus further includes:

the abstract generating module is used for acquiring a text corresponding to the characters; inputting the text into a summary generation model to generate a summary of the text.

Further, the voice transcription apparatus further includes:

the first proofreading module is used for receiving a selection signal of a role voice signal; and highlighting characters corresponding to the character voice signals.

Further, the voice transcription apparatus further includes:

the second proofreading module is used for receiving the selection signal of the characters; and highlighting the character voice signal corresponding to the character. In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the preceding first aspects.

In a fourth aspect, the present disclosure provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions for causing a computer to execute the method of any one of the foregoing first aspects.

The foregoing is a summary of the present disclosure, and for the purposes of promoting a clear understanding of the technical means of the present disclosure, the present disclosure may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a schematic view of an application scenario of an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of a voice transcription method provided in the embodiment of the present disclosure;

fig. 3 is a schematic diagram of a specific implementation manner of separating a character voice signal in a voice transcription method according to an embodiment of the present disclosure;

FIG. 4 is a further schematic diagram illustrating an implementation of separating a stereo speech signal in a speech transcription method according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an embodiment of a voice transcription apparatus provided in the embodiment of the present disclosure

Fig. 6 is a schematic structural diagram of an electronic device provided according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a schematic view of an application scenario of the embodiment of the present disclosure. Fig. 1 shows a scenario to which the method according to the embodiment of the disclosure is applied, where the scenario includes a user 101, a terminal device 102, and a server 103, where the terminal device 102 includes a display device 1021 and a sound collection device 1022, the user 101 generates a sound signal as a sound source, and if the user says "hello art", the terminal device 102 collects the sound signal of the user through the sound collection device 1022, and then the terminal device 102 pre-processes the sound signal and sends the pre-processed sound signal to the server 103 through a communication link 104, the server 103 runs a speech recognition algorithm, recognizes the received sound signal as corresponding text, and sends the recognized text to the terminal device 102 through the communication link 104, and the terminal device 102 displays the text through the display device 1021. It can be understood that the terminal device 102 and the server 103 may be disposed in the same device, that is, the device may recognize voice offline, and details are not described herein.

Fig. 2 is a flowchart of an embodiment of a voice transcription method provided in this disclosure, where the voice transcription method provided in this embodiment may be executed by a voice transcription apparatus, the voice transcription apparatus may be implemented as software, or implemented as a combination of software and hardware, and the voice transcription apparatus may be integrated in a certain device in a voice transcription system, such as a voice transcription server or a voice transcription terminal device. As shown in fig. 2, the method comprises the steps of:

step S201, collecting sound signals;

in this step, sound in the environment is collected by a sound collection device, where the sound collection device may be a microphone built in the terminal device, or a microphone external to the terminal device, or a comprehensive collection device composed of a microphone and other processing devices or apparatuses.

Optionally, the sound collection device may be implemented as a single microphone, and the implementation manner is simple in structure and low in cost, and may utilize existing equipment, such as a microphone of the terminal device itself or an external microphone.

Alternatively, the sound collection device may be implemented as a microphone array, which is composed of a certain number of microphones and is used to sample the spatial characteristics of the sound field. The microphone array has strong anti-noise capability, and the far-field sound collection is more accurate.

Optionally, the sound collection device may be implemented as a multi-microphone sound card. I.e. multiple individual microphones, into the sound card, so that each user can use one microphone, whose sound source is naturally separated, to be more advantageous for the subsequent transcription.

Optionally, the sound collection device may be implemented as a combination of the above-mentioned several implementation manners, that is, the collection devices implemented in the above-mentioned manner are connected together through a dedicated device or a network. The organization mode is most flexible and can be applied to various application scenes.

It can be understood that the specific implementation manner of the sound collection device is merely an example, and does not limit the present disclosure, and other implementation manners may also be used in the technical solution of the present disclosure, and are not described herein again.

After the sound signal is collected, the sound signal may be further preprocessed, that is, sound enhanced, for more accurate subsequent processing. Optionally, the preprocessing includes one or more of noise reduction processing, dereverberation processing, echo cancellation processing, automatic gain control processing, and beamforming processing.

The noise reduction process may be implemented using various noise reduction algorithms, for example, spectral subtraction, adaptive filters, wiener filtering, deep neural network based methods, and the like. Taking spectral subtraction as an example, spectral subtraction is to subtract the spectrum of a noise signal from the spectrum of a noise signal. Spectral subtraction is based on a simple assumption: if the noise in the sound signal is only additive noise, pure sound can be obtained by subtracting the noise spectrum from the spectrum of the noise-carrying sound. However, a plurality of consecutive silence frames are included in a general sound signal, i.e., only noise exists in the period of time, so that the noise signal can be estimated by using the silence frames, and thus a clean sound signal can be obtained by using the estimated noise signal.

When a sound source produces sound, due to reflection and delay phenomena, the same sound can reach human ears for multiple times, and particularly in sound collection, the same sound can be collected for multiple times, but the intensity of each time is different. If the time delay between the reflected sound and the original sound is between 10ms and 30ms, a phenomenon that the sound is enhanced, namely reverberation, may occur. The algorithm of dereverberation is generally divided into reverberation elimination and reverberation suppression, wherein the reverberation elimination is performed through a path of an estimated channel, and deconvolution is performed on an observed signal to obtain an original signal; whereas reverberation suppression is the direct-to-live estimation calculated directly from the observed signal. For example, the reverberation suppression can simulate the energy change of the direct signal to the reverberation tail through the phoneme and reverberation time of the front section of the sound signal, so that the reverberation energy of the part is estimated and suppressed from the reverberation signal.

Echoes are also reflected sounds except for the long delay time relative to reverberation. The echo cancellation uses a self-adaptive filter to identify the parameters of an unknown echo channel, establishes a far-end signal model based on the correlation between a sound signal and generated multi-channel echoes, simulates an echo path, makes the impulse response of the echo path approximate to a real echo path through the adjustment of a self-adaptive algorithm, and then subtracts an estimated value from the sound signal, thereby realizing the echo cancellation. Echo cancellation algorithms are numerous and are not listed here.

The automatic gain control process is used to enhance the loudness of the sound. Generally, automatic gain requires first determining a loudness gain factor; then mapping the loudness gain factor to an equal loudness curve, and determining the gain weight of the final frequency; and finally, gaining each frequency through the final gain weight to obtain the sound signal after gaining.

Beamforming is used to preserve sound signals in a desired direction and suppress sound signals in undesired directions. Beam forming is mainly divided into three categories: fixed beamforming, adaptive beamforming, and post filtering algorithms. The fixed beam forming is suitable for a stable and unchangeable noise interference environment, a sound source in a certain direction can be restrained, and the restraining strength is unchanged; the adaptive beam forming method can utilize the output of the signal to adaptively adjust the weight coefficient of the filtering, and the inhibition performance of the adaptive beam forming method can adjust the change of the environmental signal; the post-filtering algorithm is designed for further processing residual noise of the fixed beam forming structure and the self-adaptive beam forming structure, can effectively make up for the defects of the previous structure, and removes residual coherent noise and residual incoherent noise.

It should be understood that the foregoing pre-processing of the sound signal is only an example, and does not limit the present disclosure, and any other pre-processing method that can enhance the sound signal for subsequent processing may be applied to the present disclosure, and will not be described herein again.

Step S202, separating at least one character voice signal from the sound signal;

when a voice signal is transcribed, different users need to be distinguished so as to mark the words spoken by different users when transcribing. In this step, different users are separated from the sound signal to obtain character voice signals of different users.

In step S202, the manner of separation differs according to the sound collection device. Optionally, a hardware separation mode may be adopted to separate the character voice signals from the sound signals, for example, the collecting device is a plurality of microphones, each user uses a separate microphone to communicate through a network in different places, and at this time, each user uses a separate microphone, so that the collected sound signals are independent from each other, and at this time, different character voice signals can be separated only according to the sound signals of different channels; in a multi-user scenario, a separate microphone is presented to each user, and multiple microphones are then connected to the processing system via the sound card. However, since the microphones in the same location receive the voices of multiple users, the sound signals in each channel may include the voice signals of multiple users, and therefore, further separation is needed.

Optionally, for the case that the above-mentioned one channel includes a plurality of voice signals, as shown in fig. 3, the step S202 further includes:

step S301, converting a multi-channel sound signal into a plurality of single-channel sound signals;

step S302, determining at least one single-channel voice signal from the plurality of single-channel voice signals;

step S303, determining at least one character voice signal according to the at least one single-channel voice signal.

In step S301, the multi-channel sound signals are sound signals in each channel, usually one microphone corresponds to each channel, and each microphone collects the sound signal of one channel to obtain sound signals of multiple channels. Illustratively, the number of channels is greater than or equal to the number of people, that is, the number of microphones is greater than or equal to the number of users, and the number of channels is set to be n. In this step, each multi-channel sound signal is extracted as one single-channel sound signal to obtain a plurality of single-channel sound signals. For example, for n microphones, the system includes n sound channels, each channel collects a sound signal, and extracts each sound signal as a single-channel sound signal, so as to obtain n single-channel sound signals.

In step S302, at least one single-channel speech signal is determined from the plurality of single-channel speech signals. That is, in this step, it is determined which one-channel sound signal includes a voice signal. Optionally, the step S302 further includes: and inputting each single-channel sound signal in the plurality of single-channel sound signals into a deep learning model to obtain at least one single-channel sound signal.

Optionally, the deep learning model is a DNN model; before the DNN model is input, each of the single-channel sound signals is subjected to feature extraction, where the features are frame features of the sound signals, that is, the sound signals are firstly framed and then extracted for each frame, the sound signals can be divided into a plurality of sound frames in a time sequence, and each sound frame includes sound signal data within a preset time period, where the preset time period is generally very short, for example, 0.05 second or 0.1 second. After obtaining the sound frame, extracting features of the sound frame is exemplary, and the features of the speech sample include mfcc (mel frequency cepstral coeffiences) features or fbank (filter bank) features. It can be understood that, in the case of using different models, different methods may be used to extract the features of the speech sample, which is not limited in this disclosure and will not be described herein again. After the characteristics of the voice signals are obtained, the characteristics of the voice signals are input into a DNN model, characteristic vectors are obtained through characteristic extraction of a hidden layer of the DNN model, the characteristic vectors are input into an output layer of the DNN model finally, the posterior probability that the voice signals are the voice signals is obtained, and the posterior probability that each voice signal frame is the voice signal frame can be judged through the posterior probability because the voice signals are divided into a plurality of voice frames, so that the end points of the voice signals can be detected, and the voice signals between the two end points are the voice signals. And setting the posterior probability as Si, wherein i is more than or equal to 0 and less than or equal to n, setting the threshold value as t, determining that the ith channel has a voice signal when Si is more than or equal to t, and determining that the ith channel does not have the voice signal when Si is less than t. Therefore, all channels with the posterior probability larger than the threshold value can be counted, and the number of the channels is set to be m.

In step S303, it is determined how many angular voice signals are in the at least one single-channel voice signal. When a user speaks, the signals collected by the channels are all the voice signals of the user, and only the strength is different, so that the voice signals in the channels can be regarded as a role voice signal; when two users speak simultaneously, because the weight occupied by the two users in each channel is different, a plurality of character voice signals can be distinguished.

Optionally, the step S303 further includes:

step S401, calculating the similarity between the at least one single-channel voice signal;

step S402, recognizing a plurality of single-channel voice signals with similarity higher than a similarity threshold value as voice signals of the same role;

in step S403, a plurality of single-channel speech signals with similarity lower than a similarity threshold are recognized as different character speech signals.

In step S401, the similarity between the single-channel speech signals is calculated. Taking the specific example in the above embodiment as an example, m single-channel speech signals are obtained through a DNN model, and the m channels are subjected to inverse sorting according to the posterior probability of each channel, that is, the channel with the highest posterior probability is arranged at the forefront. Numbering the m channels by j, recording that a channel comprising a voice signal is mj, m0 is a channel with the maximum posterior probability, and calculating the similarity between the voice signal of each channel and the voice signal of the channel m0 from the channel from j to 1 to the channel from j to m-1, wherein the similarity can be calculated by using a feature vector output by a hidden layer of the DNN, the cosine distance of two feature vectors is calculated to serve as the similarity of the two feature vectors, a similarity threshold r is set, when the similarity of the two channels is greater than the similarity threshold r, the voice signal of the same user in the two channels is confirmed, the channel with the larger number is eliminated, and after multiple traversals in the mode, the voice signal of the same user is determined in the only one channel; when the similarity between a certain channel and any channel is smaller than the similarity threshold, the voice signal of the channel is determined to be the voice signal of another user, namely a plurality of users speak together in the same time period, and the voice signal of the channel is reserved. Through the method, the voice signals in the reserved channels are character voice signals. Thereby, at least one character voice signal can be separated from the sound signal.

Optionally, the step S202 may further separate a character voice signal from the sound signal based on a voiceprint. Based on the voiceprint separation, the voiceprint separation has two forms, one is that each user registers the own voiceprint into the system in advance, when separating the role voice signals, the voiceprint characteristics of the voice signals are directly extracted, the similarity degree of the extracted voiceprint characteristics and the registered voiceprint characteristics is calculated, if the similarity degree exceeds a threshold value, the voiceprint characteristics are considered as the voice of the same person, and therefore the role voice signals of different users can be separated from the voice signals; and another method does not need to register voiceprints in advance, but extracts the voice characteristics of the voice signals of different channels, then classifies the voice signals by using a clustering algorithm, and confirms the voice signals of the corresponding channels as the role voice signals of the same user when each voice signal is classified as the same user. Thus, the character voice signal can be separated from the sound signal through the voiceprint feature.

Optionally, the step S202 may also be based on blind source separation, which is a method for separating or recovering an unknown signal source according to an observed signal. Each microphone receives a mixed sound signal that is a plurality of sound sources. Illustratively, the number of microphones is equal to the number of sound sources. The mixed sound signal may be represented by the following formula (1) in the time domain:

wherein x_j(t) represents the mixed sound signal received by the jth microphone at time t; a is_jiIs a weighting coefficient, which is determined by the impulse response function; s_i(t) is the sound source signal; n is the number of microphones and sound sources. And then multiplying the plurality of mixed sound signals by a preset de-mixing matrix to obtain at least one role sound signal, wherein each role sound signal is the sum of products of the plurality of mixed sound signals and a de-mixing coefficient in the de-mixing matrix.

It is understood that the above-mentioned manner of extracting the character voice signal from the sound signal is only an example, and any other separation manner may be used to separate the voice signals of different users from the voice signal when implementing step S202 of the present disclosure.

Step S203, the at least one role voice signal is transcribed into characters;

in this step, the character voice signal is converted into corresponding text through voice recognition.

Optionally, the step S203 further includes: extracting voice characteristics of the at least one character voice signal; and inputting the voice characteristics into a voice recognition model to obtain characters corresponding to the at least one role voice signal. That is, feature extraction is performed on each character voice signal separated in step S202 to form a voice feature that can be input into the voice recognition model, and for example, the voice feature extraction here is the same as the feature extraction in step S302, and is not described here again. Illustratively, the speech recognition model is an acoustic model or a hybrid model of the acoustic model and a language model, the speech recognition model receives features of speech and outputs words corresponding to the speech, and typical speech recognition models such as GMM-HMM, DNN-HMM, a model of an encoding-attention mechanism-decoding structure, and so on are not illustrated here. The voice recognition model can operate at the server, after the voice signal is collected by the voice collection device, the voice signal or the voice characteristics obtained by processing the voice signal are sent to the server, and the voice is recognized through the voice recognition model of the server and is transcribed into characters.

Step S204, displaying the characters in at least one display mode.

And after the characters corresponding to the character voice signals are obtained, displaying the characters on a display device in a certain display mode so that a user can check the transcription result.

Optionally, the step S204 further includes:

The three display modes can respectively correspond to different application scenes or can be mixed in certain application scenes for use.

The display device can comprise a real-time display area, namely, the display device is provided with an area for displaying characters corresponding to the real-time character voice signals. Optionally, the real-time display area further includes users corresponding to the character voice signals, that is, which user says what. The display device can also comprise a history display area which is used for displaying characters corresponding to the role voice signals before the current role voice signals; optionally, when receiving a character corresponding to a new character voice signal, updating the character currently being displayed in the real-time display area into the historical display area; in another presentation mode, the display device includes an information presentation area, which illustratively presents slides, pictures, and the like, when the user is a commentator or a speaker, and the commentator or the speaker can add the caption spoken by the commentator or the speaker to the information presented in the information presentation area, and the position of the caption can be any position near the information presentation area.

Through the steps S201 to S204, the received sound signals of multiple users can be distinguished into characters, and each character voice signal is transcribed into a text and displayed in a certain display mode.

In some situations, the number of characters transcribed by the voice signal is large, for example, when a meeting or a negotiation occurs, the text corresponding to the finally obtained transcribed characters is large, and a user needs to spend a lot of time on reading the text to understand the meaning to be expressed. To solve this problem, optionally, the method further comprises: acquiring a text corresponding to the characters; inputting the text into a summary generation model to generate a summary of the text. Optionally, the abstract generation model is a pre-trained text-to-text conversion model, for example, the abstract generation model is an encoder and decoder structure composed of RNN, GRU, or LSTM, and is configured to convert an acquired text into an input vector, encode the input vector by an encoder composed of RNN, GRU, or LSTM to obtain a hidden state of the input vector, obtain attention information by calculating a correlation matrix, decode the input vector by a decoder composed of RNN, GRU, or LSTM, output a text output at each time by a softmax layer, and output a text at all times to obtain another text corresponding to the input text. The abstract generation model is trained by using a training data set by using the structure, wherein the training data set is a text and an abstract corresponding to the text, so that the parameters of the abstract generation model can be adjusted to output the abstract of the input text. Through the steps, the abstracts of the meanings expressed by all the characters in one conversation are obtained, and the user can conveniently perform information retrieval, information confirmation and the like in the following process.

Since automatic transcription has a certain transcription error rate, in one embodiment, the present disclosure also provides a step of error correction. Optionally, the voice transcription method further includes: receiving a selection signal of a character voice signal;

and highlighting characters corresponding to the character voice signals. Optionally, the character voice signal is displayed in a voice waveform form, when performing calibration, a user may select a character voice signal of a certain time period through a human-computer interaction interface, such as a mouse, a keyboard, a touch screen, and the like, at this time, a text corresponding to the selected character voice signal of the time period is highlighted on a display device, such as highlighting or filling a color and the like for the text, and then whether the corresponding text is correct or not may be determined by listening to a sound corresponding to the character voice signal, and if not, a modification instruction is received through the human-computer interaction interface to modify the text.

As another implementation, the optional voice transcription method further includes: receiving a selection signal of the characters; and highlighting the character voice signal corresponding to the character. Optionally, the character voice signal is displayed in a form of voice waveform, when performing calibration, a user may select a segment of text through a human-computer interaction interface, such as a mouse, a keyboard, a touch screen, and the like, at this time, the waveform of the character voice corresponding to the selected text is highlighted on a display device, such as highlighting or filling colors and the like on the waveform, and then whether the corresponding text is correct or not may be determined by listening to a sound corresponding to the waveform, and if not, a modification instruction is received through the human-computer interaction interface to modify the text.

The embodiment of the disclosure discloses a voice transcription method, which comprises the following steps: collecting a sound signal; separating at least one character voice signal from the sound signal; transcribing the at least one character voice signal into text; and displaying the characters in at least one display mode. The method solves the technical problems of inconvenient, unstable and inaccurate retrieval and the like in the scheme of recording the voice content in the prior art by separating the voice signal and transcribing the voice signal into characters.

In the above, although the steps in the above method embodiments are described in the above sequence, it should be clear to those skilled in the art that the steps in the embodiments of the present disclosure are not necessarily performed in the above sequence, and may also be performed in other sequences such as reverse, parallel, and cross, and further, on the basis of the above steps, other steps may also be added by those skilled in the art, and these obvious modifications or equivalents should also be included in the protection scope of the present disclosure, and are not described herein again.

Fig. 5 is a schematic structural diagram of an embodiment of a speech transcription apparatus provided in an embodiment of the present disclosure, and as shown in fig. 5, the apparatus 500 includes: a collection module 501, a voice separation module 502, a transcription module 503, and a presentation module 504. Wherein the content of the first and second substances,

the acquisition module 501 is used for acquiring sound signals;

a voice separation module 502 for separating at least one character voice signal from the sound signal;

a transcription module 503, configured to transcribe the at least one character voice signal into text;

a display module 504, configured to display the text in at least one display manner.

Further, the voice transcription apparatus 500 further includes:

and the enhancement module is used for carrying out enhancement processing on the sound signal.

Further, the voice separation module 502 is further configured to:

Further, the transfer module 503 is further configured to:

extracting voice characteristics of the at least one character voice signal;

Further, the display module 504 is further configured to:

Further, the voice transcription apparatus 500 further includes:

the second proofreading module is used for receiving the selection signal of the characters; and highlighting the character voice signal corresponding to the character.

The apparatus shown in fig. 5 can perform the method of the embodiment shown in fig. 2-4, and the detailed description of this embodiment can refer to the related description of the embodiment shown in fig. 2-4. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 2 to fig. 4, and are not described herein again.

Referring now to FIG. 6, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText transfer protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: the voice transcription method described in any of the above embodiments is performed.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method of voice transcription, comprising:

collecting a sound signal;

separating at least one character voice signal from the sound signal;

transcribing the at least one character voice signal into text;

and displaying the characters in at least one display mode.

2. The speech transcription method of claim 1, wherein said separating at least one character speech signal from said sound signal comprises:

3. The speech transcription method of claim 2, wherein said determining at least one single-channel speech signal from the plurality of single-channel speech signals comprises:

4. The method of speech transcription according to claim 2, wherein said determining at least one character speech signal from said at least one single-channel speech signal comprises:

5. The voice transcription method as claimed in claim 1, wherein said transcribing the at least one character voice signal into text comprises:

extracting voice characteristics of the at least one character voice signal;

6. The method of claim 1, wherein said presenting said text in at least one presentation comprises:

7. The voice transcription method as claimed in claim 1, wherein the method further comprises:

acquiring a text corresponding to the characters;

8. The voice transcription method as claimed in claim 1, wherein the method further comprises:

receiving a selection signal of a character voice signal;

and highlighting characters corresponding to the character voice signals.

9. The voice transcription method as claimed in claim 1, wherein the method further comprises:

receiving a selection signal of the characters;

and highlighting the character voice signal corresponding to the character.

10. A speech transcription device, comprising:

the acquisition module is used for acquiring sound signals;

11. An electronic device, a memory to store computer readable instructions; and

a processor for executing the computer readable instructions such that the processor when executed implements the method of any of claims 1-9.

12. A non-transitory computer readable storage medium storing computer readable instructions which, when executed by a computer, cause the computer to perform the method of any one of claims 1-9.