CN111243594A

CN111243594A - Method and device for converting audio frequency into characters

Info

Publication number: CN111243594A
Application number: CN201811436600.7A
Authority: CN
Inventors: 汪德召; 丁德辉; 唐启明
Original assignee: Hytera Communications Corp Ltd
Current assignee: Hytera Communications Corp Ltd
Priority date: 2018-11-28
Filing date: 2018-11-28
Publication date: 2020-06-05

Abstract

The invention discloses a method and a device for converting audio frequency into characters, wherein the method comprises the following steps: acquiring audio information of a speaker and recording identification information corresponding to the speaker; converting the audio information into character information; and matching the text information with the identification information corresponding to the speaker to obtain target text information. The converted character information is accurately matched with the speaker, the character information of the speaker can be obtained and displayed, the audio can be accurately converted into the character information even under complex and audio mixing environments, and the real-time performance and the accuracy of the audio converted characters are improved.

Description

Method and device for converting audio frequency into characters

Technical Field

The present invention relates to the field of audio processing technologies, and in particular, to a method and an apparatus for converting audio into text.

Background

With the development of communication technology, in many occasions, such as conferences, training and the like, remote conferences or training and the like are performed by means of communication terminals without the need for participants to gather in a certain designated area at the same time for conferences or training. During communication, the participants can transmit the content to be expressed through voice, but due to the influence of certain factors in the conversation process, sometimes the participants can not hear the audio content of the speaker clearly.

At present, methods for converting audio into characters exist, but all the methods have certain defects, for example, special voice acquisition equipment is needed to acquire audio information, and then corresponding speakers can be located by recognizing information such as tone and the like.

Disclosure of Invention

In view of the above problems, the present invention provides a method and an apparatus for converting audio into text, which achieve the purpose of improving the real-time performance and accuracy of converting into text.

In order to achieve the purpose, the invention provides the following technical scheme:

a method of audio converting text, comprising:

acquiring audio information of a speaker and recording identification information corresponding to the speaker;

converting the audio information into character information;

and matching the text information with the identification information corresponding to the speaker to obtain target text information.

Optionally, the method further comprises:

and storing the target character information and displaying the target character information on a terminal corresponding to the participant.

Optionally, the acquiring audio information of a speaker and recording identification information corresponding to the speaker includes:

acquiring audio information of a speaker;

determining identification information of the speaker according to terminal information corresponding to the speaker, wherein the terminal information comprises a network transmission protocol, port information and terminal number information;

recording the identification information of the speaker.

Optionally, the matching the text information with the identification information corresponding to the speaker to obtain target text information includes:

binding and marking the text information and the identification information corresponding to the speaker to realize matching of the text information and the corresponding speaker;

and generating target text information according to the text information after the binding marks, wherein the target text information comprises the text information and the identity information of a speaker corresponding to the text information, and the identity information is derived from the identification information of the speaker.

Optionally, the storing the target text information and displaying the target text information on a terminal corresponding to the participant includes:

judging whether the participant has a designated speaker to be displayed, if so, displaying the target text information of the designated speaker on the terminal of the participant;

if not, displaying the target text information of all the speeches on the terminal corresponding to the participant;

and storing the target character information.

Optionally, the method further comprises:

and displaying the target characters on a terminal corresponding to the participant, and simultaneously playing audio information of the speaker corresponding to the target characters on the terminal of the participant.

Optionally, the method further comprises:

responding to the fact that a narrow-band terminal is added into a conversation where a speaker is located, and if the narrow-band terminal is added in a half-duplex mode, when the narrow-band terminal is detected not to speak, displaying the target characters and audio information corresponding to the target characters to the narrow-band terminal;

and if the narrow-band terminal is added in a full duplex mode, displaying the target characters and the audio information corresponding to the target characters to the narrow-band terminal in real time.

An apparatus for audio conversion of text, comprising:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring audio information of a speaker and recording identification information corresponding to the speaker;

the conversion unit is used for converting the audio information into character information;

and the matching unit is used for matching the text information with the identification information corresponding to the speaker to obtain target text information.

A storage medium readable by a computing device, the storage medium storing a program which, when executed by the computing device, implements the method of converting audio into text as described above.

An apparatus, the apparatus comprising:

a memory for storing data and programs;

a processor coupled to the memory and implementing the method for converting text to audio as described above when the processor executes the program.

Compared with the prior art, the invention provides a method and a device for converting audio into characters, which comprises the steps of firstly acquiring audio information of a speaker and recording identification information corresponding to the speaker; converting the audio information into character information; and matching the text information with the identification information corresponding to the speaker to obtain target text information. Therefore, the converted character information is accurately matched with the speaker, the character information of the speaker can be obtained and displayed, the audio can be accurately converted into the character information even under complex and audio mixing environments, and the real-time performance and the accuracy of the audio conversion characters are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart illustrating a method for converting text into audio according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a session system provided in this embodiment;

fig. 3 is a schematic structural diagram of an audio conversion according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an instant messenger according to an embodiment of the present invention;

fig. 5 is an effect diagram of a video conference scene terminal display provided in an embodiment of the present invention;

fig. 6 is an effect diagram of an audio conference scene terminal display according to an embodiment of the present invention;

fig. 7 is an effect diagram of displaying conference content by a narrowband terminal according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an apparatus for converting audio into text according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first" and "second," and the like in the description and claims of the present invention and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not set forth for a listed step or element but may include steps or elements not listed.

An embodiment of the present invention provides a method for converting text into audio, and referring to fig. 1, the method may include the following steps:

s11, acquiring the audio information of the speaker and recording the identification information corresponding to the speaker.

During audio conference or audio training, participants can form a conversation mode, when speaking, the participants with speaking right can be used as speakers to send out audio information, then the corresponding service server can obtain the audio information of the speakers and record identification information corresponding to the speakers at the moment.

Specifically, the identification information of the speaker may be determined according to the terminal information corresponding to the speaker, because the speaker uploads the speech information of the speaker as audio information to the service server through the terminal corresponding to the speaker, and the service server may distribute the audio information to the terminals corresponding to the participants. Therefore, the terminal information of each participant is different from each other. The terminal information comprises a network transmission protocol, port information and terminal number information. For example, each call has a unique IP and port combination to indicate the channel address of the audio transmission and corresponds to the calling and called numbers in the call protocol, and the terminal of one call at the server end indicates the identity of the participant by using the combination of IP, port and number.

It should be noted that if the video conference mode is adopted, the video image is transmitted, and the audio information in the video needs to be extracted, that is, in the embodiment of the present invention, no matter what mode is adopted by the session, the audio information needs to be acquired first, so that a basis is provided for subsequent text conversion.

S12, converting the audio information into character information;

after the audio information is obtained, the audio information is usually converted into text information by means of a media server. Because certain interference information exists in the collected or received audio information, the voice audio in the audio information needs to be extracted first, character recognition is carried out on the voice audio, and finally corresponding character information is obtained after conversion. The disturbance information refers to information other than the sound uttered by the participant, such as a applause, background music, and the like.

And S13, matching the text information with the identification information corresponding to the speaker to obtain target text information.

Because the identification information corresponding to the speaker is recorded while the audio information of the speaker is acquired, the converted text information needs to be matched with the corresponding identification information so as to correspond to each speaker.

Optionally, the matching process may be:

Therefore, not only the converted text information but also the speaker identity information is included in the target text information, for example, the format of the final target text information may be "zhang san: the plan is expected to be submitted in the next friday, that is, the current speaker is Zhang III, and the text information of the audio conversion of the speaker is 'the plan is expected to be submitted in the next friday'. The text information is matched with the identification information of the corresponding speaker representing the identity, and in a mixed speech scene, the participants can also correspond to the speech content of each speaker through the text information, so that the participants can conveniently know the speech content of the speaker in time when the speech quality is poor or the speech scene is relatively loaded.

Compared with the prior art, the invention provides a method for converting audio into characters, which comprises the steps of firstly obtaining audio information of a speaker and recording identification information corresponding to the speaker; converting the audio information into character information; and matching the text information with the identification information corresponding to the speaker to obtain target text information. Therefore, the converted character information is accurately matched with the speaker, the character information of the speaker can be obtained and displayed, the audio can be accurately converted into the character information even under complex and audio mixing environments, and the real-time performance and the accuracy of the audio conversion characters are improved.

Referring to fig. 2, a schematic structural diagram of a conversation system provided in an embodiment of the present invention includes three participants, that is, A, B and C, and implements synchronous storage and forwarding of speech-converted text corresponding to a speaker through a service server and a media server. The instant messaging means that the terminals corresponding to the three participants can selectively receive different combination information of voice, video and text.

In fig. 2, the target text information is stored and displayed, and only the display content of one terminal is shown in the figure, and the display contents of the other participant terminals may be the same as or different from that of the terminal.

If the participant has a designated speaker to be displayed, only the target character information of the designated speaker is displayed on the terminal of the participant. For example, when a and B are in the same branch company and have an audio conference with a C in a different place, due to convenience of regions, a may choose to receive only the audio and corresponding text of C, but not the audio and text of B, because a can communicate with B face to face at any time.

Meanwhile, the server stores all the converted target text information, and if a conference record can be further generated in a conference mode, the conference record is stored or distributed to corresponding participants, so that the participants can conveniently find and read subsequently.

Optionally, in the embodiment of the present invention, the method further includes displaying the target text on a terminal corresponding to the participant, and simultaneously playing audio information of the speaker corresponding to the target text on the terminal of the participant.

It should be noted that the default may be to display and play the text and the audio synchronously. Certainly, the corresponding response may also be performed according to a participation mode preset by the participant, for example, in an audio training scenario, if a certain participant is not convenient to receive audio information due to the limitation of a place or a working environment, only text information may be displayed. Of course, it is also possible to receive only the audio information and not to display the text information, that is, the audio, video and text may be correspondingly combined and then correspondingly displayed according to the needs of the participants, which is not described in detail herein.

Referring to fig. 3 on the basis of fig. 2, when performing text conversion on audio information, each voice stream received by the media server performs a real-time text conversion and transmits the converted voice stream to the service server, and since the service server is directly in communication connection with the participant terminal, the service server can know which speaker (such as the number of the communication terminal) is according to the source of the audio stream, and then a binding tag is made on the text stream converted from the speaker and the corresponding voice to distinguish the text corresponding to the words spoken by the speaker.

Referring to fig. 4, which is a schematic diagram of an instant messenger according to an embodiment of the present invention, the instant messenger may send voice and text to: the instant messaging tool comprises a dispatching desk at a PC end, a dispatching desk at an android end, an Lte terminal, a PDT (just-powered optical transport) hand desk and the like, can selectively receive audio and video and characters according to the equipment capacity of the instant messaging tool and the requirements of participants, and can realize the consistency of character and voice display.

Referring to fig. 5, an effect diagram displayed at a scene terminal of a video conference according to an embodiment of the present invention is provided, where participants include A, B, C, D and E in the video conference, a media server performs real-time text conversion on audio in audio and video streams transmitted by the participants, and fuses corresponding text on a picture of a corresponding speaker in a video fusion screen, and a specific media server has a fixed fusion screen template according to a plurality of paths of videos, and finally fuses the plurality of paths of videos into one path of audio and video stream, and finally synchronously transmits the path of audio and video stream to respective terminals of the participants, where at this time, an effect diagram displayed at each participant terminal is as shown in fig. 5.

If the conference is audio only, a subtitle box display is provided, and the later dialog box of each member displays the words spoken by the member in real time, and the effect graph is shown in fig. 6.

In the embodiment of the invention, the narrow-band terminal is also supported to participate in the audio and video conference, and the conference information can be seen through the text, as shown in fig. 7, an effect diagram of displaying the conference content at the narrow-band terminal is shown.

The narrowband terminal can join the conference in two modes, namely a half-duplex mode and a full-duplex mode. The half-duplex mode is added, under the condition that the user does not have speaking right, the user can only listen To the conversation of the conference and see the transmitted characters, if the user needs To speak, the user presses a PPT (Push To Talk) key To apply for the speaking right, the user can speak after obtaining the speaking right but can not receive the audio frequency and the characters of other people, and the user can receive the audio frequency and the characters of other people after releasing the speaking right. The full duplex call joining can hear the speaker speaking and seeing the character in real time, and the speaker can speak at any time. No matter which way the narrow-band terminal joins, if the video conference, the video channel of the narrow-band terminal is closed, no video exists, and only audio and characters exist.

Therefore, the method for converting the words by the audio frequency in the embodiment of the invention realizes the accuracy of converting the words by respective voice under the audio mixing, and converts each collected voice into the words in real time, thereby ensuring the real-time performance and the accuracy; information of points needing attention under the condition of audio mixing can be well distinguished in the conference; the narrow-band terminal is supported to be capable of participating in the audio and video conference, and conference information can be seen through characters, so that the problem that the conference information can be seen even if sound cannot be played under special conditions is solved.

Corresponding to the technical solution of the method for converting audio into text provided by the embodiment of the present invention, an apparatus for converting audio into text is also provided in the embodiment of the present invention, referring to fig. 8, the apparatus may include:

the acquiring unit 10 is used for acquiring audio information of a speaker and recording identification information corresponding to the speaker;

a conversion unit 20, configured to convert the audio information into text information;

a matching unit 30, configured to match the text information with the identification information corresponding to the speaker, so as to obtain target text information.

Optionally, the method further comprises:

and the display unit is used for storing the target character information and displaying the target character information on the terminal corresponding to the participant.

Optionally, the obtaining unit 10 includes:

the acquiring subunit is used for acquiring the audio information of the speaker;

a determining subunit, configured to determine, according to terminal information corresponding to the speaker, identification information of the speaker, where the terminal information includes a network transmission protocol, port information, and terminal number information;

and the recording subunit is used for recording the identification information of the speaker.

Optionally, the matching unit 30 comprises:

the matching subunit is used for binding and marking the text information and the identification information corresponding to the speaker to realize matching of the text information and the corresponding speaker;

and the generating subunit is used for generating target text information according to the text information after the binding marks, wherein the target text information comprises the text information and the identity information of the speaker corresponding to the text information, and the identity information is derived from the identification information of the speaker.

Optionally, the method further comprises:

the judgment unit is used for judging whether the participant has a designated speaker to be displayed, and if so, displaying the target text information of the designated speaker on the terminal of the participant;

and storing the target character information.

Optionally, the method further comprises:

and the synchronization unit is used for displaying the target characters on the terminal corresponding to the participant and simultaneously playing the audio information of the speaker corresponding to the target characters on the terminal of the participant.

Optionally, the method further comprises:

the first response unit is used for responding to the fact that a narrow-band terminal is added into a conversation where a speaker is located, and when the narrow-band terminal is detected not to speak when the narrow-band terminal is added in a half-duplex mode, displaying the target characters and audio information corresponding to the target characters to the narrow-band terminal;

and the second response unit is used for displaying the target characters and the corresponding audio information to the narrowband terminal in real time if the narrowband terminal is added in a full duplex mode.

The embodiment of the invention provides a device for converting audio into characters, which comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is used for acquiring audio information of a speaker and recording identification information corresponding to the speaker; converting the audio information into character information by using a conversion unit; and matching the text information with the identification information corresponding to the speaker through a matching unit to obtain target text information. Therefore, the converted character information is accurately matched with the speaker, the character information of the speaker can be obtained and displayed, the audio can be accurately converted into the character information even under complex and audio mixing environments, and the real-time performance and the accuracy of the audio conversion characters are improved.

An embodiment of the present invention provides a storage medium readable by a computing device, where the storage medium stores a program, and when the program is executed by the computing device, the method for converting audio into text is implemented.

An embodiment of the present invention provides an apparatus, including:

a memory for storing data and programs;

a processor coupled with the memory and implementing the following steps when the processor executes the program:

converting the audio information into character information;

Further, still include:

Further, the acquiring audio information of a speaker and recording identification information corresponding to the speaker includes:

acquiring audio information of a speaker;

recording the identification information of the speaker.

Further, the matching the text information with the identification information corresponding to the speaker to obtain target text information includes:

Further, the storing the target text information and displaying the target text information on a terminal corresponding to the participant includes:

and storing the target character information.

Further, still include:

The device herein may be a server, a PC, a PAD, a mobile phone, etc.

In the storage medium and the device of the computer equipment provided by the embodiment of the invention, firstly, audio information of a speaker is obtained, and identification information corresponding to the speaker is recorded; converting the audio information into character information; and matching the text information with the identification information corresponding to the speaker to obtain target text information. Therefore, the converted character information is accurately matched with the speaker, the character information of the speaker can be obtained and displayed, the audio can be accurately converted into the character information even under complex and audio mixing environments, and the real-time performance and the accuracy of the audio conversion characters are improved.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for converting text to audio, comprising:

converting the audio information into character information;

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein the obtaining audio information of a speaker and recording identification information corresponding to the speaker comprises:

acquiring audio information of a speaker;

recording the identification information of the speaker.

4. The method of claim 1, wherein matching the text information with identification information corresponding to the speaker to obtain target text information comprises:

5. The method of claim 2, wherein storing and displaying the target text information on the terminal corresponding to the participant comprises:

and storing the target character information.

6. The method of claim 2, further comprising:

7. The method of claim 1, further comprising:

8. An apparatus for converting text to audio, comprising:

9. A storage medium readable by a computing device, the storage medium storing a program that, when executed by the computing device, implements the method of any of claims 1-7.

10. An apparatus, the apparatus comprising:

a memory for storing data and programs;

a processor coupled with the memory and implementing the method of any of claims 1-7 when the processor executes the program.