CN112185354A

CN112185354A - Voice text display method, device, equipment and storage medium

Info

Publication number: CN112185354A
Application number: CN202010980844.2A
Authority: CN
Inventors: 余逸尘
Original assignee: Zhejiang Tonghuashun Intelligent Technology Co Ltd
Current assignee: Zhejiang Tonghuashun Intelligent Technology Co Ltd
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2021-01-05

Abstract

The embodiment of the invention discloses a method, a device and equipment for displaying a voice text and a storage medium. The method comprises the following steps: determining first position information of a sound source according to the voice collected by the microphone array; converting the voice into text content, and determining the size of a target font corresponding to the text content according to the volume of the voice; and displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size. The method for displaying the voice text provided by the embodiment of the invention can correspond the voice text with the position and the volume of the speaker, thereby improving the display effect of the voice text.

Description

Voice text display method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of voice recognition, in particular to a method, a device, equipment and a storage medium for displaying a voice text.

Background

Speech visualization may be understood as converting speech into text by means of automatic speech recognition techniques, the content of the speech being directly visible in the form of text. And identifying the speaker user during speaker identification, and identifying the speaker corresponding to a certain section of audio based on the audio characteristics. In the two ways, in a multi-person conversation scene, the position and the volume of the voice and the speaker cannot be corresponded, so that the display effect of the voice text is poor.

Disclosure of Invention

The invention provides a method, a device, equipment and a storage medium for displaying a voice text, which can correspond the voice text with the position and the volume of a speaker and improve the display effect of the voice text.

In a first aspect, an embodiment of the present invention provides a method for displaying a speech text, including:

determining first position information of a sound source according to the voice collected by the microphone array;

converting the voice into text content, and determining the size of a target font corresponding to the text content according to the volume of the voice;

and displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size.

Further, the microphone array is at least three, and sets up in different positions, confirms the first positional information of sound source according to the pronunciation that the microphone array was gathered, includes:

acquiring voice phase differences acquired by each microphone array;

first position information of the sound source relative to the target microphone array is determined based on the speech phase difference.

Further, the first position information includes a distance between an audio source and the target microphone array, and determining the volume of the speech includes:

determining a first volume of the speech when acquired by a target microphone array;

determining a volume of speech generated from the sound source based on a set volume attenuation formula according to a distance between the sound source and the target microphone array.

Further, determining a target font size corresponding to the text content according to the volume of the voice includes:

acquiring a reference font size corresponding to the reference volume;

determining a ratio of a volume of the speech to the reference volume;

and determining the target font size corresponding to the text content according to the proportion and the reference font size.

Further, the volume of the speech is characterized by an amplitude; determining the size of a target font corresponding to the text content according to the volume of the voice, wherein the step comprises the following steps:

acquiring the maximum amplitude and the minimum amplitude in the voice contained in the current conversation scene;

normalizing the amplitude of each voice according to the maximum amplitude and the minimum amplitude;

and determining the target font size corresponding to each voice according to the amplitude after the normalization processing.

Further, displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size includes:

acquiring second position information of the target microphone array in a set three-dimensional coordinate system;

determining target position information of the sound source in the set three-dimensional coordinate system according to the first position information and the second position information;

displaying the text content in the target position at the target font size.

Further, after determining the size of the target font corresponding to the text content according to the volume of the voice, the method further includes:

determining emotion information of a sound source according to the voice;

determining the color of the target font according to the emotion information;

correspondingly, displaying the text content in the target position in the target font size includes:

displaying the text content in the target location in the target font size and the color.

In a second aspect, an embodiment of the present invention further provides a device for displaying a speech text, including:

the first position information determining module is used for determining first position information of a sound source according to the voice collected by the microphone array;

the target font size determining module is used for converting the voice into text content and determining the target font size corresponding to the text content according to the volume of the voice;

and the text content display module is used for displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size.

Further, the microphone array is at least three and is arranged at different positions, and the first position information determining module is further configured to:

acquiring voice phase differences acquired by each microphone array;

In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes: comprising a memory, a processor and a computer program stored on the memory and operable on the processor, the processor implementing the method of displaying speech text according to an embodiment of the invention when executing the program.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processing apparatus, implements the method for displaying a speech text according to the embodiment of the present invention.

The embodiment of the invention discloses a method, a device, equipment and a storage medium for displaying a voice text. The method for displaying the voice text provided by the embodiment of the invention can correspond the voice text with the position and the volume of the speaker, thereby improving the display effect of the voice text.

Drawings

Fig. 1 is a flowchart of a method for displaying a speech text according to a first embodiment of the present invention;

FIG. 2 is a diagram illustrating the effect of displaying a voice text according to a first embodiment of the present invention;

fig. 3 is a schematic structural diagram of a device for displaying a speech text according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer device in the third embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a method for displaying a speech text according to an embodiment of the present invention, where the embodiment is applicable to a case where a text converted from speech is displayed, and the method may be executed by a speech text display apparatus, where the apparatus may be composed of hardware and/or software, and may be generally integrated in a device having a function of displaying a speech text, where the device may be an electronic device such as a server or a server cluster. As shown in fig. 1, the method specifically includes the following steps:

step 110, determining first position information of a sound source according to the voice collected by the microphone array.

Wherein, the number of the microphone arrays can be at least 3, and the microphone arrays are arranged at different positions in the current conversation scene. For example, a conversation scenario may be understood as a venue for a multi-person conversation, such as: a conference room or a classroom, etc. The first position information may be understood as position information of the audio source relative to the target microphone array, which may be any one of the microphone arrays. The first position information comprises a distance of the audio source from the target microphone array.

In this embodiment, the manner of determining the first position information of the sound source according to the speech acquired by the microphone array may be: acquiring voice phase differences acquired by each microphone array; first position information of the audio source relative to the target microphone array is determined based on the speech phase difference.

The voice phase difference can be determined according to the time delay of the voice signals received by each microphone array.

Optionally, the manner of determining the first position information of the sound source according to the speech acquired by the microphone array may also be: determining the distance between a sound source and each microphone array, taking the microphone arrays as the spherical centers, forming a spherical surface by taking the distance between the sound source and each microphone array as the radius, obtaining the intersection point of the spherical surface corresponding to each microphone array, wherein the intersection point is the position of the sound source, obtaining the position information of the intersection point relative to the target microphone array, and determining the position information as first position information.

Step 120, converting the voice into text content, and determining a target font size corresponding to the text content according to the volume of the voice.

Among them, the collected Speech can be converted into text content by Automatic Speech Recognition (ASR).

The volume of the voice can be determined by the amplitude of the voice signal, and the volume of the voice is the volume generated by the sound source.

In this embodiment, the manner of determining the volume of the voice may be: determining a first volume of speech when collected by a target microphone array; the volume of speech produced by the source of sound is determined based on a set volume attenuation formula based on the distance of the source of sound from the target microphone array.

Wherein, the set volume attenuation formula can be obtained based on the volume test experiment fitting. Illustratively, microphone arrays are respectively arranged at positions with different distances from a sound source, and a relationship between the volume and the propagation distance is established based on the volume of the voice collected by each microphone array at different distances, so as to obtain a volume attenuation formula.

Specifically, the first volume and the distance between the sound source and the target microphone array are substituted into a set volume attenuation formula to obtain the volume when the voice is generated by the sound source.

Optionally, the manner of determining the target font size corresponding to the text content according to the volume of the voice may be: acquiring a reference font size corresponding to the reference volume; determining the ratio of the volume of the voice to the reference volume; and determining the target font size corresponding to the text content according to the proportion and the reference font size.

The reference volume can be preset, the average value of the volume when a person speaks can be obtained in a big data analysis mode, the average value is used as the reference volume, and the reference volume is set to the size of the reference font corresponding to the reference volume. Font size may be characterized by font size.

Specifically, the ratio of the reference volume of the collected voice is calculated, and the target font size is obtained by multiplying the ratio by the standard font size. For example, assuming that the standard volume is a0 and the standard font size is B0, if the volume of the captured speech is a1, the target font size is B1 ═ B0 × a1/a 0.

Optionally, the volume of the speech is represented by amplitude, and the manner of determining the target font size corresponding to the text content according to the volume of the speech may be: acquiring the maximum amplitude and the minimum amplitude in the voice contained in the current conversation scene; normalizing the amplitude of each voice according to the maximum amplitude and the minimum amplitude; and determining the target font size corresponding to each voice according to the amplitude after the normalization processing.

Wherein the normalization process has the formula

S is the amplitude of the collected speech, S_maxIs the maximum amplitude, S, in speech_minIs the minimum amplitude in speech and G is the normalized amplitude. In this embodiment, a font size range may be set, where the voice with the largest amplitude corresponds to the largest font in the range, the voice with the smallest amplitude corresponds to the smallest font in the range, and finally the font size corresponding to the acquired voice is obtained by multiplying the normalized amplitude by the range interval and adding the smallest font. For example, assuming a font size range of 10-25, a range pitch of 15, and a normalized amplitude of 0.6, the acquired speech corresponds to a font size of 10+10 × 0.6 — 16. The method has the advantage that the problem that the character is too big or too small due to improper selection of the reference volume can be prevented, so that the display of the text content is influencedThe appearance is beautiful.

And step 130, displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size.

Specifically, the process of displaying the text content in the set three-dimensional coordinate system according to the first position information and the target font size may be: acquiring second position information of the target microphone array in a set three-dimensional coordinate system; determining target position information of the sound source in a set three-dimensional coordinate system according to the first position information and the second position information; the text content is displayed in the target position at the target font size.

Assuming that the position information of the target microphone array in the set three-dimensional coordinates is (x1, y1, z1) and the position information of the sound source with respect to the target microphone array is (x2, y2, z2), the position information of the sound source in the set three-dimensional coordinates is (x2-x1, y2-y1, z2-z 1). Illustratively, fig. 2 is an effect diagram of the speech text display in the present embodiment. As shown in fig. 2, in the dialog scenario, by the scheme in this embodiment, the content of the speech of the teacher and each student can be accurately displayed, and the positions of the students and the volume of the speech can be reflected.

Optionally, after determining the size of the target font corresponding to the text content according to the volume of the voice, the method further includes the following steps: determining emotion information of a sound source according to the voice; and determining the color of the target font according to the emotional information.

The method for determining emotion information of the sound source according to the voice may be to analyze a trained emotion recognition model inputted by the voice. The emotion recognition model can be obtained by training based on a large number of voice samples labeled with emotion information. In this embodiment, the corresponding relationship between the emotion information and the font color may be established in advance. Illustratively, happy corresponds to red, sad corresponds to blue, and so on.

Specifically, after the font color corresponding to the voice is obtained, the text content is displayed in the target position with the target font size and the corresponding color. So that the text content can reflect the position, volume and emotion of the speaker.

According to the technical scheme of the embodiment, first position information of a sound source is determined according to voice collected by a microphone array, then the voice is converted into text content, the target font size corresponding to the text content is determined according to the volume of the voice, and finally the text content is displayed in a set three-dimensional coordinate system according to the first position information and the target font size. The method for displaying the voice text provided by the embodiment of the invention can correspond the voice text with the position and the volume of the speaker, thereby improving the display effect of the voice text.

Example two

Fig. 3 is a schematic structural diagram of a device for displaying a speech text according to a second embodiment of the present invention. As shown in fig. 3, the apparatus includes: a first location information determining module 210, a target font size determining module 220, and a text content display module 230.

A first position information determining module 210, configured to determine first position information of a sound source according to a voice collected by the microphone array;

a target font size determining module 220, configured to convert the voice into text content, and determine a target font size corresponding to the text content according to the volume of the voice;

and the text content display module 230 is configured to display the text content in the set three-dimensional coordinate system according to the first position information and the target font size.

Optionally, the microphone array is at least three and is disposed at different positions, and the first position information determining module 210 is further configured to:

acquiring voice phase differences acquired by each microphone array;

first position information of the audio source relative to the target microphone array is determined based on the speech phase difference.

Optionally, the first position information includes a distance between the sound source and the target microphone array, and determining the volume of the speech includes:

determining a first volume of speech when collected by a target microphone array;

the volume of speech produced by the source of sound is determined based on a set volume attenuation formula based on the distance of the source of sound from the target microphone array.

Optionally, the target font size determining module 220 is further configured to:

acquiring a reference font size corresponding to the reference volume;

determining the ratio of the volume of the voice to the reference volume;

Optionally, the text content display module 230 is further configured to:

determining target position information of the sound source in a set three-dimensional coordinate system according to the first position information and the second position information;

the text content is displayed in the target position at the target font size.

Optionally, the method further includes: a font color determination module to: determining emotion information of a sound source according to the voice; and determining the color of the target font according to the emotional information.

Optionally, the text content display module 230 is further configured to:

the text content is displayed in the target position with the target font size and color.

The device can execute the methods provided by all the embodiments of the invention, and has corresponding functional modules and beneficial effects for executing the methods. For details not described in detail in this embodiment, reference may be made to the methods provided in all the foregoing embodiments of the present invention.

EXAMPLE III

Fig. 4 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. FIG. 4 illustrates a block diagram of a computer device 312 suitable for use in implementing embodiments of the present invention. The computer device 312 shown in FIG. 4 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention. Device 312 is a computing device that typically displays speech text.

As shown in FIG. 4, computer device 312 is in the form of a general purpose computing device. The components of computer device 312 may include, but are not limited to: one or more processors 316, a storage device 328, and a bus 318 that couples the various system components including the storage device 328 and the processors 316.

Bus 318 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.

Computer device 312 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 312 and includes both volatile and nonvolatile media, removable and non-removable media.

Storage 328 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 330 and/or cache Memory 332. The computer device 312 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 334 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk-Read Only Memory (CD-ROM), a Digital Video disk (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 318 by one or more data media interfaces. Storage 328 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

Program 336 having a set (at least one) of program modules 326 may be stored, for example, in storage 328, such program modules 326 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which may comprise an implementation of a network environment, or some combination thereof. Program modules 326 generally carry out the functions and/or methodologies of embodiments of the invention as described herein.

The computer device 312 may also communicate with one or more external devices 314 (e.g., keyboard, pointing device, camera, display 324, etc.), with one or more devices that enable a user to interact with the computer device 312, and/or with any devices (e.g., network card, modem, etc.) that enable the computer device 312 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 322. Also, computer device 312 may communicate with one or more networks (e.g., a Local Area Network (LAN), Wide Area Network (WAN), etc.) and/or a public Network, such as the internet, via Network adapter 320. As shown, network adapter 320 communicates with the other modules of computer device 312 via bus 318. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer device 312, including but not limited to: microcode, device drivers, Redundant processing units, external disk drive Arrays, disk array (RAID) systems, tape drives, and data backup storage systems, to name a few.

The processor 316 executes various functional applications and data processing, such as implementing the display method of voice text provided by the above-described embodiments of the present invention, by executing programs stored in the storage 328.

Example four

Embodiments of the present invention provide a computer-readable storage medium on which a computer program is stored, the program, when executed by a processing apparatus, implementing a method for displaying a speech text as in embodiments of the present invention. The computer readable medium of the present invention described above may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: determining first position information of a sound source according to the voice collected by the microphone array; converting the voice into text content, and determining the size of a target font corresponding to the text content according to the volume of the voice; and displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for displaying a speech text, comprising:

2. The method of claim 1, wherein the microphone arrays are at least three and are disposed at different positions, and the determining the first position information of the sound source according to the voices collected by the microphone arrays comprises:

acquiring voice phase differences acquired by each microphone array;

3. The method of claim 2, wherein the first location information comprises a distance of an audio source from the target microphone array, and wherein determining the volume of the speech comprises:

4. The method of claim 1, wherein determining a target font size corresponding to the text content according to the volume of the speech comprises:

acquiring a reference font size corresponding to the reference volume;

determining a ratio of a volume of the speech to the reference volume;

5. The method of claim 1, wherein the volume of the speech is characterized by an amplitude; determining the size of a target font corresponding to the text content according to the volume of the voice, wherein the step comprises the following steps:

6. The method of claim 2, wherein displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size comprises:

displaying the text content in the target position at the target font size.

7. The method of claim 6, after determining a target font size corresponding to the text content according to the volume of the voice, further comprising:

determining emotion information of a sound source according to the voice;

determining the color of the target font according to the emotion information;

8. A device for displaying phonetic text, comprising:

9. A computer device, the device comprising: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of displaying a phonetic text according to any one of claims 1-7 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by processing means, carries out a method of displaying a phonetic text according to any one of claims 1 to 7.