CN112185354A - Voice text display method, device, equipment and storage medium - Google Patents

Voice text display method, device, equipment and storage medium Download PDF

Info

Publication number
CN112185354A
CN112185354A CN202010980844.2A CN202010980844A CN112185354A CN 112185354 A CN112185354 A CN 112185354A CN 202010980844 A CN202010980844 A CN 202010980844A CN 112185354 A CN112185354 A CN 112185354A
Authority
CN
China
Prior art keywords
voice
determining
target
volume
text content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010980844.2A
Other languages
Chinese (zh)
Inventor
余逸尘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Tonghuashun Intelligent Technology Co Ltd
Original Assignee
Zhejiang Tonghuashun Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Tonghuashun Intelligent Technology Co Ltd filed Critical Zhejiang Tonghuashun Intelligent Technology Co Ltd
Priority to CN202010980844.2A priority Critical patent/CN112185354A/en
Publication of CN112185354A publication Critical patent/CN112185354A/en
Priority to US17/447,068 priority patent/US20220084525A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the invention discloses a method, a device and equipment for displaying a voice text and a storage medium. The method comprises the following steps: determining first position information of a sound source according to the voice collected by the microphone array; converting the voice into text content, and determining the size of a target font corresponding to the text content according to the volume of the voice; and displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size. The method for displaying the voice text provided by the embodiment of the invention can correspond the voice text with the position and the volume of the speaker, thereby improving the display effect of the voice text.

Description

Voice text display method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of voice recognition, in particular to a method, a device, equipment and a storage medium for displaying a voice text.
Background
Speech visualization may be understood as converting speech into text by means of automatic speech recognition techniques, the content of the speech being directly visible in the form of text. And identifying the speaker user during speaker identification, and identifying the speaker corresponding to a certain section of audio based on the audio characteristics. In the two ways, in a multi-person conversation scene, the position and the volume of the voice and the speaker cannot be corresponded, so that the display effect of the voice text is poor.
Disclosure of Invention
The invention provides a method, a device, equipment and a storage medium for displaying a voice text, which can correspond the voice text with the position and the volume of a speaker and improve the display effect of the voice text.
In a first aspect, an embodiment of the present invention provides a method for displaying a speech text, including:
determining first position information of a sound source according to the voice collected by the microphone array;
converting the voice into text content, and determining the size of a target font corresponding to the text content according to the volume of the voice;
and displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size.
Further, the microphone array is at least three, and sets up in different positions, confirms the first positional information of sound source according to the pronunciation that the microphone array was gathered, includes:
acquiring voice phase differences acquired by each microphone array;
first position information of the sound source relative to the target microphone array is determined based on the speech phase difference.
Further, the first position information includes a distance between an audio source and the target microphone array, and determining the volume of the speech includes:
determining a first volume of the speech when acquired by a target microphone array;
determining a volume of speech generated from the sound source based on a set volume attenuation formula according to a distance between the sound source and the target microphone array.
Further, determining a target font size corresponding to the text content according to the volume of the voice includes:
acquiring a reference font size corresponding to the reference volume;
determining a ratio of a volume of the speech to the reference volume;
and determining the target font size corresponding to the text content according to the proportion and the reference font size.
Further, the volume of the speech is characterized by an amplitude; determining the size of a target font corresponding to the text content according to the volume of the voice, wherein the step comprises the following steps:
acquiring the maximum amplitude and the minimum amplitude in the voice contained in the current conversation scene;
normalizing the amplitude of each voice according to the maximum amplitude and the minimum amplitude;
and determining the target font size corresponding to each voice according to the amplitude after the normalization processing.
Further, displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size includes:
acquiring second position information of the target microphone array in a set three-dimensional coordinate system;
determining target position information of the sound source in the set three-dimensional coordinate system according to the first position information and the second position information;
displaying the text content in the target position at the target font size.
Further, after determining the size of the target font corresponding to the text content according to the volume of the voice, the method further includes:
determining emotion information of a sound source according to the voice;
determining the color of the target font according to the emotion information;
correspondingly, displaying the text content in the target position in the target font size includes:
displaying the text content in the target location in the target font size and the color.
In a second aspect, an embodiment of the present invention further provides a device for displaying a speech text, including:
the first position information determining module is used for determining first position information of a sound source according to the voice collected by the microphone array;
the target font size determining module is used for converting the voice into text content and determining the target font size corresponding to the text content according to the volume of the voice;
and the text content display module is used for displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size.
Further, the microphone array is at least three and is arranged at different positions, and the first position information determining module is further configured to:
acquiring voice phase differences acquired by each microphone array;
first position information of the sound source relative to the target microphone array is determined based on the speech phase difference.
In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes: comprising a memory, a processor and a computer program stored on the memory and operable on the processor, the processor implementing the method of displaying speech text according to an embodiment of the invention when executing the program.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processing apparatus, implements the method for displaying a speech text according to the embodiment of the present invention.
The embodiment of the invention discloses a method, a device, equipment and a storage medium for displaying a voice text. The method for displaying the voice text provided by the embodiment of the invention can correspond the voice text with the position and the volume of the speaker, thereby improving the display effect of the voice text.
Drawings
Fig. 1 is a flowchart of a method for displaying a speech text according to a first embodiment of the present invention;
FIG. 2 is a diagram illustrating the effect of displaying a voice text according to a first embodiment of the present invention;
fig. 3 is a schematic structural diagram of a device for displaying a speech text according to a second embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer device in the third embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a method for displaying a speech text according to an embodiment of the present invention, where the embodiment is applicable to a case where a text converted from speech is displayed, and the method may be executed by a speech text display apparatus, where the apparatus may be composed of hardware and/or software, and may be generally integrated in a device having a function of displaying a speech text, where the device may be an electronic device such as a server or a server cluster. As shown in fig. 1, the method specifically includes the following steps:
step 110, determining first position information of a sound source according to the voice collected by the microphone array.
Wherein, the number of the microphone arrays can be at least 3, and the microphone arrays are arranged at different positions in the current conversation scene. For example, a conversation scenario may be understood as a venue for a multi-person conversation, such as: a conference room or a classroom, etc. The first position information may be understood as position information of the audio source relative to the target microphone array, which may be any one of the microphone arrays. The first position information comprises a distance of the audio source from the target microphone array.
In this embodiment, the manner of determining the first position information of the sound source according to the speech acquired by the microphone array may be: acquiring voice phase differences acquired by each microphone array; first position information of the audio source relative to the target microphone array is determined based on the speech phase difference.
The voice phase difference can be determined according to the time delay of the voice signals received by each microphone array.
Optionally, the manner of determining the first position information of the sound source according to the speech acquired by the microphone array may also be: determining the distance between a sound source and each microphone array, taking the microphone arrays as the spherical centers, forming a spherical surface by taking the distance between the sound source and each microphone array as the radius, obtaining the intersection point of the spherical surface corresponding to each microphone array, wherein the intersection point is the position of the sound source, obtaining the position information of the intersection point relative to the target microphone array, and determining the position information as first position information.
Step 120, converting the voice into text content, and determining a target font size corresponding to the text content according to the volume of the voice.
Among them, the collected Speech can be converted into text content by Automatic Speech Recognition (ASR).
The volume of the voice can be determined by the amplitude of the voice signal, and the volume of the voice is the volume generated by the sound source.
In this embodiment, the manner of determining the volume of the voice may be: determining a first volume of speech when collected by a target microphone array; the volume of speech produced by the source of sound is determined based on a set volume attenuation formula based on the distance of the source of sound from the target microphone array.
Wherein, the set volume attenuation formula can be obtained based on the volume test experiment fitting. Illustratively, microphone arrays are respectively arranged at positions with different distances from a sound source, and a relationship between the volume and the propagation distance is established based on the volume of the voice collected by each microphone array at different distances, so as to obtain a volume attenuation formula.
Specifically, the first volume and the distance between the sound source and the target microphone array are substituted into a set volume attenuation formula to obtain the volume when the voice is generated by the sound source.
Optionally, the manner of determining the target font size corresponding to the text content according to the volume of the voice may be: acquiring a reference font size corresponding to the reference volume; determining the ratio of the volume of the voice to the reference volume; and determining the target font size corresponding to the text content according to the proportion and the reference font size.
The reference volume can be preset, the average value of the volume when a person speaks can be obtained in a big data analysis mode, the average value is used as the reference volume, and the reference volume is set to the size of the reference font corresponding to the reference volume. Font size may be characterized by font size.
Specifically, the ratio of the reference volume of the collected voice is calculated, and the target font size is obtained by multiplying the ratio by the standard font size. For example, assuming that the standard volume is a0 and the standard font size is B0, if the volume of the captured speech is a1, the target font size is B1 ═ B0 × a1/a 0.
Optionally, the volume of the speech is represented by amplitude, and the manner of determining the target font size corresponding to the text content according to the volume of the speech may be: acquiring the maximum amplitude and the minimum amplitude in the voice contained in the current conversation scene; normalizing the amplitude of each voice according to the maximum amplitude and the minimum amplitude; and determining the target font size corresponding to each voice according to the amplitude after the normalization processing.
Wherein the normalization process has the formula
Figure BDA0002687458950000071
S is the amplitude of the collected speech, SmaxIs the maximum amplitude, S, in speechminIs the minimum amplitude in speech and G is the normalized amplitude. In this embodiment, a font size range may be set, where the voice with the largest amplitude corresponds to the largest font in the range, the voice with the smallest amplitude corresponds to the smallest font in the range, and finally the font size corresponding to the acquired voice is obtained by multiplying the normalized amplitude by the range interval and adding the smallest font. For example, assuming a font size range of 10-25, a range pitch of 15, and a normalized amplitude of 0.6, the acquired speech corresponds to a font size of 10+10 × 0.6 — 16. The method has the advantage that the problem that the character is too big or too small due to improper selection of the reference volume can be prevented, so that the display of the text content is influencedThe appearance is beautiful.
And step 130, displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size.
Specifically, the process of displaying the text content in the set three-dimensional coordinate system according to the first position information and the target font size may be: acquiring second position information of the target microphone array in a set three-dimensional coordinate system; determining target position information of the sound source in a set three-dimensional coordinate system according to the first position information and the second position information; the text content is displayed in the target position at the target font size.
Assuming that the position information of the target microphone array in the set three-dimensional coordinates is (x1, y1, z1) and the position information of the sound source with respect to the target microphone array is (x2, y2, z2), the position information of the sound source in the set three-dimensional coordinates is (x2-x1, y2-y1, z2-z 1). Illustratively, fig. 2 is an effect diagram of the speech text display in the present embodiment. As shown in fig. 2, in the dialog scenario, by the scheme in this embodiment, the content of the speech of the teacher and each student can be accurately displayed, and the positions of the students and the volume of the speech can be reflected.
Optionally, after determining the size of the target font corresponding to the text content according to the volume of the voice, the method further includes the following steps: determining emotion information of a sound source according to the voice; and determining the color of the target font according to the emotional information.
The method for determining emotion information of the sound source according to the voice may be to analyze a trained emotion recognition model inputted by the voice. The emotion recognition model can be obtained by training based on a large number of voice samples labeled with emotion information. In this embodiment, the corresponding relationship between the emotion information and the font color may be established in advance. Illustratively, happy corresponds to red, sad corresponds to blue, and so on.
Specifically, after the font color corresponding to the voice is obtained, the text content is displayed in the target position with the target font size and the corresponding color. So that the text content can reflect the position, volume and emotion of the speaker.
According to the technical scheme of the embodiment, first position information of a sound source is determined according to voice collected by a microphone array, then the voice is converted into text content, the target font size corresponding to the text content is determined according to the volume of the voice, and finally the text content is displayed in a set three-dimensional coordinate system according to the first position information and the target font size. The method for displaying the voice text provided by the embodiment of the invention can correspond the voice text with the position and the volume of the speaker, thereby improving the display effect of the voice text.
Example two
Fig. 3 is a schematic structural diagram of a device for displaying a speech text according to a second embodiment of the present invention. As shown in fig. 3, the apparatus includes: a first location information determining module 210, a target font size determining module 220, and a text content display module 230.
A first position information determining module 210, configured to determine first position information of a sound source according to a voice collected by the microphone array;
a target font size determining module 220, configured to convert the voice into text content, and determine a target font size corresponding to the text content according to the volume of the voice;
and the text content display module 230 is configured to display the text content in the set three-dimensional coordinate system according to the first position information and the target font size.
Optionally, the microphone array is at least three and is disposed at different positions, and the first position information determining module 210 is further configured to:
acquiring voice phase differences acquired by each microphone array;
first position information of the audio source relative to the target microphone array is determined based on the speech phase difference.
Optionally, the first position information includes a distance between the sound source and the target microphone array, and determining the volume of the speech includes:
determining a first volume of speech when collected by a target microphone array;
the volume of speech produced by the source of sound is determined based on a set volume attenuation formula based on the distance of the source of sound from the target microphone array.
Optionally, the target font size determining module 220 is further configured to:
acquiring a reference font size corresponding to the reference volume;
determining the ratio of the volume of the voice to the reference volume;
and determining the target font size corresponding to the text content according to the proportion and the reference font size.
Optionally, the target font size determining module 220 is further configured to:
acquiring the maximum amplitude and the minimum amplitude in the voice contained in the current conversation scene;
normalizing the amplitude of each voice according to the maximum amplitude and the minimum amplitude;
and determining the target font size corresponding to each voice according to the amplitude after the normalization processing.
Optionally, the text content display module 230 is further configured to:
acquiring second position information of the target microphone array in a set three-dimensional coordinate system;
determining target position information of the sound source in a set three-dimensional coordinate system according to the first position information and the second position information;
the text content is displayed in the target position at the target font size.
Optionally, the method further includes: a font color determination module to: determining emotion information of a sound source according to the voice; and determining the color of the target font according to the emotional information.
Optionally, the text content display module 230 is further configured to:
the text content is displayed in the target position with the target font size and color.
The device can execute the methods provided by all the embodiments of the invention, and has corresponding functional modules and beneficial effects for executing the methods. For details not described in detail in this embodiment, reference may be made to the methods provided in all the foregoing embodiments of the present invention.
EXAMPLE III
Fig. 4 is a schematic structural diagram of a computer device according to a third embodiment of the present invention. FIG. 4 illustrates a block diagram of a computer device 312 suitable for use in implementing embodiments of the present invention. The computer device 312 shown in FIG. 4 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention. Device 312 is a computing device that typically displays speech text.
As shown in FIG. 4, computer device 312 is in the form of a general purpose computing device. The components of computer device 312 may include, but are not limited to: one or more processors 316, a storage device 328, and a bus 318 that couples the various system components including the storage device 328 and the processors 316.
Bus 318 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.
Computer device 312 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 312 and includes both volatile and nonvolatile media, removable and non-removable media.
Storage 328 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 330 and/or cache Memory 332. The computer device 312 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 334 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, and commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk-Read Only Memory (CD-ROM), a Digital Video disk (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 318 by one or more data media interfaces. Storage 328 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
Program 336 having a set (at least one) of program modules 326 may be stored, for example, in storage 328, such program modules 326 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which may comprise an implementation of a network environment, or some combination thereof. Program modules 326 generally carry out the functions and/or methodologies of embodiments of the invention as described herein.
The computer device 312 may also communicate with one or more external devices 314 (e.g., keyboard, pointing device, camera, display 324, etc.), with one or more devices that enable a user to interact with the computer device 312, and/or with any devices (e.g., network card, modem, etc.) that enable the computer device 312 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 322. Also, computer device 312 may communicate with one or more networks (e.g., a Local Area Network (LAN), Wide Area Network (WAN), etc.) and/or a public Network, such as the internet, via Network adapter 320. As shown, network adapter 320 communicates with the other modules of computer device 312 via bus 318. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer device 312, including but not limited to: microcode, device drivers, Redundant processing units, external disk drive Arrays, disk array (RAID) systems, tape drives, and data backup storage systems, to name a few.
The processor 316 executes various functional applications and data processing, such as implementing the display method of voice text provided by the above-described embodiments of the present invention, by executing programs stored in the storage 328.
Example four
Embodiments of the present invention provide a computer-readable storage medium on which a computer program is stored, the program, when executed by a processing apparatus, implementing a method for displaying a speech text as in embodiments of the present invention. The computer readable medium of the present invention described above may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: determining first position information of a sound source according to the voice collected by the microphone array; converting the voice into text content, and determining the size of a target font corresponding to the text content according to the volume of the voice; and displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A method for displaying a speech text, comprising:
determining first position information of a sound source according to the voice collected by the microphone array;
converting the voice into text content, and determining the size of a target font corresponding to the text content according to the volume of the voice;
and displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size.
2. The method of claim 1, wherein the microphone arrays are at least three and are disposed at different positions, and the determining the first position information of the sound source according to the voices collected by the microphone arrays comprises:
acquiring voice phase differences acquired by each microphone array;
first position information of the sound source relative to the target microphone array is determined based on the speech phase difference.
3. The method of claim 2, wherein the first location information comprises a distance of an audio source from the target microphone array, and wherein determining the volume of the speech comprises:
determining a first volume of the speech when acquired by a target microphone array;
determining a volume of speech generated from the sound source based on a set volume attenuation formula according to a distance between the sound source and the target microphone array.
4. The method of claim 1, wherein determining a target font size corresponding to the text content according to the volume of the speech comprises:
acquiring a reference font size corresponding to the reference volume;
determining a ratio of a volume of the speech to the reference volume;
and determining the target font size corresponding to the text content according to the proportion and the reference font size.
5. The method of claim 1, wherein the volume of the speech is characterized by an amplitude; determining the size of a target font corresponding to the text content according to the volume of the voice, wherein the step comprises the following steps:
acquiring the maximum amplitude and the minimum amplitude in the voice contained in the current conversation scene;
normalizing the amplitude of each voice according to the maximum amplitude and the minimum amplitude;
and determining the target font size corresponding to each voice according to the amplitude after the normalization processing.
6. The method of claim 2, wherein displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size comprises:
acquiring second position information of the target microphone array in a set three-dimensional coordinate system;
determining target position information of the sound source in the set three-dimensional coordinate system according to the first position information and the second position information;
displaying the text content in the target position at the target font size.
7. The method of claim 6, after determining a target font size corresponding to the text content according to the volume of the voice, further comprising:
determining emotion information of a sound source according to the voice;
determining the color of the target font according to the emotion information;
correspondingly, displaying the text content in the target position in the target font size includes:
displaying the text content in the target location in the target font size and the color.
8. A device for displaying phonetic text, comprising:
the first position information determining module is used for determining first position information of a sound source according to the voice collected by the microphone array;
the target font size determining module is used for converting the voice into text content and determining the target font size corresponding to the text content according to the volume of the voice;
and the text content display module is used for displaying the text content in a set three-dimensional coordinate system according to the first position information and the target font size.
9. A computer device, the device comprising: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of displaying a phonetic text according to any one of claims 1-7 when executing the program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by processing means, carries out a method of displaying a phonetic text according to any one of claims 1 to 7.
CN202010980844.2A 2020-09-17 2020-09-17 Voice text display method, device, equipment and storage medium Pending CN112185354A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010980844.2A CN112185354A (en) 2020-09-17 2020-09-17 Voice text display method, device, equipment and storage medium
US17/447,068 US20220084525A1 (en) 2020-09-17 2021-09-08 Systems and methods for voice audio data processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010980844.2A CN112185354A (en) 2020-09-17 2020-09-17 Voice text display method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112185354A true CN112185354A (en) 2021-01-05

Family

ID=73920319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010980844.2A Pending CN112185354A (en) 2020-09-17 2020-09-17 Voice text display method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112185354A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101819758A (en) * 2009-12-22 2010-09-01 中兴通讯股份有限公司 System of controlling screen display by voice and implementation method
CN104967717A (en) * 2015-05-26 2015-10-07 努比亚技术有限公司 Noise reduction method and apparatus in terminal voice interaction mode
KR20160055337A (en) * 2014-11-07 2016-05-18 삼성전자주식회사 Method for displaying text and electronic device thereof
CN106772247A (en) * 2016-11-30 2017-05-31 努比亚技术有限公司 A kind of terminal and sound localization method
CN109784128A (en) * 2017-11-14 2019-05-21 幻视互动(北京)科技有限公司 Mixed reality intelligent glasses with text and language process function
CN110673819A (en) * 2019-09-18 2020-01-10 联想(北京)有限公司 Information processing method and electronic equipment
CN111128180A (en) * 2019-11-22 2020-05-08 北京理工大学 Auxiliary dialogue system for hearing-impaired people
CN111464827A (en) * 2020-04-20 2020-07-28 玉环智寻信息技术有限公司 Data processing method and device, computing equipment and storage medium
CN111597828A (en) * 2020-05-06 2020-08-28 Oppo广东移动通信有限公司 Translation display method and device, head-mounted display equipment and storage medium
CN111627456A (en) * 2020-05-13 2020-09-04 广州国音智能科技有限公司 Noise elimination method, device, equipment and readable storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101819758A (en) * 2009-12-22 2010-09-01 中兴通讯股份有限公司 System of controlling screen display by voice and implementation method
KR20160055337A (en) * 2014-11-07 2016-05-18 삼성전자주식회사 Method for displaying text and electronic device thereof
CN104967717A (en) * 2015-05-26 2015-10-07 努比亚技术有限公司 Noise reduction method and apparatus in terminal voice interaction mode
CN106772247A (en) * 2016-11-30 2017-05-31 努比亚技术有限公司 A kind of terminal and sound localization method
CN109784128A (en) * 2017-11-14 2019-05-21 幻视互动(北京)科技有限公司 Mixed reality intelligent glasses with text and language process function
CN110673819A (en) * 2019-09-18 2020-01-10 联想(北京)有限公司 Information processing method and electronic equipment
CN111128180A (en) * 2019-11-22 2020-05-08 北京理工大学 Auxiliary dialogue system for hearing-impaired people
CN111464827A (en) * 2020-04-20 2020-07-28 玉环智寻信息技术有限公司 Data processing method and device, computing equipment and storage medium
CN111597828A (en) * 2020-05-06 2020-08-28 Oppo广东移动通信有限公司 Translation display method and device, head-mounted display equipment and storage medium
CN111627456A (en) * 2020-05-13 2020-09-04 广州国音智能科技有限公司 Noise elimination method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
US20180366107A1 (en) Method and device for training acoustic model, computer device and storage medium
US9672829B2 (en) Extracting and displaying key points of a video conference
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN109614934B (en) Online teaching quality assessment parameter generation method and device
CN109325091B (en) Method, device, equipment and medium for updating attribute information of interest points
CN110473525B (en) Method and device for acquiring voice training sample
CN108962282A (en) Speech detection analysis method, apparatus, computer equipment and storage medium
US9324325B2 (en) Converting data between users during a data exchange session
CN110808034A (en) Voice conversion method, device, storage medium and electronic equipment
CN109410918B (en) Method and device for acquiring information
Mirzaei et al. Combining augmented reality and speech technologies to help deaf and hard of hearing people
US9613140B2 (en) Real-time audio dictionary updating system
CN112261456A (en) Voice bullet screen display method, device, equipment and storage medium
Mirzaei et al. Audio-visual speech recognition techniques in augmented reality environments
CN111343410A (en) Mute prompt method and device, electronic equipment and storage medium
CN112364144B (en) Interaction method, device, equipment and computer readable medium
JP2023059937A (en) Data interaction method and device, electronic apparatus, storage medium and program
CN111785247A (en) Voice generation method, device, equipment and computer readable medium
CN111354362A (en) Method and device for assisting hearing-impaired communication
US10650803B2 (en) Mapping between speech signal and transcript
CN112309389A (en) Information interaction method and device
CN112837672A (en) Method and device for determining conversation affiliation, electronic equipment and storage medium
CN112185354A (en) Voice text display method, device, equipment and storage medium
CN112242143A (en) Voice interaction method and device, terminal equipment and storage medium
CN113312928A (en) Text translation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination