CN109754808B

CN109754808B - Method, device, computer equipment and storage medium for converting voice into text

Info

Publication number: CN109754808B
Application number: CN201811526588.9A
Authority: CN
Inventors: 胡大兵
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2024-02-13
Anticipated expiration: 2038-12-13
Also published as: CN109754808A

Abstract

The embodiment of the invention discloses a method, a device, computer equipment and a storage medium for converting voice into characters, which comprise the following steps: acquiring voice information to be processed; segmenting the voice information according to a preset sentence-breaking rule; and converting the segmented voice information into characters. The voice information is segmented through a preset sentence-breaking rule, the segmented voice information is converted into characters, and the readability of the characters can be improved through segmentation of the characters, so that unnecessary misreading or ambiguity is avoided.

Description

Method, device, computer equipment and storage medium for converting voice into text

Technical Field

The embodiment of the invention relates to the field of finance, in particular to a method, a device, computer equipment and a storage medium for converting voice into characters.

Background

The voice recognition is a relatively rapid application, and has wide application scenes in various fields of industry, household appliances, communication, automobile electronics, medical treatment, home service, consumer electronics and the like. The fields to which speech recognition technology relates include: signal processing, pattern recognition, probability theory and information theory, sounding and hearing mechanisms, artificial intelligence, and the like.

In the prior art, speech may be converted into text by speech recognition. However, in the recognition process, when a user speaks continuously or in some scenes, a plurality of people talk, after converting a voice into a text, the text is not broken, and a speaker cannot be distinguished, so that ambiguity or misunderstanding exists in the text after the voice conversion.

Disclosure of Invention

The embodiment of the invention provides a method, a device, computer equipment and a storage medium for converting voice into characters.

In order to solve the technical problems, the embodiment of the invention adopts the following technical scheme: a method for converting voice into text is provided, which comprises the following steps:

acquiring voice information to be processed;

segmenting the voice information according to a preset sentence-breaking rule;

and converting the segmented voice information into characters.

Optionally, the segmenting the voice information according to a preset sentence-breaking rule includes:

detecting a decibel value in the voice information;

when the decibel value in the voice information is smaller than a preset decibel value, taking the position of the decibel value smaller than the preset decibel value as a first segmentation point of the voice information;

and segmenting the voice information according to the first segmentation point.

Optionally, the step of using the position with the db value smaller than the preset db value as the first segmentation point of the voice information includes:

judging the voice duration of the voice information, wherein the decibel value in the voice information is smaller than the preset decibel value;

when the voice time length is longer than a preset time length, acquiring a voice fragment with a decibel value smaller than the preset decibel value;

and taking any time in the voice fragment as the first segmentation point.

judging whether tone color change occurs in the voice information;

when the tone color change occurs in the voice information, taking the position of the tone color change as a second segmentation point of the voice information;

and segmenting the voice information according to the second segmentation point.

Optionally, the converting the segmented voice information into text includes:

performing tone marking on the segmented voice information with the same tone;

converting the marked segmented voice into characters through preset voice conversion software;

and marking the characters after conversion according to the tone marks.

Optionally, the converting the segmented voice information into text includes:

converting the segmented voice into target characters through preset voice conversion software;

acquiring language keywords in the target characters;

searching punctuation marks with mapping relation with the language keywords in a preset information table, and adding the punctuation marks to the target characters.

Optionally, the acquiring the voice information to be processed includes:

collecting voice information of a user;

and carrying out noise reduction processing on the voice information according to preset processing software.

In order to solve the above technical problem, an embodiment of the present invention further provides a voice-to-text device, including:

the acquisition module is used for acquiring the voice information to be processed;

the processing module is used for segmenting the voice information according to a preset sentence-breaking rule;

and the execution module is used for converting the segmented voice information into characters.

Optionally, the processing module includes:

the first processing sub-module is used for detecting decibel values in the voice information;

the second processing sub-module is used for taking the position of which the decibel value is smaller than a preset decibel value as a first segmentation point of the voice information when the decibel value in the voice information is smaller than the preset decibel value;

and the first execution sub-module is used for segmenting the voice information according to the first segmentation point.

Optionally, the second processing sub-module includes:

the third processing sub-module is used for judging the voice duration of which the decibel value is smaller than the preset decibel value in the voice information;

the first acquisition submodule is used for acquiring a voice fragment with a decibel value smaller than a preset decibel value when the voice time length is longer than the preset time length;

and the second execution sub-module is used for taking any time in the voice fragment as the first segmentation point.

Optionally, the processing module includes:

a fourth processing sub-module, configured to determine whether a tone color change occurs in the voice information;

a fifth processing sub-module, configured to take, when a tone color change occurs in the voice information, a location of the tone color change as a second segmentation point of the voice information;

and the third execution sub-module is used for segmenting the voice information according to the second segmentation point.

Optionally, the execution module includes:

a sixth processing sub-module, configured to perform tone marking on the segmented voice information with the same tone;

a seventh processing sub-module, configured to convert the marked segmented speech into text through preset speech conversion software;

and the fourth execution sub-module is used for marking the characters after conversion according to the tone marks.

Optionally, the execution module includes:

an eighth processing sub-module, configured to convert the segmented speech into a target text through preset speech conversion software;

the second acquisition sub-module is used for acquiring the mood keywords in the target characters;

and the fifth execution sub-module is used for searching punctuation marks with mapping relation with the language keywords in a preset information table, and adding the punctuation marks to the target characters.

Optionally, the acquiring module includes:

the third acquisition sub-module is used for acquiring voice information of the user;

and the ninth processing sub-module is used for carrying out noise reduction processing on the voice information according to preset processing software.

In order to solve the above technical problem, an embodiment of the present invention further provides a computer device, including a memory and a processor, where the memory stores computer readable instructions, and when the computer readable instructions are executed by the processor, the processor is caused to execute the steps of the above voice conversion text method.

To solve the above technical problem, an embodiment of the present invention further provides a storage medium storing computer readable instructions, where the computer readable instructions when executed by one or more processors cause the one or more processors to perform the steps of the above voice conversion text method.

The embodiment of the invention has the beneficial effects that: the voice information is segmented through a preset sentence-breaking rule, the segmented voice information is converted into characters, and the readability of the characters can be improved through segmentation of the characters, so that unnecessary misreading or ambiguity is avoided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a basic flow diagram of a method for converting text into speech according to an embodiment of the present invention;

fig. 2 is a basic flow diagram of a method for segmenting speech information according to a preset sentence-breaking rule according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a basic flow chart of a method for using a position with a db value smaller than a preset db value as a first segmentation point of voice information according to an embodiment of the present invention;

FIG. 4 is a basic flow diagram of a method for segmenting speech information according to a preset sentence-breaking rule according to an embodiment of the present invention;

FIG. 5 is a basic flow diagram of a method for converting segmented speech information into text according to an embodiment of the present invention;

FIG. 6 is a basic flow diagram of a method for converting segmented speech information into text according to an embodiment of the present invention;

FIG. 7 is a basic block diagram of a voice-to-text device according to an embodiment of the present invention;

FIG. 8 is a basic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings.

In some of the flows described in the specification and claims of the present invention and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 101, 102, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

Examples

As used herein, a "terminal" includes both a device of a wireless signal receiver having no transmitting capability and a device of receiving and transmitting hardware having receiving and transmitting hardware capable of performing bi-directional communications over a bi-directional communication link, as will be appreciated by those skilled in the art. Such a device may include: a cellular or other communication device having a single-line display or a multi-line display or a cellular or other communication device without a multi-line display; a PCS (Personal Communications Service, personal communication system) that may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (PersonalDigital Assistant ) that can include a radio frequency receiver, pager, internet/intranet access, web browser, notepad, calendar and/or GPS (Global Positioning System ) receiver; a conventional laptop and/or palmtop computer or other appliance that has and/or includes a radio frequency receiver. As used herein, "terminal," "terminal device" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or adapted and/or configured to operate locally and/or in a distributed fashion, to operate at any other location(s) on earth and/or in space. The "terminal" and "terminal device" used herein may also be a communication terminal, a network access terminal, and a music/video playing terminal, for example, may be a PDA, a MID (Mobile Internet Device ), and/or a mobile phone with a music/video playing function, and may also be a smart tv, a set top box, and other devices.

The client terminal in this embodiment is the above-described terminal.

Specifically, referring to fig. 1, fig. 1 is a basic flow chart of the voice conversion text method according to the present embodiment.

As shown in fig. 1, the voice conversion text method comprises the following steps:

s1100, acquiring voice information to be processed;

the voice information to be processed is the voice information to be converted into the text information, and in general, in order to improve the accuracy of text conversion, the voice information to be processed is generally the voice information subjected to noise reduction processing. Specifically, acquiring the voice information to be processed includes: and collecting voice information of a user, and carrying out noise reduction treatment on the voice information according to preset processing software.

When the voice information of the user is collected, the voice information can be input through a voice recording module arranged in the terminal, and the voice information can be obtained through downloading or receiving the voice information sent by other terminals. In the noise reduction processing of the voice information, processing can be performed by using preset audio processing software, for example, adobe Audition CS, vinylStudio, etc.

S1200, segmenting the voice information according to a preset sentence-breaking rule;

the preset sentence breaking rule is a preset rule for segmenting the voice information, for example, the voice information is divided into a plurality of segments according to the pause position in the voice information, and when voices of a plurality of characters appear in the voice information, the voice information is segmented according to tone.

S1300, converting the segmented voice information into characters.

In the embodiment of the invention, the segmented voice information can be converted into the text through the text conversion software built in the terminal, for example, the SwiftScribe software.

According to the method for converting the words by the voice, the voice information is segmented through the preset sentence-breaking rule, the words are converted according to the segmented voice information, and the readability of the words can be improved through segmentation of the words, so that unnecessary misreading or ambiguity is avoided.

In practical application, when a user inputs voice information through a terminal, a pause occurs according to a speaking habit of the user, so in order to break a sentence of the voice information according to a natural speaking habit, an embodiment of the present invention provides a method for segmenting the voice information according to a preset breaking rule, as shown in fig. 2, and fig. 2 is a basic flow diagram of the method for segmenting the voice information according to the preset breaking rule provided by the embodiment of the present invention.

Specifically, as shown in fig. 2, step S1200 specifically includes the following steps:

s1211, detecting a decibel value in the voice information;

s1212, when the decibel value in the voice information is smaller than a preset decibel value, taking the position with the decibel value smaller than the preset decibel value as a first segmentation point of the voice information;

in the embodiment of the invention, the terminal detects the decibel value of the voice information, such as a sound meter 2.0,Digital Sound Meter, through preset decibel detection software.

The preset decibel value is preset, when the user speaking is stopped in the voice information, the decibel value is lower, in view of the noise in the environment, the preset decibel value can be set to be in view of the sound which is higher than the decibel value of the noise in the environment and lower than the sound when the user speaking is normal.

The embodiment of the invention also provides a method for taking the position with the decibel value smaller than the preset decibel value as the first segmentation point of the voice information, as shown in fig. 3, fig. 3 is a basic flow diagram of the method for taking the position with the decibel value smaller than the preset decibel value as the first segmentation point of the voice information, which is provided by the embodiment of the invention.

Specifically, as shown in fig. 3, step S1212 includes the steps of:

s12121, judging the voice duration of the voice information, wherein the decibel value of the voice duration is smaller than a preset decibel value;

in practical application, because there is a time interval between each word when people speak, in general, a pause after the complete sentence is expressed is the segmentation in the embodiment of the present invention, so in the process of determining the first segmentation point, when the db value at a certain moment in the voice information is less than the preset db value, it is determined whether the duration of the voice segment is greater than the preset duration when the db values in the voice segment taking the moment as the starting point are all less than the preset db value. The voice duration in the embodiment of the invention is the duration of the voice segment with the decibel value smaller than the preset decibel value.

S12122, when the voice time length is longer than the preset time length, acquiring a voice fragment with a decibel value smaller than the preset decibel value;

in the embodiment of the invention, when the voice fragment with the decibel value smaller than the preset decibel value is obtained, the time of the voice fragment in the voice information is only required to be obtained.

S12123, taking any time in the voice segment as the first segmentation point.

In the following, for example, the db detection software detects that the db at the 2S position in the voice information is lower than the preset db value, and determines that the duration 1S of the 2-3S voice segment is longer than the preset duration 0.5S from the start point of 2S, that is, the db value of the 2-3S voice segment in the voice information is lower than the preset db value, so any time in the 2-3S voice segment, that is, at the 2.5S position, at the 3S position, etc., can be determined as the first segmentation point.

Therefore, the first segmentation point can be accurately determined, and the problem of random open statement is avoided.

S1213, segmenting the voice information according to the first segmentation point.

In the embodiment of the invention, the voice information is divided into a plurality of voice fragments according to the first segmentation point, namely, the point at any moment in the voice fragments with the decibel values smaller than the preset decibel value. It should be noted that, after segmentation, each speech segment is ordered according to its original position in the speech information, so as to maintain continuity.

In practical applications, a scenario of multi-person chat, such as interviews, conference recordings, etc., often occurs in voice information. In this case, in order to enhance the converted text information, the embodiment of the present invention provides a method for segmenting the speech information according to the preset sentence-breaking rule, as shown in fig. 4, and fig. 4 is a basic flow diagram of the method for segmenting the speech information according to the preset sentence-breaking rule provided in the embodiment of the present invention.

Specifically, as shown in fig. 4, step S1220 includes the steps of:

s1221, judging whether tone color change occurs in voice information;

in the embodiment of the invention, the built-in tone detection software can be utilized to detect whether tone change exists in the voice information, such as polyphosphine software and the like.

S1222, when tone color change occurs in the voice information, taking the position of the tone color change as a second segmentation point of the voice information;

s1223, segmenting the voice information according to the second segmentation point.

In the embodiment of the invention, when tone color change occurs in the voice information, the time point of the tone color change in the voice information is extracted, and the time point is used as the second segmentation point of the voice information.

It should be explained that, in practical application, the same voice information may be the case in the embodiment shown in fig. 2 and the present embodiment, that is, the case where the first segmentation point and the second segmentation point exist simultaneously, at this time, the voice information is segmented according to the first segmentation point and the second segmentation point, and the segmented voice segments are ordered according to the order in the voice information, so as to avoid the situation of disordered text after segmentation.

When a plurality of roles (i.e. tone colors) appear in the voice information, the voice information is divided into a plurality of segments according to the second segment, and after the segmented voice segments are converted into characters, the words of each role are easy to be confused in the reading process of readers because the roles are not known, so as to solve the problem, in the embodiment of the invention, a method for converting the segmented voice information into characters is provided, as shown in fig. 5, and fig. 5 is a basic flow diagram of the method for converting the segmented voice information into characters, which is provided in the embodiment of the invention.

Specifically, as shown in fig. 5, step S1300 includes the steps of:

s1311, performing tone marking on segmented voice information with the same tone;

when the voice information includes a plurality of characters, i.e., different timbres, the timbres are marked, for example, the voice information includes two characters a and B, a is marked with a, and B is marked with B, so that the characters are distinguished.

S1312, converting the segmented voice after marking into characters through preset voice conversion software;

s1313, performing role marking on the converted characters according to tone marks.

In the embodiment of the invention, each segmented voice is converted into the characters according to the sequence through the built-in voice conversion software, and the characters are recorded on the segment headers of each segmented character, so that readers can clearly know which character each segmented character is spoken by, and the readability of the characters is improved.

In practical application, in order to increase the readability of the converted text and provide a good reading experience for readers, another method for converting segmented voice information into text is provided in the embodiment of the present invention, as shown in fig. 6, fig. 6 is a basic flow diagram of a method for converting segmented voice information into text provided in the embodiment of the present invention.

Specifically, as shown in fig. 6, step S1300 includes the steps of:

s1321, converting segmented voice into target characters through preset voice conversion software;

the voice conversion software includes SwiftScribe, IBM Viavoice and other software.

S1322, obtaining language keywords in the target characters;

in the embodiment of the invention, a mood word database is preset at the terminal, when the mood key words of the target characters are obtained, the terminal compares the words in the mood word database with the words in the target characters, when the words which are the same as the mood word database exist in the target characters, the words are extracted, and the words are used as the mood key words of the target characters.

S1323, searching punctuation marks with mapping relation with the mood keywords in a preset information table, and adding the punctuation marks to the target characters.

The information table describes the correspondence between the vocabulary of the mood and punctuation marks, for example, when "what" appears in the sentence, this usually indicates a question, and "? "when an o appears at the end of a sentence, generally exclamation, should be used at the end of the sentence" +| -! ". Thus, the words "what" and "are the mood words in the information table? "having mapping relation," o "and" Σ -! "have a mapping relationship".

In practical applications, the emotion represented by different mood words is different according to different contexts, for example, "o" may also represent questioning. In order to more accurately add punctuation marks, the structure and the context semantics of the sentence can be analyzed to determine the punctuation marks at the end of the sentence, which is not described herein.

In order to solve the technical problems, the embodiment of the invention also provides a voice character conversion device. Referring specifically to fig. 7, fig. 7 is a basic block diagram of the voice conversion text device according to the present embodiment.

As shown in fig. 7, a voice conversion text apparatus includes: an acquisition module 2100, a processing module 2200, and an execution module 2300. Wherein, the obtaining module 2100 is configured to obtain voice information to be processed; the processing module 2200 is used for segmenting the voice information according to a preset sentence-breaking rule; the execution module 2300 is used for converting the segmented voice information into text.

The voice conversion text device segments voice information through a preset sentence breaking rule, converts the segmented voice information into text, and can increase the readability of the text by segmenting the text so as to avoid unnecessary misreading or ambiguity.

In some embodiments, the processing module comprises: the first processing sub-module is used for detecting decibel values in the voice information; the second processing sub-module is used for taking the position of which the decibel value is smaller than a preset decibel value as a first segmentation point of the voice information when the decibel value in the voice information is smaller than the preset decibel value; and the first execution sub-module is used for segmenting the voice information according to the first segmentation point.

In some embodiments, the second processing sub-module comprises: the third processing sub-module is used for judging the voice duration of which the decibel value is smaller than the preset decibel value in the voice information; the first acquisition submodule is used for acquiring a voice fragment with a decibel value smaller than a preset decibel value when the voice time length is longer than the preset time length; and the second execution sub-module is used for taking any time in the voice fragment as the first segmentation point.

In some embodiments, the processing module comprises: a fourth processing sub-module, configured to determine whether a tone color change occurs in the voice information; a fifth processing sub-module, configured to take, when a tone color change occurs in the voice information, a location of the tone color change as a second segmentation point of the voice information; and the third execution sub-module is used for segmenting the voice information according to the second segmentation point.

In some embodiments, the execution module comprises: a sixth processing sub-module, configured to perform tone marking on the segmented voice information with the same tone; a seventh processing sub-module, configured to convert the marked segmented speech into text through preset speech conversion software; and the fourth execution sub-module is used for marking the characters after conversion according to the tone marks.

In some embodiments, the execution module comprises: an eighth processing sub-module, configured to convert the segmented speech into a target text through preset speech conversion software; the second acquisition sub-module is used for acquiring the mood keywords in the target characters; and the fifth execution sub-module is used for searching punctuation marks with mapping relation with the language keywords in a preset information table, and adding the punctuation marks to the target characters.

In some embodiments, the acquisition module comprises: the third acquisition sub-module is used for acquiring voice information of the user; and the ninth processing sub-module is used for carrying out noise reduction processing on the voice information according to preset processing software.

In order to solve the technical problems, the embodiment of the invention also provides computer equipment. Referring specifically to fig. 8, fig. 8 is a basic structural block diagram of a computer device according to the present embodiment.

As shown in fig. 8, the internal structure of the computer device is schematically shown. As shown in fig. 8, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected by a system bus. The nonvolatile storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store a control information sequence, and the computer readable instructions can enable the processor to realize a voice character conversion method when the computer readable instructions are executed by the processor. The processor of the computer device is used to provide computing and control capabilities, supporting the operation of the entire computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, cause the processor to perform a voice conversion text method. The network interface of the computer device is for communicating with a terminal connection. It will be appreciated by those skilled in the art that the structure shown in fig. 8 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

The processor in this embodiment is configured to execute the specific contents of the acquisition module 2100, the processing module 2200, and the execution module 2300 in fig. 7, and the memory stores program codes and various types of data required for executing the above modules. The network interface is used for data transmission between the user terminal or the server. The memory in this embodiment stores program codes and data required for executing all the sub-modules in the voice conversion text method, and the server can call the program codes and data of the server to execute the functions of all the sub-modules.

The computer equipment segments the voice information through a preset sentence-breaking rule, converts the voice information after segmentation into characters, and can increase the readability of the characters by segmenting the characters so as to avoid unnecessary misreading or ambiguity. .

The present invention also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the speech-to-text method of any of the embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. A method for converting text into speech, comprising the steps of:

acquiring voice information to be processed;

segmenting the voice information according to a preset sentence-breaking rule;

converting the segmented voice information into characters;

the step of segmenting the voice information according to a preset sentence-breaking rule comprises the following steps:

detecting the voice information to obtain a decibel value and a tone corresponding to the voice information;

segmenting the voice information according to the decibel value, the tone and the preset sentence-breaking rule;

the step of converting the segmented voice information into text comprises the following steps:

acquiring language keywords in the target characters;

searching punctuation marks with mapping relation with the language keywords in a preset information table to obtain preselected punctuation marks;

analyzing the structure and the context semantics of the sentence, and determining a target punctuation mark from the pre-selected punctuation marks;

and adding the target punctuation mark to the target text.

2. The method for converting text into speech according to claim 1, wherein the segmenting the speech information according to the preset sentence-breaking rule comprises:

detecting a decibel value in the voice information;

and segmenting the voice information according to the first segmentation point.

3. The method for converting text according to claim 2, wherein the step of using the position with the db value smaller than the preset db value as the first segmentation point of the voice message includes:

and taking any time in the voice fragment as the first segmentation point.

4. The method for converting text into speech according to claim 1, wherein the segmenting the speech information according to the preset sentence-breaking rule comprises:

judging whether tone color change occurs in the voice information;

5. The method for converting text into speech according to claim 4, wherein said converting the segmented speech information into text comprises:

performing tone marking on the segmented voice information with the same tone;

and marking the characters after conversion according to the tone marks.

6. The method for converting text into speech according to claim 1, wherein the obtaining speech information to be processed includes:

collecting voice information of a user;

7. A speech-to-text apparatus, comprising:

the execution module is used for converting the segmented voice information into characters;

the executing module is specifically configured to, when executing the step of converting the segmented voice information into text:

acquiring language keywords in the target characters;

adding the target punctuation mark to the target text;

the processing module is specifically configured to:

and segmenting the voice information according to the decibel value, the tone and the preset sentence breaking rule.

8. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions that, when executed by the processor, cause the processor to perform the steps of the speech-to-text method of any of claims 1 to 6.

9. A storage medium storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the speech-to-text method of any one of claims 1 to 6.