CN116631400A

CN116631400A - Voice-to-text method and device, computer equipment and storage medium

Info

Publication number: CN116631400A
Application number: CN202310834751.2A
Authority: CN
Inventors: 黄杨
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-08-22

Abstract

The embodiment of the application provides a method and a device for converting voice into text, computer equipment and a storage medium, and belongs to the technical field of financial science and technology. The method comprises the following steps: acquiring voice data; performing content recognition on the voice data to obtain an original text; extracting key information from the original text according to preset key word characteristics to obtain selected key information; wherein the selected key information comprises: word information to be marked and marking effect information; screening words to be marked from the original text according to the word information to be marked; screening out target marking operations from preset candidate marking operations according to the marking effect information; and marking the words to be marked in the original text according to the target marking operation to obtain a target text. The embodiment of the application can generate the text for marking the key sentences, thereby saving the manpower for manual marking.

Description

Voice-to-text method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of financial technology, and in particular, to a method and apparatus for converting voice into text, a computer device, and a storage medium.

Background

Along with the development of financial science and technology, in order to improve the service quality of the financial industry, voice data is obtained by serving clients and recording sales calls, and then text conversion is performed on the voice data to extract key contents for clients to review. For example, in the insurance industry, insurance service personnel explain service schemes to clients in a dictation mode, and in order to facilitate the subsequent reference of clients, the service schemes based on voice data of the insurance service personnel are converted into texts.

In the related art, voice-to-text is a content-generated text that directly recognizes voice data. If the key sentences in the voice data need to be added with marking effects in the generated text, the key sentences are highlighted. For example, after voice data is converted into a service scheme in the insurance industry, labeling effects are required to be added to key contents in the service scheme by insurance service personnel so that customers can view the key contents. However, the labeling effect is required to be manually added to the key sentences in the generated text, so that the manual editing time is consumed. Therefore, how to automatically generate text marked with key sentences from voice data becomes a technical problem to be solved.

Disclosure of Invention

The embodiment of the application mainly aims to provide a method and a device for converting voice into text, computer equipment and a storage medium, and aims to automatically generate text marked with key sentences from voice data, thereby saving manpower.

To achieve the above object, a first aspect of an embodiment of the present application provides a method for converting speech to text, the method including:

acquiring voice data;

performing content recognition on the voice data to obtain an original text;

extracting key information from the original text according to preset key word characteristics to obtain selected key information; wherein the selected key information comprises: word information to be marked and marking effect information;

screening words to be marked from the original text according to the word information to be marked;

screening out target marking operations from preset candidate marking operations according to the marking effect information;

and marking the words to be marked in the original text according to the target marking operation to obtain a target text.

In some embodiments, the extracting the key information from the original text according to the preset key word features to obtain the selected key information includes:

dividing word characteristics of the original text to obtain original word characteristics;

screening target word features from the original word features according to the keyword features;

and extracting text content from the original text according to the target word characteristics to obtain the selected key information.

In some embodiments, the extracting text content from the original text according to the target word feature to obtain the selected key information includes:

extracting sentences of the target word characteristics from the original text to obtain candidate key sentences; wherein the candidate key sentences include: key words and marking effect words;

and constructing the selected key information according to the key words and the marking effect words.

In some embodiments, the marking the word to be marked in the original text according to the target marking operation to obtain a target text includes:

extracting sentences containing the words to be marked from the original text to obtain sentences to be marked;

marking the word to be marked of the statement to be marked according to the target marking operation to obtain a target statement;

and replacing the statement to be marked in the original text with the target statement to obtain the target text.

In some embodiments, the extracting the sentence containing the word to be tagged from the original text to obtain the sentence to be tagged includes:

selecting a sentence range according to the word to be marked in the original text to obtain a sentence selection range; the sentence selection range is the range of the sentence before the candidate key sentence in the original text;

And selecting the sentence to be marked from the original text according to the sentence selection range.

In some embodiments, after the word to be marked in the original text is marked according to the target marking operation, the method further includes:

acquiring the position information of punctuation marks before the candidate key sentences in the target text to obtain symbol position information;

and eliminating punctuation marks of the target text according to the mark position information, and eliminating the candidate key sentences in the target text to update the target text.

In some embodiments, before extracting the key information from the original text according to the preset key word characteristics to obtain the selected key information, the method further includes:

the construction of the keyword features specifically comprises the following steps:

acquiring a preset marking rule; wherein the marking rule includes: a signer feature, a marker feature, and an effect word feature;

and combining the index word features, the mark word features and the effect word features to obtain the keyword features.

To achieve the above object, a second aspect of an embodiment of the present application provides a device for converting speech to text, the device including:

The data acquisition module is used for acquiring voice data;

the content recognition module is used for carrying out content recognition on the voice data to obtain an original text;

the information extraction module is used for extracting key information from the original text according to preset key word characteristics to obtain selected key information; wherein the selected key information comprises: word information to be marked and marking effect information;

the word screening module is used for screening words to be marked from the original text according to the word information to be marked;

the operation screening module is used for screening out target marking operations from preset candidate marking operations according to the marking effect information;

and the word processing module is used for carrying out marking processing on the word to be marked in the original text according to the target marking operation to obtain a target text.

To achieve the above object, a third aspect of the embodiments of the present application proposes a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the method according to the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of the first aspect.

The method and the device for converting voice into text, the computer equipment and the storage medium provided by the application are characterized in that voice data are converted into original text, word information to be marked and marking effect information in the original text are extracted to screen words to be marked from the original text, target marking operation is screened from candidate marking operations, and the words to be marked in the original text are automatically marked with effects according to the target marking operation, so that text marked with key sentences is generated, and the labor for manually marking the text with effects is saved. Therefore, in the insurance industry, the service scheme with the marking effect can be directly generated, so that customers can conveniently review key contents, and the workload of insurance service personnel for generating the service scheme can be reduced.

Drawings

FIG. 1 is a flow chart of a method for converting speech to text according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for converting speech to text according to another embodiment of the present application;

fig. 3 is a flowchart of step S103 in fig. 1;

fig. 4 is a flowchart of step S303 in fig. 3;

FIG. 5 is a schematic diagram of a code mapping relationship in a method for converting speech to text according to an embodiment of the present application;

fig. 6 is a flowchart of step S106 in fig. 1;

Fig. 7 is a flowchart of step S601 in fig. 6;

FIG. 8 is a flow chart of a method for converting speech to text according to another embodiment of the present application;

fig. 9 is a schematic structural diagram of a voice-to-text device according to an embodiment of the present application;

fig. 10 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

First, several nouns involved in the present application are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, speech recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and NLP involves data mining related to language processing, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language computing, and the like.

Automatic speech recognition technology (Automatic Speech Recognition, ASR): automatic speech recognition technology is a technology that converts human speech into text. The automatic speech recognition technology takes speech as a research object, and allows a machine to automatically recognize and understand human spoken speech through speech signal processing and pattern recognition. Speech recognition technology is a high technology that allows a machine to convert speech signals into corresponding text or commands through a recognition and understanding process. Speech recognition is a widely related interdisciplinary, and speech recognition technology has very close relations with the disciplines of acoustics, phonetics, linguistics, information theory, pattern recognition theory, neurobiology and the like.

Rendering (Render): rendering refers to the process of generating an image from a model with software to describe computing effects in video editing software, and generating the output of the final video. The application field of rendering includes special effects such as computer and video games, movies, and the like, and visual design, and each application is a comprehensive consideration of characteristics and technologies.

The voice-to-text technology is to obtain voice data by recording by a recording device after the interpreter speaks, the voice data is transmitted to a server, and the server converts the voice data into text. For example, in the insurance industry, service communication between insurance service personnel and clients and sales call are recorded to obtain voice data, and then the voice data is converted into text, so that the service personnel can extract key content to construct a service scheme. In addition, in other financial fields, a financial service scheme can be generated by a voice input mode, and the financial service scheme does not need to be manually manufactured. In order to facilitate the important review of clients, the clients are ensured to be able to review the key content in an important way, and the key content in the text needs to be marked with effects so as to emphasize the speaking intention of the oral translator. In the related art, after voice data is converted into text, key words in the text are manually marked according to the intention of a translator by effect, so that the key content of the text is highlighted. However, manual labeling by hand can take time and effort.

Based on the above, the embodiment of the application provides a method and a device for converting voice into text, a computer device and a storage medium, which are used for extracting word information to be marked and marking effect information in the text after converting voice data into text so as to select words to be marked and target marking operation in an original text, automatically marking the words to be marked in the original text with effects according to the target marking operation, so as to generate text marked with key sentences, and save the labor for manually marking the effects of the text. Therefore, for insurance business personnel in the insurance industry, the service scheme with the marking effect can be directly generated in an interpretation mode, so that a customer can conveniently check the service scheme, and the workload of the insurance business personnel can be reduced.

The method and apparatus for converting voice into text, the computer device and the storage medium provided by the embodiments of the present application are specifically described by the following embodiments, and the method for converting voice into text in the embodiments of the present application is first described.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a voice-to-text method, which relates to the technical fields of artificial intelligence and financial science and technology. The voice-to-text method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements a voice-to-text method, but is not limited to the above.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It should be noted that, in each specific embodiment of the present application, when related processing is required according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through popup or jump to a confirmation page and the like, and after the independent permission or independent consent of the user is definitely acquired, the necessary relevant data of the user for enabling the embodiment of the application to normally operate is acquired.

Fig. 1 is an optional flowchart of a method for converting speech to text according to an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S106.

Step S101, voice data is obtained;

step S102, carrying out content recognition on voice data to obtain an original text;

step S103, extracting key information from the original text according to preset key word characteristics to obtain selected key information; wherein, the selected key information comprises: word information to be marked and marking effect information;

step S104, screening out words to be marked from the original text according to the word information to be marked;

step S105, screening out target marking operations from preset candidate marking operations according to marking effect information;

and S106, marking the words to be marked in the original text according to the target marking operation to obtain the target text.

In the steps S101 to S106 shown in the embodiment of the present application, the original text is obtained by performing content recognition on the obtained voice data, the word information to be marked and the marking effect information are extracted from the original text according to the keyword characteristics, the word to be marked is screened from the original text according to the word information to be marked, and the target marking operation is screened from the candidate marking operations according to the marking effect information, so that the target text is obtained by performing effect marking on the word to be marked in the original text according to the target marking operation. Therefore, after the voice data is automatically converted into the text, the key sentences in the text are automatically marked to generate the target text with the marked key sentences, so that not only is the labor and time for manually editing the text saved, but also the generation of differentiated contents in the target text is accelerated.

In step S101 of some embodiments, voice data may be extracted from a preset voice database. The voice data may also be acquired in real time by other means, not limited thereto. If the voice data are acquired in real time, the words of the speaking object need to be recorded by the recording equipment to acquire the voice data. The recording device can be preconfigured on the terminal, or can be directly transmitted to the terminal or the server after voice data are collected through the recording device. The recording device comprises any one of the following: a mobile phone, a notebook computer and a recording pen. The voice recording icon is displayed on the interface of the voice recording device and is provided with the volume identification, so that the speaking object is prompted to speak the volume in real time through the volume identification, and whether voice data are recorded or not can be known by the speaking object according to the volume identification.

It should be noted that, the voice data collected by the recording device may be subjected to a next operation at the terminal, or the voice data may be sent to the server, and the server may perform a next operation on the voice data. The communication mode of the voice recording device for sending the voice data to the terminal or the server can be wireless or limited, and the wireless mode can be any one of Bluetooth, GPRS and WIFI.

For example, if the application scenario of the insurance industry is that, when insurance service personnel and clients explain the insurance service scheme on site, recording the voice of the insurance personnel when explaining the insurance service scheme by adopting recording equipment to obtain voice data. If insurance service personnel explain through the telephone mode, through taking the telephone record of telephone communication process as the voice data. If insurance service personnel adopt a teleconference mode to explain, audio data in conference video data can be extracted as voice data.

In step S102 of some embodiments, after the voice data is acquired, the voice data needs to be converted into text. The voice data is converted into original text by an automatic voice recognition technology to recognize text contents in the voice data.

It should be noted that, the voice data is converted into the original text, and the content in the voice data is recognized by the voice recognition program to obtain the original text.

In some embodiments, prior to step S103, the speech-to-text method further comprises: and constructing keyword features.

It should be noted that, before extracting the selected key information from the original text, the key word features need to be constructed, that is, the words of the word features in the original text need to be obtained are predefined as the selected key information, so as to improve the flexibility of the customization of the key word features.

Referring to fig. 2, in some embodiments, constructing the keyword features may include, but is not limited to including steps S201 through S202:

step S201, obtaining a preset marking rule; wherein the marking rule includes: a signer feature, a marker feature, and an effect word feature;

step S202, combining the index word features, the mark word features and the effect word features to obtain keyword features.

In step S201 of some embodiments, a preset marking rule is obtained, and the marking rule includes: the method comprises the steps of referring to word characteristics, marking word characteristics and effect word characteristics, wherein the referring word characteristics are characteristics of referring to words, the marking word characteristics are characteristics of words to be marked, and the effect word characteristics can determine what marking effect is carried out on the words to be marked.

For example, if the reference character feature is "above", the marker character feature is "key word", and the effect character feature is "marker effect word". Thus, by referring to the word features, key words, and effect word features, it is possible to determine which part of the sentence in the original text is the selected key information. The signer features can be customized according to requirements, the signer features can be words such as the words of the words, the words and the like, and the selection range of the words to be marked can be determined according to the signer features.

In step S202 of some embodiments, the keyword features are obtained by combining the reference word features, the marker word features, and the effect word features, so that words satisfying the keyword features can be searched out from the original text as selected key information.

For example, the reference word feature, the marker word feature, and the effect word feature are combined to obtain "above+keyword+marker effect word" as a keyword feature, or "before+keyword+marker effect word" as a keyword feature.

In the steps S201 to S202 shown in the embodiment of the present application, the word satisfying the keyword feature is extracted from the original text as the selected key information by acquiring the reference word feature, the tag word feature and the effect word feature, and combining the reference word feature, the tag word feature and the effect word feature into the key word feature.

Referring to fig. 3, in some embodiments, step S103 may include, but is not limited to, steps S301 to S303:

step S301, dividing word characteristics of an original text to obtain the original word characteristics;

step S302, screening target word features from the original word features according to the keyword features;

and step S303, extracting text content from the original text according to the target word characteristics to obtain selected key information.

In step S301 of some embodiments, the terms in the original text are subjected to word feature classification, so as to obtain the original word feature of each term in the original text, so as to determine whether the terms in the original text match the keyword features according to the original word features.

Before word feature division, a word feature database is constructed in advance to determine the word features corresponding to each word. After the original text is generated, each word in the original text is divided, and then the matched word characteristics are extracted from the word characteristic database according to each word to serve as the original word characteristics, so that the original word characteristics of each word in the original text are determined simply.

For example, a sentence in the original text is "good speaking mode" and has to be the speaking mode with clear teeth, the above teeth are clear and red, the word characteristics of the original text are divided, the "speaking mode" and the "clear teeth" are determined as the character of the marking word, "certain is" and "above" are the character of the referring word, and the "red is the character of the effect word. If the application scene is insurance industry, the age of the crowd served by the A dangerous seed in the original text is 50-60 years old and is marked red by 50-60 years old, so that the 50-60 years old is determined to be the characteristic of the marking word, and the more than is determined to be the characteristic of the marking word.

In step S302 and step S303 of some embodiments, the original word features are matched according to the keyword features, so as to obtain the keyword features matched with the keyword features as target word features. And then extracting corresponding text content from the original text according to the target word characteristics as selected key information, so that the selected key information is filtered out simply.

For example, the keyword features are "above+keyword+marking effect word", the target word features selected from the original word features according to the keyword features are "referring word features+marking word features+effect word features", and the corresponding word is "above stuttering clear mark red" extracted from the original text according to the target word features as the selected keyword information. If the application scene is insurance industry, extracting the corresponding word of 'over 50-60 years old standard red' from the original text as the selected key information.

In steps S301 to S303 shown in the embodiment of the present application, the original word features are obtained by dividing the word features of each word in the original text, and then the original word features matched with the key word features are screened out from the original word features to be used as target word features. And screening corresponding content from the original text according to the target word characteristics to serve as selected key information, so that the selected key information is easy to screen, and word information to be marked and marking effect information are known according to the selected key information.

Referring to fig. 4, in some embodiments, step S303 may include, but is not limited to, steps S401 to S402:

step S401, extracting sentences of target word characteristics from the original text to obtain candidate key sentences; wherein the candidate key sentences include: key words and marking effect words;

step S402, constructing selected key information according to the key words and the marking effect words.

In step S401 of some embodiments, a candidate key sentence is obtained by first extracting a sentence corresponding to a target word feature from an original text. Because the target word features are 'referring to word features + marking word features + effect word features', sentences with the same feature order as the target word features are found out from the original text to serve as candidate key sentences, then key words matched with the marking word features are selected from the candidate key sentences, and marking effect words matched with the effect word features are selected from the candidate key sentences.

For example, if a sentence in the original text is "today, i want to say a key, a good speaking mode must be to say that the mouth teeth are clear, the mouth teeth are clear and red, and according to the target word characteristics are" reference word characteristics+mark word characteristics+effect word characteristics ", candidate key sentences matched with the target word characteristics are selected from the original text as" the mouth teeth are clear and red. If the application scenario is the insurance industry, the determined candidate key sentence is 'red standard over 50-60 years old'. Therefore, candidate key sentences matched with the target word features are selected from the original text, so that screening of the candidate key sentences is easy.

It should be noted that, after the candidate key sentence is selected, the candidate key sentence is structured in the original text, that is, the candidate key sentence is marked in the original text. For example, a statement in the original text that is marked as "good" must be interpreted as having a clear accent, [ clear red of the accent above ].

In step S402 of some embodiments, since the candidate key sentence includes: the method comprises the steps of using key words and marking effect words as word information to be marked, and using marking effect words as marking effect information. For example, the key words in the candidate key sentence are "sharp" and the label effect word is "red. Wherein, the marking effect words can be any one of red mark, thick mark, underline and italic mark. Thus, the determination of the selected key information is facilitated based on the key words and the marking effect words as the selected key information.

In the steps S401 to S402 shown in the embodiment of the present application, sentences matching with the features of the target word are selected from the original text as candidate key sentences. And taking the key words and the marking effect words in the candidate key sentences as selected key information, so that the selected key information is easy to acquire, and the words and the marking effect which need to be marked are further determined according to the selected key information.

In step S104 of some embodiments, after determining the word information to be tagged, word searching is performed in the original text according to the word information to be tagged. In order to save word searching time, a searching range is determined in an original text according to the candidate key sentences, namely, the previous sentence of the candidate key sentences is used as the word searching range. According to the word information to be marked, searching matched words in sentences corresponding to the word searching range to serve as the words to be marked, so that the words to be marked can be searched more quickly.

For example, if the word information to be marked is "mouth tooth clear", determining the previous sentence of the candidate key sentence as the word searching range, and if the word searching range is "good speaking mode, the mouth tooth clear" in the sentence is determined as the word to be marked in the sentence searching range. The term searching range is also determined according to the reference word, and if the reference word is 'front', the range of all contents in front of the candidate key sentences is determined as the term searching range. If the term is the first two sentences, the range of the contents of the first two sentences of the candidate key sentences is used as the term searching range.

In step S105 of some embodiments, after the word to be tagged is found, the word to be tagged is tagged. And screening out target marking operations from the preset candidate marking operations according to the marking effect information. Specifically, the target marking codes are screened from the candidate marking codes according to the marking effect information, the target marking codes are combined with the words to be marked, and the combined target marking codes and the words to be marked are displayed through a preset display, so that the marking effect corresponding to the marking effect information on the marks of the words to be marked can be achieved.

For example, if the marking effect information is "red, the corresponding candidate marking codes are found out from the preset code mapping relation according to the marking effect information. The candidate mark code corresponding to "mark red" is " words to be labeled ". If the marking effect information is thickened, the corresponding candidate marking codes are " words to be marked. If the marking effect information is "underlined", the corresponding candidate marking code is " words to be labeled ". If the marking effect information is in italics, the corresponding candidate marking code is " word to be marked. The marking effect information may be any one of "red mark", "bold", "underline" and "italic". The preset code mapping relationship is shown in fig. 5, and the corresponding candidate marking codes can be found out from the code mapping relationship according to the marking effect information to serve as target marking codes so as to mark the words to be marked. Therefore, the corresponding candidate marking codes are determined from the code mapping relation according to the marking effect information, so that the words to be marked are conveniently marked.

Referring to fig. 6, in some embodiments, step S106 may include, but is not limited to, steps S601 to S603:

step S601, extracting sentences containing words to be marked from an original text to obtain sentences to be marked;

step S602, marking the words to be marked of the sentences to be marked according to the target marking operation to obtain target sentences;

step S603, the sentences to be marked in the original text are replaced by target sentences, and the target text is obtained.

In step S601 of some embodiments, after determining the word to be marked, a sentence carrying the word to be marked is first extracted from the original text as a sentence to be marked, so that the word to be marked is directly marked in the sentence to be marked, and the whole text in the original text is not required to be marked one by one. The statement to be marked is a statement positioned in front of the candidate key statement.

For example, if the word to be marked is "clear in terms of mouth, the sentence containing" clear in terms of mouth "is selected from the original text to have the 2 nd sentence 4, the 3 rd sentence 2, and the 4 th sentence 5 at the 4 th end, and the sentence before the candidate key sentence is determined to be the sentence of the 2 nd sentence 4 as the sentence to be marked.

In step S602 of some embodiments, after determining the sentence to be marked, that is, after defining the scope to be marked in the original text, the sentence to be marked is selected first without performing marking processing on all the words to be marked in the whole original text, and then marking processing is performed on the words to be marked in the sentence to be marked according to the target marking operation.

For example, a statement in the original text that a good speech pattern is "good" must be the best of the teeth of the mouthClear mark red. The statement to be marked is determined to be ' good speaking mode ' and the statement to be marked is definitely clear in mouth and teeth '. Therefore, the "good speaking mode" is always the "clear mouth teeth" in the "clear mouth teeth" which is taught to the mouth teeth, and the target sentence with the marking effect is obtained by marking. If the target mark operation is to increase the underline, then it is always advisable for the target sentence to be "good speaking modeClear mouth teeth A kind of electronic device". If the application scene is insurance industry, a sentence in the original text is that the age of the crowd served by the A dangerous seed is 50-60 years old and the age of the crowd served by the A dangerous seed is 50-60 years old, and then the generated text is that the age of the crowd served by the A dangerous seed is 50-60 years old, wherein the age of the crowd served by the A dangerous seed is 50-60 years old, and the red font is 50-60 years old. Therefore, key contents can be highlighted when an insurance service scheme is generated for insurance service personnel so as to facilitate customer reference, thereby improving the quality of insurance service.

In step S603 of some embodiments, after the marking process of the word to be marked is completed, a target sentence is obtained, the sentence to be marked in the original text is replaced by the target sentence, so as to obtain a target text with a marked keyword, and after the text is converted from voice, the keyword in the text is automatically marked.

For example, a sentence in the original text is "today, i want to say an emphasis, and a good speaking mode must be the one that is clearly spoken. The target text is obtained after the target sentence is replaced by the original text, the same sentence in the target text is' today, I want to say an important point, and a good speaking mode is necessarily taughtClear mouth teethA kind of electronic device. ".

In the steps S601 to S603 shown in the embodiment of the present application, the sentences to be marked are determined first, and then the words to be marked in the sentences to be marked are marked according to the target marking operation, so as to define the sentences to be marked, and the marking of each word to be marked in the whole original text is not required, so that the keyword marking in the original text is more accurate.

Referring to fig. 7, in some embodiments, step S601 includes, but is not limited to, steps S701 to S702:

Step S701, selecting a sentence range according to the word to be marked in the original text to obtain a sentence selection range; the sentence selection range is the range of the sentence before the candidate key sentence in the original text;

step S702, selecting sentences to be marked from the original text according to the sentence selection range.

In step S701 of some embodiments, in order to make the labeling process of the word to be labeled more accurate, a sentence selection range is determined according to the word to be labeled in the original text, where the sentence selection range is a range of a sentence preceding the candidate key sentence in the original text. The sentence selection range is determined simply, and the sentence containing the word to be marked is not required to be searched in full text as the sentence to be marked.

It should be noted that, the sentence selection range is also determined according to the reference word, and if the reference word is "all of the foregoing", then all of the contents located before the candidate key sentence are used as the sentence selection range. If the term is "previous sentence", the sentence located in the previous sentence of the candidate key sentence is used as the sentence selection range. If the term is "previous paragraph", the content of the previous paragraph of the candidate key sentence is used as the sentence selection range.

In step S702 of some embodiments, after determining the sentence selection scope, a sentence to be tagged is selected from the original text according to the sentence selection scope, and the sentence to be tagged includes a word to be tagged. Therefore, when the range to be limited is the content of the previous sentence, the previous sentence is directly selected as the sentence to be marked, instead of marking the word to be marked of the whole original text, so that the marking of the word to be marked is more accurate and more efficient.

In the steps S701 to S702 shown in the embodiment of the present application, the range of the sentence preceding the candidate key sentence in the original text is determined as the sentence selection range, and then the sentence to be marked is selected from the original text according to the sentence selection range, so that the sentence to be marked is selected more simply. Meanwhile, after the statement to be marked is determined, the marking processing range is limited, so that the marking is more accurate and more efficient.

Referring to fig. 8, in some embodiments, after step S106, the voice-to-text method may further include, but is not limited to, steps S801 to S802:

step S801, obtaining the position information of punctuation marks before candidate key sentences in a target text, and obtaining the symbol position information;

step S802, eliminating punctuation marks of the target text according to the mark position information, and eliminating candidate key sentences in the target text to update the target text.

In step S801 of some embodiments, after the target text is obtained by replacing the to-be-tagged sentence in the original text with the target sentence, the candidate key sentence and the additional punctuation mark are further included in the target text. The position information of the punctuation coincidence before the candidate key sentence in the target text is required to be acquired first to obtain the symbol position information, so that the position of the additional punctuation coincidence is determined according to the symbol position information.

For example, if the candidate key sentence is located in the 2 nd paragraph and the 4 th sentence, the position before the 2 nd paragraph and the 4 th sentence is used as the symbol position information, and the punctuation mark before the 2 nd paragraph and the 4 th sentence is selected to be consistent with ",".

In step S802 of some embodiments, punctuation marks in the target text are selected according to the mark position information, and then punctuation corresponding to the mark position information is removed, and candidate key sentence bodies in the target text are divided to update the target text, so as to obtain a final text.

For example, if the target text is "today, i want to say an emphasis, and a good speaking way must be the focusClear mouth teethThe above teeth are clearly underlined. ". Eliminating punctuation marks corresponding to symbol position information in target text and the clear mark underlines of the upper mouth teeth of the candidate key sentences to obtain updated text which is' today, i want to say a key point, and a good speaking mode is always taughtClear mouth teethA kind of electronic device. ".

In steps S801 to S802 shown in the embodiment of the present application, after a target text is constructed, punctuation corresponding to symbol position information in the target text is removed, and then candidate key sentences in the target text are removed to update the target text, so as to obtain a final text. Therefore, redundant sentences and punctuations are removed to construct a final text which accords with the intention of the speaking object, so that the intention of the speaking object can be known more intuitively after the browsing object browses the final text.

It should be noted that, the final text is displayed by combining the target marking code and the word to be marked to form the target code and loading the target code through a preset display or browser to display the final target text. The preset browsers are APP ends and web ends. For example, if the object code is "good speaking mode," it is necessarily speaking oral teeth are clear. Therefore, the "good speaking mode" displayed by the preset browser is always to be the focus of the definition of the teeth, and the "definition of the teeth" is red. For example, if the object code is "the age group of the crowd served by the a risk" is the red 50-60 years old. Therefore, the preset browser displays that the age of the crowd served by the danger A is 50-60 years old, and the age of the crowd served by the danger A is 50-60 years old in red fonts.

Specifically, a text editor is arranged in the preset browser, the text editor analyzes the target code into corresponding style contents, the words to be marked are marked according to the style contents to obtain a final text, and the final text is displayed in a rendering mode.

The embodiment of the application firstly records the speaking content of the speaking object through the recording equipment to obtain the voice data, and then sends the voice data to the server in a wireless mode. And then the server performs content recognition on the voice data through an automatic voice content recognition technology to obtain an original text. After the voice data is converted into the original text, dividing each word in the original text, determining the word characteristics of each word from a word characteristic database according to the words to obtain original word characteristics, and selecting target word characteristics from the original word characteristics according to the signified word characteristics, the marked word characteristics and the effect word characteristics. Extracting key words and marking effect words matched with the characteristics of the target words from the original text, taking the key words as word information to be marked, and taking the marking effect words as marking effect information. And then determining the previous sentence of the original text as a word searching range according to the marked word information. Searching matched words in sentences corresponding to the word searching range according to the word information to be marked as the words to be marked. And then further determining the sentence selection range in the original text according to the word to be marked, namely determining the sentence range of marking processing. And selecting sentences to be marked from the original text according to the sentence selection range, namely, selecting the sentences as marking processing. And then, determining the corresponding candidate marking codes from the representative mapping relation according to the marking effect information as target marking codes, combining the target marking codes and the words to be marked in the sentences to be marked into target codes, and displaying the combined target codes through a preset browser to mark the words to be marked in the sentences to be marked to obtain target sentences. And replacing the sentences to be marked in the original text with target sentences to obtain target texts, acquiring punctuation coincidence positions before the candidate key sentences in the target texts to obtain symbol position information, eliminating the corresponding punctuation coincidence in the target texts according to the symbol position information, eliminating the candidate key sentences, and updating the target texts to obtain final texts. Therefore, after the voice data is automatically converted into the text, the selected key information in the original text is removed according to the preset key word characteristics. Therefore, the words to be marked in the original text are automatically marked according to the selected key information, so that the target text with the marked key words is generated, and the manpower for editing the text by manpower is saved. Therefore, the method can be applied to the field of converting the universal voice into the text by constructing a text capable of converting the voice data into the text with the labeling effect. Particularly, aiming at the insurance service industry, because a plurality of key contents exist in the insurance service scheme and are required to be reviewed by clients, the insurance service scheme with the labeling effect can be generated only by dictation of insurance service personnel, the workload of the insurance service personnel for making the insurance service scheme is reduced, and the customer is convenient to review by clients and experience of the customer service is improved.

Referring to fig. 9, an embodiment of the present application further provides a device for converting voice into text, which can implement the method for converting voice into text, where the device includes:

a data acquisition module 901, configured to acquire voice data;

the content recognition module 902 is configured to perform content recognition on the voice data to obtain an original text;

the information extraction module 903 is configured to extract key information from the original text according to a preset key feature, so as to obtain selected key information; wherein, the selected key information comprises: word information to be marked and marking effect information;

the word screening module 904 is configured to screen the word to be marked from the original text according to the word information to be marked;

an operation screening module 905, configured to screen out a target marking operation from preset candidate marking operations according to the marking effect information;

the word processing module 906 is configured to perform marking processing on the word to be marked in the original text according to the target marking operation, so as to obtain a target text.

The specific implementation of the voice-to-text device is substantially the same as the specific embodiment of the voice-to-text method described above, and will not be described herein.

The embodiment of the application also provides computer equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the voice-to-text method when executing the computer program. The computer equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 10, fig. 10 illustrates a hardware structure of a computer device according to another embodiment, where the computer device includes:

the processor 1001 may be implemented by using a general-purpose CPU (central processing unit), a microprocessor, an application-specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. to execute related programs to implement the technical solution provided by the embodiments of the present application;

the memory 1002 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). Memory 1002 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present disclosure are implemented in software or firmware, relevant program codes are stored in memory 1002, and the processor 1001 invokes a voice-to-text method for executing the embodiments of the present disclosure;

an input/output interface 1003 for implementing information input and output;

the communication interface 1004 is configured to implement communication interaction between the present device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

A bus 1005 for transferring information between the various components of the device (e.g., the processor 1001, memory 1002, input/output interface 1003, and communication interface 1004);

wherein the processor 1001, the memory 1002, the input/output interface 1003, and the communication interface 1004 realize communication connection between each other inside the device through the bus 1005.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the voice-to-text method when being executed by a processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The method, the device, the computer equipment and the storage medium for converting the voice data into the original text are provided by the embodiment of the application, and then the content which accords with the keyword characteristics in the original text is used as the selected key information. And carrying out automatic marking processing on words in the original text according to the selected key information so as to generate text marked with key sentences, editing the text manually without manual labor, and saving manual labor for marking the key contents of the text. Therefore, the text with the marking effect is generated based on the voice data, and for the insurance service industry, the insurance service scheme with the marking effect can be generated only through the dictation scheme, so that a customer can conveniently review key contents, and the workload of insurance service personnel is reduced.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by persons skilled in the art that the embodiments of the application are not limited by the illustrations, and that more or fewer steps than those shown may be included, or certain steps may be combined, or different steps may be included.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not thereby limiting the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A method for converting speech to text, the method comprising:

acquiring voice data;

performing content recognition on the voice data to obtain an original text;

2. The method according to claim 1, wherein the extracting key information from the original text according to the preset key word features to obtain the selected key information includes:

3. The method according to claim 2, wherein said extracting text content from said original text according to said target word characteristics to obtain said selected key information comprises:

4. A method according to any one of claims 1 to 3, wherein the marking the word to be marked in the original text according to the target marking operation to obtain target text includes:

5. The method according to claim 4, wherein extracting the sentence containing the word to be tagged from the original text to obtain the sentence to be tagged comprises:

6. The method of claim 3, wherein after the word to be tagged in the original text is tagged according to the target tagging operation, the method further comprises:

7. The method of claim 5, wherein before extracting the key information from the original text according to the preset key word characteristics to obtain the selected key information, the method further comprises:

8. A speech-to-text apparatus, the apparatus comprising:

the data acquisition module is used for acquiring voice data;

9. A computer device, characterized in that it comprises a memory storing a computer program and a processor implementing the speech-to-text method according to any of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the speech-to-text method of any one of claims 1 to 7.