CN113378530A

CN113378530A - Voice editing method and device, equipment and medium

Info

Publication number: CN113378530A
Application number: CN202110717969.0A
Authority: CN
Inventors: 殷元江; 高发宝; 马添翼
Original assignee: Beijing Qiwei Visual Media Technology Co ltd
Current assignee: Beijing Qiwei Visual Media Technology Co ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-09-10

Abstract

The present disclosure provides a voice editing method and apparatus, device and medium. The voice editing method comprises the following steps: determining at least one candidate word from a text to be edited according to received first voice information in a voice editing mode; marking each candidate word in the at least one candidate word by using a corresponding identifier; determining a target identifier from the identifiers of the at least one candidate word according to the received second voice message; and editing the target word marked by the target identifier in the at least one candidate word according to the received third voice message.

Description

Voice editing method and device, equipment and medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a voice editing method, apparatus, electronic device, computer-readable storage medium, and computer program product.

Background

The speech input is also called voice control input, and is an input method for automatically recognizing the speech of a user into characters. The existing voice input software can only input voice once, namely, characters are generated after the user speaks. However, the accuracy of speech input is easily affected by environmental noise, a user's accent, homophones, and the like, and it often happens that a character recognized by speech is not a character that the user wants to input. Under the condition, the user needs to modify the characters through manual input, so that the operation is complex, the use is inconvenient, and the user experience is poor. In public environments, touch-based manual input may present a health hazard. And, it is also an obstacle for users who are inconvenient to manually input characters.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides a voice editing method, apparatus, electronic device, computer-readable storage medium, and computer program product.

According to an aspect of the present disclosure, there is provided a voice editing method including: determining at least one candidate word from a text to be edited according to received first voice information in a voice editing mode; marking each candidate word in the at least one candidate word by using a corresponding identifier; determining a target identifier from the identifiers of the at least one candidate word according to the received second voice message; and editing the target word marked by the target identifier in the at least one candidate word according to the received third voice message.

According to another aspect of the present disclosure, there is also provided a voice editing apparatus including: the first positioning module is configured to determine at least one candidate word from a text to be edited according to received first voice information in a voice editing mode; a marking module configured to mark each of the at least one candidate word with a corresponding identifier; the second positioning module is configured to determine a target identifier from the identifiers of the at least one candidate word according to the received second voice message; and the editing module is configured to edit the target word marked by the target identifier in the at least one candidate word according to the received third voice message.

According to another aspect of the present disclosure, there is also provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program which, when executed by the at least one processor, implements the speech editing method according to the above.

According to another aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing a computer program which, when executed by a processor, implements the voice editing method according to the above.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the speech editing method according to the above.

According to one or more embodiments of the present disclosure, at least one candidate word is determined from a text to be edited according to first voice information, each candidate word is marked by a corresponding identifier, a target identifier is determined according to second voice information, and a target word marked by the target identifier is edited according to third voice information, so that a user can accurately position and edit a position to be edited (i.e., a target word) only by sending a voice instruction, and the method and the device are simple and convenient to operate, do not need manual input, avoid hidden health hazards and inconvenience caused by manual input, and improve user experience.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 shows a flow diagram of a method of speech editing according to an embodiment of the present disclosure;

2A-2L illustrate schematic diagrams of an exemplary voice editing interface, according to an embodiment of the present disclosure;

fig. 3 shows a block diagram of a voice editing apparatus according to an embodiment of the present disclosure;

FIG. 4 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

FIG. 1 shows a flow diagram of a method 100 of speech editing according to an embodiment of the present disclosure. The method 100 may be performed in an electronic device, i.e. the execution subject of the method 100 may be the electronic device. More specifically, input method software may be installed in the electronic device that instructs a processor of the electronic device to perform the method 100. In some embodiments, the electronic device may be any type of mobile computer device, including but not limited to a mobile computer, a mobile phone, a smart wearable device (e.g., a smart watch, smart glasses, etc.), and the like. In other embodiments, the electronic device may also be any type of stationary computing device, including but not limited to a desktop, a server computer, and the like. Embodiments of an electronic device for performing the method 100 will be described in detail below.

As shown in FIG. 1, the speech editing method 100 may include: step 110, in a voice editing mode, determining at least one candidate word from a text to be edited according to received first voice information; step 120, marking each candidate word in the at least one candidate word by using a corresponding identifier; step 130, according to the received second voice message, determining a target identifier from the identifiers of the at least one candidate word; and step 140, editing the target word marked by the target identifier in the at least one candidate word according to the received third voice message.

According to the embodiment of the disclosure, at least one candidate word is determined from a text to be edited according to first voice information, each candidate word is marked by a corresponding identifier, a target identifier is determined according to second voice information, and the target word marked by the target identifier is edited according to third voice information, so that a user can accurately position a position to be edited (namely the target word) and edit the position to be edited only by sending a voice instruction.

The various steps of method 100 are described in detail below.

Referring to fig. 1, in step 110, in a voice editing mode, at least one candidate word is determined from a text to be edited according to received first voice information.

According to some embodiments, the voice-editing mode may be entered in response to a voice-editing trigger command issued by a user. That is, the user speaks a voice editing trigger command, and the electronic device enters a voice editing mode in response to the voice editing trigger command after receiving the voice editing trigger command issued by the user (e.g., through a voice input device such as a microphone).

In some embodiments, the voice editing trigger command may be any preset voice command, such as "text modification", "text editing", "enter editing mode", and the like. In other embodiments, the voice editing trigger command may also be any voice containing a preset keyword, for example, the preset keyword is "edit", and accordingly, any voice containing "edit" issued by the user is regarded as the voice editing trigger command, and the voice editing trigger command may be, for example, "enter editing mode", "enter voice editing mode", "open editing mode", "perform text editing", and the like.

The voice editing mode is used to edit a text by a user voice. According to some embodiments, the electronic device may provide other operating modes in addition to the voice editing mode, such as a voice input mode, and the like. In the voice input mode, the electronic device performs voice recognition on voice uttered by the user to generate a corresponding text.

The different working modes can be switched by corresponding voice commands. For example, in response to a voice input trigger command, enter a voice input mode; and responding to a voice editing trigger command, and entering a voice editing mode. Similar to the voice editing trigger command, the voice input trigger command may be any preset voice command, such as "voice input", "voice controlled input", "voice typing", and other voices; the input device may also be any voice including a preset keyword, such as "enter voice input mode", "open voice input mode", "perform voice input" including a preset keyword "voice input", and the like. In some embodiments, besides the voice input mode and the voice editing mode, other working modes, such as an expression input mode, a skin setting mode, etc., may be included, and the other working modes may be entered through corresponding voice commands.

In a more common case, a user inputs a voice in a voice input mode, and the electronic device performs voice recognition on the voice input by the user to generate a corresponding text. The generated text may be inaccurate enough to not meet the user's expectations due to ambient noise, user accents, or other effects. Therefore, the user speaks the voice editing trigger command to enter the voice editing mode to execute step 110 and step 140 of the voice editing method 100 of the present disclosure to edit and modify the text generated in the voice input mode. That is, the text to be edited in step 110 is the text generated in the voice input mode, that is, the text to be edited is obtained by performing voice recognition on the voice input by the user in the voice input mode.

FIG. 2A illustrates a schematic diagram of an exemplary voice editing interface 200A, according to an embodiment of the present disclosure. An exemplary text to be edited is shown in area 210 of interface 200A. The text to be edited is obtained by performing voice recognition on the voice input by the user in the voice input mode, wherein a plurality of errors exist, such as recognizing the 'text' error as 'mosquito', misusing punctuation marks, and the like.

In step 110, when the electronic device receives first voice information input by a user, at least one candidate word is determined from a text to be edited according to the received first voice information.

The first voice message is voice sent by the user and is used for indicating a word which the user wants to edit. For example, if the user wants to edit the word "mosquito" in the text to be edited, the first voice message is the voice of "mosquito" spoken by the user.

In some embodiments, the prompt message may be output to the user by text or voice to guide the user to input the first voice information. For example, a voice prompt message "please say a word or phrase that you want to edit" may be played to the user to prompt the user to enter the first voice message.

There are a number of ways to determine the at least one candidate word in step 110. According to some embodiments, at least one candidate word may be determined from the text to be edited by means of text matching, namely: performing voice recognition on the first voice information to obtain a set of homophones of the first voice information; and determining at least one word belonging to the set of homophones in the text to be edited as the at least one candidate word.

For example, by performing speech recognition on the first speech information, a set of homophones of the pronunciation "wenzi" of the first speech information is obtained, and the obtained set of homophones may be, for example, { text, mosquito, Wen and catalpa }. Then, the words belonging to the set { words, mosquitoes, aragonin, catalpa } in the text to be edited are determined as candidate words. For example, in the text to be edited shown in fig. 2A, three candidate words, which are respectively the "character" located in the first row and the "mosquito" located in the third row and the ninth row, are determined by determining whether each word belongs to the set of homophones of the first voice message.

According to further embodiments, at least one candidate word may be determined from the text to be edited by means of speech matching, namely: respectively determining the similarity between the voice of each word in the text to be edited and the first voice information; and determining at least one word with the similarity larger than a preset threshold in all words in the text to be edited as the at least one candidate word based on the determined similarity.

For example, the speech characteristics of each word in the text to be edited and the speech characteristics of the first speech information may be acquired. And respectively calculating the similarity between the voice of each word and the first voice message according to the corresponding voice features (for example, calculating the cosine similarity between the voice feature of a certain word and the voice feature of the first voice message), and taking the word with the similarity larger than a preset threshold value as a candidate word. The speech feature may be, for example, an audio feature such as MFCC (Mel-Frequency Cepstral Coefficients), PLP (Linear Perceptual Prediction), or a pinyin feature. The value of the preset threshold value can be set by a person skilled in the art according to the actual situation. For example, in the text to be edited shown in fig. 2A, three candidate words, namely "words" located in the first row and "mosquitoes" located in the third row and the ninth row, respectively, are obtained by determining the similarity between the voice of each word in the text to be edited and the first voice information.

After at least one candidate word is determined, via step 110, step 120 is performed.

In step 120, each candidate word of the at least one candidate word is marked with a corresponding identifier.

Each candidate word corresponds to an identifier, the identifiers of the different candidate words being different. The identifier may be any symbol. Since the target identifier needs to be determined according to the second voice information of the user in the subsequent step 130, the identifier may be preferably set to any character having a pronunciation so that the user can specify the target identifier by voice. For example, the identifier may be set to a number.

For example, the user utters the pronunciation of "mosquito", i.e., the first voice message, under the interface 200A shown in fig. 2A, and determines three candidate words, i.e., "words" located in the first row and "mosquitoes" located in the third and ninth rows, respectively, from the text to be edited based on the first voice message. The identifiers (i), (ii), and (iii) may be used to mark the three candidate words, and the marking results are presented to the user, resulting in the interface 200B shown in fig. 2B. As shown in fig. 2B, the identifiers corresponding to the "characters" in the first row, the "mosquitoes" in the third row, and the "mosquitoes" in the ninth row are (i), (ii), and (iii), respectively. According to some embodiments, in order to achieve better presentation and facilitate the user to select candidate words, each candidate word and its corresponding identifier may be highlighted in a format different from other words in the text to be edited. For example, each candidate word and its corresponding identifier may be highlighted, or displayed in bold, italics, different colors, and so forth.

Based on the labeling of each candidate word at step 120, step 130 is performed.

In step 130, a target identifier is determined from the identifiers of the at least one candidate word according to the received second voice message.

The second voice message is voice sent by the user and is used for indicating the identifier selected by the user. In some embodiments, the prompting message may be output to the user by text or voice to guide the user to input the second voice information. For example, a voice prompt message "please say your selected word number" may be played to the user to prompt the user to enter the second voice message. Since each identifier corresponds to one candidate word, the user can pinpoint a candidate word that the user wants to edit from among the plurality of candidate words by issuing the second voice information indicating the identifier (hereinafter, the candidate word that the user wants to edit is referred to as "target word").

According to some embodiments, the target identifier and the target word marked by the target identifier may be highlighted in a different format than the other identifiers and the candidate words marked by the other identifiers, for example, the target identifier and the target word marked by the target identifier may be configured to flash at a preset frequency, or the target identifier and the target word marked by the target identifier may be displayed in a larger font size, a bolded font, or the like.

According to some embodiments, when the identifier is a number, a number corresponding to the second voice information may be determined as the target identifier from the identifier of the at least one candidate word by performing voice recognition on the second voice information. For example, under the interface 200B shown in fig. 2B, the user utters a voice corresponding to the identifier "two" (i.e., the second voice information), the electronic device performs voice recognition on the voice, and determines that the number corresponding to the voice is 2, so that the target identifier is "two". The target identifier "②" and the corresponding target word "mosquito" can be displayed in a blinking manner.

In some embodiments, the second voice message spoken by the user may include a voice of multiple identifiers, and accordingly, multiple target identifiers may be determined. For example, under the interface 200B shown in fig. 2B, the user utters voices (i.e., the second voice information) corresponding to the identifiers "c" and "c", and the electronic device performs voice recognition on the voices, determines that the corresponding numbers are 2 and 3, and then can determine two target identifiers, i.e., "c" and "c". The target identifiers 'c' and the corresponding target words 'mosquitoes' can be displayed in a flashing manner.

Based on the target identifier determined in step 120, step 140 is performed.

In step 140, according to the received third voice message, editing a target word marked by the target identifier in the at least one candidate word.

The third voice message is voice sent by the user and is used for indicating the editing operation which the user wants to execute on the target word. According to some embodiments, the editing operation and the editing word may be determined by performing voice recognition on the third voice information; and executing the editing operation on the target word by using the editing word.

In some embodiments, the prompting message may be output to the user by text or voice to guide the user to input the third voice information. For example, a voice prompt message "please say that you want to do an editing operation" may be played to the user to prompt the user to input the third voice message.

Editing operations may include various types of operations such as modify, add, delete, and the like. The editing words are used for editing the target words. Accordingly, the editing operation performed on the target word by using the editing word may include various operations such as modifying the target word into an editing word, adding an editing word behind or in front of the target word, or deleting the target word.

In particular, according to some embodiments, the editing operation comprises a modification, and accordingly, performing the editing operation on the target word comprises: and modifying the target word into the editing word. In this case, the third voice information may be, for example, a voice of "modify, edit word" uttered by the user. For example, the user selects "mosquitoes" of the third and ninth lines as target words by uttering voices (i.e., second voice information) corresponding to "c" and "c" in the interface 200B shown in fig. 2B. Subsequently, the user utters a voice of "modify, text" (i.e., the third voice message), and the electronic device performs voice recognition on the voice to determine that the editing operation is modify and the editing word is "text", and accordingly modifies the target words "mosquitoes" located on the third row and the ninth row into the editing word "text", respectively, to obtain an interface 200C as shown in fig. 2C.

According to further embodiments, the editing operation comprises an addition, and accordingly, performing the editing operation on the target word comprises: and adding the editing words behind or in front of the target words. In this case, the third voice information may be, for example, a voice of "add, edit word" uttered by the user. Specifically, the edited word is added behind or in front of the target word, and may be preset by a person skilled in the art according to actual situations. For example, the user selects "text" in the first line as the target word by uttering the voice corresponding to "r" (i.e., the second voice message) in the interface 200B shown in fig. 2B. Subsequently, the user utters a voice of "add, content" (i.e., third voice information), and the electronic device performs voice recognition on the voice, determines that the editing operation is add and the editing word is "content", and accordingly adds the editing word "content" behind the "text" in the first line.

It will be appreciated that in some cases, the electronic device may not be able to accurately recognize the edited word from the third speech information due to environmental noise, user accents, presence of homophones, or other aspects. In this case, a list of candidate edit words corresponding to the third voice information may be provided to the user, and an edit word may be selected from the list according to the fourth voice information input by the user. Namely: performing voice recognition on the third voice information, and determining an editing operation and a list of candidate editing words; and determining an editing word from the list of candidate editing words according to the received fourth voice information.

For example, a user utters a voice of "voice input" (i.e., first voice information) under the interface 200A shown in fig. 2A, the electronic device determines three candidate words "voice input" corresponding to the voice from a text to be edited, marks the three candidate words "voice input" by using identifiers (i), ii, and iii), and highlights the candidate words and the identifiers thereof, so as to obtain an interface 200D shown in fig. 2D. Then, the user utters a voice of "c" (i.e., the second voice message), the electronic device determines that the target identifier corresponding to the voice is "c", and accordingly, the target word is the second "voice input" in the text to be edited. The target identifier "②" and its corresponding target word "voice input" may be blinked. Subsequently, the user utters a "add, all" voice (third voice information), the electronic device determines that the editing operation corresponding to the voice is an add, the list of candidate editing words is "1, all 2, pocket 3, tremble 4, bean 5, comma", and fig. 2E shows an interface 200E including the list of candidate editing words 220. Subsequently, the user utters the voice of "1" (i.e., the fourth voice message), the electronic device takes the corresponding candidate editing word "all" as an editing word based on the voice, and adds the editing word to the rear of the target word, i.e., the "voice input" marked by the identifier "②", resulting in the interface 200F shown in fig. 2F.

According to some embodiments, the method 100 further comprises: and responding to the symbol polling voice command, and sequentially checking the symbols in the text to be edited.

The symbol polling voice command can be any preset voice command, such as voice of 'symbol polling', 'punctuation polling', and the like. And after the user speaks the symbol polling voice command, the electronic equipment receives and responds to the serial number polling voice command to check the symbols in the text to be edited in sequence.

According to some embodiments, sequentially checking the symbols in the text to be edited comprises: performing voice recognition on the received fifth voice message to determine a target symbol corresponding to the fifth voice message; modifying the current symbol into the target symbol in response to the current symbol indicated by the cursor being different from the target symbol; and moving the cursor to the next symbol in the text to be edited.

For example, the user speaks a symbol patrol voice command under the interface 200F shown in FIG. 2F. In response to the symbol polling voice command, the electronic device displays a cursor at the first symbol in the text to be edited, so as to obtain an interface 200G shown in fig. 2G or an interface 200H shown in fig. 2H, so as to check the symbols in the text to be edited in sequence. It should be understood that the presentation form of the cursor is various, for example, in the interface 200G shown in fig. 2G, the cursor may be presented as a vertical line 230 positioned at the left side of the symbol (further, the vertical line 230 may be configured to flash at a preset frequency). In other embodiments, the cursor may also appear as a vertical line to the right of the symbol. For another example, in the interface 200H shown in FIG. 2H, the cursor may appear to overlay a highlighted area 240 of the symbol. The symbol patrol scheme of the present disclosure is illustrated below by way of example in the form of a cursor shown in fig. 2G (i.e., presented as a vertical line 230 to the left of the symbol).

In the interface 200G shown in fig. 2G, the user utters a "period" (fifth speech information), and the electronic device performs speech recognition on the speech and determines that the corresponding target symbol is a period ". ". The current symbol indicated by the cursor 230 is comma ",", which is different from the target symbol, and thus the current symbol is "modified to the target symbol". ", then, move the cursor 230 to the next symbol in the text to be edited, resulting in the interface 200I shown in fig. 2I. Subsequently, the user utters a "comma" voice (fifth voice information) under the interface 200I, and the electronic device recognizes the voice and determines that the corresponding target symbol is a comma. The current symbol indicated by the cursor 230 is comma ",", which is the same as the target symbol, and therefore the cursor 230 is directly moved to the next symbol in the text to be edited without modification.

And repeating the process until the check of all the symbols in the text to be edited is completed. At this point, the cursor 230 will be located at the last symbol of the text to be edited, as shown in interface 200J of FIG. 2J.

According to some embodiments, the method 100 further comprises: and responding to a paragraph adjustment voice command, and performing paragraph adjustment on the text to be edited.

The paragraph adjustment voice command can be any preset voice command, such as "paragraph adjustment", "paragraph setting", and other voices. And when the user speaks a paragraph adjusting voice command, the electronic equipment receives and responds to the paragraph adjusting voice command to perform paragraph adjustment on the text to be edited.

According to some embodiments, paragraph adjustment of text to be edited includes: determining paragraph adjustment operation according to the received sixth voice message; and executing the paragraph adjusting operation on the text to be edited. The paragraph adjustment operation includes, for example, first line indentation, line feed in segments, adjusting segment spacing, and the like. Through voice recognition of the sixth voice information, the paragraph adjustment operation corresponding to the sixth voice information can be determined, and the paragraph adjustment operation is further executed on the text to be edited.

For example, the user speaks a paragraph adjust voice command under the interface 200J shown in FIG. 2J. The electronic device may issue a voice prompt message "please say the paragraph adjustment operation you want to do" in response to the paragraph adjustment voice command. Then, the user utters a voice of "two characters are indented in the first line" (i.e. sixth voice information), the electronic device performs voice recognition on the voice to determine a paragraph adjustment operation corresponding to the voice, and executes the paragraph adjustment operation on the text to be edited, that is, indenting two characters in the first line of the text to be edited, so as to obtain an interface 200K shown in fig. 2K. Under the interface 200K shown in fig. 2K, the user sends out a voice of "line-feeding the eleventh character of the sixth line", the electronic device performs voice recognition on the voice to determine a paragraph adjustment operation corresponding to the voice, and performs the paragraph adjustment operation on the text to be edited, that is, line-feeding the eleventh character of the sixth line of the text to be edited, so as to obtain the interface 200L shown in fig. 2L.

Based on the above embodiment, the voice editing method of the present disclosure can implement switching and operation of different working modes. For example, in a voice input mode, a user's voice is converted into text; after the text to be edited is switched to the voice editing mode, the text to be edited can be edited through the voice of the user, for example, words are modified and/or added, punctuation marks are checked and modified, paragraphs are adjusted, and the like, so that the full-voice input and editing functions are realized, any manual operation is not needed, the hidden health trouble and inconvenience caused by the manual operation are avoided, and the user experience is improved.

According to another aspect of the present disclosure, a voice editing apparatus is also provided. Fig. 3 shows a schematic diagram of a speech editing apparatus 300 according to an embodiment of the present disclosure. As shown in FIG. 3, the apparatus 300 includes a first positioning module 310, a marking module 320, a second positioning module 330, and an editing module 340.

The first positioning module 310 is configured to determine at least one candidate word from the text to be edited according to the received first voice information in the voice editing mode.

The tagging module 320 is configured to tag each of the at least one candidate word with a respective identifier.

The second positioning module 330 is configured to determine a target identifier from the identifiers of the at least one candidate word according to the received second voice message.

The editing module 340 is configured to edit a target word marked by the target identifier in the at least one candidate word according to the received third voice message.

It should be understood that the various modules of the apparatus 300 shown in fig. 3 may correspond to the various steps in the method 100 described with reference to fig. 1. Thus, the operations, features and advantages described above with respect to the method 100 are equally applicable to the apparatus 300 and the modules included therein. Certain operations, features and advantages may not be described in detail herein for the sake of brevity.

Although specific functionality is discussed above with reference to particular modules, it should be noted that the functionality of the various modules discussed herein may be divided into multiple modules and/or at least some of the functionality of multiple modules may be combined into a single module. Performing an action by a particular module discussed herein includes the particular module itself performing the action, or alternatively the particular module invoking or otherwise accessing another component or module that performs the action (or performs the action in conjunction with the particular module). Thus, a particular module that performs an action can include the particular module that performs the action itself and/or another module that the particular module invokes or otherwise accesses that performs the action. For example, the first positioning module 310 and the marking module 320 described above may be combined into a single module in some embodiments.

It should also be appreciated that various techniques may be described herein in the general context of software, hardware elements, or program modules. The various modules described above with respect to fig. 3 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, the modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the first positioning module 310, the tagging module 320, the second positioning module 330, and the editing module 340 may be implemented together in a System on a Chip (SoC). The SoC may include an integrated circuit chip (which includes one or more components of a Processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, Digital Signal Processor (DSP), etc.), memory, one or more communication interfaces, and/or other circuitry), and may optionally execute received program code and/or include embedded firmware to perform functions.

According to another aspect of the present disclosure, there is also provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program which, when executed by the at least one processor, implements a speech editing method according to the above.

According to another aspect of the present disclosure, there is also provided a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the voice editing method according to the above.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the speech editing method according to the above.

Referring to fig. 4, a block diagram of a structure of an electronic device 400, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the apparatus 400 includes a computing unit 401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data required for the operation of the device 400 can also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

A number of components in device 400 are connected to I/O interface 405, including: an input unit 406, an output unit 407, a storage unit 408, and a communication unit 409. The input unit 406 may be any type of device capable of inputting information to the device 400, and the input unit 406 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. Output unit 407 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 408 may include, but is not limited to, magnetic or optical disks. The communication unit 409 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as bluetooth^TMDevices, 1302.11 devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 401 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 401 performs the various methods and processing steps described above, such as

step

110 and 140 in fig. 1. For example, in some embodiments, the voice editing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM 402 and/or the communication unit 409. When the computer program is loaded into RAM 403 and executed by computing unit 401, one or more steps of the speech editing method described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the voice editing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A method of speech editing comprising:

determining at least one candidate word from a text to be edited according to received first voice information in a voice editing mode;

tagging each candidate word of the at least one candidate word with a corresponding identifier;

determining a target identifier from the identifiers of the at least one candidate word according to the received second voice information; and

and editing a target word marked by the target identifier in the at least one candidate word according to the received third voice message.

2. The method of claim 1, further comprising:

and responding to a voice editing trigger command, and entering the voice editing mode.

3. The method of claim 1, wherein the text to be edited is obtained by performing speech recognition on speech input by a user in a speech input mode.

4. The method of claim 3, further comprising:

and entering the voice input mode in response to a voice input trigger command.

5. The method according to any one of claims 1-4, wherein the determining at least one candidate word from the text to be edited comprises:

performing voice recognition on the first voice information to obtain a set of homophones of the first voice information; and

determining at least one word belonging to the set of homophones in the text to be edited as the at least one candidate word.

6. The method according to any one of claims 1-4, wherein the determining at least one candidate word from the text to be edited comprises:

respectively determining the similarity between the voice of each word in the text to be edited and the first voice information; and

and determining at least one word with the similarity larger than a preset threshold in all words in the text to be edited as the at least one candidate word based on the determined similarity.

7. The method of any of claims 1-4, wherein the identifier is a number, and

wherein the determining a target identifier comprises: and determining a number corresponding to the second voice information from the identifiers of the at least one candidate word as the target identifier by performing voice recognition on the second voice information.

8. The method of any of claims 1-4, wherein the editing of the target word of the at least one candidate word labeled by the target identifier comprises:

determining editing operation and editing words by performing voice recognition on the third voice information; and

and executing the editing operation on the target word by using the editing word.

9. The method of claim 8, wherein the determining editing operations and editing words comprises:

determining the editing operation and a list of candidate editing words by performing voice recognition on the third voice information; and

and determining the editing words from the list of the candidate editing words according to the received fourth voice information.

10. The method of claim 8, wherein the editing operation comprises a modification, and

wherein the performing the editing operation on the target word comprises: and modifying the target word into the editing word.

11. The method of claim 8, wherein the editing operation comprises adding, and

wherein the performing the editing operation on the target word comprises: and adding the editing words behind or in front of the target words.

12. The method of any of claims 1-4, further comprising:

and responding to the symbol polling voice command, and sequentially checking the symbols in the text to be edited.

13. The method of claim 12, wherein the sequentially checking the symbols in the text to be edited comprises:

performing voice recognition on the received fifth voice message to determine a target symbol corresponding to the fifth voice message;

modifying the current symbol to the target symbol in response to the current symbol indicated by the cursor being different from the target symbol; and

and moving the cursor to the next symbol in the text to be edited.

14. The method of any of claims 1-4, further comprising:

and responding to a paragraph adjustment voice command, and performing paragraph adjustment on the text to be edited.

15. The method of claim 14, wherein the paragraph adjusting the text to be edited comprises:

determining paragraph adjustment operation according to the received sixth voice message; and

and executing the paragraph adjusting operation on the text to be edited.

16. A voice editing apparatus comprising:

the first positioning module is configured to determine at least one candidate word from a text to be edited according to received first voice information in a voice editing mode;

a tagging module configured to tag each of the at least one candidate word with a respective identifier;

a second positioning module configured to determine a target identifier from the identifiers of the at least one candidate word according to the received second voice information; and

and the editing module is configured to edit the target word marked by the target identifier in the at least one candidate word according to the received third voice message.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores a computer program that, when executed by the at least one processor, implements the method of any one of claims 1-15.

18. A non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-15.

19. A computer program product comprising a computer program, wherein the computer program realizes the method according to any of claims 1-15 when executed by a processor.