CN108831473B - Audio processing method and device - Google Patents

Audio processing method and device Download PDF

Info

Publication number
CN108831473B
CN108831473B CN201810287493.XA CN201810287493A CN108831473B CN 108831473 B CN108831473 B CN 108831473B CN 201810287493 A CN201810287493 A CN 201810287493A CN 108831473 B CN108831473 B CN 108831473B
Authority
CN
China
Prior art keywords
text information
sub
information
input
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810287493.XA
Other languages
Chinese (zh)
Other versions
CN108831473A (en
Inventor
单震生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201810287493.XA priority Critical patent/CN108831473B/en
Publication of CN108831473A publication Critical patent/CN108831473A/en
Application granted granted Critical
Publication of CN108831473B publication Critical patent/CN108831473B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72436User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for text messaging, e.g. short messaging services [SMS] or e-mails

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

According to the audio processing method and device, after the first audio input is obtained and the first text information corresponding to the first audio input is obtained, the second sub-text information obtained by editing the first sub-text information in the first text information is obtained, the second sub-text information can be used for updating the first relation set, and the first relation set comprises the corresponding relation between the input information and the text information and is used for matching the corresponding text information according to the input information. The first relation set can be updated by utilizing the second sub-text information obtained by editing the first sub-text information in the first text information, so that the matching information of the input information and the text information in the first relation set can be more accurate, and subsequently, when information matching processing is carried out based on the updated first relation set, the accuracy of information matching can be further improved.

Description

Audio processing method and device
Technical Field
The present application belongs to the technical field of audio processing, and in particular, to an audio processing method and apparatus.
Background
The existing speech recognition engine has low recognition accuracy when recognizing the speech of the user as the text information, and taking the speech recognition engine in the market as an example, the recognition accuracy is generally about 90%, namely, about 10% of error recognition probability exists, thereby affecting the speech input experience of the user.
Disclosure of Invention
The application discloses following technical scheme:
an audio processing method, comprising:
obtaining a first audio input;
obtaining first text information corresponding to the first audio input;
acquiring second sub-text information, wherein the second sub-text information is obtained by editing first sub-text information in the first text information;
the second sub-text information can be used for updating a first relation set, and the first relation set comprises a corresponding relation between the input information and the text information and is used for matching the corresponding text information according to the input information.
The above-described process, preferably,
wherein the corresponding relationship between the input information and the text information comprises at least one of the following corresponding relationships: the corresponding relation between the audio input information and the text information; or, the corresponding relation between the text input information and the text information;
wherein the updating the first set of relationships comprises at least one of:
determining corresponding first sub-audio data in the first audio input according to first sub-text information, and using the first sub-audio data as first audio input information; updating the first relation set according to the corresponding relation formed by the first audio input information and the second sub-text information;
alternatively, the first and second electrodes may be,
determining character information corresponding to second sub-text information according to the second sub-text information, wherein the character information is information capable of being input to form the second sub-text information and is used as first text input information; and updating the first relation set according to the corresponding relation formed by the first text input information and the second sub-text information.
The method above, preferably, if the first relationship set includes a corresponding relationship between the audio input information and the text information, the first relationship set can be updated based on an update of the second relationship set; the second relation set comprises corresponding relations between text input information and text information;
wherein the updating process in which the first relationship set is updated based on the updating of the second relationship set comprises:
acquiring a newly added corresponding relation of a second relation set, wherein the newly added corresponding relation is a corresponding relation between second text input information and third sub-text information, and the second text input information is character information corresponding to the third sub-text information;
obtaining second sub-audio data corresponding to the third sub-text information;
and updating the first relation set according to the corresponding relation formed by the second sub-audio data and the third sub-text information.
Preferably, the method further includes, after obtaining the updated first relationship set, the step of:
obtaining a second audio input;
performing speech recognition on the second audio input based on the updated first set of relationships.
In the method, preferably, when the updated first relationship set includes a first corresponding relationship between the first sub-audio data and the second sub-text information and a second corresponding relationship between the first sub-audio data and the first sub-text information, and the second audio input includes the first sub-audio data:
responding to the second audio input with second textual information including second sub-textual information if the second audio input and the first audio input satisfy a predetermined condition;
if the second audio input and the first audio input do not meet the preset condition, selecting matched subfile information from the first corresponding relation and the second corresponding relation as a part of third text information to respond to the second audio input according to matching priority;
the first sub-audio data is audio data corresponding to the first sub-text information in the first audio input.
The method above, preferably, the second audio input and the first audio input satisfying the predetermined condition include at least one of:
the time interval between the input time of the second audio input and the input time of the first audio input is less than a preset time length;
the input position of the second audio input and the input position of the first audio satisfy the same input attribute.
In the above method, preferably, when the first set of relationships is updated, the input attribute information of the first audio input is obtained, and the first set of relationships is updated based on the input attribute information;
the first relationship set includes a corresponding relationship of input information, text information, and input attribute information.
The method above, preferably, the obtaining a first audio input; obtaining first text information corresponding to the first audio input; obtaining second subfile information, comprising:
the terminal equipment collects first audio input; the terminal equipment performs voice recognition on the first audio input to obtain first text information corresponding to the first audio input; the terminal equipment collects second sub-text information so that the first relation set is updated by the terminal equipment through the second sub-text information;
alternatively, the first and second electrodes may be,
the method comprises the steps that a server receives first audio input collected by terminal equipment; the server performs voice recognition on the first audio input to obtain first text information corresponding to the first audio input, and sends the first text information to the terminal equipment; and the server receives the second sub-text information collected by the terminal equipment, so that the first relation set is updated by using the second sub-text information at the server.
An audio processing apparatus comprising:
a first obtaining unit for obtaining a first audio input;
the second acquisition unit is used for acquiring first text information corresponding to the first audio input;
a third obtaining unit, configured to obtain second sub-text information, where the second sub-text information is obtained by editing first sub-text information in the first text information;
the second sub-text information can be used for updating a first relation set, and the first relation set comprises a corresponding relation between the input information and the text information and is used for matching the corresponding text information according to the input information.
An audio processing apparatus comprising:
the device comprises a memory, a first display unit and a second display unit, wherein the memory is used for at least storing a first relation set, the first relation set comprises the corresponding relation between input information and text information, and the first relation set is used for matching the corresponding text information according to the input information;
a processor to perform the following operations:
obtaining a first audio input;
obtaining first text information corresponding to the first audio input;
acquiring second sub-text information, wherein the second sub-text information is obtained by editing first sub-text information in the first text information;
wherein the second sub-text information is usable to update the first set of relationships.
According to the scheme, after the first audio input is obtained and the first text information corresponding to the first audio input is obtained, the second sub-text information obtained by editing the first sub-text information in the first text information is obtained, the second sub-text information can be used for updating the first relation set, and the first relation set comprises the corresponding relation between the input information and the text information and is used for matching the corresponding text information according to the input information. The first relation set can be updated by utilizing the second sub-text information obtained by editing the first sub-text information in the first text information, so that the matching information of the input information and the text information in the first relation set can be more accurate, and subsequently, when information matching processing is carried out based on the updated first relation set, the accuracy of information matching can be further improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of an audio processing method according to an embodiment of the present application;
2-3 are flowcharts of audio processing methods provided in the second embodiment of the present application;
FIG. 4 is a flowchart of updating a first relationship set based on an update of a second relationship set according to a third embodiment of the present application;
FIG. 5 is a flowchart of an audio processing method according to a fourth embodiment of the present application;
FIG. 6 is a flowchart of an audio processing method according to a fifth embodiment of the present application;
7-8 are flowcharts of audio processing methods applied to different execution subjects according to a sixth embodiment of the present application;
fig. 9 is a schematic structural diagram of an audio processing apparatus according to a seventh embodiment of the present application;
fig. 10 is a schematic structural diagram of an audio processing apparatus according to a thirteenth embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The present application provides an audio processing method and apparatus, which will be described below with reference to a plurality of embodiments.
Referring to fig. 1, a flowchart of an embodiment of an audio processing method provided in the present application is shown, where the method may be applied to various terminal devices such as a smart phone, a tablet personal computer (PAD), a personal Digital assistant (pda), a notebook, a desktop, or a kiosk, or may also be applied to various general or special servers, as shown in fig. 1, in this embodiment, the audio processing method includes:
step 101, obtaining a first audio input.
The first audio input may be audio information input by a user to a corresponding application of the terminal device based on actual needs of the user, and may be, for example, audio information input to an information input box of a chat tool by using an audio capture function of the chat tool, or a recording file in the form of voice entry to the device by using an audio capture device of the device, such as mic, and the like. The first audio input may also be audio files in various formats existing on the device, for example, audio files in mp3, wma, rm, wav, etc. formats existing on the terminal device or the server.
The audio information corresponding to the first audio input in this application is preferably speech audio information.
In this step, the obtaining of the first audio input may be that an execution subject (e.g., a terminal device) of the method acquires the first audio input through a preset audio acquisition device, or may also obtain the first audio input from a specified path, or may also receive the first audio input transmitted by another device, which is not limited in this embodiment.
And 102, obtaining first text information corresponding to the first audio input.
The first text information may be text information obtained by performing speech recognition on the audio information corresponding to the first audio input by using a speech recognition engine based on a speech recognition technology.
When the voice recognition technology is used for carrying out voice recognition on the audio information corresponding to the first audio input, the voice recognition on the first audio input can be realized by matching the first audio input with a preset corresponding relation set (such as a voice recognition library) of the audio input information and text information.
As a possible manner, the matching may be matching of the overall audio of the first audio input with the corresponding relationship set of the audio input information and the text information, which is applicable to a case where the input content corresponding to the first audio input is short, for example, only some basic words (such as "china" and "hello"), in which case, by matching the overall audio of the first audio input with the corresponding relationship set of the audio input information and the text information, the first text information corresponding to the first audio input can be identified.
As another possible manner, the matching may also be matching of each part of sub audio data of the first audio input with a corresponding relationship set of audio input information and text information, where the manner is suitable for a case where input content corresponding to the first audio input is complex, such as a long speech sentence, a speech passage, or a speech file, and in such a case, each part of sub audio data of the first audio input needs to be respectively matched with a corresponding relationship set of audio input information and text information, and each text information obtained through respective matching is sequentially spliced/combined, so as to obtain the first text information corresponding to the first audio input.
The obtaining of the first text information corresponding to the first audio input may be an execution subject of the method, where the first text information is obtained by performing speech recognition on the first audio input, or may also be a receiving of the first text information obtained by performing speech recognition on another device.
After the first text information corresponding to the first audio input is obtained, the first text information may be displayed, for example, the first text information is displayed in a chat window of a chat tool, the first text information is displayed in an information search bar of a search tool, the first text information is displayed in a corresponding information registration bar, or the first text information is displayed in a corresponding text editing interface, and the like, where specific display conditions are determined according to actual application scenarios.
103, obtaining second sub-text information, wherein the second sub-text information is obtained by editing first sub-text information in the first text information;
the second sub-text information can be used for updating a first relation set, and the first relation set comprises a corresponding relation between the input information and the text information and is used for matching the corresponding text information according to the input information. The first set of relationships may include, but is not limited to, correspondences between audio input information and textual information.
Because the speech recognition engine generally has the problem that the recognition accuracy is not high enough, especially for the recognition of some similar sounds, the recognition result required by the user may not be obtained, that is, the first text information corresponding to the first audio input may have the first sub-text information which has a recognition error and does not meet the user requirement. For example, the user's actual first audio input: the crested ibis is rare, and after the first audio input is subjected to voice recognition, first text information of 'congratulation or curiosity' is obtained, and the first text information has first sub text information recognition errors of 'congratulation' and 'curiosity', so that the requirement of a user is not met.
For this case, in general, the user is required to perform a corresponding editing operation on the first sub-text information so as to correct it to the second sub-text information required by the user. For example, the cursor is positioned at the position of the first sub-text information in the first text information, the first sub-text information is deleted, and the required second sub-text information is input (which may be input by an input method or by handwriting, etc.) at the position, or a different character in the first sub-text information is revised to obtain the second sub-text information, etc.
And when the user obtains the second sub-text information by editing the first sub-text information in the first text information, the step obtains the second sub-text information. The obtaining of the second sub-text information may be realized by acquiring editing information of a user by an execution main body of the method, or may also be realized by receiving second sub-text information acquired by other devices, which is not limited in this embodiment.
Since the second sub-text information is more suitable for the user requirement or at least more suitable for the user requirement in the current application scenario, the first relationship set may be updated based on the second sub-text information. Such as adding the corresponding relationship between the input information and the text information to the first relationship set.
By updating the first relation set based on the second sub-text information, the matching information of the input information and the text information in the first relation set can be more suitable for the user requirements, and further, the subsequent information matching (such as voice recognition) based on the updated first relation set can be realized, so that higher matching accuracy can be achieved.
According to the foregoing solution, after obtaining the first audio input and obtaining the first text information corresponding to the first audio input, the audio processing method according to this embodiment obtains the second sub-text information obtained by editing the first sub-text information in the first text information, where the second sub-text information can be used to update the first relationship set, and the first relationship set includes a correspondence between the input information and the text information, and is used to match the corresponding text information according to the input information. The first relation set can be updated by utilizing the second sub-text information obtained by editing the first sub-text information in the first text information, so that the matching information of the input information and the text information in the first relation set can be more accurate, and subsequently, when information matching processing (such as voice recognition) is carried out based on the updated first relation set, the accuracy of information matching can be further improved.
In the second embodiment of the present application, the content of the corresponding relationship between the input information and the text information included in the first relationship set is described, and on this basis, an implementation process for updating the first relationship set based on the second sub-text information is provided.
In this embodiment, the correspondence between the input information and the text information included in the first relationship set includes at least one of the following correspondences:
the corresponding relation between the audio input information and the text information;
alternatively, the first and second electrodes may be,
and the corresponding relation between the text input information and the text information.
That is, as a possible implementation manner, the first relationship set may only include the correspondence between the audio input information and the text information, such as only the correspondence between the voice words/words and the text words/words, and in this case, the first relationship set may be used as a speech recognition library for speech recognition.
As another possible implementation manner, the first relationship set may also only include the correspondence between the text input information and the text information, for example, only include the correspondence between pinyin characters and text words/phrases, and in this case, the first relationship set may be used as a text input method recognition library for recognizing text input.
As another possible implementation manner, the first relationship set may include both the foregoing relationships, so that the first relationship set may provide a speech recognition library for performing speech recognition and a text input method recognition library for performing text input recognition.
On this basis, referring to fig. 2, a flowchart of an audio processing method according to a second embodiment of the present application is provided. For the case that the correspondence between the input information and the text information only includes the correspondence between the audio input information and the text information, as shown in fig. 2, the implementation process of updating the first relationship set based on the second sub-text information specifically includes the following steps:
step 201, determining corresponding first sub-audio data in the first audio input according to the first sub-text information, and using the first sub-audio data as first audio input information.
The first sub-audio data is audio data corresponding to the edited first sub-text information in the first audio input.
The first sub audio data may be colloquially understood as audio data in which a recognition error or a recognition result does not reach a user's satisfaction in the first audio input when the first audio input is subjected to speech recognition. For example, in the above example in which the first audio input "crested ibis is rare" actually provided by the user is recognized as "wish" or "curiosa", the recognition results "wish-back" and "curiosa" of "crested ibis" are recognized as erroneous and need to be edited, and accordingly, the first sub audio data is audio data corresponding to the "wish-back" and "curiosa".
Step 202, updating the first relation set according to the corresponding relation formed by the first audio input information and the second sub-text information.
As described above, the second sub-text information is information obtained by editing the first sub-text information, such as "crested ibis" obtained by editing the "blessing", or "rare" obtained by editing the "rare", so that the second sub-text information is more suitable for a user's request or at least a user's request in a current application scenario and is more suitable for the first sub-audio data (first audio input information), and in view of this, a correspondence relationship between the first audio input information (first sub-audio data) and the second sub-text information may be established, and the first relationship set may be updated based on the established correspondence relationship.
Wherein, the first relation set is updated based on the corresponding relation between the first audio input information and the second sub-text information, specifically, when the corresponding relation between the first audio input information and the second sub-text information does not exist in the first relation set, adding a correspondence of the first audio input information and second sub-text information to the first set of relationships, in this case, the second sub-text information is essentially added to the first set of relationships as a new candidate text information for the first audio input information, for example, assuming that the first audio input information corresponds to the first candidate text information and the second candidate text information in the first set of relationships, the second sub-text information may be added to the first set of relationships as third candidate text information for the first audio input information.
Or, in a case that the first audio input information and the second sub-text information have a corresponding relationship in the first relationship set, adjusting an attribute of the corresponding candidate text information of the first audio input information in the first relationship set, for example, adjusting the number of times of use and frequency of use of the second sub-text information, or directly adjusting the priority of the second sub-text information to be the highest priority based on the most recent usage rule, and correspondingly reducing the priority of other candidate text information, and the like.
For example, for the correspondence between the "crested ibis" audio data and the "crested ibis" text, if the correspondence does not exist in the speech recognition library, the correspondence may be added to the speech recognition library, that is, one candidate text in which the "crested ibis" audio data is added to the speech recognition library: the text of Nipponia nippon. If the correspondence exists in the speech recognition library, the frequency/number of times of use of the correspondence is increased by using the "crested ibis" character text in the correspondence this time, so that the frequency of use and the number of times of use of the candidate text of "crested ibis" corresponding to the "crested ibis" audio data in the speech recognition library can be correspondingly increased, or the priority of the candidate text of "crested ibis" can be adjusted to the highest priority directly based on the latest usage principle, and the priority of other candidate text information can be reduced accordingly.
Subsequently, when information matching is performed on the input information based on the updated first relationship set, for example, when speech recognition is performed on the audio data, corresponding text information may be matched according to information such as the number of times of use, the frequency of use, or the context of each candidate text corresponding to the audio data in the first relationship set, or second sub-text information corresponding to the audio data in the first relationship set may be directly matched with the audio data based on the latest usage principle as a candidate text with the highest priority.
The processing procedures corresponding to the steps 201 and 202 realize updating the speech recognition library according to the editing condition of the user on the audio input recognition result, so that the corresponding relation information in the speech recognition library is more accurate, and the recognition accuracy rate of the subsequent speech recognition based on the updated speech recognition library can be effectively improved.
Referring to fig. 3, which is a flowchart of another audio processing method provided in the second embodiment of the present application, for a case that a corresponding relationship between the input information and the text information only includes a corresponding relationship between the text input information and the text information, as shown in fig. 3, an implementation process of updating the first relationship set based on the second sub-text information specifically includes the following steps:
step 301, determining character information corresponding to second sub-text information according to the second sub-text information, where the character information is information capable of being input to form the second sub-text information, and using the character information as first text input information.
The character information corresponding to the second sub-text information may be, but is not limited to, pinyin characters corresponding to the second sub-text information, such as "zhuhua" corresponding to "crested ibis", "zhenqin" corresponding to "rare bird", and the like.
Step 302, updating the first relation set according to the corresponding relation formed by the first text input information and the second sub-text information.
As described above, the second sub-text information is information obtained by editing the first sub-text information, such as "crested ibis" obtained by editing the "blessing", or "rare" obtained by editing the "rare", so that the second sub-text information more closely meets the user's needs or at least meets the user's needs in the current application scene and more closely matches the first sub-audio data in the first audio input, and in view of this, the correspondence relationship between the first text input information, which is the character information, and the second sub-text information is established, and the first relationship set is updated based on the correspondence relationship.
The first relationship set is updated based on a corresponding relationship between first text input information and second sub-text information, and specifically, the corresponding relationship between the first text input information and the second sub-text information may be added to the first relationship set when the corresponding relationship between the first text input information and the second sub-text information does not exist in the first relationship set; or, in a case that the first text input information and the second sub-text information have a corresponding relationship in the first relationship set, adjusting an attribute of the corresponding candidate text information of the first text input information in the first relationship set, for example, adjusting the number of times of use and frequency of use of the second sub-text information, or directly adjusting the priority of the second sub-text information to be the highest priority based on the most recent usage rule, and correspondingly reducing the priority of other candidate text information, and the like.
For example, for the correspondence between the pinyin character "zhuhua" and the text of the "crested ibis" word, if the correspondence does not exist in the word input method recognition library, the correspondence may be added to the word input method recognition library, that is, a new candidate text is added to the pinyin character "zhhuhuan" in the word input method recognition library: "crested ibis". If the correspondence exists in the character input method recognition library, the use frequency/frequency of the candidate text "crested ibis" is increased by using the character text "crested ibis" in the correspondence at this time, so that the use frequency or the use frequency of the candidate text "crested ibis" corresponding to the pinyin character "zhuhua" in the character input method recognition library can be increased correspondingly, or the priority of the candidate text "crested ibis" can be adjusted to the highest priority directly based on the latest use principle, and the priority of other candidate text information can be decreased correspondingly.
Subsequently, when information matching is performed on the input information based on the updated first relationship set, for example, when character recognition is performed on the input pinyin character, the corresponding text can be matched according to the information such as the number of times of use, the frequency of use, or the context of each candidate text corresponding to the pinyin character in the first relationship set, or the second sub-text information corresponding to the pinyin character in the first relationship set can be directly matched with the pinyin character as the candidate text with the highest priority based on the latest use principle.
The processing procedures corresponding to the above steps 301, 302, which realizes updating the text input method recognition base according to the editing condition of the audio input recognition result by the user, breaks the barrier of the traditional technology that the speech recognition and the text input recognition are mutually isolated, in the specific implementation, when updating the speech recognition base, for example, when updating the speech recognition base based on the editing condition that the user edits the first sub-text information corresponding to the first sub-audio data in the first audio input into the second sub-text information, the updating of the text input method recognition base is triggered, thereby realizing the communication between the speech recognition base and the text input method recognition base, and the text input method recognition base can be updated based on the updating of the speech recognition base in a linkage manner, so that the text input method recognition base can update the word base information thereof without configuring and updating the word base information thereof based on the first character input or updating the word base information thereof via the network, and realizing updating. By using the edit condition information of the audio input recognition result of the user as a reference and applying the edit condition information to the updating of the character input method recognition library, the corresponding relation information in the character input method recognition library can be more accurate, and the accuracy of character input recognition based on the updated character input method recognition library can be effectively improved.
Aiming at the condition that the corresponding relation between the input information and the text information comprises the corresponding relation between the audio input information and the text information and the corresponding relation between the text input information and the text information, namely the first relation set can provide a voice recognition library for voice recognition and a character input method recognition library for character input recognition, the two processing procedures can be combined, and when a user edits an audio input recognition result, the two information recognition libraries are updated based on the related editing information.
In this embodiment, when the user edits the recognition result of the audio input, the first relationship set is updated based on the related editing information, so that the accuracy of the corresponding relationship information between the input information and the text information in the first relationship set can be effectively improved, and subsequently, when information matching is performed based on the updated first relationship set (such as voice recognition or character input recognition), the accuracy of information matching can be further improved. In addition, the embodiment realizes updating of the character input method recognition base according to the editing condition of the audio input recognition result by the user, and breaks the barrier that the speech recognition and the character input recognition are mutually isolated in the traditional technology.
In a third embodiment of the present application, another possible implementation manner of updating the first relationship set is provided, where if the first relationship set includes a corresponding relationship between audio input information and text information, the first relationship set can also be updated based on an update of a second relationship set, and the second relationship set includes a corresponding relationship between text input information and text information.
The first set of relationships in this embodiment may be understood colloquially to include at least a speech recognition library, while the second set of relationships includes a text input method recognition library.
As shown in fig. 4, the update process in which the first relationship set is updated based on the update of the second relationship set includes:
step 401, obtaining a new corresponding relationship of a second relationship set, where the new corresponding relationship is a corresponding relationship between second text input information and third sub-text information, and the second text input information is character information corresponding to the third sub-text information.
The new correspondence relationship of the second relationship set may be, but is not limited to, a correspondence relationship newly added to the second relationship set based on an editing condition of the user on the character input recognition result.
The following description is given by way of example.
When a user inputs pinyin character 'qiyong' through a character input method, the character input method recognition library is assumed to be the matched character recognition result of 'activating', but the character recognition result is not the result required by the user, and then the user edits the character recognition library to 'activating', and the corresponding relation between the pinyin character 'qiyong' and character information 'activating' does not exist in the character input method recognition library, namely the character information 'activating' is not used as a candidate character of the pinyin character 'qiyong' in the character input method recognition library, aiming at the situation, the corresponding relation between the pinyin character 'qiyong' and the character information 'activating' can be added into the character input method recognition library, so that the corresponding relation between the pinyin character 'qiyong' and the character information 'activating' is newly added in the character input method recognition library.
After the second relation set is added with the corresponding relation in the character input method identification library, the added corresponding relation can be obtained, and the first relation set is updated according to the added corresponding relation.
And 402, obtaining second sub-audio data corresponding to the third sub-text information.
After the newly added corresponding relationship of the second relationship set is obtained, the second sub-audio data corresponding to the third sub-text information in the corresponding relationship can be obtained.
In this embodiment, the second sub-audio data may be audio data obtained by performing voice simulation on the third sub-text information, for example, for a newly added corresponding relationship of the second relationship set: the pinyin character "qiyong" → the text information "start up", and the text information "start up" may be subjected to voice simulation, so that the second sub-audio data corresponding to the text information "start up" is obtained.
Step 403, updating the first relationship set according to the corresponding relationship formed by the second sub-audio data and the third sub-text information.
After the second sub-audio data corresponding to the third sub-text information is obtained, a corresponding relationship between the second sub-audio data and the third sub-text information may be established, and the first relationship set may be updated based on the corresponding relationship.
The first relationship set is updated based on the correspondence between the second sub-audio data and the third sub-text information, and specifically, the correspondence between the second sub-audio data and the third sub-text information may be added to the first relationship set when the correspondence between the second sub-audio data and the third sub-text information does not exist in the first relationship set; or, in a case that the corresponding relationship between the second sub-audio data and the third sub-text information already exists in the first relationship set, the attribute of the corresponding candidate text information of the second sub-audio data in the first relationship set may be adjusted, for example, the number of times of use and the frequency of use of the third sub-text information are adjusted, or the priority of the third sub-text information is adjusted to the highest priority based on the most recent usage rule directly, and the priorities of other candidate text information are reduced accordingly.
For example, for a correspondence of "active" audio data to "active" text, if the correspondence does not exist in the speech recognition library, the correspondence may be added to the speech recognition library. If the corresponding relation exists in the voice recognition library, the use frequency and the use times of the candidate text of 'active' corresponding to the audio data of 'active' in the voice recognition library can be adjusted, or the priority of the candidate text of 'active' is directly adjusted to be the highest priority based on the latest use principle, and the priority of other candidate text information is correspondingly reduced, and the like.
Subsequently, when information matching is performed on the input information based on the updated first relationship set, for example, when speech recognition is performed on the audio data, corresponding text information may be matched according to information such as the number of times of use, the frequency of use, or the context of each candidate text corresponding to the audio data in the first relationship set, or based directly on a recent usage principle, third sub-text information corresponding to the audio data in the first relationship set may be used as a candidate text with the highest priority to be matched with the audio data.
The scheme of the embodiment is combined with the scheme of the second embodiment, so that the bidirectional communication between the voice recognition library and the character input method recognition library is realized, the voice recognition library and the character input method recognition library can be updated based on the editing condition of the voice recognition result by the user, the character input method recognition library and the voice recognition library can be updated based on the editing condition of the character input recognition result by the user, or the character input method recognition library can be triggered to be updated based on the updating of the voice recognition library, the voice recognition library can be triggered to be updated based on the updating of the character input method recognition library, the bidirectional linkage between the two recognition libraries is realized, the barrier that the voice recognition and the character input recognition are isolated from each other in the traditional technology is broken through, and the information matching accuracy of the voice recognition library and the character input method recognition library can be effectively improved.
Referring to fig. 5, it is a flowchart of a fourth embodiment of an audio processing method provided in the present application, in this embodiment, as shown in fig. 5, after obtaining an updated first relationship set, the method may further include:
step 501, obtaining a second audio input.
The second audio input may also be audio information input by the user to the corresponding application of the terminal device based on the actual requirement, and may illustratively be audio information input to an information input box of the chat tool by using an audio capture function of the chat tool, or a recording file in the form of voice input to the device by using an audio capture device of the device, for example. The first audio input may also be audio files in various formats existing on the device, for example, audio files in mp3, wma, rm, wav, etc. formats existing on the terminal device or the server.
The obtaining of the second audio input may be audio information obtained by an execution subject (e.g., a terminal device) of the method of the present application through a preset audio collecting device, or may also be audio information obtained from a specified path, or may also be received audio information transmitted by another device, which is not limited in this embodiment.
Step 502, performing speech recognition on the second audio input based on the updated first relationship set.
After obtaining the second audio input, speech recognition may be performed on the second audio input based on the updated first set of relationships.
When performing speech recognition on the second audio input based on the updated first relationship set, as a possible implementation manner, each piece of second sub-text information included in the first relationship set may be directly used as a highest-priority candidate text when audio is matched based on a latest usage principle, so that, when performing speech recognition on the second audio input based on the updated first relationship set, if first sub-text information of first sub-audio data in the second audio input has corresponding second sub-text information in the first relationship set, the second sub-text information (i.e., information obtained when the first sub-audio data was edited last time) may be directly matched with the first sub-audio data.
If the first relation set does not have the second sub-text information corresponding to the first sub-audio data, the second sub-text information obtained by editing the first sub-text information corresponding to the first sub-audio data by the user can be obtained according to the method introduced in the application, and the first relation set is updated based on the second sub-text information, so that the corresponding second sub-text information can be matched for the first sub-audio data when the second audio input is identified next time.
Here, before that, that is, in the case where there is no second sub-text information corresponding to the "first sub-audio data", the generation process of the first sub-text information corresponding to the "first sub-audio data" may be: the second sub-text information in the first relation set comprises a plurality of characters (different from the pinyin characters), a first part in the first sub-audio data is matched with the first characters in the characters according to the first relation set, a second part in the first sub-audio data is matched with the second characters in the characters according to the first relation set, and so on until the first sub-audio data is matched, and finally the characters matched with the parts of the first sub-audio data are spliced/combined in sequence, so that the first sub-text information corresponding to the first sub-audio data is formed.
As another possible implementation manner, in the first relation set, if the first sub audio data corresponds to existing candidate text information, such as corresponding to the first candidate text information, the second candidate text information, and the like, in the case that the first relation set is updated based on the second sub text information of the first sub audio data, and the second sub text information is added to the third candidate text information of the first sub audio data, then, when performing speech recognition on the first sub audio data based on the updated first relation set, one of the candidate text information corresponding to the first sub audio data may be selected based on a predetermined policy to be matched with the first sub audio data, for example, based on the usage frequency, the usage number, or context information, such as selecting one of the candidate text information corresponding to the first sub audio data such as selecting the usage frequency or the candidate text information with the highest usage number, or candidate text information and the like which accord with the context information are selected and matched with the first sub audio data.
In this embodiment, voice recognition is performed on the second audio input based on the updated first relationship set, and since matching information between the input information and the text information in the updated first relationship set is more accurate, when voice recognition is performed on the second audio input by using the updated first relationship set, the voice recognition of the second audio input can have higher accuracy.
In the fifth embodiment of the present application, an implementation procedure of the step 502 (performing speech recognition on the second audio input based on the updated first relationship set) is further provided, where in a case that the updated first relationship set includes a first corresponding relationship between the first sub-audio data and the second sub-text information and a second corresponding relationship between the first sub-audio data and the first sub-text information, and the second audio input includes the first sub-audio data, as shown in fig. 6, the step 502 may be implemented by the following processing procedures:
step 5021, if the second audio input and the first audio input meet a preset condition, responding to the second audio input by adopting second text information comprising second sub-text information.
Wherein the second audio input and the first audio input satisfying a predetermined condition comprise at least one of: the time interval between the input time of the second audio input and the input time of the first audio input is less than a preset time length; the input position of the second audio input and the input position of the first audio satisfy the same input attribute.
That is, if the input time of the second audio input and the input time of the first audio input are relatively close, the time interval between the input times of the second audio input and the first audio input is less than the preset time length, or the input position of the second audio input and the input position of the first audio input satisfy the same input attribute, it is considered that the first sub audio data appearing in the second audio input and the first sub audio data appearing in the first audio input have the same text information matching requirement, so that after obtaining the second sub text information obtained by editing the first sub text information corresponding to the first sub audio data in the first audio input by the user, and updating the first relation set based on the second sub text information, when continuing to perform voice recognition on the second audio input by using the updated first relation set, the second sub text information can be preferentially selected to match the first sub audio data appearing in the second audio input, that is, the second audio input is responded to with second text information including second sub-text information.
For example, in a case where voice audio input is continuously performed (assuming that input time intervals of respective voice audio in a certain scene are all smaller than the preset time length), if second sub-text information obtained after a user edits first sub-text information corresponding to first sub-audio data in a certain piece of audio (for example, a first sentence) is obtained, and the first relation set is updated based on the second sub-text information, then subsequently, when speech recognition is performed on other audio in the continuously input scene based on the updated first relation set, if the first sub-audio data appears again in the other audio, the second sub-text information is preferentially used to match the first sub-audio data in the other audio.
The input position of the second audio input and the input position of the first audio input satisfying the same input properties may include, but are not limited to: the input position of the second audio input and the input position of the first audio input are the same input area of the same application, or different input areas of the same application, or matching input areas of different applications.
Specifically, for example, the input position of the second audio input and the input position of the first audio input are both user name input boxes of a certain application, such as a user using an application multiple times, entering a scene of a name into a name input box of the application at each use, in this case, the name entered by the user into the name input box generally corresponds to a particular Chinese combination, and, for that case, after editing first text information of first sub-audio data (such as audio corresponding to a certain word in the name audio) included in first audio input (such as name audio input for the first time to the name input box of the application) to obtain second sub-text information, when the first sub-audio data also appears in a subsequent second audio input (such as the name audio again entered into the name input box of the application), the second sub-text information may be preferentially employed to match first sub-audio data present in the second audio input.
For another example, the input position of the second audio input and the input position of the first audio input are different chat windows of the chat tool, and for a scene based on different chat windows and different people chatting in the chat tool, the same text characters are also used when a name of a certain person is involved, for example, when the name of a friend is chatted with different friends through a plurality of windows, wherein the same text characters are used for the name of the friend when the names of the friend are talked in the plurality of windows, so that, for this case, when the first text information of the first sub-audio data (for example, the audio corresponding to a certain word in the name audio) included in the first audio input (for example, the name audio first typed into a certain chat window) is edited to obtain the second sub-text information, when the first sub-audio data also appears in the subsequent second audio input (for example, the name audio typed into other chat windows), the second sub-text information may be preferentially employed to match first sub-audio data present in the second audio input.
For another example, the input position of the second audio input and the input position of the first audio input are user name input boxes in different applications/occasions. When names are input to name input boxes in different applications/occasions, the input names generally correspond to a special Chinese combination, so that, for this case, after first text information of first sub-audio data (such as audio corresponding to a certain word in the name audio) included in a first audio input (such as name audio input to the name input box of a first application) is edited to obtain second sub-text information, when the first sub-audio data also appears in a subsequent second audio input (such as name audio input to the name input box of a second application), the second sub-text information can be preferentially adopted to match the first sub-audio data appearing in the second audio input.
In the above three examples, the first audio input may be, for example, "stale" audio input to a name input box of an application or a chat window of a chat tool, and when the first match is made, since the high-priority character is "sunken ship", the stale "audio is matched with the literal text" sunken ship ", on the basis of which, the user modifies the literal text" sunken ship "into the literal text" stale "through a corresponding editing operation, so that, when the user inputs" stale "audio again to the name input box of the same application or a different chat window of the chat tool, the literal text" stale "is preferentially matched with the audio.
In practical implementations, when updating the first set of relationships, relevant input attribute information for the first audio input may also be obtained, and then the first set of relationships may be updated in conjunction with the input attribute information.
Specifically, the input progress of the first audio input, the content of the page and/or the cursor position, the time, the geographic position, and the like can be analyzed to obtain the input attribute information such as the application type, the input area, the context, the time information, and/or the geographic position, the occasion, and the like corresponding to the first audio input. On this basis, the first relationship set may be updated based on the input attribute information and second sub-text information obtained by editing first sub-text information corresponding to first sub-audio data in the first audio input. The first relation set comprises corresponding relations of the input information, the text information and the input attribute information, so that text matching can be performed on the first sub-audio data in the second audio input by combining the input attribute information when speech recognition is performed on the second audio input based on the updated first relation set.
Step 5022, if the second audio input and the first audio input do not meet the preset condition, selecting matched subfile information from the first corresponding relation and the second corresponding relation as a part of third text information to respond to the second audio input according to matching priority.
If the second audio input and the first audio input do not satisfy the predetermined condition, if the time interval between the input times of the second audio input and the first audio input is longer than the preset time length, or, the input positions of the two audio input units do not have the same position attribute, etc., in this case, it is considered that the same sub-audio data appearing in different audio input units are not strongly correlated, and do not necessarily have the same text matching requirement, thus, for this case, the second audio input may be responded to by selecting the sub-text information having the highest priority from the first correspondence and the second correspondence as a part of the third text information according to the matching priority (e.g., using the candidate text information having the highest usage number or the highest frequency as the high-priority text information, or using the candidate text information capable of matching context information as the high-priority text information, etc.).
In the embodiment, different speech recognition processing is executed on the audio input based on whether the input attribute of the audio input meets the condition, so that when the speech recognition of the audio input is performed to determine the text information matching requirement, the input attribute information is considered, and the speech recognition of the audio input has higher accuracy.
Referring to fig. 7 and fig. 8, which are process flow diagrams of the audio processing method in a case of adopting different execution subjects according to a sixth embodiment of the present application, respectively, as shown in fig. 7, the audio processing method in the present application may be applied to a terminal device, where the terminal device may be various intelligent terminals or computer devices such as a smart phone, a tablet computer, a personal digital assistant, a notebook, a desktop, or a kiosk, and when the terminal device is adopted as the execution subject, the audio processing method may include the following processing procedures:
step 701, the terminal device collects a first audio input.
Step 702, the terminal device performs voice recognition on the first audio input to obtain first text information corresponding to the first audio input.
Step 703, the terminal device collects second sub-text information, so that the terminal device updates the first relationship set by using the second sub-text information;
that is, the terminal device may be an execution subject of the audio processing method, and the acquisition of the audio input data, the speech recognition, and the update processing of the first relationship set may be executed at the terminal device.
The terminal device can directly collect corresponding audio input data based on an audio collection device such as mic, or collect audio input data of a user based on an audio collection function provided by an application in a corresponding application. On the basis, voice recognition is carried out on the collected audio input data by using a first relation set comprising the corresponding relation between the audio input information and the text information and/or the corresponding relation between the text input information and the text information, and after a user edits first sub-text information corresponding to the first audio data in the audio input data to obtain second sub-text information, the first relation set is updated based on the second sub-text information, so that voice recognition can be carried out on the audio input data of the user subsequently based on the updated first relation set.
As shown in fig. 8, the audio processing method of the present application may also be applied to a server, which may be various general-purpose or dedicated servers, and when the server is adopted as an execution subject, the audio processing method may include the following processing procedures:
step 801, a server receives a first audio input collected by a terminal device;
step 802, the server performs voice recognition on the first audio input to obtain first text information corresponding to the first audio input, and sends the first text information to the terminal device;
step 803, the server receives the second sub-text information collected by the terminal device, so that the server updates the first relationship set by using the second sub-text information.
That is, the server may be an execution subject of the audio processing method, and the acquisition of the audio input data, the speech recognition, and the update processing of the first relationship set may be executed at the server.
It should be noted that, when the server is used as the execution main body of the audio processing method, generally, the server needs to be matched with the terminal device of the user for use, and specifically, the terminal device of the user is used as a front-end collecting and displaying device, which is used for collecting the audio input data of the user and collecting the second sub-text information obtained when the user edits the first sub-text information corresponding to the first sub-audio data in the audio input data, and displaying the related information; after receiving the audio input data of the terminal equipment, the server performs voice recognition on the audio input data by using the first relation set, and obtains the second sub-text information to update the first relation set, so that voice recognition can be performed on the audio input data of the user based on the updated first relation set subsequently.
In practical application, different requirements can be combined, a terminal device or a server is selected to be adopted as an execution main body, the audio processing method is implemented, wherein, if the terminal equipment is selected to be adopted, the voice recognition function of the user terminal equipment and the updating function of the first relation set do not need to depend on the server, thus, the normal voice recognition and first relation set updating functions can be provided under the condition of no networking or no network, and if a server is selected, the user's terminal device needs to access the network and connect to the server through the network, which is capable of updating the first set of relationships based on the big data result of multiple users, therefore, the matching information of the input information and the text information in the first relation set can be more comprehensive and accurate, and the voice recognition result of the audio data can have the advantage of higher accuracy.
Fig. 9 is a schematic structural diagram of a seventh embodiment of an audio processing apparatus provided in the present application, where the apparatus may be applied to various terminal devices such as a smart phone, a tablet computer, a personal digital assistant, a notebook, a desktop, or a kiosk, or may also be applied to various general or special servers. As shown in fig. 9, in this embodiment, the audio processing apparatus includes:
the memory 901 is configured to store at least a first relationship set, where the first relationship set includes a corresponding relationship between input information and text information, and is configured to match corresponding text information according to the input information.
A processor 902 for performing the following operations:
obtaining a first audio input;
obtaining first text information corresponding to the first audio input;
acquiring second sub-text information, wherein the second sub-text information is obtained by editing first sub-text information in the first text information; wherein the second sub-text information is usable to update the first set of relationships.
The first audio input may be audio information input by a user to a corresponding application of the terminal device based on actual needs of the user, and may be, for example, audio information input to an information input box of a chat tool by using an audio capture function of the chat tool, or a recording file in the form of voice entry to the device by using an audio capture device of the device, such as mic, and the like. The first audio input may also be audio files in various formats existing on the device, for example, audio files in mp3, wma, rm, wav, etc. formats existing on the terminal device or the server.
The audio information corresponding to the first audio input in this application is preferably speech audio information.
The obtaining of the first audio input may be an executing subject (such as a terminal device) of the method of the present application
The first audio input is captured by a preset audio capture device, or alternatively
Obtaining the first audio input from a specified path, or receiving other device transmissions
The first audio input, which is not limited by this implementation.
The first text information may be text information obtained by performing speech recognition on the audio information corresponding to the first audio input by using a speech recognition engine based on a speech recognition technology.
When the voice recognition technology is used for carrying out voice recognition on the audio information corresponding to the first audio input, the voice recognition on the first audio input can be realized by matching the first audio input with a preset corresponding relation set (such as a voice recognition library) of the audio input information and text information.
As a possible manner, the matching may be matching of the overall audio of the first audio input with the corresponding relationship set of the audio input information and the text information, which is applicable to a case where the input content corresponding to the first audio input is short, for example, only some basic words (such as "china" and "hello"), in which case, by matching the overall audio of the first audio input with the corresponding relationship set of the audio input information and the text information, the first text information corresponding to the first audio input can be identified.
As another possible manner, the matching may also be matching of each part of sub audio data of the first audio input with a corresponding relationship set of audio input information and text information, where the manner is suitable for a case where input content corresponding to the first audio input is complex, such as a long speech sentence, a speech passage, or a speech file, and in such a case, each part of sub audio data of the first audio input needs to be respectively matched with a corresponding relationship set of audio input information and text information, and each text information obtained through respective matching is sequentially spliced/combined, so as to obtain the first text information corresponding to the first audio input.
The obtaining of the first text information corresponding to the first audio input may be an execution subject of the method, where the first text information is obtained by performing speech recognition on the first audio input, or may also be a receiving of the first text information obtained by performing speech recognition on another device.
After the first text information corresponding to the first audio input is obtained, the first text information may be displayed, for example, the first text information is displayed in a chat window of a chat tool, the first text information is displayed in an information search bar of a search tool, the first text information is displayed in a corresponding information registration bar, or the first text information is displayed in a corresponding text editing interface, and the like, where specific display conditions are determined according to actual application scenarios.
The second sub-text information can be used for updating a first relation set, and the first relation set comprises a corresponding relation between the input information and the text information and is used for matching the corresponding text information according to the input information. The first set of relationships may include, but is not limited to, correspondences between audio input information and textual information.
Because the speech recognition engine generally has the problem that the recognition accuracy is not high enough, especially for the recognition of some similar sounds, the recognition result required by the user may not be obtained, that is, the first text information corresponding to the first audio input may have the first sub-text information which has a recognition error and does not meet the user requirement. For example, the user's actual first audio input: the crested ibis is rare, and after the first audio input is subjected to voice recognition, first text information of 'congratulation or curiosity' is obtained, and the first text information has first sub text information recognition errors of 'congratulation' and 'curiosity', so that the requirement of a user is not met.
For this case, in general, the user is required to perform a corresponding editing operation on the first sub-text information so as to correct it to the second sub-text information required by the user. For example, the cursor is positioned at the position of the first sub-text information in the first text information, the first sub-text information is deleted, and the required second sub-text information is input (which may be input by an input method or by handwriting, etc.) at the position, or a different character in the first sub-text information is revised to obtain the second sub-text information, etc.
And when the user obtains the second sub-text information by editing the first sub-text information in the first text information, obtaining the second sub-text information. The obtaining of the second sub-text information may be realized by acquiring editing information of a user by an execution main body of the method, or may also be realized by receiving second sub-text information acquired by other devices, which is not limited in this embodiment.
Since the second sub-text information is more suitable for the user requirement or at least more suitable for the user requirement in the current application scenario, the first relationship set may be updated based on the second sub-text information. Such as adding the corresponding relationship between the input information and the text information to the first relationship set.
By updating the first relation set based on the second sub-text information, the matching information of the input information and the text information in the first relation set can be more suitable for the user requirements, and further, the subsequent information matching (such as voice recognition) based on the updated first relation set can be realized, so that higher matching accuracy can be achieved.
As can be seen from the above solution, the audio processing apparatus provided in this embodiment obtains, after obtaining the first audio input and obtaining the first text information corresponding to the first audio input, the second sub-text information obtained by editing the first sub-text information in the first text information, where the second sub-text information can be used to update the first relationship set, and the first relationship set includes a correspondence between the input information and the text information, and is used to match the corresponding text information according to the input information. The first relation set can be updated by utilizing the second sub-text information obtained by editing the first sub-text information in the first text information, so that the matching information of the input information and the text information in the first relation set can be more accurate, and subsequently, when information matching processing (such as voice recognition) is carried out based on the updated first relation set, the accuracy of information matching can be further improved.
In the eighth embodiment of the present application, the content of the corresponding relationship between the input information and the text information included in the first relationship set is described, and on this basis, the implementation process of updating the first relationship set based on the second sub-text information by the processor 902 is provided.
In this embodiment, the correspondence between the input information and the text information included in the first relationship set includes at least one of the following correspondences:
the corresponding relation between the audio input information and the text information;
alternatively, the first and second electrodes may be,
and the corresponding relation between the text input information and the text information.
That is, as a possible implementation manner, the first relationship set may only include the correspondence between the audio input information and the text information, such as only the correspondence between the voice words/words and the text words/words, and in this case, the first relationship set may be used as a speech recognition library for speech recognition.
As another possible implementation manner, the first relationship set may also only include the correspondence between the text input information and the text information, for example, only include the correspondence between pinyin characters and text words/phrases, and in this case, the first relationship set may be used as a text input method recognition library for recognizing text input.
As another possible implementation manner, the first relationship set may include both the foregoing relationships, so that the first relationship set may provide a speech recognition library for performing speech recognition and a text input method recognition library for performing text input recognition.
On this basis, for the case that the correspondence between the input information and the text information only includes the correspondence between the audio input information and the text information, the implementation process of the processor 902 updating the first relationship set based on the second sub-text information is specifically as follows:
determining corresponding first sub-audio data in the first audio input according to first sub-text information, and using the first sub-audio data as first audio input information; and updating the first relation set according to the corresponding relation formed by the first audio input information and the second sub-text information.
The first sub-audio data is audio data corresponding to the edited first sub-text information in the first audio input.
The first sub audio data may be colloquially understood as audio data in which a recognition error or a recognition result does not reach a user's satisfaction in the first audio input when the first audio input is subjected to speech recognition. For example, in the above example in which the first audio input "crested ibis is rare" actually provided by the user is recognized as "wish" or "curiosa", the recognition results "wish-back" and "curiosa" of "crested ibis" are recognized as erroneous and need to be edited, and accordingly, the first sub audio data is audio data corresponding to the "wish-back" and "curiosa".
As described above, the second sub-text information is information obtained by editing the first sub-text information, such as "crested ibis" obtained by editing the "blessing", or "rare" obtained by editing the "rare", so that the second sub-text information is more suitable for a user's request or at least a user's request in a current application scenario and is more suitable for the first sub-audio data (first audio input information), and in view of this, a correspondence relationship between the first audio input information (first sub-audio data) and the second sub-text information may be established, and the first relationship set may be updated based on the established correspondence relationship.
Wherein, the first relation set is updated based on the corresponding relation between the first audio input information and the second sub-text information, specifically, when the corresponding relation between the first audio input information and the second sub-text information does not exist in the first relation set, adding a correspondence of the first audio input information and second sub-text information to the first set of relationships, in this case, the second sub-text information is essentially added to the first set of relationships as a new candidate text information for the first audio input information, for example, assuming that the first audio input information corresponds to the first candidate text information and the second candidate text information in the first set of relationships, the second sub-text information may be added to the first set of relationships as third candidate text information for the first audio input information.
Or, in a case that the first audio input information and the second sub-text information have a corresponding relationship in the first relationship set, adjusting an attribute of the corresponding candidate text information of the first audio input information in the first relationship set, for example, adjusting the number of times of use and frequency of use of the second sub-text information, or directly adjusting the priority of the second sub-text information to be the highest priority based on the most recent usage rule, and correspondingly reducing the priority of other candidate text information, and the like.
For example, for the correspondence between the "crested ibis" audio data and the "crested ibis" text, if the correspondence does not exist in the speech recognition library, the correspondence may be added to the speech recognition library, that is, one candidate text in which the "crested ibis" audio data is added to the speech recognition library: the text of Nipponia nippon. If the correspondence exists in the speech recognition library, the frequency/number of times of use of the correspondence is increased by using the "crested ibis" character text in the correspondence this time, so that the frequency of use and the number of times of use of the candidate text of "crested ibis" corresponding to the "crested ibis" audio data in the speech recognition library can be correspondingly increased, or the priority of the candidate text of "crested ibis" can be adjusted to the highest priority directly based on the latest usage principle, and the priority of other candidate text information can be reduced accordingly.
Subsequently, when information matching is performed on the input information based on the updated first relationship set, for example, when speech recognition is performed on the audio data, corresponding text information may be matched according to information such as the number of times of use, the frequency of use, or the context of each candidate text corresponding to the audio data in the first relationship set, or second sub-text information corresponding to the audio data in the first relationship set may be directly matched with the audio data based on the latest usage principle as a candidate text with the highest priority.
The processing process realizes updating the voice recognition library according to the editing condition of the user on the audio input recognition result, so that the corresponding relation information in the voice recognition library is more accurate, and the recognition accuracy rate of the subsequent voice recognition based on the updated voice recognition library can be effectively improved.
For the case that the correspondence between the input information and the text information only includes the correspondence between the text input information and the text information, the implementation process of the processor 902 updating the first relationship set based on the second sub-text information specifically includes the following steps:
determining character information corresponding to second sub-text information according to the second sub-text information, wherein the character information is information capable of being input to form the second sub-text information and is used as first text input information; and updating the first relation set according to the corresponding relation formed by the first text input information and the second sub-text information.
The character information corresponding to the second sub-text information may be, but is not limited to, pinyin characters corresponding to the second sub-text information, such as "zhuhua" corresponding to "crested ibis", "zhenqin" corresponding to "rare bird", and the like.
As described above, the second sub-text information is information obtained by editing the first sub-text information, such as "crested ibis" obtained by editing the "blessing", or "rare" obtained by editing the "rare", so that the second sub-text information more closely meets the user's needs or at least meets the user's needs in the current application scene and more closely matches the first sub-audio data in the first audio input, and in view of this, the correspondence relationship between the first text input information, which is the character information, and the second sub-text information is established, and the first relationship set is updated based on the correspondence relationship.
The first relationship set is updated based on a corresponding relationship between first text input information and second sub-text information, and specifically, the corresponding relationship between the first text input information and the second sub-text information may be added to the first relationship set when the corresponding relationship between the first text input information and the second sub-text information does not exist in the first relationship set; or, in a case that the first text input information and the second sub-text information have a corresponding relationship in the first relationship set, adjusting an attribute of the corresponding candidate text information of the first text input information in the first relationship set, for example, adjusting the number of times of use and frequency of use of the second sub-text information, or directly adjusting the priority of the second sub-text information to be the highest priority based on the most recent usage rule, and correspondingly reducing the priority of other candidate text information, and the like.
For example, for the correspondence between the pinyin character "zhuhua" and the text of the "crested ibis" word, if the correspondence does not exist in the word input method recognition library, the correspondence may be added to the word input method recognition library, that is, a new candidate text is added to the pinyin character "zhu huan" in the word input method recognition library: "crested ibis". If the correspondence exists in the character input method recognition library, the use frequency/frequency of the candidate text "crested ibis" is increased by using the character text "crested ibis" in the correspondence at this time, so that the use frequency or the use frequency of the candidate text "crested ibis" corresponding to the pinyin character "zhuhua" in the character input method recognition library can be increased correspondingly, or the priority of the candidate text "crested ibis" can be adjusted to the highest priority directly based on the latest use principle, and the priority of other candidate text information can be decreased correspondingly.
Subsequently, when information matching is performed on the input information based on the updated first relationship set, for example, when character recognition is performed on the input pinyin character, the corresponding text can be matched according to the information such as the number of times of use, the frequency of use, or the context of each candidate text corresponding to the pinyin character in the first relationship set, or the second sub-text information corresponding to the pinyin character in the first relationship set can be directly matched with the pinyin character as the candidate text with the highest priority based on the latest use principle.
The above processing procedure realizes updating the word input method recognition base according to the editing condition of the audio input recognition result by the user, breaks the barrier that the voice recognition and the word input recognition are mutually isolated in the traditional technology, and in the specific implementation, when updating the voice recognition base, for example, when updating the voice recognition base based on the editing condition that the user edits the first sub-text information corresponding to the first sub-audio data in the first audio input into the second sub-text information, the updating of the word input method recognition base is triggered, thereby realizing the communication between the voice recognition base and the word input method recognition base, and the word input method recognition base can be updated in a linkage manner based on the updating of the voice recognition base, so that the word input method recognition base can update the word base information thereof without configuring and updating the word base information thereof based on the first character input or updating the word base information thereof through a network, and realizing updating. By using the edit condition information of the audio input recognition result of the user as a reference and applying the edit condition information to the updating of the character input method recognition library, the corresponding relation information in the character input method recognition library can be more accurate, and the accuracy of character input recognition based on the updated character input method recognition library can be effectively improved.
Aiming at the condition that the corresponding relation between the input information and the text information comprises the corresponding relation between the audio input information and the text information and the corresponding relation between the text input information and the text information, namely the first relation set can provide a voice recognition library for voice recognition and a character input method recognition library for character input recognition, the two processing procedures can be combined, and when a user edits an audio input recognition result, the two information recognition libraries are updated based on the related editing information.
In this embodiment, when the user edits the recognition result of the audio input, the first relationship set is updated based on the related editing information, so that the accuracy of the corresponding relationship information between the input information and the text information in the first relationship set can be effectively improved, and subsequently, when information matching is performed based on the updated first relationship set (such as voice recognition or character input recognition), the accuracy of information matching can be further improved. In addition, the embodiment realizes updating of the character input method recognition base according to the editing condition of the audio input recognition result by the user, and breaks the barrier that the speech recognition and the character input recognition are mutually isolated in the traditional technology.
In the ninth embodiment of the present application, another possible implementation manner of updating the first relationship set is provided, wherein if the first relationship set includes a corresponding relationship between audio input information and text information, the first relationship set can be updated based on an update of a second relationship set, and the second relationship set includes a corresponding relationship between text input information and text information.
The first set of relationships in this embodiment may be understood colloquially to include at least a speech recognition library, while the second set of relationships includes a text input method recognition library.
On this basis, the update process by which the processor 902 updates the first set of relationships based on the update of the second set of relationships includes:
acquiring a newly added corresponding relation of a second relation set, wherein the newly added corresponding relation is a corresponding relation between second text input information and third sub-text information, and the second text input information is character information corresponding to the third sub-text information; obtaining second sub-audio data corresponding to the third sub-text information; and updating the first relation set according to the corresponding relation formed by the second sub-audio data and the third sub-text information.
The new correspondence relationship of the second relationship set may be, but is not limited to, a correspondence relationship newly added to the second relationship set based on an editing condition of the user on the character input recognition result.
The following description is given by way of example.
When a user inputs pinyin character 'qiyong' through a character input method, the character input method recognition library is assumed to be the matched character recognition result of 'activating', but the character recognition result is not the result required by the user, and then the user edits the character recognition library to 'activating', and the corresponding relation between the pinyin character 'qiyong' and character information 'activating' does not exist in the character input method recognition library, namely the character information 'activating' is not used as a candidate character of the pinyin character 'qiyong' in the character input method recognition library, aiming at the situation, the corresponding relation between the pinyin character 'qiyong' and the character information 'activating' can be added into the character input method recognition library, so that the corresponding relation between the pinyin character 'qiyong' and the character information 'activating' is newly added in the character input method recognition library.
After the second relation set is added with the corresponding relation in the character input method identification library, the added corresponding relation can be obtained, and the first relation set is updated according to the added corresponding relation.
After the newly added corresponding relationship of the second relationship set is obtained, the second sub-audio data corresponding to the third sub-text information in the corresponding relationship can be obtained.
In this embodiment, the second sub-audio data may be audio data obtained by performing voice simulation on the third sub-text information, for example, for a newly added corresponding relationship of the second relationship set: the pinyin character "qiyong" → the text information "start up", and the text information "start up" may be subjected to voice simulation, so that the second sub-audio data corresponding to the text information "start up" is obtained.
After the second sub-audio data corresponding to the third sub-text information is obtained, a corresponding relationship between the second sub-audio data and the third sub-text information may be established, and the first relationship set may be updated based on the corresponding relationship.
The first relationship set is updated based on the correspondence between the second sub-audio data and the third sub-text information, and specifically, the correspondence between the second sub-audio data and the third sub-text information may be added to the first relationship set when the correspondence between the second sub-audio data and the third sub-text information does not exist in the first relationship set; or, in a case that the corresponding relationship between the second sub-audio data and the third sub-text information already exists in the first relationship set, the attribute of the corresponding candidate text information of the second sub-audio data in the first relationship set may be adjusted, for example, the number of times of use and the frequency of use of the third sub-text information are adjusted, or the priority of the third sub-text information is adjusted to the highest priority based on the most recent usage rule directly, and the priorities of other candidate text information are reduced accordingly.
For example, for a correspondence of "active" audio data to "active" text, if the correspondence does not exist in the speech recognition library, the correspondence may be added to the speech recognition library. If the corresponding relation exists in the voice recognition library, the use frequency and the use times of the candidate text of 'active' corresponding to the audio data of 'active' in the voice recognition library can be adjusted, or the priority of the candidate text of 'active' is directly adjusted to be the highest priority based on the latest use principle, and the priority of other candidate text information is correspondingly reduced, and the like.
Subsequently, when information matching is performed on the input information based on the updated first relationship set, for example, when speech recognition is performed on the audio data, corresponding text information may be matched according to information such as the number of times of use, the frequency of use, or the context of each candidate text corresponding to the audio data in the first relationship set, or based directly on a recent usage principle, third sub-text information corresponding to the audio data in the first relationship set may be used as a candidate text with the highest priority to be matched with the audio data.
The scheme of the embodiment is combined with the scheme of the previous embodiment, so that the bidirectional communication between the voice recognition library and the character input method recognition library is realized, the voice recognition library and the character input method recognition library can be updated based on the editing condition of the voice recognition result by the user, the character input method recognition library and the voice recognition library can be updated based on the editing condition of the character input recognition result by the user, or the character input method recognition library can be triggered to be updated based on the updating of the voice recognition library, the voice recognition library can be triggered to be updated based on the updating of the character input method recognition library, the bidirectional linkage between the two recognition libraries is realized, the barrier that the voice recognition and the character input recognition are mutually isolated in the traditional technology is broken through, and the information matching accuracy of the voice recognition library and the character input method recognition library can be effectively improved.
In the tenth embodiment that follows in this embodiment, after obtaining the updated first relationship set, the processor 902 may further be configured to:
obtaining a second audio input; performing speech recognition on the second audio input based on the updated first set of relationships.
The second audio input may also be audio information input by the user to the corresponding application of the terminal device based on the actual requirement, and may illustratively be audio information input to an information input box of the chat tool by using an audio capture function of the chat tool, or a recording file in the form of voice input to the device by using an audio capture device of the device, for example. The first audio input may also be audio files in various formats existing on the device, for example, audio files in mp3, wma, rm, wav, etc. formats existing on the terminal device or the server.
The obtaining of the second audio input may be audio information obtained by an execution subject (e.g., a terminal device) of the method of the present application through a preset audio collecting device, or may also be audio information obtained from a specified path, or may also be received audio information transmitted by another device, which is not limited in this embodiment.
After obtaining the second audio input, speech recognition may be performed on the second audio input based on the updated first set of relationships.
When performing speech recognition on the second audio input based on the updated first relationship set, as a possible implementation manner, each piece of second sub-text information included in the first relationship set may be directly used as a highest-priority candidate text when audio is matched based on a latest usage principle, so that, when performing speech recognition on the second audio input based on the updated first relationship set, if first sub-text information of first sub-audio data in the second audio input has corresponding second sub-text information in the first relationship set, the second sub-text information (i.e., information obtained when the first sub-audio data was edited last time) may be directly matched with the first sub-audio data.
If the first relation set does not have the second sub-text information corresponding to the first sub-audio data, the second sub-text information obtained by editing the first sub-text information corresponding to the first sub-audio data by the user can be obtained according to the method introduced in the application, and the first relation set is updated based on the second sub-text information, so that the corresponding second sub-text information can be matched for the first sub-audio data when the second audio input is identified next time.
Here, before that, that is, in the case where there is no second sub-text information corresponding to the "first sub-audio data", the generation process of the first sub-text information corresponding to the "first sub-audio data" may be: the second sub-text information in the first relation set comprises a plurality of characters (different from the pinyin characters), a first part in the first sub-audio data is matched with the first characters in the characters according to the first relation set, a second part in the first sub-audio data is matched with the second characters in the characters according to the first relation set, and so on until the first sub-audio data is matched, and finally the characters matched with the parts of the first sub-audio data are spliced/combined in sequence, so that the first sub-text information corresponding to the first sub-audio data is formed.
As another possible implementation manner, in the first relation set, if the first sub audio data corresponds to existing candidate text information, such as corresponding to the first candidate text information, the second candidate text information, and the like, in the case that the first relation set is updated based on the second sub text information of the first sub audio data, and the second sub text information is added to the third candidate text information of the first sub audio data, then, when performing speech recognition on the first sub audio data based on the updated first relation set, one of the candidate text information corresponding to the first sub audio data may be selected based on a predetermined policy to be matched with the first sub audio data, for example, based on the usage frequency, the usage number, or context information, such as selecting one of the candidate text information corresponding to the first sub audio data such as selecting the usage frequency or the candidate text information with the highest usage number, or candidate text information and the like which accord with the context information are selected and matched with the first sub audio data.
In this embodiment, voice recognition is performed on the second audio input based on the updated first relationship set, and since matching information between the input information and the text information in the updated first relationship set is more accurate, when voice recognition is performed on the second audio input by using the updated first relationship set, the voice recognition of the second audio input can have higher accuracy.
In the eleventh embodiment of the present application, an implementation process of performing speech recognition on the second audio input by the processor 902 based on the updated first relationship set is further provided, where in a case that the updated first relationship set includes a first corresponding relationship between the first sub-audio data and the second sub-text information and a second corresponding relationship between the first sub-audio data and the first sub-text information, and the second audio input includes the first sub-audio data, the processor 902 may implement speech recognition on the second audio input based on the updated first relationship set by the following processing processes:
responding to the second audio input with second textual information including second sub-textual information if the second audio input and the first audio input satisfy a predetermined condition; and if the second audio input and the first audio input do not meet the preset condition, selecting matched subfile information from the first corresponding relation and the second corresponding relation as a part of third text information to respond to the second audio input according to the matching priority. .
Wherein the second audio input and the first audio input satisfying a predetermined condition comprise at least one of: the time interval between the input time of the second audio input and the input time of the first audio input is less than a preset time length; the input position of the second audio input and the input position of the first audio satisfy the same input attribute.
That is, if the input time of the second audio input and the input time of the first audio input are relatively close, the time interval between the input times of the second audio input and the first audio input is less than the preset time length, or the input position of the second audio input and the input position of the first audio input satisfy the same input attribute, it is considered that the first sub audio data appearing in the second audio input and the first sub audio data appearing in the first audio input have the same text information matching requirement, so that after obtaining the second sub text information obtained by editing the first sub text information corresponding to the first sub audio data in the first audio input by the user, and updating the first relation set based on the second sub text information, when continuing to perform voice recognition on the second audio input by using the updated first relation set, the second sub text information can be preferentially selected to match the first sub audio data appearing in the second audio input, that is, the second audio input is responded to with second text information including second sub-text information.
For example, in a case where voice audio input is continuously performed (assuming that input time intervals of respective voice audio in a certain scene are all smaller than the preset time length), if second sub-text information obtained after a user edits first sub-text information corresponding to first sub-audio data in a certain piece of audio (for example, a first sentence) is obtained, and the first relation set is updated based on the second sub-text information, then subsequently, when speech recognition is performed on other audio in the continuously input scene based on the updated first relation set, if the first sub-audio data appears again in the other audio, the second sub-text information is preferentially used to match the first sub-audio data in the other audio.
The input position of the second audio input and the input position of the first audio input satisfying the same input properties may include, but are not limited to: the input position of the second audio input and the input position of the first audio input are the same input area of the same application, or different input areas of the same application, or matching input areas of different applications.
Specifically, for example, the input position of the second audio input and the input position of the first audio input are both user name input boxes of a certain application, such as a user using an application multiple times, entering a scene of a name into a name input box of the application at each use, in this case, the name entered by the user into the name input box generally corresponds to a particular Chinese combination, and, for that case, after editing first text information of first sub-audio data (such as audio corresponding to a certain word in the name audio) included in first audio input (such as name audio input for the first time to the name input box of the application) to obtain second sub-text information, when the first sub-audio data also appears in a subsequent second audio input (such as the name audio again entered into the name input box of the application), the second sub-text information may be preferentially employed to match first sub-audio data present in the second audio input.
For another example, the input position of the second audio input and the input position of the first audio input are different chat windows of the chat tool, and for a scene based on different chat windows and different people chatting in the chat tool, the same text characters are also used when a name of a certain person is involved, for example, when the name of a friend is chatted with different friends through a plurality of windows, wherein the same text characters are used for the name of the friend when the names of the friend are talked in the plurality of windows, so that, for this case, when the first text information of the first sub-audio data (for example, the audio corresponding to a certain word in the name audio) included in the first audio input (for example, the name audio first typed into a certain chat window) is edited to obtain the second sub-text information, when the first sub-audio data also appears in the subsequent second audio input (for example, the name audio typed into other chat windows), the second sub-text information may be preferentially employed to match first sub-audio data present in the second audio input.
For another example, the input position of the second audio input and the input position of the first audio input are user name input boxes in different applications/occasions. When names are input to name input boxes in different applications/occasions, the input names generally correspond to a special Chinese combination, so that, for this case, after first text information of first sub-audio data (such as audio corresponding to a certain word in the name audio) included in a first audio input (such as name audio input to the name input box of a first application) is edited to obtain second sub-text information, when the first sub-audio data also appears in a subsequent second audio input (such as name audio input to the name input box of a second application), the second sub-text information can be preferentially adopted to match the first sub-audio data appearing in the second audio input.
In the above three examples, the first audio input may be, for example, "stale" audio input to a name input box of an application or a chat window of a chat tool, and when the first match is made, since the high-priority character is "sunken ship", the stale "audio is matched with the literal text" sunken ship ", on the basis of which, the user modifies the literal text" sunken ship "into the literal text" stale "through a corresponding editing operation, so that, when the user inputs" stale "audio again to the name input box of the same application or a different chat window of the chat tool, the literal text" stale "is preferentially matched with the audio.
In practical implementations, when updating the first set of relationships, relevant input attribute information for the first audio input may also be obtained, and then the first set of relationships may be updated in conjunction with the input attribute information.
Specifically, the input progress of the first audio input, the content of the page and/or the cursor position, the time, the geographic position, and the like can be analyzed to obtain the input attribute information such as the application type, the input area, the context, the time information, and/or the geographic position, the occasion, and the like corresponding to the first audio input. On this basis, the first relationship set may be updated based on the input attribute information and second sub-text information obtained by editing first sub-text information corresponding to first sub-audio data in the first audio input. The first relation set comprises corresponding relations of the input information, the text information and the input attribute information, so that text matching can be performed on the first sub-audio data in the second audio input by combining the input attribute information when speech recognition is performed on the second audio input based on the updated first relation set.
If the second audio input and the first audio input do not satisfy the predetermined condition, if the time interval between the input times of the second audio input and the first audio input is longer than the preset time length, or, the input positions of the two audio input units do not have the same position attribute, etc., in this case, it is considered that the same sub-audio data appearing in different audio input units are not strongly correlated, and do not necessarily have the same text matching requirement, thus, for this case, the second audio input may be responded to by selecting the sub-text information having the highest priority from the first correspondence and the second correspondence as a part of the third text information according to the matching priority (e.g., using the candidate text information having the highest usage number or the highest frequency as the high-priority text information, or using the candidate text information capable of matching context information as the high-priority text information, etc.).
In the embodiment, different speech recognition processing is executed on the audio input based on whether the input attribute of the audio input meets the condition, so that when the speech recognition of the audio input is performed to determine the text information matching requirement, the input attribute information is considered, and the speech recognition of the audio input has higher accuracy.
In a twelfth embodiment of the present application, a process for implementing audio processing by an audio processing apparatus is provided in each case when different execution entities are adopted, where the audio processing apparatus of the present application may be applied to a terminal device, and the terminal device may be a smart phone, a tablet computer, a personal digital assistant, a notebook computer, a desktop computer, or a all-in-one computer, and when the audio processing apparatus is applied to the terminal device, the process of audio processing by the audio processing apparatus may include:
collecting a first audio input at a terminal device; performing voice recognition on the first audio input at the terminal equipment to obtain first text information corresponding to the first audio input; and collecting second sub-text information at the terminal equipment, so that the first relation set is updated by the terminal equipment by using the second sub-text information.
That is, the acquisition of audio input data, the speech recognition, and the update processing of the first relationship set are performed at the terminal device.
The terminal device can directly collect corresponding audio input data based on an audio collection device such as mic, or collect audio input data of a user based on an audio collection function provided by an application in a corresponding application. On the basis, voice recognition is carried out on the collected audio input data by using a first relation set comprising the corresponding relation between the audio input information and the text information and/or the corresponding relation between the text input information and the text information, and after a user edits first sub-text information corresponding to the first audio data in the audio input data to obtain second sub-text information, the first relation set is updated based on the second sub-text information, so that voice recognition can be carried out on the audio input data of the user subsequently based on the updated first relation set.
The audio processing device of the present application may also be applied to a server, which may be various general-purpose or special-purpose servers, and when applied to a server, the audio processing process of the audio processing device may include:
receiving a first audio input collected by a terminal device at a server; performing voice recognition on the first audio input at a server to obtain first text information corresponding to the first audio input, and sending the first text information to terminal equipment; and receiving the second sub-text information collected by the terminal equipment at the server, so that the first relation set is updated by using the second sub-text information at the server.
That is, the acquisition of the audio input data, the speech recognition, and the update processing of the first relationship set may also be performed at the server.
It should be noted that, when the audio processing apparatus is applied to a server, generally, the server needs to be matched with a terminal device of a user for use, and specifically, the terminal device of the user is used as a front-end collecting and displaying apparatus, and is used for collecting audio input data of the user and collecting second sub-text information obtained when the user edits first sub-text information corresponding to first sub-audio data in the audio input data, and displaying related information; after receiving the audio input data of the terminal equipment, the server performs voice recognition on the audio input data by using the first relation set, and obtains the second sub-text information to update the first relation set, so that voice recognition can be performed on the audio input data of the user based on the updated first relation set subsequently.
In practical application, different requirements can be combined, and the audio processing device can be selected to be applied to a terminal device or a server, wherein if the audio processing device is selected to be applied to the terminal device, the voice recognition function of the user terminal device and the updating function of the first relation set do not need to depend on the server, so that the normal voice recognition and first relation set updating functions can be provided under the condition of no networking or no network, if the audio processing device is selected to be applied to the server, the user terminal device needs to be accessed to the network and is connected with the server through the network, and the mode has the advantage that the first relation set can be updated based on a large data result of multiple users, so that the matching information of input information and text information in the first relation set is more comprehensive and accurate, and the voice recognition result of audio data has the advantage of higher accuracy.
Fig. 10 is a schematic structural diagram of a thirteenth embodiment of the audio processing apparatus provided in the present application, where the apparatus may be applied to various terminal devices such as a smart phone, a tablet computer, a personal digital assistant, a notebook, a desktop, or a kiosk, or may also be applied to various general or special servers. As shown in fig. 10, in the present embodiment, the audio processing apparatus includes:
a first obtaining unit 1001 for obtaining a first audio input.
The first audio input may be audio information input by a user to a corresponding application of the terminal device based on actual needs of the user, and may be, for example, audio information input to an information input box of a chat tool by using an audio capture function of the chat tool, or a recording file in the form of voice entry to the device by using an audio capture device of the device, such as mic, and the like. The first audio input may also be audio files in various formats existing on the device, for example, audio files in mp3, wma, rm, wav, etc. formats existing on the terminal device or the server.
The audio information corresponding to the first audio input in this application is preferably speech audio information.
The obtaining of the first audio input may be an executing subject (such as a terminal device) of the method of the present application
The first audio input is captured by a preset audio capture device, or alternatively
Obtaining the first audio input from a specified path, or receiving other device transmissions
The first audio input, which is not limited by this implementation.
A second obtaining unit 1002, configured to obtain first text information corresponding to the first audio input.
The first text information may be text information obtained by performing speech recognition on the audio information corresponding to the first audio input by using a speech recognition engine based on a speech recognition technology.
When the voice recognition technology is used for carrying out voice recognition on the audio information corresponding to the first audio input, the voice recognition on the first audio input can be realized by matching the first audio input with a preset corresponding relation set (such as a voice recognition library) of the audio input information and text information.
As a possible manner, the matching may be matching of the overall audio of the first audio input with the corresponding relationship set of the audio input information and the text information, which is applicable to a case where the input content corresponding to the first audio input is short, for example, only some basic words (such as "china" and "hello"), in which case, by matching the overall audio of the first audio input with the corresponding relationship set of the audio input information and the text information, the first text information corresponding to the first audio input can be identified.
As another possible manner, the matching may also be matching of each part of sub audio data of the first audio input with a corresponding relationship set of audio input information and text information, where the manner is suitable for a case where input content corresponding to the first audio input is complex, such as a long speech sentence, a speech passage, or a speech file, and in such a case, each part of sub audio data of the first audio input needs to be respectively matched with a corresponding relationship set of audio input information and text information, and each text information obtained through respective matching is sequentially spliced/combined, so as to obtain the first text information corresponding to the first audio input.
The obtaining of the first text information corresponding to the first audio input may be an execution subject of the method, where the first text information is obtained by performing speech recognition on the first audio input, or may also be a receiving of the first text information obtained by performing speech recognition on another device.
After the first text information corresponding to the first audio input is obtained, the first text information may be displayed, for example, the first text information is displayed in a chat window of a chat tool, the first text information is displayed in an information search bar of a search tool, the first text information is displayed in a corresponding information registration bar, or the first text information is displayed in a corresponding text editing interface, and the like, where specific display conditions are determined according to actual application scenarios.
A third obtaining unit 1003, configured to obtain second sub-text information, where the second sub-text information is obtained by editing first sub-text information in the first text information;
the second sub-text information can be used for updating a first relation set, the first relation set comprises corresponding relations of the input information and the text information, and the second relation set is used for matching the corresponding text information according to the input information
The second sub-text information can be used for updating a first relation set, and the first relation set comprises a corresponding relation between the input information and the text information and is used for matching the corresponding text information according to the input information. The first set of relationships may include, but is not limited to, correspondences between audio input information and textual information.
Because the speech recognition engine generally has the problem that the recognition accuracy is not high enough, especially for the recognition of some similar sounds, the recognition result required by the user may not be obtained, that is, the first text information corresponding to the first audio input may have the first sub-text information which has a recognition error and does not meet the user requirement. For example, the user's actual first audio input: the crested ibis is rare, and after the first audio input is subjected to voice recognition, first text information of 'congratulation or curiosity' is obtained, and the first text information has first sub text information recognition errors of 'congratulation' and 'curiosity', so that the requirement of a user is not met.
For this case, in general, the user is required to perform a corresponding editing operation on the first sub-text information so as to correct it to the second sub-text information required by the user. For example, the cursor is positioned at the position of the first sub-text information in the first text information, the first sub-text information is deleted, and the required second sub-text information is input (which may be input by an input method or by handwriting, etc.) at the position, or a different character in the first sub-text information is revised to obtain the second sub-text information, etc.
And when the user obtains the second sub-text information by editing the first sub-text information in the first text information, obtaining the second sub-text information. The obtaining of the second sub-text information may be realized by acquiring editing information of a user by an execution main body of the method, or may also be realized by receiving second sub-text information acquired by other devices, which is not limited in this embodiment.
Since the second sub-text information is more suitable for the user requirement or at least more suitable for the user requirement in the current application scenario, the first relationship set may be updated based on the second sub-text information. Such as adding the corresponding relationship between the input information and the text information to the first relationship set.
By updating the first relation set based on the second sub-text information, the matching information of the input information and the text information in the first relation set can be more suitable for the user requirements, and further, the subsequent information matching (such as voice recognition) based on the updated first relation set can be realized, so that higher matching accuracy can be achieved.
As can be seen from the above solution, the audio processing apparatus provided in this embodiment obtains, after obtaining the first audio input and obtaining the first text information corresponding to the first audio input, the second sub-text information obtained by editing the first sub-text information in the first text information, where the second sub-text information can be used to update the first relationship set, and the first relationship set includes a correspondence between the input information and the text information, and is used to match the corresponding text information according to the input information. The first relation set can be updated by utilizing the second sub-text information obtained by editing the first sub-text information in the first text information, so that the matching information of the input information and the text information in the first relation set can be more accurate, and subsequently, when information matching processing (such as voice recognition) is carried out based on the updated first relation set, the accuracy of information matching can be further improved.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
For convenience of description, the above system or apparatus is described as being divided into various modules or units by function, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
Finally, it is further noted that, herein, relational terms such as first, second, third, fourth, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (11)

1. An audio processing method, comprising:
obtaining a first audio input;
obtaining first text information corresponding to the first audio input;
acquiring second sub-text information, wherein the second sub-text information is obtained by editing first sub-text information in the first text information;
the second sub-text information can be used for updating a first relation set, the first relation set comprises corresponding relations of the input information and the text information, and the first relation set is used for matching the corresponding text information according to the input information;
wherein the correspondence between the input information and the text information comprises a correspondence between audio input information and text information, and the first set of relationships can be updated based on an update of the second set of relationships; the second relationship set includes a correspondence between the text input information and the text information.
2. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,
wherein the corresponding relationship between the input information and the text information comprises at least one of the following corresponding relationships: the corresponding relation between the audio input information and the text information; or, the corresponding relation between the text input information and the text information;
wherein the updating the first set of relationships comprises at least one of:
determining corresponding first sub-audio data in the first audio input according to first sub-text information, and using the first sub-audio data as first audio input information; updating the first relation set according to the corresponding relation formed by the first audio input information and the second sub-text information;
alternatively, the first and second electrodes may be,
determining character information corresponding to second sub-text information according to the second sub-text information, wherein the character information is information capable of being input to form the second sub-text information and is used as first text input information; and updating the first relation set according to the corresponding relation formed by the first text input information and the second sub-text information.
3. The method of claim 1, wherein the first and second light sources are selected from the group consisting of,
wherein, in a case where the first relationship set includes a correspondence between the audio input information and the text information, an update process in which the first relationship set is updated based on an update of the second relationship set includes:
acquiring a newly added corresponding relation of a second relation set, wherein the newly added corresponding relation is a corresponding relation between second text input information and third sub-text information, and the second text input information is character information corresponding to the third sub-text information;
obtaining second sub-audio data corresponding to the third sub-text information;
and updating the first relation set according to the corresponding relation formed by the second sub-audio data and the third sub-text information.
4. The method of claim 1, wherein the obtaining a first audio input; obtaining first text information corresponding to the first audio input; obtaining second subfile information, comprising:
the terminal equipment collects first audio input; the terminal equipment performs voice recognition on the first audio input to obtain first text information corresponding to the first audio input; the terminal equipment collects second sub-text information so that the first relation set is updated by the terminal equipment through the second sub-text information;
alternatively, the first and second electrodes may be,
the method comprises the steps that a server receives first audio input collected by terminal equipment; the server performs voice recognition on the first audio input to obtain first text information corresponding to the first audio input, and sends the first text information to the terminal equipment; and the server receives the second sub-text information collected by the terminal equipment, so that the first relation set is updated by using the second sub-text information at the server.
5. An audio processing method, comprising:
obtaining a first audio input;
obtaining first text information corresponding to the first audio input;
acquiring second sub-text information, wherein the second sub-text information is obtained by editing first sub-text information in the first text information;
the second sub-text information can be used for updating a first relation set, the first relation set comprises corresponding relations of the input information and the text information, and the first relation set is used for matching the corresponding text information according to the input information;
after obtaining the updated first set of relationships, further comprising:
obtaining a second audio input;
performing speech recognition on the second audio input based on the updated first set of relationships;
wherein, in case the updated first set of relationships comprises a first correspondence of first sub-audio data and second sub-text information and a second correspondence of first sub-audio data and first sub-text information, and the second audio input comprises the first sub-audio data:
responding to the second audio input with second textual information including second sub-textual information if the second audio input and the first audio input satisfy a predetermined condition;
if the second audio input and the first audio input do not meet the preset condition, selecting matched subfile information from the first corresponding relation and the second corresponding relation as a part of third text information to respond to the second audio input according to matching priority;
the first sub-audio data is audio data corresponding to the first sub-text information in the first audio input.
6. The method of claim 5, wherein the second audio input and the first audio input satisfying a predetermined condition comprises at least one of:
the time interval between the input time of the second audio input and the input time of the first audio input is less than a preset time length;
the input position of the second audio input and the input position of the first audio satisfy the same input attribute.
7. The method of claim 6, wherein, in updating the first set of relationships, obtaining input attribute information for a first audio input and updating the first set of relationships based on the input attribute information;
the first relationship set includes a corresponding relationship of input information, text information, and input attribute information.
8. An audio processing apparatus comprising:
a first obtaining unit for obtaining a first audio input;
the second acquisition unit is used for acquiring first text information corresponding to the first audio input;
a third obtaining unit, configured to obtain second sub-text information, where the second sub-text information is obtained by editing first sub-text information in the first text information;
the second sub-text information can be used for updating a first relation set, the first relation set comprises corresponding relations of the input information and the text information, and the first relation set is used for matching the corresponding text information according to the input information;
wherein the correspondence between the input information and the text information comprises a correspondence between audio input information and text information, and the first set of relationships can be updated based on an update of the second set of relationships; the second relationship set includes a correspondence between the text input information and the text information.
9. An audio processing apparatus comprising:
a first obtaining unit that obtains a first audio input;
the second acquisition unit is used for acquiring first text information corresponding to the first audio input;
a third acquiring unit that acquires second sub-text information obtained by editing first sub-text information in the first text information;
the second sub-text information can be used for updating a first relation set, the first relation set comprises corresponding relations of the input information and the text information, and the first relation set is used for matching the corresponding text information according to the input information;
the apparatus, after obtaining the updated first set of relationships, is further configured to:
obtaining a second audio input;
performing speech recognition on the second audio input based on the updated first set of relationships;
wherein, in case the updated first set of relationships comprises a first correspondence of first sub-audio data and second sub-text information and a second correspondence of first sub-audio data and first sub-text information, and the second audio input comprises the first sub-audio data:
responding to the second audio input with second textual information including second sub-textual information if the second audio input and the first audio input satisfy a predetermined condition;
if the second audio input and the first audio input do not meet the preset condition, selecting matched subfile information from the first corresponding relation and the second corresponding relation as a part of third text information to respond to the second audio input according to matching priority;
the first sub-audio data is audio data corresponding to the first sub-text information in the first audio input.
10. An audio processing apparatus comprising:
the device comprises a memory, a first display unit and a second display unit, wherein the memory is used for at least storing a first relation set, the first relation set comprises the corresponding relation between input information and text information, and the first relation set is used for matching the corresponding text information according to the input information;
a processor to perform the following operations:
obtaining a first audio input;
obtaining first text information corresponding to the first audio input;
acquiring second sub-text information, wherein the second sub-text information is obtained by editing first sub-text information in the first text information;
wherein the second sub-script information is usable to update the first set of relationships;
wherein the correspondence between the input information and the text information comprises a correspondence between audio input information and text information, and the first set of relationships can be updated based on an update of the second set of relationships; the second relationship set includes a correspondence between the text input information and the text information.
11. An audio processing apparatus comprising:
the device comprises a memory, a first display unit and a second display unit, wherein the memory is used for at least storing a first relation set, the first relation set comprises the corresponding relation between input information and text information, and the first relation set is used for matching the corresponding text information according to the input information;
a processor to perform the following operations:
obtaining a first audio input;
obtaining first text information corresponding to the first audio input;
acquiring second sub-text information, wherein the second sub-text information is obtained by editing first sub-text information in the first text information;
the second sub-text information can be used for updating a first relation set, the first relation set comprises corresponding relations of the input information and the text information, and the first relation set is used for matching the corresponding text information according to the input information;
the processor, after obtaining the updated first set of relationships, is further configured to:
obtaining a second audio input;
performing speech recognition on the second audio input based on the updated first set of relationships;
wherein, in case the updated first set of relationships comprises a first correspondence of first sub-audio data and second sub-text information and a second correspondence of first sub-audio data and first sub-text information, and the second audio input comprises the first sub-audio data:
responding to the second audio input with second textual information including second sub-textual information if the second audio input and the first audio input satisfy a predetermined condition;
if the second audio input and the first audio input do not meet the preset condition, selecting matched subfile information from the first corresponding relation and the second corresponding relation as a part of third text information to respond to the second audio input according to matching priority;
the first sub-audio data is audio data corresponding to the first sub-text information in the first audio input.
CN201810287493.XA 2018-03-30 2018-03-30 Audio processing method and device Active CN108831473B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810287493.XA CN108831473B (en) 2018-03-30 2018-03-30 Audio processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810287493.XA CN108831473B (en) 2018-03-30 2018-03-30 Audio processing method and device

Publications (2)

Publication Number Publication Date
CN108831473A CN108831473A (en) 2018-11-16
CN108831473B true CN108831473B (en) 2021-08-17

Family

ID=64155177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810287493.XA Active CN108831473B (en) 2018-03-30 2018-03-30 Audio processing method and device

Country Status (1)

Country Link
CN (1) CN108831473B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947955A (en) * 2019-03-21 2019-06-28 深圳创维数字技术有限公司 Voice search method, user equipment, storage medium and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1764944A (en) * 2003-03-26 2006-04-26 皇家飞利浦电子股份有限公司 Speech recognition system
CN101432801A (en) * 2006-02-23 2009-05-13 日本电气株式会社 Speech recognition dictionary making supporting system, speech recognition dictionary making supporting method, and speech recognition dictionary making supporting program
CN106328146A (en) * 2016-08-22 2017-01-11 广东小天才科技有限公司 Video subtitle generation method and apparatus
CN106384593A (en) * 2016-09-05 2017-02-08 北京金山软件有限公司 Voice information conversion and information generation method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130246041A1 (en) * 2012-03-19 2013-09-19 Marc Alexander Costa Systems and methods for event and incident reporting and management
CN104123930A (en) * 2013-04-27 2014-10-29 华为技术有限公司 Guttural identification method and device
CN103903615B (en) * 2014-03-10 2018-11-09 联想(北京)有限公司 A kind of information processing method and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1764944A (en) * 2003-03-26 2006-04-26 皇家飞利浦电子股份有限公司 Speech recognition system
CN101432801A (en) * 2006-02-23 2009-05-13 日本电气株式会社 Speech recognition dictionary making supporting system, speech recognition dictionary making supporting method, and speech recognition dictionary making supporting program
CN106328146A (en) * 2016-08-22 2017-01-11 广东小天才科技有限公司 Video subtitle generation method and apparatus
CN106384593A (en) * 2016-09-05 2017-02-08 北京金山软件有限公司 Voice information conversion and information generation method and device

Also Published As

Publication number Publication date
CN108831473A (en) 2018-11-16

Similar Documents

Publication Publication Date Title
US10958598B2 (en) Method and apparatus for generating candidate reply message
CN102782751B (en) Digital media voice tags in social networks
CN109522419B (en) Session information completion method and device
CN109309751B (en) Voice recording method, electronic device and storage medium
WO2015096564A1 (en) On-line voice translation method and device
CN106302933B (en) Voice information processing method and terminal
WO2020253064A1 (en) Speech recognition method and apparatus, and computer device and storage medium
CN107844470B (en) Voice data processing method and equipment thereof
CN106713111B (en) Processing method for adding friends, terminal and server
CN108304424B (en) Text keyword extraction method and text keyword extraction device
CN111063355A (en) Conference record generation method and recording terminal
CN111832308A (en) Method and device for processing consistency of voice recognition text
US20220093103A1 (en) Method, system, and computer-readable recording medium for managing text transcript and memo for audio file
TWI752406B (en) Speech recognition method, speech recognition device, electronic equipment, computer-readable storage medium and computer program product
CN106558311A (en) Voice content reminding method and device
CN110020429B (en) Semantic recognition method and device
CN111062221A (en) Data processing method, data processing device, electronic equipment and storage medium
CN113111658B (en) Method, device, equipment and storage medium for checking information
CN108831473B (en) Audio processing method and device
CN113850083A (en) Method, device and equipment for determining broadcast style and computer storage medium
CN106844734B (en) Method for automatically generating session reply content
CN109712606A (en) A kind of information acquisition method, device, equipment and storage medium
CN116150333A (en) Text matching method, device, electronic equipment and readable storage medium
CN114155841A (en) Voice recognition method, device, equipment and storage medium
CN114242047A (en) Voice processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant