CN111951790A - Voice processing method, device, terminal and storage medium - Google Patents

Voice processing method, device, terminal and storage medium Download PDF

Info

Publication number
CN111951790A
CN111951790A CN202010849414.7A CN202010849414A CN111951790A CN 111951790 A CN111951790 A CN 111951790A CN 202010849414 A CN202010849414 A CN 202010849414A CN 111951790 A CN111951790 A CN 111951790A
Authority
CN
China
Prior art keywords
user
target
voice
voice data
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010849414.7A
Other languages
Chinese (zh)
Inventor
田植良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010849414.7A priority Critical patent/CN111951790A/en
Publication of CN111951790A publication Critical patent/CN111951790A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application discloses a voice processing method, a voice processing device, a terminal and a storage medium, wherein the method comprises the following steps: obtaining target voice data to be identified; obtaining a target user to which the target voice data belongs; performing voice recognition on the target voice data by using a voice recognition model corresponding to the target user to obtain target text data corresponding to the target voice data; the voice recognition model is obtained by training a universal recognition model by utilizing a plurality of first voice samples with text labels of the target user, and the universal recognition model is obtained by training an initially constructed universal recognition model by utilizing a plurality of second voice samples with text labels.

Description

Voice processing method, device, terminal and storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a speech processing method, apparatus, terminal, and storage medium.
Background
Many social software have a voice to text function. In general, the speech-to-text background uses the same speech recognition model for converting speech to text.
However, since the speech recognition model is a common model, the same conversion effect is obtained for different users, and thus there is a case where the speech conversion is inaccurate.
Disclosure of Invention
In view of the above, the present application provides a voice processing method, apparatus, terminal and storage medium to improve accuracy of voice processing.
To achieve the above object, in one aspect, the present application provides a speech processing method, including:
obtaining target voice data to be identified;
obtaining a target user to which the target voice data belongs;
performing voice recognition on the target voice data by using a voice recognition model corresponding to the target user to obtain target text data corresponding to the target voice data;
the voice recognition model is obtained by training a universal recognition model by utilizing a plurality of first voice samples with text labels of the target user, and the universal recognition model is obtained by training an initially constructed universal recognition model by utilizing a plurality of second voice samples with text labels.
In one possible implementation manner, obtaining a target user to which the target voice data belongs includes:
obtaining each preset first voice data in a first voice set, wherein each first voice data corresponds to one affiliated user respectively;
performing voice processing on the target voice data and the first voice data by using a user classification model to obtain a target user to which the target voice data belongs, wherein the first voice data corresponding to the target user and the target voice data meet preset similar conditions;
the user classification model is obtained by training an initially constructed user classification model by utilizing a plurality of voice sample groups with user classification labels, wherein the voice sample groups comprise two third voice samples, and the user classification labels represent whether the two third voice samples in the voice sample groups belong to the same user.
Optionally, the first voice data corresponding to the target user and the target voice data meet a preset similar condition, including:
the similarity between the first voice data corresponding to the target user and the target voice data is greater than or equal to a preset similarity threshold;
and/or the presence of a gas in the gas,
and the similarity between the first voice data corresponding to the target user and the target voice data is maximum.
Optionally, the user classification model at least includes a convolutional neural network layer, a full-connectivity layer, and a classification layer;
the convolutional neural network layer is used for respectively extracting voice features of the target voice data and the first voice data to obtain a first voice feature corresponding to the target voice data and a second voice feature corresponding to the first voice data;
the full communication layer is used for carrying out feature interaction processing on the first voice feature and the second voice feature to obtain a feature interaction result;
and the classification layer is used for generating a classification result according to the feature interaction result, and the classification result represents whether the target voice data and the first voice data belong to the same user.
In one possible implementation, obtaining each first voice data in the first voice set includes:
acquiring a first voice set stored on a terminal, wherein the terminal is a device needing voice recognition on the target voice data;
and obtaining each preset first voice data in the first voice set.
Optionally, in a case that each of the first voice data in the first voice set and the target voice data do not satisfy the similar condition, the method further includes:
acquiring each second voice data in a second voice set stored on a server, wherein each second voice data corresponds to a belonging user, the server is a device capable of carrying out data transmission with a terminal, and the terminal is a device needing voice recognition on the target voice data;
and performing voice processing on the target voice data and the second voice data by using the user classification model to obtain a target user to which the target voice data belongs, wherein the second voice data corresponding to the target user and the target voice data meet the similar condition.
In one possible implementation manner, obtaining a target user to which the target voice data belongs includes:
performing voice recognition on the target voice data by using a user recognition model to obtain a target user to which the target voice data belongs;
and the user recognition model is obtained by training the initially constructed user recognition model by utilizing a plurality of fourth voice samples with user labels.
In another aspect, the present application further provides a speech processing apparatus, including:
a voice obtaining unit for obtaining target voice data to be recognized;
a user obtaining unit, configured to obtain a target user to which the target speech data belongs;
the voice recognition unit is used for carrying out voice recognition on the target voice data by utilizing a voice recognition model corresponding to the target user so as to obtain target text data corresponding to the target voice data;
the voice recognition model is obtained by training a universal recognition model by utilizing a plurality of first voice samples with text labels of the target user, and the universal recognition model is obtained by training an initially constructed universal recognition model by utilizing a plurality of second voice samples with text labels.
In another aspect, the present application further provides a terminal, including:
a processor and a memory;
wherein the processor is configured to execute a program stored in the memory;
the memory is used for storing a program for implementing at least the speech processing method as defined in any of the above.
In yet another aspect, the present application further provides a storage medium having stored therein computer-executable instructions that, when loaded and executed by a processor, implement a speech processing method as described in any one of the above.
According to the above scheme, in the speech processing method, the apparatus, the terminal and the storage medium provided by the present application, after target speech data to be recognized is obtained, a target user to which the target speech data belongs is obtained, and then speech recognition is performed on the target speech data by using a speech recognition model corresponding to the target user, where the speech recognition model corresponding to the target user is obtained by training a universal recognition model by using a plurality of first speech samples having text labels of the target user, and the universal recognition model is obtained by training an initially-constructed universal recognition model by using a plurality of second speech samples having text labels, based on which, the speech recognition model corresponding to the target user better conforms to the pronunciation characteristics of the target user than the universal recognition model, so that the target text number obtained by performing speech recognition on the target speech data by using the speech recognition model corresponding to the target user The method is more accurate compared with the text data obtained by using a universal recognition model for voice recognition. Therefore, the trained general recognition model is trained individually again by using the voice sample of the target user, so that the individual voice recognition model for the target user is obtained, and the accuracy of voice recognition on the target voice data of the target user is improved by using the voice recognition model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a block diagram of a speech processing system according to an embodiment of the present application;
FIGS. 2-4 respectively illustrate exemplary diagrams of a speech processing system according to embodiments of the present application;
fig. 5 is a schematic diagram illustrating a hardware component structure of a terminal implementing speech processing according to an embodiment of the present application;
FIG. 6 is a flow chart illustrating a method of speech processing according to an embodiment of the present application;
FIG. 7 is a diagram illustrating the logical architecture of a user classification model in an embodiment of the present application;
8-10 show application diagrams in embodiments of the present application, respectively;
fig. 11 is a schematic diagram illustrating a configuration of an embodiment of a speech processing apparatus according to an embodiment of the present application.
Detailed Description
In the application of voice interaction, a voice recognition model can be used to realize voice recognition of voice data of a user. The user here refers to a user to which the voice data received on the electronic device belongs, such as two users of a chat application, and further, a client in a customer service system, and the like.
The inventor of the present application has found through research that: at present, there are various schemes for implementing voice recognition on voice data to obtain text data corresponding to the voice data. In one scheme, all users use a voice-to-text model, the voice-to-text model does not distinguish each user, all users use the same time sequence model to convert voice to text, and the time sequence model can be constructed by using a recurrent neural network or a transformer, so that the conversion from voice to text is realized; in another scheme, a voice-to-text model is used on the same device, the voice-to-text model distinguishes different devices, the same device is considered to belong to the same user, all users on the same device are not distinguished, and all users on the same device adopt the model to convert voice to text. Therefore, users are not distinguished in the scheme, and different users may have personalized differences such as accents, dialects or language habits, so that the accuracy of recognizing the voice to the characters by using the same model for different users is poor.
Therefore, the inventor of the present application further studies to find that the speech data of different users are different in byte pronunciation and pronunciation habits, and therefore, in order to improve the speech recognition accuracy, the speech recognition model common to multiple users can be trained separately by using the speech sample of a single user, so as to obtain a personalized speech recognition model for each user, and thus, the personalized speech recognition models can accurately recognize the speech data of corresponding users, thereby improving the speech recognition accuracy.
For ease of understanding, a system to which the solution of the present application is applied is described herein, and reference is made to fig. 1, which is a schematic diagram illustrating a component architecture of a speech processing system of the present application.
As can be seen from fig. 1, the system may include: the server 10 and the terminal 20 are connected in communication through a network, and the server 10 and the terminal 20 are connected in communication through the network.
The server 10 may be a background server, the terminal 20 may be a client such as a mobile phone, a pad, or a computer, and at this time, the user may collect and receive voice data through the terminal 20, and certainly, may also play and output the voice data, based on which, the terminal 20 may obtain the received target voice data or the collected target voice data, and obtain a target user to which the target voice data belongs, and then perform voice recognition on the target voice data by using a voice recognition model corresponding to the target user to obtain target text data corresponding to the target voice data, where the voice recognition model corresponding to the target user is obtained by the server 10 training a universal recognition model by using a plurality of first voice samples with text labels of the target user, and the universal recognition model may be obtained by using a plurality of second voice samples with text labels to perform initial construction on the universal recognition model And (5) obtaining the training. The server 10 may also transmit and store voice data, such as a background server of a chat application, and the server 10 may also store a trained generic recognition model and a voice recognition model of a target user.
It should be noted that the speech processing system in the present application may not include the server 10 in another implementation, only the terminal 20, and the storage function and the model training function on the server 10 are integrated on the terminal 20, the initially constructed universal recognition model is trained by the terminal 20 in advance by using a plurality of second voice samples with text labels, and training the universal recognition model by utilizing a plurality of first voice samples with text labels of the target user to obtain a voice recognition model corresponding to the target user, the terminal 20 may further obtain the received target voice data or the collected target voice data, and obtain a target user to which the target voice data belongs, performing voice recognition on the target voice data by using the voice recognition model corresponding to the target user to obtain target text data corresponding to the target voice data;
or, the speech processing system in the present application may not include the terminal 20, but only the server 10, and the terminal 20 integrates the function of performing speech recognition on the target speech data by using the speech recognition model corresponding to the target user to which the target speech data belongs into the server 10, based on which, the server 10 trains the initially constructed general recognition model by using a plurality of second speech samples with text labels in advance, and trains the general recognition model by using a plurality of first speech samples with text labels of the target user to obtain the speech recognition model corresponding to the target user, and then the server 10 may obtain the target speech data or the received target speech data collected on the terminal 20, and obtain the target user to which the target speech data belongs, and then perform speech recognition on the target speech data by using the speech recognition model corresponding to the target user, to obtain target text data corresponding to the target voice data, and then returning the target text data to the terminal 20;
alternatively, the speech processing system in the present application may include not only the terminal 20 but also a server in another implementation, which is different from the speech processing system in the foregoing, in which the function of performing model training on the server 10 is integrated on the terminal 20, at this time, the terminal 20 may perform speech recognition on the target speech data by using the speech recognition model corresponding to the target user to which the target speech data belongs, based on which, the terminal 20 may train the initially constructed universal recognition model by using a plurality of second speech samples with text labels in advance, and train the universal recognition model by using a plurality of first speech samples with text labels of the target user to obtain the speech recognition model corresponding to the target user, and then, the terminal 20 may store the trained speech recognition models for subsequent invocation, based on this, after acquiring the target voice data or receiving the target voice data, the terminal 20 may call the trained voice recognition model corresponding to the target user to perform voice recognition on the target voice data by obtaining the target user to which the target voice data belongs, so as to obtain the target text data corresponding to the target voice data.
Taking the interaction between the user a and the user B in fig. 2 as an example, after the mobile phone terminal acquires the target voice data of the mobile phone user a through the chat application, the mobile phone terminal invokes the pre-trained voice recognition model of the user a on the mobile phone to perform voice recognition on the target voice data to obtain the target text data of the user a, and if the mobile phone terminal receives the target voice data sent by the friend user B through the chat application, the mobile phone terminal invokes the pre-trained voice recognition model of the user B on the mobile phone to perform voice recognition on the target voice data to obtain the target text data of the user B;
for another example, after the mobile phone terminal collects the target voice data of the mobile phone user a through the chat application, the mobile phone terminal invokes the voice recognition model of the user a stored in the mobile phone and pre-trained by the server, and performs voice recognition on the target voice data of the user a by using the voice recognition model of the user a to obtain the target text data of the user a, and after the mobile phone terminal receives the target voice data sent by the friend user B through the chat application, the mobile phone terminal invokes the voice recognition model of the user B stored in the mobile phone and pre-trained by the server, and performs voice recognition on the target voice data of the user B by using the voice recognition model of the user B to obtain the target text data of the user B, as shown in fig. 3;
for another example, after the mobile phone terminal collects the target voice data of the mobile phone user a through the chat application, the mobile phone terminal requests the server to perform voice recognition on the target voice data of the user a, the server finds the pre-trained voice recognition model of the user a after receiving the request, and performs voice recognition on the target voice data of the user a through the voice recognition model of the user a to obtain the target text data of the user a and transmits the target text data of the user a to the mobile phone terminal, and if the mobile phone terminal receives the target voice data sent by the user B through the chat application, the server is requested to perform voice recognition on the target voice data of the user B, and the server finds the pre-trained voice recognition model of the user B after receiving the request, and performs voice recognition on the target voice data of the user B through the voice recognition model of the user B, to obtain target text data of the user B and transmit the target text data of the user B to the mobile phone terminal, as shown in fig. 4.
In order to implement the corresponding voice processing function on the terminal or the server, a program for implementing the corresponding function needs to be stored in the memory of the terminal or the server. In order to facilitate understanding of the hardware configuration of the terminal or the server, the following description will be given by taking the terminal as an example. As shown in fig. 5, which is a schematic structural diagram of a terminal of the present application, the terminal 20 in this embodiment may include: a processor 201, a memory 202, a communication interface 203, an input unit 204, a display 205 and a communication bus 206.
The processor 201, the memory 202, the communication interface 203, the input unit 204, and the display 205 all communicate with each other through the communication bus 206.
In this embodiment, the processor 201 may be a Central Processing Unit (CPU), an application specific integrated circuit, a digital signal processor, an off-the-shelf programmable gate array, or other programmable logic device.
The processor 201 may call a program stored in the memory 202. Specifically, the processor 201 may perform operations performed by the terminal in the following embodiments of the voice processing method.
The memory 202 is used for storing one or more programs, which may include program codes including computer operation instructions, and in this embodiment, the memory stores at least the programs for implementing the following functions:
obtaining target voice data to be identified;
obtaining a target user to which the target voice data belongs;
performing voice recognition on the target voice data by using a voice recognition model corresponding to the target user to obtain target text data corresponding to the target voice data;
the voice recognition model is obtained by training a universal recognition model by utilizing a plurality of first voice samples with text labels of the target user, and the universal recognition model is obtained by training an initially constructed universal recognition model by utilizing a plurality of second voice samples with text labels.
In one possible implementation, the memory 202 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function (such as model training, etc.), and the like; the storage data area may store data created during use of the computer, such as speech samples, speech data, trained speech recognition models and generic recognition models, and so forth.
Further, the memory 202 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device or other volatile solid state storage device.
The communication interface 203 may be an interface of a communication module, such as an interface of a GSM module.
Of course, the structure of the terminal shown in fig. 5 is not limited to the terminal in the embodiment of the present application, and the terminal may include more or less components than those shown in fig. 5 or some components in combination in practical applications. It is to be understood that the hardware composition of the server may refer to that of the terminal in fig. 5.
It should be noted that the server 10 in this embodiment may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
That is to say, the server 10 in the present application may be a Cloud server, and the technical solution of the present application is implemented by a Cloud technology (Cloud technology), where the Cloud technology refers to a hosting technology that unifies series resources such as hardware, software, and network in a wide area network or a local area network to implement calculation, storage, processing, and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.
Among them, cloud computing (cloud computing) is a computing mode that distributes computing tasks over a resource pool formed by a large number of computers, so that various application systems can acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.
As a basic capability provider of cloud computing, a cloud computing resource pool (cloud platform for short) generally called as an Infrastructure as a Service (IaaS) platform is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients. The cloud computing resource pool mainly comprises: computing devices (which are virtualized machines, including operating systems), storage devices, and network devices.
According to the logic function division, a PaaS (Platform as a Service) layer can be deployed on an IaaS (Infrastructure as a Service) layer, a SaaS (Software as a Service) layer is deployed on the PaaS layer, and the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, a web container, etc. SaaS is a variety of business software, such as web portal, sms, and mass texting. Generally speaking, SaaS and PaaS are upper layers relative to IaaS.
The server 10 mentioned in the foregoing embodiment is a server capable of performing cloud computing on a cloud platform, and can be used to implement speech processing and model training in the present application.
With the above generality in mind, referring to fig. 6, a flowchart of an embodiment of a speech processing method according to the present application is shown, where the method in this embodiment may include:
s601: and obtaining target voice data to be identified.
The target voice data to be recognized may be voice data acquired by a voice acquisition component such as a microphone on the terminal, or may be voice data sent by other terminals received by a data transmission component such as WiFi or a mobile communication network on the terminal.
In this embodiment, the target voice data may be voice data of the user a collected by a microphone on the terminal, or may be voice data sent by the user B received by the chat application on the terminal.
S602: and obtaining a target user to which the target voice data belongs.
The target user refers to a pronunciation user of the target voice data, such as user a or user B.
In a specific implementation, in this embodiment, a target user to which target voice data belongs may be obtained through a model constructed based on an algorithm such as voice recognition. The following were used:
in one implementation, S602 may obtain the target user to which the target voice data belongs by:
first, obtaining each preset first voice data in a first voice set, where the first voice set may be a voice set stored on a local terminal, and the local terminal is a device that needs to recognize target voice data in this embodiment, so that in this embodiment, the first voice set stored on the terminal may be obtained first, and then each preset first voice data in the first voice set is obtained, where the first voice data may be preset voice data for a user having a trained voice recognition model, and each first voice data corresponds to a belonging user respectively. For example, a first voice set is stored in advance in the mobile phone, where the first voice set includes first voice data pre-recorded by the user a (of course, the first voice data may also be extracted from historical pronunciation data of the user a) and first voice data pre-recorded by the user B, and the first voice data corresponds to the user a and the user B, respectively;
it should be noted that the first voice data in the first voice set may be acquired by collecting historical voice data on the terminal to obtain the first voice data of one or more users.
And then, carrying out voice processing on the target voice data and the first voice data by using the user classification model to obtain a target user to which the target voice data belongs, wherein the obtained first voice data corresponding to the target user and the target voice data meet preset similar conditions. That is to say, in this embodiment, the user classification model is used to perform speech processing on the target user speech data and each first speech data, so as to determine the first speech data that satisfies the similar condition with the target user speech data, and the user to which the first speech data that satisfies the similar condition with the target user speech data belongs is the target user.
Wherein, the user classification model is obtained by training the initially constructed user classification model by utilizing a plurality of voice sample groups with user classification labels, and the user classification model is constructed based on a classification algorithm and is used as a training sample of the user classification model, one voice sample group comprises two third voice samples, and the user classification label of the voice sample group is used for representing whether the two third voice samples in the voice sample group belong to the same user, therefore, in the embodiment, when the user classification model is trained, the two third voice samples in the voice sample group are used as input samples, the user classification label of the voice sample group is used as an output sample, after the input samples are input into the user classification model, the classification test result of the user classification model aiming at the input samples is obtained, and then the classification test result is compared with the user classification label, and then, model parameters of the user classification model are adjusted by using the difference value between the classification test result and the user classification label, so that the loss function value of the user classification model is reduced, and the loss function value is gradually reduced along with multiple times of training of the user classification model until the loss function value is reduced and is not changed, and at the moment, the training of the user classification model is finished.
It should be noted that the user classification labels in the output samples of the user classification model may be characterized by yes or no symbols, and the classification test result of the user classification model on the input sample may be represented by a probability value between 0 and 1, i.e. a probability that characterizes whether two third speech samples belong to the same user or not, if the probability value of the classification test result is greater than or equal to the probability threshold value, two third voice samples corresponding to the characterization classification test result belong to the same user, at the moment, the classification test result is compared with the user classification label, if the classification test result is different from the user classification label, the model parameters of the user classification model are adjusted according to the probability value in the classification test result until the classification test result is the same as the user classification label and the loss function of the user classification model is converged, and the training of the user classification model is completed.
Therefore, based on the trained user classification model, in this embodiment, the user classification model is sequentially used to perform speech processing on the target speech data and each piece of first speech data, so as to determine the user to which the first speech data meeting the preset similar condition with the target speech data belongs, that is, the target user to which the target speech data belongs. For example, after obtaining a piece of voice, i.e. target voice data, on the mobile phone, obtaining first voice data of user a and first voice data of user B in a first voice set stored on the mobile phone, performing voice processing on the first voice data of user a and the target voice data by using a user classification model to obtain a probability that the first voice data of user a and the target voice data belong to the same user or determine whether the first voice data of user a and the target voice data belong to the same user, then performing voice processing on the first voice data of user B and the target voice data by using a user classification model to obtain a probability that the first voice data of user B and the target voice data belong to the same user or determine whether the first voice data of user B and the target voice data belong to the same user, and so on until the first voice data of each user in the first voice set and the target voice data are classified into different types And performing class processing, namely determining a user to which the first voice data meeting similar conditions with the target voice data belongs, namely a target user to which the target voice data belongs, according to a result obtained by performing voice processing on the first voice data of the users such as the user A, the user B and the like and the target voice data by using a user classification model, wherein the result is a classification result of whether the first voice data belong to the same user or a probability value of whether the first voice data belong to the same user, for example, the first voice data of the user A and the target voice data meet similar conditions, and at this moment, determining that the user A is the target user to which the target voice data belongs.
Specifically, the first voice data corresponding to the target user and the target voice data satisfy a preset similar condition, which may be: the similarity between the first voice data corresponding to the target user and the target voice data is greater than or equal to a preset similarity threshold, wherein the similarity between the first voice data corresponding to the target user and the target voice data can be represented by a probability value obtained by performing voice processing on the first voice data corresponding to the target user and the target voice data through a user classification model, so that when the similarity between the first voice data corresponding to the target user and the target voice data is greater than or equal to the preset similarity threshold, namely when the probability value obtained by performing voice processing on the first voice data corresponding to the target user and the target voice data through the user classification model is greater than or equal to the probability threshold, the first voice data corresponding to the target user and the target voice data meet a preset similarity condition, namely the first voice data corresponding to the target user and the target voice data belong to the same user is determined, namely the target user;
for example, classifying the first voice data of each user in the first voice set and the target voice data on the mobile phone, and then determining that the probability that the first voice data of the user a and the target voice data belong to the same user is greater than a probability threshold according to a result obtained by performing voice processing on the first voice data of the users such as the user a, the user B and the like and the target voice data by using a user classification model, that is, the similarity between the first voice data of the user a and the target voice data is greater than a similarity threshold, and at this time, determining that the user a is the target user to which the target voice data belongs;
optionally, the first voice data corresponding to the target user and the target voice data meet a preset similar condition, which may be: the similarity between the first voice data corresponding to the target user and the target voice data is the largest. It should be noted that, the similarity between the first voice data of the first voice set and the target voice data may not be greater than the similarity threshold, or the similarity between the first voice data corresponding to the target user and the target voice data may be greater than the similarity threshold, at this time, the first voice data with the maximum similarity between the first voice set and the target voice data may be selected as the voice data meeting the similarity condition with the target voice data, and at this time, it is determined that the first voice data with the maximum similarity between the first voice data and the target voice data is the target user to which the target voice data belongs, and the user corresponding to the first voice data with the maximum probability of belonging to the same user between the first voice set and the target voice data is the target user to which the target voice data belongs;
for example, the first voice data of each user in the first voice set and the target voice data are classified on the mobile phone, and then according to the results obtained by performing voice processing on the first voice data of the users, such as the user a and the user B, and the target voice data by using the user classification model, the probability that the first voice data of the user a and the target voice data belong to the same user is determined to be the maximum, that is, the similarity between the first voice data of the user a and the target voice data is the maximum, and at this time, the user a is determined to be the target user to which the target voice data belongs.
In a specific implementation, the user classification model at least includes a convolutional neural network layer, a full-connectivity layer, and a classification layer, as shown in fig. 7:
the convolutional Neural network layer can also be called a convolutional layer, can be constructed based on a convolutional Neural network CNN (convolutional Neural networks), and is mainly used for respectively extracting voice features of target voice data and first voice data input to a user classification model so as to obtain first voice features corresponding to the target voice data and second voice features corresponding to the first voice data;
based on the output characteristics of the convolutional layer, the full-communication layer is mainly used for performing characteristic interaction processing on the first voice characteristics and the second voice characteristics to obtain characteristic interaction results, and then classification results are generated through the classification layer according to the characteristic interaction results, wherein the classification results represent whether the target voice data and the first voice data belong to the same user. For example, the classification result is characterized by a probability value, which is a probability that the target voice data and the first voice data belong to the same user.
It should be noted that, in this embodiment, when obtaining a target user to which target speech data belongs based on each first speech data in the first speech set, there may be a case where all the first speech data in the first speech set and the target speech data do not satisfy a similar condition, for example, a case where the similarity between all the first speech data and the target speech data is smaller than a similar threshold, that is, the probability that the first speech data and the target speech data belong to the same user is smaller than a probability threshold, and therefore, in order to obtain the target user to which the target speech data belongs in this embodiment, in a case where all the first speech data in the first speech set and the target speech data do not satisfy the similar condition, the following method may be further implemented:
first, each second voice data in a second voice set stored in a server is obtained, where the second voice set may be a voice set stored in a server capable of performing data transmission with a terminal, for example, a local terminal is a device that needs to identify target voice data in this embodiment, and the server is a cloud server, so in this embodiment, each second voice data may be obtained in the second voice set stored in the server under the condition that probabilities that all first voice data in the first voice set stored in the local terminal and the target voice data belong to the same user are not greater than a probability threshold, where the second voice data may be voice data preset by a user having a trained voice identification model, and each second voice data corresponds to a belonging user respectively. For example, a first voice set is pre-stored in the mobile phone, where the first voice set includes first voice data pre-recorded by the user a (of course, voice data extracted from historical pronunciation data of the user a) and first voice data pre-recorded by the user B, and corresponds to the user a and the user B, respectively, and a second voice set is pre-stored in the server, and the second voice set includes second voice data pre-recorded by the user C and second voice data pre-recorded by the user D, and corresponds to the user C and the user D, respectively;
and then, carrying out voice processing on the target voice data and the second voice data by using the user classification model in the previous text, so as to obtain a target user to which the target voice data belongs, wherein the second voice data corresponding to the obtained target user and the target voice data meet preset similar conditions. That is to say, in this embodiment, the user classification model is used to perform speech processing on the target user speech data and each piece of second speech data, so as to determine second speech data that satisfies a similar condition with the target user speech data, for example, the similarity between the target user speech data is greater than or equal to the similarity threshold or the second speech data with the largest similarity, and the user to which the second speech data that satisfies the similar condition with the target user speech data belongs is the target user.
For example, after obtaining a piece of voice, i.e., target voice data, on the mobile phone, if the similarity between the target voice data and each of the first voice data of the user a and the first voice data of the user B stored on the mobile phone is smaller than the similarity threshold, that is, the probabilities that the first voice data of the user a and the first voice data of the user B stored on the mobile phone belong to the same user as the target voice data are smaller than the probability threshold, it is determined that the similarity between the target voice data and each of the first voice data of the user B and the first voice data of the user a stored on the mobile phone are not satisfied, in order to ensure the reliability of voice recognition, in this embodiment, the second voice data of the user C and the second voice data of the user D stored on the cloud server corresponding to the chat application on the mobile phone are obtained, and then the second voice data of the user C and the target voice data are subjected to voice processing by using the user classification model, obtaining the probability that the second voice data of the user C and the target voice data belong to the same user or determining whether the second voice data of the user C and the target voice data belong to the same user, then performing voice processing on the second voice data of the user D and the target voice data by using a user classification model to obtain the probability that the second voice data of the user D and the target voice data belong to the same user or determining whether the second voice data of the user D and the target voice data belong to the same user, and so on until the second voice data of each user in the second voice set and the target voice data are classified, and further performing voice processing on the result obtained by using a user classification model according to the second voice data of the users such as the user C, the user D and the target voice data respectively to obtain the result of whether the result belongs to the classification result of the same user or the probability value of belonging to the same user, determining that the user to which the second voice data meeting the similar condition with the target voice data belongs is the target user to which the target voice data belongs, for example, the second voice data of the user C and the target voice data meet the similar condition, for example, the similarity between the second voice data of the user C and the target voice data is greater than or equal to the similarity threshold or the similarity is maximum, or speaking, the probability value between the second voice data of the user C and the target voice data belonging to the same user is greater than or equal to the similarity threshold or the probability value is maximum, and at this time, determining that the user C is the target user to which the target voice data belongs.
Further, in this embodiment, in a case that all the first voice data in the first voice set and the target voice data do not satisfy the similar condition, the target user to which the target voice data belongs may be used as a new user on the current terminal, and if there is second voice data in the second voice set and the target voice data satisfy the similar condition, the second voice data that satisfies the similar condition with the target voice data is stored as the first voice data in the first voice set, so that when voice processing for converting voice into text is required next time, the voice data that satisfies the similar condition with the new target voice data may be found in the first voice set on the terminal, and it is not necessary to obtain the voice data that satisfies the similar condition with the target voice data through the user classification model in the second voice set in the server.
In another implementation, S602 may also obtain the target user to which the target voice data belongs by:
firstly, performing voice recognition on target voice data by using a user recognition model to obtain a target user to which the target voice data belongs, wherein the user recognition model is obtained by training an initially constructed user recognition model by using a plurality of fourth voice samples with user labels.
It should be noted that the user recognition model may be constructed based on a recognition algorithm, and then after training of the voice sample, the voice feature of the target voice data may be recognized, so as to obtain the target user to which the target voice data belongs.
Specifically, in this embodiment, fourth voice samples may be obtained by sampling and sampling voice data related to a network, a server, or a terminal, and each fourth voice sample has a user tag representing a user pronunciation type or a user identity, for example, the user tag is represented by a sequence of 0 and 1, for example, the user tag represents by [0, 0, 0, 0, 1] that the fourth voice sample belongs to user B in [ tetrakage pronunciation user, northeast pronunciation user, cantonese pronunciation user, english pronunciation user, american pronunciation user, user a, and user B ], based on which, in this embodiment, each fourth voice sample is respectively used as an input sample of the user identification model, the corresponding user tag is used as an output sample, after the fourth voice sample is input into the user identification model, the user recognition model can perform voice recognition on the fourth voice sample to obtain a recognition test result, and the recognition test result represents the probability value of the fourth voice sample belonging to each user, so that the user to which the fourth voice sample with the maximum probability value belongs in the recognition test result is compared with the user label of the output sample, and the model parameter of the user recognition model is adjusted according to the comparison result, so that the loss function of the user recognition model is reduced, and the loss function of the user recognition model is gradually reduced and tends to be stable along with the training of other more fourth voice samples on the user recognition model, and the training of the user recognition model is completed at this moment.
Based on this, in this embodiment, after the target voice data is obtained on the terminal, the target voice data may be input into the user identification model, and the user identification model may output a user identification result, where the user identification result represents a probability value that the target voice data belongs to each user, and a user corresponding to the maximum probability value is a target user to which the target voice data belongs.
For example, after a piece of speech, that is, target speech data is obtained on the mobile phone, the target speech data is subjected to speech recognition by using a user recognition model on the mobile phone to obtain a probability value of each user belonging to [ a user speaking a Sichuan language, a user speaking a northeast language, a user speaking a cantonese language, a user speaking a english language, a user speaking a american language, and a user B ] to which the target speech data belongs, so that the user with the highest probability value is determined as the target user to which the target speech data belongs, for example, the user speaking a cantonese language with the highest probability value is used as the target user to which the target speech data belongs.
S603: and performing voice recognition on the target voice data by using the voice recognition model corresponding to the target user to obtain target text data corresponding to the target voice data.
The target text data comprises at least one text statement, and the text statement comprises at least one word, such as ' what you eat at noon ' or ' yesterday's chafing dish really good at eating '.
Specifically, the speech recognition model in this embodiment is obtained by training a universal recognition model using a plurality of first speech samples with text labels of the target user, and the universal recognition model is obtained by training an initially-constructed universal recognition model using a plurality of second speech samples with text labels.
The initially constructed general recognition model can be a deep learning model constructed based on a Transformer mechanism, and the second voice sample for initially training the initially constructed general recognition model contains a plurality of voice samples with text labels of a plurality of users, and the second voice sample does not distinguish the users, and specifically, the second voice sample can be obtained by collecting and sampling voice data and corresponding text data related in a network, a server or a terminal.
Specifically, when the initially-constructed general recognition model is trained by using the second voice sample, the second voice sample may be used as an input sample, the text label corresponding to the second voice sample is used as an output sample, after the obtained general recognition model performs voice recognition on the second voice sample to obtain test text data, the text in the test text data is compared with the text in the text label, and then the model parameters of the general recognition model are adjusted according to the comparison result, so that the loss function of the general recognition model is reduced, while along with the training of other more second voice samples on the general recognition model, the loss function of the general recognition model is gradually reduced and tends to be stable, and at this time, the training on the general recognition model is completed.
Therefore, the trained general recognition model in this embodiment can perform speech recognition on speech data to obtain corresponding text data, but the general recognition model does not distinguish users, and the speech data of different users are all the same text conversion effect.
Therefore, in the embodiment, after the target user to which the target voice data to be recognized belongs is obtained, the voice recognition model corresponding to the target user can be called to perform voice recognition on the target voice data, so that text conversion is performed by using the personalized voice recognition model, and the obtained target text data is more accurate than the text data obtained by performing text conversion on the target voice data by using the general recognition model.
Specifically, a plurality of first voice samples with text labels of each user can be acquired through voice data and corresponding text data of the user appearing in a network, a terminal or a server, so as to obtain voice data and corresponding text data of each user, therefore, for each user, the voice data of each user is taken as the first voice sample, the text data corresponding to the voice data is taken as the text label, namely, the first voice samples can be used for respectively training a universal recognition model, and further a voice recognition model corresponding to each user is trained, at the moment, the voice recognition model of each user is trained according to the personalized pronunciation characteristics of the corresponding user, therefore, when the target voice data of the target user needs to be subjected to text conversion, only the voice recognition model corresponding to the target user is called to perform voice recognition on the target voice data, therefore, the personalized speech recognition model is used for text conversion, and the accuracy of the obtained target text data is higher than that of the text data obtained by performing text conversion on the target speech data by using the universal recognition model.
For example, after a piece of voice, that is, target voice data, is collected on the mobile phone, first voice data of users such as a user a and a user B in a first voice set on the mobile phone is classified with the target voice data, and then according to a result obtained by performing voice processing on the first voice data of the users such as the user a and the user B and the target voice data respectively by using a user classification model, it is determined that the probability that the first voice data of the user a and the target voice data belong to the same user is greater than a probability threshold, that is, the similarity between the first voice data of the user a and the target voice data is greater than a similarity threshold, at this time, the user a is determined to be the target user to which the target voice data belongs, and then a voice recognition model corresponding to the user a is called to perform voice recognition on the target voice data, so as to obtain target text data of the user a, furthermore, the speech recognition model corresponding to the called user A is an individualized recognition model for the user A at the position where the initially trained general recognition model is trained by using a plurality of first speech samples with text labels of the user A, so that the accuracy of the finally obtained target text data is higher than that of the text data obtained by performing text conversion on the target speech data by using the general recognition model;
for another example, after obtaining a voice, i.e., target voice data, on the mobile phone, if the similarity between the target voice data and each of the first voice data of the user a and the first voice data of the user B stored on the mobile phone is smaller than the similarity threshold, that is, the probabilities that the first voice data of the user a and the first voice data of the user B stored on the mobile phone belong to the same user as the target voice data are both smaller than the probability threshold, at this time, it is determined that the first voice data of the user a and the first voice data of the user B stored on the mobile phone do not satisfy the similarity condition with the target voice data, in order to ensure the reliability of voice recognition, in this embodiment, the second voice data of the user C and the second voice data of the user D stored on the cloud server corresponding to the chat application on the mobile phone are obtained, and then, the user classification model is used to perform voice processing on the second voice data of the user C and the target voice data, obtaining the probability that the second voice data of the user C and the target voice data belong to the same user or determining whether the second voice data of the user C and the target voice data belong to the same user, then performing voice processing on the second voice data of the user D and the target voice data by using a user classification model to obtain the probability that the second voice data of the user D and the target voice data belong to the same user or determining whether the second voice data of the user D and the target voice data belong to the same user, and so on until the second voice data of each user in the second voice set and the target voice data are classified, and further performing voice processing on the result obtained by using a user classification model according to the second voice data of the users such as the user C, the user D and the target voice data respectively to obtain the result of whether the result belongs to the classification result of the same user or the probability value of belonging to the same user, determining a user to which second voice data meeting similar conditions with the target voice data belongs, that is, a target user to which the target voice data belongs, for example, the second voice data of the user C and the target voice data meet similar conditions, for example, the similarity between the second voice data of the user C and the target voice data is greater than or equal to a similarity threshold or the probability that the second voice data of the user C and the target voice data belong to the same user is greater than or equal to the similarity threshold, at this time, determining that the user C is the target user to which the target voice data belongs, based on this, calling a voice recognition model corresponding to the user C to perform voice recognition on the target voice data, and further obtaining the target text data of the user C, and further, directly outputting the target text data on a mobile phone to be provided for a mobile phone user to view, and at this time, calling a voice recognition model corresponding to the user C uses a plurality of text labels of the user C for the initially trained universal recognition model The second speech sample is trained, and the personalized recognition model for the user C is used, so that the accuracy of the finally obtained target text data is higher than that of text data obtained by performing text conversion on the target speech data by using a general recognition model.
For another example, after obtaining a piece of speech, i.e., target speech data, the mobile phone performs speech recognition on the target speech data by using the user recognition model on the mobile phone to obtain a probability value of each user belonging to the user to which the target speech data belongs [ the user for speaking in the four-channel language, the user for speaking in the northeast-China language, the user for speaking in cantonese, the user for speaking in the english, the user for speaking in the american style, the user a and the user B ], thereby determining the user with the highest probability value as the target user to which the target speech data belongs, e.g., the user for speaking in the cantonese with the highest probability value as the target user to which the target speech data belongs, based on which the speech recognition model corresponding to the cantonese speaking user performs speech recognition on the target speech data to obtain the target text corresponding to the target speech data, and the speech recognition model corresponding to the cantonese speaking user called at this time is the fourth speech recognition model with text labels of the plurality of the general recognition model having been trained initially The personalized recognition model for the Guangdong language pronunciation user at the training of the voice sample is higher in accuracy of the finally obtained target text data compared with the text data obtained by performing text conversion on the target voice data by using the universal recognition model.
It should be noted that, under the condition that all the voice data in the first voice set and the second voice set do not satisfy the similar condition with the target voice data, or the user recognition model cannot recognize the target user, in this embodiment, the target user to which the target voice data belongs may be determined as a general user as a new user on the current terminal, and at this time, the trained general recognition model may be used to perform voice recognition on the target voice data to obtain corresponding target text data, thereby ensuring the reliability of voice processing of the voice-to-text.
Meanwhile, in this embodiment, the voice data of the new user and the text label corresponding to the voice data may also be collected, for example, the new user is invited to collect the voice data and record corresponding text data, so as to obtain a first voice sample of the new user, and then the first voice sample is used to train the general recognition model, so as to obtain a voice recognition model corresponding to the new user, and the voice data of the user is stored as the first voice data in the first voice set, so that when voice processing for converting voice into text is required next time, the voice data meeting similar conditions with the new target voice data may be found in the first voice set on the terminal, and the voice data meeting similar conditions with the target voice data does not need to be obtained from the second voice set in the server through the user classification model.
As can be seen from the above solution, in this embodiment, after target voice data to be recognized is obtained, a target user to which the target voice data belongs is obtained, and then voice recognition is performed on the target voice data by using a voice recognition model corresponding to the target user, where the voice recognition model corresponding to the target user is obtained by training a universal recognition model by using a plurality of first voice samples with text labels of the target user, and the universal recognition model is obtained by training an initially-constructed universal recognition model by using a plurality of second voice samples with text labels, based on which the voice recognition model corresponding to the target user better conforms to the pronunciation characteristics of the target user than the universal recognition model, and therefore, the target text data obtained by performing voice recognition on the target voice data by using the voice recognition model corresponding to the target user better conforms to the pronunciation characteristics of the target user than the target text data obtained by performing voice recognition using the universal recognition model The text data is more accurate. Therefore, in this embodiment, the trained general recognition model is personalized again by using the voice sample of the target user, so as to obtain a personalized voice recognition model for the target user, and then accuracy of performing voice recognition on the target voice data of the target user is improved by using the voice recognition model.
For ease of understanding, the following describes an example of the present solution in a practical application:
firstly, the scheme can be applied to various social software configured on the terminal, such as chat application, and the scheme includes the following implementation modules:
module one, a voice conversion text model, i.e. the general recognition model in the foregoing:
the module constructs a voice-to-text model based on all user data, and specifically, a second voice sample, namely voice data with text labels of all users, can be read in the published data set. Specifically, the input of the speech-to-text submodel of the module is the speech of a plurality of users, the model encodes the speech according to a time sequence through a Transformer mechanism, as shown in fig. 8, and the output is the converted text.
Module two, model fine tuning of different users obtains the model, i.e. the speech recognition model corresponding to the user in the foregoing:
this module will customize the speech to text model for a single user. Specifically, the speech-to-text model obtained by the first module is finely adjusted on the individual training data of each user, so that a personalized model is obtained. Therefore, after model training in the first module, the voice-to-character models of all users provide initial values of model parameters for each personalized voice-to-character model, so that the personalized models can be trained well quickly under the supervision data of a small number of single users. This part of training requires collecting training data of speech and text of a single user for supervised training.
Specifically, in this embodiment, the general recognition model obtained by the first module is finely adjusted on the supervision data of each user. The structure of the general recognition model is unchanged, the model parameters take the model parameters of the general recognition model obtained by the module one as the initial parameters of the personalized model, then the general recognition model is trained continuously by using the supervision data of the single user, so that the personalized model corresponding to each single user is obtained, and as shown in fig. 9, the personalized models of the user 1 and the user 2 are obtained respectively. Assuming that supervision data of K users are collected, for example, K individuals are invited to speak, which may be K chat users, corresponding voice data are obtained, and the user writes the voice into text, so as to obtain supervision data which may be different accents, different regions or different dialects, i.e., a first voice sample, and based on this, after fine tuning, K personalized voice recognition models may be obtained.
Module three, user decision model, i.e. the user classification model in the foregoing:
the user judgment model of the module can be used for judging whether two voice data belong to the same user.
In order to distinguish different users using voice converted text on the same device, the present disclosure must be able to distinguish different users. Specifically, the user determination model receives voice data of two users, then performs feature extraction on the voice data by using a convolutional neural network, then performs feature interaction on the extracted features through a full-communication layer, and finally determines whether the two users are the same person, as shown in fig. 10. Positive examples are samples where the two users are the same person, and negative examples are samples where the two users are different persons. After training of the positive example samples and the negative example samples, a user determination model capable of determining whether or not two pieces of speech data belong to the same user is obtained.
In the scheme, the user judgment model can give the probability that two users belong to the same person (the last layer of the full communication layer is a numerical value), if the probability exceeds a probability threshold, the two users are considered to be the same person, and if the probability does not exceed the probability threshold, the two users belong to different persons.
Thus, 1 shared model (a general recognition model in the first module), k personalized models (a speech recognition model in the second module) and 1 discriminant model (a user determination model in the third module) are obtained in the present application. In the application of specifically converting voice into text, the specific implementation scheme is as follows:
when one device inputs voice, the discrimination model of the third module is called to determine whether the current input voice belongs to a person recorded under the device. If the person belongs to the personalized model, the personalized model corresponding to the person is directly used for voice conversion; if not, recording the new user under the equipment, and turning to the next step;
and calling the third model again, finding the user which is the most similar to the current user in the k users collected in the second model, and using the model of the user as the model of the current user (the model comes from the second module), thereby processing the voice of the user by using the determined personalized model to obtain the corresponding text.
In another aspect, the present application further provides a speech processing apparatus, as shown in fig. 11, which shows a schematic composition diagram of an embodiment of the speech processing apparatus of the present application, where the apparatus of the present embodiment may be applied to a terminal, and the apparatus may include:
a voice obtaining unit 1101 for obtaining target voice data to be recognized;
a user obtaining unit 1102 configured to obtain a target user to which the target speech data belongs;
a voice recognition unit 1103, configured to perform voice recognition on the target voice data by using a voice recognition model corresponding to the target user, so as to obtain target text data corresponding to the target voice data;
the voice recognition model is obtained by training a universal recognition model by utilizing a plurality of first voice samples with text labels of the target user, and the universal recognition model is obtained by training an initially constructed universal recognition model by utilizing a plurality of second voice samples with text labels.
In an implementation manner, the user obtaining unit 1102 is specifically configured to:
obtaining each preset first voice data in a first voice set, wherein each first voice data corresponds to one affiliated user respectively; performing voice processing on the target voice data and the first voice data by using a user classification model to obtain a target user to which the target voice data belongs, wherein the first voice data corresponding to the target user and the target voice data meet preset similar conditions;
the user classification model is obtained by training an initially constructed user classification model by utilizing a plurality of voice sample groups with user classification labels, wherein the voice sample groups comprise two third voice samples, and the user classification labels represent whether the two third voice samples in the voice sample groups belong to the same user.
Optionally, the first voice data corresponding to the target user and the target voice data meet a preset similar condition, including:
the similarity between the first voice data corresponding to the target user and the target voice data is greater than or equal to a preset similarity threshold;
and/or the similarity between the first voice data corresponding to the target user and the target voice data is maximum.
Optionally, the user classification model at least includes a convolutional neural network layer, a full-connectivity layer, and a classification layer;
the convolutional neural network layer is used for respectively extracting voice features of the target voice data and the first voice data to obtain a first voice feature corresponding to the target voice data and a second voice feature corresponding to the first voice data;
the full communication layer is used for carrying out feature interaction processing on the first voice feature and the second voice feature to obtain a feature interaction result;
and the classification layer is used for generating a classification result according to the feature interaction result, and the classification result represents whether the target voice data and the first voice data belong to the same user.
Optionally, the user obtaining unit 1102 may obtain, when obtaining each first speech data in the first speech set, the following manner:
acquiring a first voice set stored on a terminal, wherein the terminal is a device needing voice recognition on the target voice data; and obtaining each preset first voice data in the first voice set.
Optionally, in a case that each of the first speech data in the first speech set and the target speech data do not satisfy the similar condition, the user obtaining unit 1102 is further configured to:
acquiring each second voice data in a second voice set stored on a server, wherein each second voice data corresponds to a belonging user, the server is a device capable of carrying out data transmission with a terminal, and the terminal is a device needing voice recognition on the target voice data;
and performing voice processing on the target voice data and the second voice data by using the user classification model to obtain a target user to which the target voice data belongs, wherein the second voice data corresponding to the target user and the target voice data meet the similar condition.
In an implementation manner, the user obtaining unit 1102 is specifically configured to:
performing voice recognition on the target voice data by using a user recognition model to obtain a target user to which the target voice data belongs; and the user recognition model is obtained by training the initially constructed user recognition model by utilizing a plurality of fourth voice samples with user labels.
On the other hand, an embodiment of the present application further provides a storage medium, where computer-executable instructions are stored in the storage medium, and when the computer-executable instructions are loaded and executed by a processor, the voice processing method executed by the terminal side in any of the above embodiments is implemented.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims (10)

1. A method of speech processing, comprising:
obtaining target voice data to be identified;
obtaining a target user to which the target voice data belongs;
performing voice recognition on the target voice data by using a voice recognition model corresponding to the target user to obtain target text data corresponding to the target voice data;
the voice recognition model is obtained by training a universal recognition model by utilizing a plurality of first voice samples with text labels of the target user, and the universal recognition model is obtained by training an initially constructed universal recognition model by utilizing a plurality of second voice samples with text labels.
2. The method of claim 1, wherein obtaining the target user to which the target voice data belongs comprises:
obtaining each preset first voice data in a first voice set, wherein each first voice data corresponds to one affiliated user respectively;
performing voice processing on the target voice data and the first voice data by using a user classification model to obtain a target user to which the target voice data belongs, wherein the first voice data corresponding to the target user and the target voice data meet preset similar conditions;
the user classification model is obtained by training an initially constructed user classification model by utilizing a plurality of voice sample groups with user classification labels, wherein the voice sample groups comprise two third voice samples, and the user classification labels represent whether the two third voice samples in the voice sample groups belong to the same user.
3. The method according to claim 2, wherein the first voice data corresponding to the target user and the target voice data satisfy a preset similarity condition, which includes:
the similarity between the first voice data corresponding to the target user and the target voice data is greater than or equal to a preset similarity threshold;
and/or the presence of a gas in the gas,
and the similarity between the first voice data corresponding to the target user and the target voice data is maximum.
4. The method of claim 2, wherein the user classification model comprises at least a convolutional neural network layer, a full-connectivity layer, and a classification layer;
the convolutional neural network layer is used for respectively extracting voice features of the target voice data and the first voice data to obtain a first voice feature corresponding to the target voice data and a second voice feature corresponding to the first voice data;
the full communication layer is used for carrying out feature interaction processing on the first voice feature and the second voice feature to obtain a feature interaction result;
and the classification layer is used for generating a classification result according to the feature interaction result, and the classification result represents whether the target voice data and the first voice data belong to the same user.
5. The method of claim 2, wherein obtaining each first speech data in the first set of speech comprises:
acquiring a first voice set stored on a terminal, wherein the terminal is a device needing voice recognition on the target voice data;
and obtaining each preset first voice data in the first voice set.
6. The method according to claim 2, wherein in a case where each of the first speech data in the first speech set and the target speech data do not satisfy the similarity condition, the method further comprises:
acquiring each second voice data in a second voice set stored on a server, wherein each second voice data corresponds to a belonging user, the server is a device capable of carrying out data transmission with a terminal, and the terminal is a device needing voice recognition on the target voice data;
and performing voice processing on the target voice data and the second voice data by using the user classification model to obtain a target user to which the target voice data belongs, wherein the second voice data corresponding to the target user and the target voice data meet the similar condition.
7. The method of claim 1, wherein obtaining the target user to which the target voice data belongs comprises:
performing voice recognition on the target voice data by using a user recognition model to obtain a target user to which the target voice data belongs;
and the user recognition model is obtained by training the initially constructed user recognition model by utilizing a plurality of fourth voice samples with user labels.
8. A speech processing apparatus, comprising:
a voice obtaining unit for obtaining target voice data to be recognized;
a user obtaining unit, configured to obtain a target user to which the target speech data belongs;
the voice recognition unit is used for carrying out voice recognition on the target voice data by utilizing a voice recognition model corresponding to the target user so as to obtain target text data corresponding to the target voice data;
the voice recognition model is obtained by training a universal recognition model by utilizing a plurality of first voice samples with text labels of the target user, and the universal recognition model is obtained by training an initially constructed universal recognition model by utilizing a plurality of second voice samples with text labels.
9. A terminal, comprising:
a processor and a memory;
wherein the processor is configured to execute a program stored in the memory;
the memory is adapted to store a program for implementing at least a speech processing method according to any of the preceding claims 1-7.
10. A storage medium having stored thereon computer-executable instructions which, when loaded and executed by a processor, carry out a speech processing method according to any one of claims 1 to 7.
CN202010849414.7A 2020-08-21 2020-08-21 Voice processing method, device, terminal and storage medium Pending CN111951790A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010849414.7A CN111951790A (en) 2020-08-21 2020-08-21 Voice processing method, device, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010849414.7A CN111951790A (en) 2020-08-21 2020-08-21 Voice processing method, device, terminal and storage medium

Publications (1)

Publication Number Publication Date
CN111951790A true CN111951790A (en) 2020-11-17

Family

ID=73359529

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010849414.7A Pending CN111951790A (en) 2020-08-21 2020-08-21 Voice processing method, device, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN111951790A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112700766A (en) * 2020-12-23 2021-04-23 北京猿力未来科技有限公司 Training method and device of voice recognition model and voice recognition method and device
CN112735381A (en) * 2020-12-29 2021-04-30 四川虹微技术有限公司 Model updating method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09127975A (en) * 1995-10-30 1997-05-16 Ricoh Co Ltd Speaker recognition system and information control method
CN102915731A (en) * 2012-10-10 2013-02-06 百度在线网络技术(北京)有限公司 Method and device for recognizing personalized speeches
CN104167208A (en) * 2014-08-08 2014-11-26 中国科学院深圳先进技术研究院 Speaker recognition method and device
CN109119071A (en) * 2018-09-26 2019-01-01 珠海格力电器股份有限公司 A kind of training method and device of speech recognition modeling
CN110111780A (en) * 2018-01-31 2019-08-09 阿里巴巴集团控股有限公司 Data processing method and server

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09127975A (en) * 1995-10-30 1997-05-16 Ricoh Co Ltd Speaker recognition system and information control method
CN102915731A (en) * 2012-10-10 2013-02-06 百度在线网络技术(北京)有限公司 Method and device for recognizing personalized speeches
CN104167208A (en) * 2014-08-08 2014-11-26 中国科学院深圳先进技术研究院 Speaker recognition method and device
CN110111780A (en) * 2018-01-31 2019-08-09 阿里巴巴集团控股有限公司 Data processing method and server
CN109119071A (en) * 2018-09-26 2019-01-01 珠海格力电器股份有限公司 A kind of training method and device of speech recognition modeling

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112700766A (en) * 2020-12-23 2021-04-23 北京猿力未来科技有限公司 Training method and device of voice recognition model and voice recognition method and device
CN112700766B (en) * 2020-12-23 2024-03-19 北京猿力未来科技有限公司 Training method and device of voice recognition model, and voice recognition method and device
CN112735381A (en) * 2020-12-29 2021-04-30 四川虹微技术有限公司 Model updating method and device

Similar Documents

Publication Publication Date Title
KR102633499B1 (en) Fully supervised speaker diarization
CN111428010B (en) Man-machine intelligent question-answering method and device
CN110209812B (en) Text classification method and device
CN111444382B (en) Audio processing method and device, computer equipment and storage medium
CN112530408A (en) Method, apparatus, electronic device, and medium for recognizing speech
US20180286429A1 (en) Intelligent truthfulness indicator association
US11004449B2 (en) Vocal utterance based item inventory actions
CN112053692B (en) Speech recognition processing method, device and storage medium
CN113094481A (en) Intention recognition method and device, electronic equipment and computer readable storage medium
CN111951790A (en) Voice processing method, device, terminal and storage medium
KR20230175258A (en) End-to-end speaker separation through iterative speaker embedding
CN116863935B (en) Speech recognition method, device, electronic equipment and computer readable medium
CN111639162A (en) Information interaction method and device, electronic equipment and storage medium
JP2020042131A (en) Information processor, information processing method and program
CN113254620B (en) Response method, device and equipment based on graph neural network and storage medium
CN114267345A (en) Model training method, voice processing method and device
CN112309384A (en) Voice recognition method, device, electronic equipment and medium
CN113393842A (en) Voice data processing method, device, equipment and medium
CN116884402A (en) Method and device for converting voice into text, electronic equipment and storage medium
CN112069786A (en) Text information processing method and device, electronic equipment and medium
CN103474063B (en) Voice identification system and method
CN113782014A (en) Voice recognition method and device
CN114970470A (en) Method and device for processing file information, electronic equipment and computer readable medium
US20220180865A1 (en) Runtime topic change analyses in spoken dialog contexts
CN111899718A (en) Method, apparatus, device and medium for recognizing synthesized speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination