CN115171660A - Voiceprint information processing method and device, electronic equipment and storage medium - Google Patents

Voiceprint information processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115171660A
CN115171660A CN202210657997.2A CN202210657997A CN115171660A CN 115171660 A CN115171660 A CN 115171660A CN 202210657997 A CN202210657997 A CN 202210657997A CN 115171660 A CN115171660 A CN 115171660A
Authority
CN
China
Prior art keywords
voiceprint
information
embedded code
embedded
terminal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210657997.2A
Other languages
Chinese (zh)
Inventor
朱绍明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210657997.2A priority Critical patent/CN115171660A/en
Publication of CN115171660A publication Critical patent/CN115171660A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention provides a voiceprint information processing method, a voiceprint information processing device, electronic equipment and a storage medium, wherein the applicable scenes including but not limited to maps, navigation, automatic driving, internet of vehicles, vehicle-road coordination and other using environment methods comprise the following steps: analyzing the first audio information to obtain first voiceprint information of a first target object; processing the first voiceprint information through a voiceprint identification service process to obtain a first voiceprint embedded code corresponding to the first voiceprint information; searching a voiceprint embedded code set matched with the first terminal identification code; calculating the similarity between the first voiceprint embedded code and each voiceprint embedded code in the set of voiceprint embedded codes; and updating the voiceprint embedded codes in the voiceprint embedded code set according to the reliability parameters corresponding to the first voiceprint information. Therefore, the accuracy of voiceprint identification by utilizing the voiceprint embedded code can be improved, and a user can obtain better use experience.

Description

Voiceprint information processing method and device, electronic equipment and storage medium
Technical Field
The present invention relates to voiceprint information processing technologies, and in particular, to a voiceprint information processing method and apparatus, an electronic device, and a storage medium.
Background
The key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and text-to-Speech (TTS Test-to-Speech) technologies, as well as voiceprint Recognition. The computer can listen, see, speak and feel, and is a development direction of future human-computer interaction, wherein voice becomes one of the most convenient human-computer interaction modes, but due to different states of users, different physiological development periods and different types of used languages, the voice of the same user is often changed easily, a terminal cannot timely and accurately recognize the voice instruction of the user, and the use experience of the user on voice information is influenced.
Disclosure of Invention
In view of this, embodiments of the present invention provide a voiceprint information processing method and apparatus, an electronic device, and a storage medium, which can ensure accuracy of a voiceprint embedded code by updating a voiceprint embedded code in a voiceprint embedded code set, improve accuracy of voiceprint recognition using the voiceprint embedded code, and reduce tedious steps of manually updating the voiceprint embedded code by a user, so that the user obtains better use experience.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a voiceprint information processing method, which comprises the following steps:
acquiring first audio information of a first target object through a first terminal;
analyzing the first audio information to obtain first voiceprint information of the first target object;
processing the first voiceprint information through a voiceprint identification service process to obtain a first voiceprint embedded code corresponding to the first voiceprint information;
searching a voiceprint embedded code set matched with the first terminal identification code according to the equipment identification code of the first terminal and the version identification code of the voiceprint recognition service process;
calculating the similarity between the first voiceprint embedded code and each voiceprint embedded code in the voiceprint embedded code set;
when the similarity is greater than or equal to a similarity threshold value, calculating a reliability parameter corresponding to the first voiceprint information;
and updating the voiceprint embedded codes in the voiceprint embedded code set according to the reliability parameters corresponding to the first voiceprint information.
An embodiment of the present invention further provides a voiceprint information processing apparatus, where the apparatus includes:
the information transmission module is used for acquiring first audio information of a first target object through a first terminal;
the information processing module is used for analyzing the first audio information to obtain first voiceprint information of the first target object;
the information processing module is used for processing the first voiceprint information through a voiceprint recognition service process to obtain a first voiceprint embedded code corresponding to the first voiceprint information;
the information processing module is used for searching a voiceprint embedded code set matched with the first terminal identification code according to the equipment identification code of the first terminal and the version identification code of the voiceprint identification service process;
the information processing module is used for calculating the similarity between the first voiceprint embedded code and each voiceprint embedded code in the voiceprint embedded code set;
the information processing module is used for calculating a reliability parameter corresponding to the first voiceprint information when the similarity is greater than or equal to a similarity threshold;
and the information processing module is used for updating the voiceprint embedded codes in the voiceprint embedded code set according to the reliability parameters corresponding to the first voiceprint information.
In the above-mentioned scheme, the first and second light sources,
the information processing module is used for configuring a first terminal identification code for the first terminal;
the information processing module is used for configuring a second target object identification code for a second target object through the voiceprint recognition service process when the audio information of the second target object is acquired through the first terminal, and establishing a mapping relation between the second target object identification code and the first terminal identification code;
the information processing module is used for acquiring at least 2 pieces of audio information of the second target object through the voiceprint recognition service process;
the information processing module is used for calculating second voiceprint embedded codes corresponding to the at least 2 pieces of audio information through the voiceprint recognition service process;
the information processing module is used for calculating the average value of second voiceprint embedded codes corresponding to the at least 2 pieces of audio information respectively;
and the information processing module is used for storing the average value of the second voiceprint embedded code in the voiceprint embedded code set and marking the average value through the first terminal identification code, the second target object identification code and the version identification code of the voiceprint identification service process.
In the above-mentioned scheme, the first step of the method,
the information processing module is used for triggering a voice information recognition model through the voiceprint recognition service process;
the information processing module is used for extracting pinyin corresponding to each character in the first voiceprint information and intonation corresponding to each character in the first voiceprint information through the voice information recognition model according to the recognition environment of the target voice information;
the information processing module is used for determining a single character pronunciation feature vector of each character level in the first voiceprint information according to the pinyin corresponding to each character in the first voiceprint information and the tone corresponding to each character in the first voiceprint information;
the information processing module is used for combining and processing the single character pronunciation feature vector corresponding to each character in the first voiceprint information through a word and voice encoder network in the voice information recognition model to form a statement-level pronunciation feature vector;
and the information processing module is used for taking the pronunciation feature vector at the statement level as the first voiceprint embedded code.
In the above-mentioned scheme, the first step of the method,
the information processing module is used for carrying out sound channel conversion processing on the first audio to form single-channel audio data;
the information processing module is used for carrying out short-time Fourier transform on the single-channel audio data based on a windowing function corresponding to the voice information recognition model to form a corresponding Mel frequency spectrogram;
the information processing module is used for determining a corresponding input triple sample based on the Mel frequency spectrogram and inputting the input triple sample into a voice information recognition model;
the information processing module is used for processing the input triple samples in a crossed manner through the convolution layer and the maximum pooling layer of the voice information recognition model to obtain the down-sampling results of different input triple samples;
and the information processing module is used for carrying out normalization processing on the downsampling results of the different input triple samples through a full connection layer of the voice information recognition model to obtain the first voiceprint embedded code.
In the above-mentioned scheme, the first step of the method,
the information processing module is configured to obtain a first text recognition result corresponding to the first audio information, and similarity corresponding to the identification information of the first target object and the first voiceprint information;
the information processing module is used for searching a text frequency information list of the first target object according to the identification information of the first target object;
the information processing module is used for calculating the display times of the first text recognition result in the text frequency information list;
the information processing module is used for calculating the frequency sum in the text frequency information list;
and the information processing module is used for taking the ratio of the display times to the sum of the frequency as the reliability parameter corresponding to the first voiceprint information when the similarity is greater than or equal to a similarity threshold value and the display times of the first text recognition result in the text frequency information list are greater than or equal to 1.
In the above-mentioned scheme, the first step of the method,
the information processing module is configured to update a text frequency information list of the first target object and update the display times of the first text recognition result in the text frequency information list when the reliability parameter corresponding to the first voiceprint information is 0.
In the above-mentioned scheme, the first step of the method,
the information processing module is configured to obtain an original voiceprint embedded code corresponding to the similarity when the reliability parameter corresponding to the first voiceprint information is greater than or equal to a reliability parameter threshold and the number of characters of the first text recognition result corresponding to the first audio information is greater than or equal to a character number threshold;
the information processing module is used for acquiring a first weight parameter of the original voiceprint embedded code and a second weight parameter of the first voiceprint embedded code;
the information processing module is configured to calculate a weighted average of the original voiceprint embedded code and the first voiceprint embedded code according to the first weight parameter and the second weight parameter, so as to obtain a third voiceprint embedded code;
and the information processing module is used for updating the original voiceprint embedded codes in the voiceprint embedded code set through the third voiceprint embedded code.
In the above-mentioned scheme, the first step of the method,
the information processing module is used for storing the voiceprint embedded code set in a cloud server;
the information processing module is used for detecting the processing authority of a second terminal when first audio information of a first target object is acquired through the second terminal;
and when the second terminal meets the requirement of the processing authority, storing the voiceprint embedded code set in the cloud server into the second terminal.
An embodiment of the present invention further provides an electronic device, where the electronic device includes:
a memory for storing executable instructions;
and the processor is used for realizing the voice print information processing method of the preorder when the executable instruction stored in the memory is operated.
An embodiment of the present invention further provides a computer-readable storage medium storing executable instructions, where the executable instructions, when executed by a processor, implement a method for processing voiceprint information of a preamble.
The embodiment of the invention has the following beneficial effects:
the embodiment of the invention collects first audio information of a first target object through a first terminal; analyzing the first audio information to obtain first voiceprint information of the first target object; processing the first voiceprint information through a voiceprint identification service process to obtain a first voiceprint embedded code corresponding to the first voiceprint information; searching a voiceprint embedded code set matched with the identification code of the first terminal according to the equipment identification code of the first terminal and the version identification code of the voiceprint recognition service process; calculating the similarity between the first voiceprint embedded code and each voiceprint embedded code in the set of voiceprint embedded codes; when the similarity is greater than or equal to a similarity threshold value, calculating a reliability parameter corresponding to the first voiceprint information; and updating the voiceprint embedded codes in the voiceprint embedded code set according to the reliability parameters corresponding to the first voiceprint information. From this, can be through updating the voiceprint embedded code in the set of voiceprint embedded codes, guarantee the accuracy of voiceprint embedded code, improve the accuracy that utilizes voiceprint embedded code to carry out voiceprint recognition, reduce the loaded down with trivial details step that the user manually updated voiceprint embedded code simultaneously, make the user obtain better use and experience.
Drawings
FIG. 1 is a schematic diagram of an environment for processing voiceprint information according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of an alternative voiceprint information processing method according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating an alternative structure of a speech information recognition model according to an embodiment of the present invention;
fig. 5 is a schematic flow chart of an alternative voiceprint information processing method according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating the processing of the speech information recognition model for audio according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating a process of updating a voiceprint embedded code according to an embodiment of the present invention;
fig. 8 is a schematic view of a usage scenario of a voiceprint information processing method according to an embodiment of the present invention;
fig. 9 is a schematic flowchart of an alternative voiceprint information processing method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments that can be obtained by a person skilled in the art without making creative efforts fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.
1) Short-time Fourier transform: short-time Fourier transform (STFT) is a mathematical transform related to the Fourier transform that determines the frequency and phase of the local area sinusoid of a time-varying signal.
2) Mel spectrum (MBF Mel Bank Features): since the obtained spectrogram is large, in order to obtain a sound feature with a proper size, it is usually passed through a Mel-scale filter banks (Mel-scale filters) to become a Mel spectrum.
3) Neural Networks (NN): an Artificial Neural Network (ANN), referred to as Neural Network or Neural Network for short, is a mathematical model or computational model that imitates the structure and function of biological Neural Network (central nervous system of animals, especially brain) in the field of machine learning and cognitive science, and is used for estimating or approximating functions.
4) Speech Recognition (SR Speech Recognition): also known as Automatic Speech Recognition (ASR Automatic Speech Recognition), computer Speech Recognition (CSR Computer Speech Recognition) or Speech-To-Text Recognition (STT Speech To Text), the goal of which is To automatically convert human Speech content into corresponding Text using a Computer.
5) Terminals, including but not limited to: the system comprises a common terminal and a special terminal, wherein the common terminal is in long connection and/or short connection with a sending channel, and the special terminal is in long connection with the sending channel.
6) Client, the bearer in the terminal implementing a specific function, e.g. a mobile client (APP), is the bearer in the mobile terminal for a specific function, e.g. performing a voice wake-up function.
7) A Mini Program (Program) is a Program developed based on a front-end-oriented Language (e.g., javaScript) and implementing a service in a hypertext Markup Language (HTML) page, and software downloaded by a client (e.g., a browser or any client embedded in a browser core) via a network (e.g., the internet) and interpreted and executed in a browser environment of the client saves steps installed in the client. For example, the small program in the terminal is awakened through a voice instruction, so that the small program for realizing various services such as air ticket purchase, task processing and making, data display and the like can be downloaded and run in the social network client.
Fig. 1 is a schematic view of a usage scenario of a voiceprint information processing method according to an embodiment of the present invention, referring to fig. 1, a terminal (including a terminal 10-1 and a terminal 10-2) is provided with corresponding clients capable of executing different functions, where the terminal (including the terminal 10-1 and the terminal 10-2) can obtain different corresponding information from a corresponding server 200 through a network 300 to browse through the set clients, the terminal is connected to the server 200 through the network 300, the network 300 can be a wide area network or a local area network, or a combination of the two, and data transmission is implemented using a wireless link, where the terminal (including the terminal 10-1 and the terminal 10-2) can be woken up through a voice instruction of a user, and specifically, key technologies of the voice technology include an automatic voice recognition technology, a voice synthesis technology, and a voiceprint recognition technology. The voice technology can be applied to the electronic equipment to achieve the function of waking up the electronic equipment, namely the voice wake-up technology. Generally, voice wakeup is implemented by setting a fixed wakeup word, and after a user speaks the wakeup word, a voice recognition function on a terminal is in a working state, otherwise, the terminal is in a dormant state.
The intelligent device wake-up method provided by the embodiment of the application is implemented based on Artificial Intelligence (AI), which is a theory, method, technology and application system for simulating, extending and expanding human Intelligence by using a digital computer or a machine controlled by a digital computer, sensing environment, acquiring knowledge and obtaining an optimal result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
In the embodiment of the present application, the artificial intelligence software technology mainly involved includes the above-mentioned voice processing technology and machine learning and other directions. For example, the Speech Recognition Technology (ASR) in the Speech Technology (Speech Technology) may be involved, including Speech signal preprocessing (Speech signal preprocessing), speech signal frequency domain analysis (Speech signal analysis), speech signal feature extraction (Speech signal feature extraction), speech signal feature matching/Recognition (Speech signal feature matching/Recognition), training of Speech (Speech training), and the like.
For example, machine Learning (ML) may be involved, which is a multi-domain cross discipline, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and so on. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine Learning generally includes techniques such as Deep Learning (Deep Learning), which includes artificial Neural networks (artificial Neural networks), such as Convolutional Neural Networks (CNN), recurrent Neural Networks (RNN), deep Neural Networks (DNN), and the like.
It can be understood that the method can be applied to an Intelligent device (Intelligent device), and the Intelligent device can be any device with a voice wake-up function, for example, the device can be an Intelligent terminal, an Intelligent home device (such as an Intelligent sound box, an Intelligent washing machine, etc.), an Intelligent wearable device (such as an Intelligent watch), a vehicle-mounted Intelligent central control system (that an applet executing different tasks in the terminal is awakened through a voice instruction), an AI Intelligent medical device (that is awakened and triggered through a voice instruction), and the like.
As an example, the terminal (including the terminal 10-1 and the terminal 10-2) is used for laying a voiceprint information processing apparatus to implement the voiceprint information processing method provided by the present invention, so as to acquire first audio information of a first target object through a first terminal; analyzing the first audio information to obtain first voiceprint information of the first target object; processing the first voiceprint information through a voiceprint recognition service process to obtain a first voiceprint embedded code corresponding to the first voiceprint information; searching a voiceprint embedded code set matched with the identification code of the first terminal according to the equipment identification code of the first terminal and the version identification code of the voiceprint recognition service process; calculating the similarity between the first voiceprint embedded code and each voiceprint embedded code in the voiceprint embedded code set; when the similarity is larger than or equal to a similarity threshold value, calculating a reliability parameter corresponding to the first voiceprint information; and updating the voiceprint embedded codes in the voiceprint embedded code set according to the reliability parameters corresponding to the first voiceprint information, and finally realizing the identification and execution of the voice awakening instruction.
As will be described in detail below with respect to the structure of the voiceprint information processing apparatus according to the embodiment of the present invention, the voiceprint information processing apparatus may be implemented in various forms, such as a dedicated terminal with a voiceprint information processing function, or a mobile phone or a tablet computer with a voiceprint information processing function, for example, the terminal in fig. 1 in the foregoing. Fig. 2 is a schematic diagram of a composition structure of a voiceprint information processing apparatus according to an embodiment of the present invention, and it can be understood that fig. 2 only shows an exemplary structure of the voiceprint information processing apparatus, and not a whole structure, and a part of the structure or the whole structure shown in fig. 2 may be implemented as required.
The voiceprint information processing device provided by the embodiment of the invention comprises: at least one processor 201, memory 202, user interface 203, and at least one network interface 204. The various components in the voiceprint information processing apparatus are coupled together by a bus system 205. It will be appreciated that the bus system 205 is used to enable communications among the components. The bus system 205 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 205 in fig. 2.
The user interface 203 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, or a touch screen.
It will be appreciated that the memory 202 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The memory 202 in the embodiments of the present invention is capable of storing data to support the operation of the terminal (e.g., 10-1). Examples of such data include: any computer program, such as an operating system and application programs, for operating on a terminal (e.g., 10-1). The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application program may include various application programs.
In some embodiments, the voiceprint information processing apparatus provided by the embodiment of the present invention may be implemented by using a combination of hardware and software, and as an example, the voice processing model provided by the embodiment of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the semantic processing method of the voice processing model provided by the embodiment of the present invention. For example, a processor in the form of a hardware decode processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable Logic Devices (PLDs), complex Programmable Logic Devices (CPLDs), field Programmable Gate Arrays (FPGAs), or other electronic components.
As an example of the voiceprint information processing apparatus provided by the embodiment of the present invention implemented by combining software and hardware, the voiceprint information processing apparatus provided by the embodiment of the present invention may be directly embodied as a combination of software modules executed by the processor 201, the software modules may be located in a storage medium, the storage medium is located in the memory 202, the processor 201 reads executable instructions included in the software modules in the memory 202, and the semantic processing method of the voice processing model provided by the embodiment of the present invention is completed in combination with necessary hardware (for example, including the processor 201 and other components connected to the bus 205).
By way of example, the Processor 201 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor or the like.
As an example of the voiceprint information processing apparatus provided by the embodiment of the present invention implemented by hardware, the apparatus provided by the embodiment of the present invention may be implemented by directly using the processor 201 in the form of a hardware decoding processor, for example, a semantic processing method for implementing the voice processing model provided by the embodiment of the present invention may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable Logic Devices (PLDs), complex Programmable Logic Devices (CPLDs), field Programmable Gate Arrays (FPGAs), or other electronic components.
The memory 202 in the embodiment of the present invention is used to store various types of data to support the operation of the voiceprint information processing apparatus. Examples of such data include: any executable instructions for operating on a voiceprint information processing apparatus, such as executable instructions, a program implementing the semantic processing method from a sound processing model of the embodiments of the present invention may be contained in the executable instructions.
In other embodiments, the voiceprint information processing apparatus provided by the embodiment of the present invention may be implemented by software, and fig. 2 illustrates the voiceprint information processing apparatus stored in the memory 202, which may be software in the form of programs, plug-ins, and the like, and includes a series of modules, and as an example of the program stored in the memory 202, the voiceprint information processing apparatus may include the following software modules: an information transmission module 2081 and an information processing module 2082. When the software module in the voiceprint information processing apparatus is read into the RAM by the processor 201 and executed, the semantic processing method of the sound processing model provided in the embodiment of the present invention is implemented, and the following describes functions of each software module in the voiceprint information processing apparatus in the embodiment of the present invention, and specifically includes:
the information transmission module 2081 is configured to collect first audio information of a first target object through a first terminal.
The information processing module 2082 is configured to analyze the first audio information to obtain first voiceprint information of the first target object.
The information processing module 2082 is configured to process the first voiceprint information through a voiceprint recognition service process, so as to obtain a first voiceprint embedded code corresponding to the first voiceprint information.
The information processing module 2082 is configured to search a voiceprint embedded code set matched with the first terminal identification code according to the device identification code of the first terminal and the version identification code of the voiceprint recognition service process.
The information processing module 2082 is configured to calculate a similarity between the first voiceprint embedded code and each voiceprint embedded code in the set of voiceprint embedded codes.
The information processing module 2082 is configured to calculate a reliability parameter corresponding to the first voiceprint information when the similarity is greater than or equal to a similarity threshold.
The information processing module 2082 is configured to update the voiceprint embedded codes in the voiceprint embedded code set according to the reliability parameter corresponding to the first voiceprint information.
According to the electronic device shown in fig. 2, in one aspect of the present application, the present application further provides a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute different embodiments and combinations of embodiments provided in various alternative implementations of the voiceprint information processing method described above.
The voiceprint information processing method provided by the embodiment of the present invention is explained with reference to the electronic device 20 shown in fig. 2, and before describing the voiceprint information processing method provided by the present invention, the defects of the related art are first described.
The theoretical basis for voiceprint recognition is that each sound has a unique characteristic by which it is possible to effectively distinguish between different human voices. This unique characteristic is determined primarily by two factors, the first being the size of the acoustic cavity, specifically including the throat, nasal cavity, oral cavity, etc., the shape, size and location of these organs determining the magnitude of vocal cord tension and the range of vocal frequencies. Therefore, different people speak the same, but the frequency distribution of the sound is different, and the sound sounds with heavy and loud sound. The sounding cavity of each person is different, and like fingerprints, the sound of each person has unique characteristics.
The second factor that determines the characteristics of the sound is the manner in which the organs of the sound are manipulated, including the muscles of the lips, teeth, tongue, soft palate, and palate, which interact to produce clear speech. And the cooperation mode among the people is randomly learned by the communication between the acquired people and the surrounding people. In the process of learning speaking, a person can gradually form the vocal print characteristics of the person by simulating the speaking modes of different people around the person.
However, due to different states of users (damaged vocal cords), different physiological developmental stages (changed adolescent sounds), and different types of languages used (different pronunciations of different dialects of the same word), the sounds of the same user are often changed, the terminal cannot timely and accurately recognize the voice instruction of the user, the use experience of the user on voice information is affected, and if the user frequently updates the voiceprint embedded code manually, the operation burden of the user is increased.
In order to overcome the above-mentioned defects, referring to fig. 3, fig. 3 is an optional flowchart of a voiceprint information processing method provided in an embodiment of the present invention, and it can be understood that the steps shown in fig. 3 may be executed by various electronic devices operating a voiceprint information processing apparatus, for example, a terminal, a server, or a server cluster having a voiceprint information processing function, and when the voiceprint information processing apparatus is operated in the terminal, an instant messaging client in the terminal or an applet in a car machine may be triggered to process audio information, so as to increase a speed of voiceprint information processing. The user can also operate the electronic equipment through the awakening word in the voice command, and the electronic equipment executes the task matched with the audio information characteristics; the dedicated device with the voiceprint information processing apparatus may be packaged in the middle terminal shown in fig. 1 to execute the corresponding software module in the voiceprint information processing apparatus shown in the foregoing fig. 2, and a user may obtain and display task information through a corresponding client, which is described below with reference to the steps shown in fig. 3.
Step 301: the voiceprint information processing device collects first audio information of a first target object through a first terminal.
In some embodiments of the present invention, before performing step 301, the terminal needs to perform voiceprint information registration on the received voice command, so as to determine from which user the voice command originates after receiving the voice command, and specifically, performing voiceprint information registration may be implemented by:
configuring a first terminal identification code for the first terminal; when the audio information of a second target object is acquired through the first terminal, configuring a second target object identification code for the second target object through the voiceprint recognition service process, and establishing a mapping relation between the second target object identification code and the first terminal identification code; acquiring at least 2 pieces of audio information of the second target object through the voiceprint recognition service process; calculating second voice print embedded codes corresponding to the at least 2 pieces of audio information respectively through the voice print identification service process; calculating the average value of the second acoustic pattern embedded codes respectively corresponding to the at least 2 pieces of audio information; and storing the average value of the second voiceprint embedded codes in the voiceprint embedded code set, and marking the average value through the first terminal identification code, the second target object identification code and the version identification code of the voiceprint recognition service process. For example: taking the first terminal as an example of an intelligent sound box, wherein a user can perform voice control on the electronic device through a corresponding voice instruction and execute a task matched with the audio information characteristics, instead of the traditional manual operation, specifically, for various operations of different types of electronic devices, corresponding wake-up words can be configured in advance, and the user can control the electronic device to execute corresponding operations through a voice control mode only by speaking the wake-up words corresponding to the required task operations through the voice instruction. For example: when the electronic equipment is a vehicle-mounted intelligent central control system, the awakening word of the electronic equipment is 'turn on song', the intelligent equipment can acquire audio data at any time, the electronic equipment can acquire the audio data 'turn on music', whether the 'turn on music' is the awakening word is identified, and a task matched with the audio information characteristic is executed through the electronic equipment, so that the electronic equipment plays songs, at the moment, a second target object can be all users (marked as sound box users) using the intelligent sound box, the sound box users initiate user creation on a first terminal, and the identification service allocates a user unique identification code u and is bound with a first terminal unique identification code D; and then the voice frequency of the speaker user is input. The sound box user inputs 3 sections of audio of fixed texts on the first terminal, stores the audio in the audio storage service, and marks the audio with the 'equipment unique identification code + user unique identification code'. The voiceprint recognition service process uses the 3 sections of input audios, three voiceprint embedded codes are obtained through calculation of a voice information recognition model, the average value of the 3 voiceprint embedded codes is calculated and serves as the final voiceprint embedded code e of the sound box user, the final voiceprint embedded code e is stored in the embedded code storage service, and the voiceprint embedded code is marked by 'equipment unique identification code D + user unique identification code u + voiceprint recognition service version number vn', so that the voiceprint registration process of the sound box user is completed.
Step 302: and the voiceprint information processing device analyzes the first audio information to obtain first voiceprint information of the first target object.
Step 303: and the voiceprint information processing device processes the first voiceprint information through a voiceprint recognition service process to obtain a first voiceprint embedded code corresponding to the first voiceprint information.
With continuing reference to fig. 4, fig. 4 is an optional structural schematic diagram of the speech information recognition model in the embodiment of the present invention, where the Encoder includes: n =6 identical layers, each layer containing two sub-layers. The first sub-layer is a multi-head attention layer (multi-head attention layer) and then a simple fully connected layer. Each sub-layer is added with residual connection (residual connection) and normalization (normalization).
Referring to fig. 5 in conjunction with the model structure shown in fig. 4, fig. 5 is an optional flowchart of the speech information recognition method according to the embodiment of the present invention, and it can be understood that the steps shown in fig. 5 may be executed by various electronic devices operating the speech information recognition apparatus to obtain the word characteristic vector and the font characteristic vector corresponding to the speech information to be recognized, and specifically include the following steps:
step 501: and extracting pinyin corresponding to each character in the voice information to be recognized and the tone corresponding to each character in the voice information to be recognized through a word-tone encoder network in the voice information recognition model according to the recognition environment of the target voice information.
Step 502: and determining a single character pronunciation feature vector of each character level in the voice information to be recognized according to the pinyin corresponding to each character in the voice information to be recognized and the tone corresponding to each character in the voice information to be recognized.
Step 503: and combining the single character pronunciation feature vectors corresponding to each character in the speech information to be recognized through a word pronunciation encoder network in the speech information recognition model to form the sentence-level pronunciation feature vectors.
Step 504: and taking the pronunciation feature vector of the statement level as the first voiceprint embedded code.
In some embodiments of the present invention, in performing the utterance recognition process, the sentence-level utterance coding uses a 4-layer transform model, and the input is the output of the word-level utterance coder. It should be noted that the Gated round-robin Unit network (GRU Gated current Unit) is a model with fewer parameters than the LSTM that can process sequence information very well, and then the fused features are input into the feedforward neural network in order to process effective information of other features. Taking the recognition of wrong characters as the problem of prediction occurrence probability, using sigmoid function (logic function) as an output layer, where the loss function is standard cross entropy loss, and can refer to formula 1:
Figure BDA0003689148030000161
the GRU layer is used for extracting the depth features, and can be replaced by a plurality of spliced feedforward neural network layers without the GRU layer, so that the features can be effectively processed and fused.
In some embodiments of the present invention, the obtaining manner of the first voiceprint embedded code further includes:
performing channel conversion processing on the first audio to form single-channel audio data; performing short-time Fourier transform on the single-channel audio data based on a windowing function corresponding to the voice information identification model to form a corresponding Mel frequency spectrogram; determining a corresponding input triple sample based on the Mel frequency spectrogram, and inputting the input triple sample into a voice information recognition model; processing the input triple samples in a crossed manner through the convolution layer and the maximum pooling layer of the voice information recognition model to obtain the down-sampling results of different input triple samples; and normalizing the downsampling results of the different input triple samples through a full connection layer of the voice information identification model to obtain the first voiceprint embedded code. Fig. 6 is a schematic diagram of a processing process of the voice information recognition model for audio in the embodiment of the present invention, and may perform feature extraction through a VGGish network, where feature extraction of the voice information recognition model may be implemented through a Visual Geometry Group network (VGGish), for example, for audio information in user video monitoring information, extraction of an audio file may be performed to obtain an audio file, a corresponding mel spectrogram is obtained for the audio file, then, extraction of audio features is performed through the VGGish network for the mel spectrogram, and the extracted vectors are cluster-encoded through a spatial local aggregation Vector (NetVlad Net Vector of aggregated encoded descriptors) to obtain audio feature vectors. The NetVlad can save the distance between each feature point and the nearest cluster center and take the feature point as a new feature. Because the acquired audio information may have noise in the environment, in order to better calculate the first voiceprint embedded code, the audio may be first resampled to 16KHz single-channel audio; then, using a 25ms Hann time window, and carrying out short-time Fourier transform on the audio frequency by using a 10ms frame shift and a periodic Hann window to obtain a corresponding spectrogram; calculating a mel sound spectrum by mapping the spectrogram into a mel filter bank of 64 th order, wherein the range of mel bins is 125-7500Hz; calculating log (mel-spectrum + 0.01) to obtain stable mel-frequency spectrum, wherein 0.01 offset is added to avoid taking logarithm of 0; and framing the obtained features by using the features of 0.96s, wherein no frame overlapping exists, each frame comprises 64 mel frequency bands and is 10ms in duration (totally 96 frames), so that a corresponding Mel frequency spectrogram is extracted, and finally, a clear first voiceprint embedded code is obtained by processing the Mel frequency spectrogram, and the accuracy of the first voiceprint embedded code is ensured.
Step 304: and the voiceprint information processing device searches a voiceprint embedded code set matched with the first terminal identification code according to the equipment identification code of the first terminal and the version identification code of the voiceprint identification service process.
In some embodiments of the invention, the version identification code of the voiceprint recognition service process may be used to indicate where the set of voiceprint embedding codes is stored, for example: when the electronic equipment is a vehicle-mounted intelligent central control system, the version identification code of the voiceprint recognition service process is as follows: and when the version 1.1 (national service version) or the version 19.0.1 (overseas version), the voiceprint embedded code set is only stored in the storage device of the vehicle-mounted intelligent central control system and is used for executing the voiceprint information processing method provided by the application. When the version identification code of the voiceprint recognition service process is: when the 3.1 version (national service version) or the 21.0.1 version (overseas version), the voiceprint embedded code set is not only stored in the storage device of the vehicle-mounted intelligent central control system, but also can be stored in the corresponding cloud network (cloud server cluster) for executing the voiceprint information processing method provided by the application, when a user changes a new vehicle, the stored voiceprint embedded code set can be obtained from the cloud network and applied to the vehicle-mounted intelligent central control system, the defect that the user manually copies the voiceprint embedded code set is avoided, and the use of the user is facilitated.
In some embodiments of the present invention, when a user needs to use a cloud network in order to store a voiceprint embedded code set, the version identification code of the voiceprint recognition service process may be adjusted by purchasing the voiceprint recognition service process, for example, upgrading a 1.1 version (national costume version) to a 3.1 version (national costume version) and above; or the 19.0.1 version (overseas version) is upgraded to the 21.0.1 version (overseas version) or more so as to meet the use requirements of the user, and meanwhile, the user can flexibly select the voiceprint recognition service process of the national service version or the voiceprint recognition service process of the overseas version according to different use areas so as to meet the legal requirements of the use areas on voiceprint information collection.
Step 305: the voiceprint information processing apparatus calculates a similarity between the first voiceprint embedded code and each voiceprint embedded code in the set of voiceprint embedded codes.
Still taking the smart speaker in the preamble embodiment as an example, speaker user a verifies the audio entry. The speaker user inputs a section of audio on the first terminal and carries the unique identification code D of the equipment A With a user's unique identification code u A Voice print verification is initiated to the voice print service. The voiceprint embedded code is then computed. The voiceprint recognition service process calculates the voiceprint embedded code e of the audio through a voice recognition model A While simultaneously according to the unique identification code D of the first terminal A And voice print recognition serviceVersion number v of program n And acquiring embedded codes E = { E1, E2,. En } of all sound box users on the first terminal. Calculation embedded code e of voiceprint recognition service process A The cosine similarity of the voiceprint embedded codes E of all the sound box users of the first terminal is taken as the value C with the maximum similarity, and the corresponding voiceprint embedded code is E i
In some embodiments of the present invention, the value of the similarity threshold T is 0.6, if C is greater than or equal to the threshold T =0.6, the verification is successful, and the voiceprint recognition service process returns e i Corresponding loudspeaker box user identification code u i Giving the similarity value C to the first terminal; if C is smaller than the threshold value T, the verification fails, the voiceprint recognition service process returns a prompt of the verification failure to the first terminal, and can prompt the user of the intelligent sound box to trigger other verification modes to change the voiceprint embedded code.
Step 306: and when the similarity is greater than or equal to a similarity threshold value, the voiceprint information processing device calculates a reliability parameter corresponding to the first voiceprint information.
In some embodiments of the present invention, since the voiceprint information is changed due to the physiological development of the user, when the reliability parameter corresponding to the first voiceprint information is calculated, when the similarity of each voiceprint embedded code in the obtained voiceprint embedded code set is greater than or equal to the similarity threshold and the number of the similarities is greater than or equal to 2, in order to ensure the security of the change of the voiceprint embedded code, different verification policies may be triggered according to the version identification code of the voiceprint recognition service process, when the voiceprint recognition service process is 3.1 or more (national service version) or 21.0.1 or more (overseas version), the face recognition information of the user may be sent to the cloud server network, and when the face recognition information passes through the verification of the cloud server network, the reliability parameter corresponding to the first voiceprint information is calculated to update the voiceprint embedded code. When the voiceprint recognition service process is 3.1 and the following versions (national service versions) or 21.0.1 and the following versions (overseas versions), because a cloud server network cannot be used, only local voiceprint embedded codes can be updated, and therefore, in order to ensure the security of the change of the voiceprint embedded codes, the first audio information needs to be collected again, and the similarity between the first voiceprint embedded codes and each voiceprint embedded code in the voiceprint embedded code set needs to be calculated.
In some embodiments of the present invention, calculating the reliability parameter corresponding to the first voiceprint information may be implemented by:
acquiring a first text recognition result corresponding to the first audio information, identification information of the first target object and similarity corresponding to the first voiceprint information; searching a text frequency information list of the first target object according to the identification information of the first target object; calculating the display times of the first text recognition result in the text frequency information list; calculating the sum of the frequency in the text frequency information list; and when the similarity is greater than or equal to a similarity threshold value and the display frequency of the first text recognition result in the text frequency information list is greater than or equal to 1, taking the ratio of the display frequency to the frequency sum as the reliability parameter corresponding to the first voiceprint information. In combination with the structure of the speech recognition model shown in the foregoing fig. 4, the speech recognition model may perform text recognition on the audio information, or may convert the audio information through the text-to-speech conversion server to obtain a corresponding audio information feature set; and processing the audio information feature set through the first neural network, determining feature vectors with the same number as the number of the test feature frames, and performing average processing on the feature vectors to extract corresponding awakening word features. The embodiment of the invention can be realized by combining a Cloud technology or a block chain network technology, wherein the Cloud technology refers to a hosting technology for unifying series resources such as hardware, software and network in a wide area network or a local area network to realize the calculation, storage, processing and sharing of data, and can also be understood as a general term of a network technology, an information technology, an integration technology, a management platform technology, an application technology and the like applied based on a Cloud computing business mode. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, photo-like websites and more portal websites, so cloud technology needs to be supported by cloud computing.
It should be noted that cloud computing is a computing mode, and distributes computing tasks on a resource pool formed by a large number of computers, so that various application systems can obtain computing power, storage space and information services as required. The network that provides the resources is called the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand. As a basic capability provider of cloud computing, a cloud computing resource pool platform, which is called an Infrastructure as a Service (IaaS) for short, is established, and multiple types of virtual resources are deployed in a resource pool and are used by external clients selectively. The cloud computing resource pool mainly comprises: a computing device (which may be a virtualized machine, including an operating system), a storage device, and a network device.
In some embodiments of the present invention, a TTS server at the cloud may generate N different wake-up word voices (pronunciations) by using a wake-up text to form feature vectors with different frame lengths, for example, a user may modify audio information arbitrarily according to different usage scenarios, and the TTS server converts each character included in the audio information into a syllable identifier according to a pronunciation dictionary to extract corresponding wake-up word features.
Step 307: and the voiceprint information processing device updates the voiceprint embedded codes in the voiceprint embedded code set according to the reliability parameters corresponding to the first voiceprint information.
Referring to fig. 7, fig. 7 is a schematic diagram of a process of updating a voiceprint embedded code in an embodiment of the present invention, which specifically includes the following steps:
step 701: and when the reliability parameter corresponding to the first voiceprint information is greater than or equal to a reliability parameter threshold value, and the number of characters of the first text recognition result corresponding to the first audio information is greater than or equal to a character number threshold value, acquiring the original voiceprint embedded code corresponding to the similarity.
Step 702: and acquiring a first weight parameter of the original voiceprint embedded code and a second weight parameter of the first voiceprint embedded code.
Step 703: and calculating a weighted average of the original voiceprint embedded code and the first voiceprint embedded code according to the first weight parameter and the second weight parameter to obtain a third voiceprint embedded code.
Step 704: and updating the original voiceprint embedded codes in the voiceprint embedded code set through the third voiceprint embedded code.
In connection with the preamble embodiment, the first terminal compares the text recognition result T of the audio information with the identification u of the target object i Calculating with the similarity value C to obtain a reliability value R,0 of the voiceprint recognition result<R<1. And sending R and T to the voiceprint recognition service process, wherein the calculation mode of R is as follows: and R = m/N, wherein m is the corresponding frequency of occurrence of T in L, L is the text frequency information list of the first target object, and N is the frequency sum of the text frequency information list.
The voiceprint recognition service process judges whether to update the voiceprint embedded code according to the obtained reliability value R and the verification audio text recognition result T, and in some embodiments of the present invention, the similarity threshold R may be 0.7, that is: when R is greater than 0.7 and the number of words of T is greater than 10, updating the embedded code; otherwise, no update is performed.
In some embodiments of the present invention, the first weight parameter is 0.9, the second weight parameter is 0.1, and the weighted averaging is performed to obtain a new embedded code ei '= 0.1 × er +0.9 × ei, and then ei stored in the embedded code storage service is replaced with ei'.
Step 705: and when the reliability parameter corresponding to the first voiceprint information is 0, updating a text frequency information list of the first target object, and updating the display frequency of the first text recognition result in the text frequency information list.
The voiceprint information processing method provided by the present application is described below by taking an example of a vehicle-mounted system wakeup process in a vehicle-mounted use environment, and fig. 8 is a use scene schematic diagram of the voiceprint information processing method provided by the embodiment of the present invention, and the voiceprint information processing method provided by the present invention can serve clients of various types (for example, packaged in a vehicle-mounted terminal or packaged in different mobile electronic devices) as a cloud service form, where a user interface includes a personal view angle picture for observing a task information processing environment in an instant client with a first personal view angle of different types of users, and the user interface further includes a task control component and an information display component; displaying the tasks matched with the awakening voice characteristics and the corresponding awakening words by utilizing an information display component through the user interface; based on the result of the awakening judgment, the user interface utilizes the information display component to display the task processing result of the electronic equipment, which is matched with the awakening voice characteristic, so as to realize information interaction between the electronic equipment and the user, for example, the user can trigger a vehicle-mounted system to execute a music playing function or awaken a map applet in the vehicle-mounted WeChat to use by utilizing an awakening word through a voice instruction.
Specifically, referring to fig. 9, fig. 9 is a schematic view of an optional flow chart of the voiceprint information processing method provided in the embodiment of the present invention, which specifically includes:
step 901: and acquiring the audio information of the user through the vehicle-mounted terminal, and analyzing the audio information to obtain the voiceprint information of the user.
Step 902: and processing the voiceprint information through a voiceprint recognition service process to obtain a voiceprint embedded code corresponding to the voiceprint information, and searching a voiceprint embedded code set matched with the first terminal identification code according to the equipment identification code of the first terminal and the version identification code of the voiceprint recognition service process.
Step 903: and calculating the similarity between the voiceprint embedded code and each voiceprint embedded code in the set of voiceprint embedded codes.
Step 904: and when the similarity is greater than or equal to the similarity threshold, calculating the reliability parameter corresponding to the voiceprint information, and updating the voiceprint embedded codes in the voiceprint embedded code set.
Step 905: and obtaining a corresponding voice instruction judgment result through the updated voiceprint embedded code so as to determine whether to awaken the vehicle-mounted terminal.
The beneficial technical effects are as follows:
the embodiment of the invention collects first audio information of a first target object through a first terminal; analyzing the first audio information to obtain first voiceprint information of the first target object; processing the first voiceprint information through a voiceprint recognition service process to obtain a first voiceprint embedded code corresponding to the first voiceprint information; searching a voiceprint embedded code set matched with the first terminal identification code according to the equipment identification code of the first terminal and the version identification code of the voiceprint recognition service process; calculating the similarity between the first voiceprint embedded code and each voiceprint embedded code in the voiceprint embedded code set; when the similarity is larger than or equal to a similarity threshold value, calculating a reliability parameter corresponding to the first voiceprint information; and updating the voiceprint embedded codes in the voiceprint embedded code set according to the reliability parameters corresponding to the first voiceprint information. From this, can update through the voiceprint embedded code to in the voiceprint embedded code set, guarantee the accuracy of voiceprint embedded code, improve the accuracy that utilizes voiceprint embedded code to carry out voiceprint recognition, reduce the loaded down with trivial details step that the user manually updated voiceprint embedded code simultaneously, make the user obtain better use and experience, the voiceprint recognition service process of different editions can be selected according to service environment's difference is nimble simultaneously, has promoted user's use and has experienced.
The above description is intended to be illustrative only, and should not be taken as limiting the scope of the invention, which is intended to include all such modifications, equivalents, and improvements as fall within the true spirit and scope of the invention.

Claims (12)

1. A voiceprint information processing method, characterized in that the method comprises:
acquiring first audio information of a first target object through a first terminal;
analyzing the first audio information to obtain first voiceprint information of the first target object;
processing the first voiceprint information through a voiceprint identification service process to obtain a first voiceprint embedded code corresponding to the first voiceprint information;
searching a voiceprint embedded code set matched with the identification code of the first terminal according to the equipment identification code of the first terminal and the version identification code of the voiceprint recognition service process;
calculating the similarity between the first voiceprint embedded code and each voiceprint embedded code in the set of voiceprint embedded codes;
when the similarity is greater than or equal to a similarity threshold value, calculating a reliability parameter corresponding to the first voiceprint information;
and updating the voiceprint embedded codes in the voiceprint embedded code set according to the reliability parameters corresponding to the first voiceprint information.
2. The method of claim 1, further comprising:
configuring a first terminal identification code for the first terminal;
when the audio information of a second target object is acquired through the first terminal, configuring a second target object identification code for the second target object through the voiceprint recognition service process, and establishing a mapping relation between the second target object identification code and the first terminal identification code;
acquiring at least 2 pieces of audio information of the second target object through the voiceprint recognition service process;
calculating second voice print embedded codes corresponding to the at least 2 pieces of audio information respectively through the voice print identification service process;
calculating the average value of second voiceprint embedded codes corresponding to the at least 2 pieces of audio information respectively;
and storing the average value of the second voiceprint embedded codes in the voiceprint embedded code set, and marking the average value through the first terminal identification code, the second target object identification code and the version identification code of the voiceprint recognition service process.
3. The method according to claim 1, wherein the processing the first voiceprint information by the voiceprint recognition service process to obtain a first voiceprint embedded code corresponding to the first voiceprint information includes:
triggering a voice information recognition model through the voiceprint recognition service process;
extracting pinyin corresponding to each character in the first voiceprint information and intonation corresponding to each character in the first voiceprint information through the voice information recognition model according to the recognition environment of the target voice information;
determining a single character pronunciation feature vector of each character level in the first voiceprint information according to the pinyin corresponding to each character in the first voiceprint information and the intonation corresponding to each character in the first voiceprint information;
combining the single character pronunciation feature vectors corresponding to each character in the first voiceprint information through a word and pronunciation encoder network in the voice information recognition model to form a statement-level pronunciation feature vector;
and taking the pronunciation feature vector of the statement level as the first voiceprint embedded code.
4. The method of claim 1, further comprising:
performing channel conversion processing on the first audio to form single-channel audio data;
based on a windowing function corresponding to the voice information recognition model, carrying out short-time Fourier transform on the single-channel audio data to form a corresponding Mel frequency spectrogram;
determining a corresponding input triple sample based on the Mel frequency spectrogram, and inputting the input triple sample into a voice information recognition model;
processing the input triple samples in a crossed manner through the convolution layer and the maximum pooling layer of the voice information recognition model to obtain the down-sampling results of different input triple samples;
and normalizing the downsampling results of the different input triple samples through a full connection layer of the voice information identification model to obtain the first voiceprint embedded code.
5. The method according to claim 1, wherein when the similarity is greater than or equal to a similarity threshold, calculating a reliability parameter corresponding to the first voiceprint information includes:
acquiring a first text recognition result corresponding to the first audio information, identification information of the first target object and similarity corresponding to the first voiceprint information;
searching a text frequency information list of the first target object according to the identification information of the first target object;
calculating the display times of the first text recognition result in the text frequency information list;
calculating the sum of the frequency in the text frequency information list;
and when the similarity is greater than or equal to a similarity threshold value and the display frequency of the first text recognition result in the text frequency information list is greater than or equal to 1, taking the ratio of the display frequency to the frequency sum as the reliability parameter corresponding to the first voiceprint information.
6. The method of claim 5, further comprising:
and when the reliability parameter corresponding to the first voiceprint information is 0, updating a text frequency information list of the first target object, and updating the display frequency of the first text recognition result in the text frequency information list.
7. The method according to claim 1, wherein said updating the voiceprint embedding codes in the set of voiceprint embedding codes according to the reliability parameter corresponding to the first voiceprint information comprises:
when the reliability parameter corresponding to the first voiceprint information is greater than or equal to a reliability parameter threshold value, and the number of characters of a first text recognition result corresponding to the first audio information is greater than or equal to a character number threshold value, acquiring an original voiceprint embedded code corresponding to the similarity;
acquiring a first weight parameter of the original voiceprint embedded code and a second weight parameter of the first voiceprint embedded code;
calculating a weighted average of the original voiceprint embedded code and the first voiceprint embedded code according to the first weight parameter and the second weight parameter to obtain a third voiceprint embedded code;
and updating the original voiceprint embedded codes in the voiceprint embedded code set through the third voiceprint embedded code.
8. The method of claim 1, further comprising:
storing the set of voiceprint embedded codes in a cloud server;
when first audio information of a first target object is acquired through a second terminal, detecting the processing authority of the second terminal;
and when the second terminal meets the requirement of the processing authority, storing the voiceprint embedded code set in the cloud server into the second terminal.
9. A voiceprint information processing apparatus, characterized in that the apparatus comprises:
the information transmission module is used for acquiring first audio information of a first target object through a first terminal;
the information processing module is used for analyzing the first audio information to obtain first voiceprint information of the first target object;
the information processing module is used for processing the first voiceprint information through a voiceprint recognition service process to obtain a first voiceprint embedded code corresponding to the first voiceprint information;
the information processing module is used for searching a voiceprint embedded code set matched with the identification code of the first terminal according to the equipment identification code of the first terminal and the version identification code of the voiceprint recognition service process;
the information processing module is used for calculating the similarity between the first voiceprint embedded code and each voiceprint embedded code in the voiceprint embedded code set;
the information processing module is used for calculating a reliability parameter corresponding to the first voiceprint information when the similarity is greater than or equal to a similarity threshold;
and the information processing module is used for updating the voiceprint embedded codes in the voiceprint embedded code set according to the reliability parameters corresponding to the first voiceprint information.
10. An electronic device, characterized in that the electronic device comprises:
a memory for storing executable instructions;
a processor for implementing the voiceprint information processing method of any one of claims 1 to 8 when executing the executable instructions stored by the memory.
11. A computer-readable storage medium storing executable instructions, wherein the executable instructions, when executed by a processor, implement the voiceprint information processing method of any one of claims 1 to 8.
12. A computer program product comprising a computer program or instructions for implementing a voiceprint information processing method according to any one of claims 1 to 8 when executed by a processor.
CN202210657997.2A 2022-06-10 2022-06-10 Voiceprint information processing method and device, electronic equipment and storage medium Pending CN115171660A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210657997.2A CN115171660A (en) 2022-06-10 2022-06-10 Voiceprint information processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210657997.2A CN115171660A (en) 2022-06-10 2022-06-10 Voiceprint information processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115171660A true CN115171660A (en) 2022-10-11

Family

ID=83484904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210657997.2A Pending CN115171660A (en) 2022-06-10 2022-06-10 Voiceprint information processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115171660A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116992422A (en) * 2023-09-05 2023-11-03 腾讯科技(深圳)有限公司 Biological data processing method, apparatus, device and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116992422A (en) * 2023-09-05 2023-11-03 腾讯科技(深圳)有限公司 Biological data processing method, apparatus, device and computer readable storage medium
CN116992422B (en) * 2023-09-05 2024-01-09 腾讯科技(深圳)有限公司 Biological data processing method, apparatus, device and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
CN111739521B (en) Electronic equipment awakening method and device, electronic equipment and storage medium
CN111312245B (en) Voice response method, device and storage medium
JP2023542685A (en) Speech recognition method, speech recognition device, computer equipment, and computer program
CN111968618A (en) Speech synthesis method and device
CN113643693B (en) Acoustic model conditioned on sound characteristics
US20230035504A1 (en) Audio processing method and apparatus, vocoder, electronic device, computer-readable storage medium, and computer program product
CN113393828A (en) Training method of voice synthesis model, and voice synthesis method and device
CN114678032B (en) Training method, voice conversion method and device and electronic equipment
CN115171660A (en) Voiceprint information processing method and device, electronic equipment and storage medium
CN115132195B (en) Voice wakeup method, device, equipment, storage medium and program product
CN116884386A (en) Speech synthesis method, speech synthesis apparatus, device, and storage medium
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN113539239B (en) Voice conversion method and device, storage medium and electronic equipment
CN114446268B (en) Audio data processing method, device, electronic equipment, medium and program product
CN116994553A (en) Training method of speech synthesis model, speech synthesis method, device and equipment
CN116978359A (en) Phoneme recognition method, device, electronic equipment and storage medium
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN113012681A (en) Awakening voice synthesis method based on awakening voice model and application awakening method
CN114283828A (en) Training method of voice noise reduction model, voice scoring method, device and medium
CN112150103A (en) Schedule setting method and device and storage medium
CN116959421B (en) Method and device for processing audio data, audio data processing equipment and medium
CN113555006B (en) Voice information identification method and device, electronic equipment and storage medium
CN112420022B (en) Noise extraction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination