CA3062359A1

CA3062359A1 - Method and device for voice information verification, electronic equipment and storage medium

Info

Publication number: CA3062359A1
Application number: CA3062359A
Authority: CA
Inventors: Huan CHEN
Original assignee: 10353744 Canada Ltd
Current assignee: 10353744 Canada Ltd
Priority date: 2018-12-13
Filing date: 2019-11-22
Publication date: 2020-06-13
Also published as: CN109493872A; WO2020119448A1; CN109493872B

Abstract

Provided is an apparatus, a method, an electronic device and a computer readable memory for voice information verification to automatically judge whether a voice is an authentic human voice or a machine generated voice. The method comprises: obtaining a voice verification code that has undergone disturbance processing and a voice to be verified sent by a user terminal regarding the verification code; judging whether the voice to be verified is a non-machine voice or not in response to matching between the voice to be verified and the voice verification code; and judging that the voice to be verified has passed the verification if the voice to be verified is a non-machine voice. The disclosed embodiments can effectively verify the identity of voice information to reduce attacks of machine synthesized voices on a system and thereby improve system security.

Description

METHOD AND DEVICE FOR VOICE INFORMATION VERIFICATION, ELECTRONIC EQUIPMENT AND STORAGE MEDIUM
TECHNICAL FIELD
[0001] The present disclosure relates to the technical field of artificial intelligence, particularly to a voice information verification method, a voice information verification apparatus, an electronic device and a computer readable memory medium.
BACKGROUND ART

[0002] With the development of computer technology, on many Apps (Applications) and websites, malicious behaviors such as stealing of accounts, fake accounts, and online scams have appeared, which pose a security risk to the normal operation of the Apps and the websites.
Therefore, it is necessary to verify identity of every kind of account to stop stolen accounts or fake accounts from committing network misconducts and safeguard the interests of real users.

[0003] It needs to be explained that information disclosed in the foregoing section of background art is only intended to deepen understanding on the background of the present disclosure, so it may include information that does not constitute the prior art known to those of ordinary skill in the art.
SUMMARY

[0004] The present disclosure provides a voice information verification method, a voice information verification apparatus, an electronic device and a computer readable memory medium, thereby overcoming at least to some extent the problem of low security of identity verification owing to defects of the prior art.

[0005] Other features and advantages of the present disclosure will be evident through the following detailed description, or partially learnt through practice of the present disclosure.

[0006] According to one aspect of the present disclosure, a voice information verification method is provided, comprising: obtaining a verification code and a voice to be verified sent by a user terminal regarding the verification code; judging whether the voice to be verified is of non-machine or not in response to match between the voice to be verified and the verification code; and judging that the voice to be verified has passed the verification if the voice to be verified is of non-machine.

[0007] In an exemplary embodiment of the present disclosure, judging whether the voice to be verified is of non-machine or not if the voice to be verified is matched with the verification code comprises: converting the voice to be verified into a target spectrogram if the voice to be verified is matched with the verification code; analyzing the target spectrogram by a convolutional neural network model to obtain a man machine classification result of the target spectrogram; and determining whether the voice to be verified is of non-machine or not based on the man machine classification result.

[0008] In an exemplary embodiment of the present disclosure, obtaining a verification code comprises: obtaining a verification code that has undergone disturbance processing.

[0009] In an exemplary embodiment of the present disclosure, obtaining a verification code that has undergone disturbance processing comprises: obtaining a preset text and converting the preset text into a target picture; and processing the target picture by one or more of deformation, discoloring, fuzzification and increase of noisy points to generate an image verification code that has undergone disturbance processing.

[0010] In an exemplary embodiment of the present disclosure, before response to match between the voice to be verified and the verification code, the method further comprises:
detecting a length of the voice to be verified; judging the voice to be verified is failed in the verification and returning a failure prompt message to the user terminal if the length of the voice to be verified is smaller than a preset length; and converting the voice to be verified into a text to be verified and matching the text to be verified with the verification code if the length of the voice to be verified is greater than or equal to a preset length.

[0011] In an exemplary embodiment of the present disclosure, converting the voice to be verified into a text to be verified comprises: pre-processing the voice to be verified by one or more of sound track conversion, pre-emphasis, speech enhancement and blank deletion; and converting the pre-processed voice to be verified into the text to be verified by a time delay neural network model.

[0012] In an exemplary embodiment of the present disclosure, the verification code comprises a text verification code, matching the text to be verified with the verification code comprises: matching the text to be verified with the text verification code to obtain a typo proportion of the text to be verified; and judging whether the voice to be verified is of non-machine or not if the voice to be verified is matched with the verification code comprises:
judging whether the voice to be verified is of non-machine or not if the typo proportion is lower than a threshold of matching.

[0013] According to one aspect of the present disclosure, a voice information verification apparatus is provided, comprising: an information obtaining module, for obtaining a verification code and a voice to be verified sent by a user terminal regarding the verification code; a man machine judgment module, for judging whether the voice to be verified is of non-machine or not in response to match between the voice to be verified and the verification code; and a voice verification module, for determining that the voice to be verified has passed the verification if the voice to be verified is of non-machine

[0014] According to one aspect of the present disclosure, an electronic device is provided, comprising: a processor; and a memory, for storing executable instructions of the processor;
wherein the processor is configured to execute the method in any of the foregoing descriptions by executing the executable instructions.

[0015] According to one aspect of the present disclosure, a computer readable memory medium is provided, and stores a computer program. When being executed by a processor, the computer program achieves the method in any of the foregoing descriptions.

[0016] The exemplary embodiments of the present disclosure have the following beneficial effects:

[0017] A verification code is matched with a voice to be verified sent by a user terminal regarding the verification code. Whether the successfully matched voice to be verified is of non-machine or not is judged. If the judgment result is of non-machine, a verification result of the voice to be verified is obtained. On the one hand, match between a voice to be verified and a verification code and verification of man machine judgment can verify consistency of user identity and meanwhile, reduce attacks of machine synthesized voices at a system and improve security of the voice information verification method. On the other hand, in this exemplary embodiment, the user neither needs to input voice registration information in advance nor needs to store user's voice print characteristic information, thereby reducing use cost of the voice information verification method, simplifying user's operation flow, reducing resource occupation of the system and raising efficiency.

[0018] It should be understood that the foregoing general description and subsequent detailed description are only exemplary and explanatory and cannot limit the present disclosure.
BRIEF DESCRIPTION OF THE DRAWINGS

[0019] The accompanying drawings here are incorporated into the Description, form a part of the Description, show embodiments that conform to the present disclosure and are used together with the Description to explain the theory of the present disclosure.
Apparently, accompanying drawings in subsequent description are only some embodiments of the present disclosure, and those of ordinary skill in the art can obtain other accompanying drawings according to these accompanying drawings without creative work.

[0020] FIG. 1 shows a schematic diagram of system architecture of an operating environment of this exemplary embodiment;

[0021] FIG. 2 schematically shows a step diagram of a voice information verification method in this exemplary embodiment;

[0022] FIG. 3 schematically shows a sub-flow chart of a voice information verification method in this exemplary embodiment;

[0023] FIG. 4 schematically shows a flow chart of a method for obtaining a verification code in this exemplary embodiment;

[0024] FIG. 5 schematically shows a sub-flow chart of an alternative voice information verification method in this exemplary embodiment;

[0025] FIG. 6 schematically shows a flow chart of a voice information verification method in this exemplary embodiment;

[0026] FIG. 7 schematically shows a structural block diagram of a voice information verification apparatus in this exemplary embodiment;

[0027] FIG. 8 schematically shows an electronic device for achieving the foregoing method in this exemplary embodiment;

[0028] FIG. 9 schematically shows a computer readable memory medium for achieving the foregoing method in this exemplary embodiment.
DETAILED DESCRIPTION

[0029] Now exemplary implementation manners are more comprehensively described with reference to accompanying drawings. However, exemplary implementation manners can be implemented in various forms and should not be understood that they are limited to the examples set forth herein; on the contrary, provision of these implementation manners make the present disclosure more comprehensive and complete and comprehensively conveys the conception of the exemplary implementation manners to those skilled in the art. The described characteristics, structures or features can be combined in one or more implementation manners in any appropriate way.

[0030] In a solution of the prior art, voice verification is conducted by matching user's own voice print characteristics. The user needs to input voice registration information during registration from which voice print information is extracted so that during logon of the user in the future, the voice verification information is matched with the voice print characteristics.
However, in this solution, if the voice registration information or voice verification information is synthesized by a machine, for example, voice verification information synthesized using original voices of a user after daily voice information of the user is crawled, the server can hardly recognize it. What is more, user's voice print characteristic information needs to be stored, increasing resource occupation of the system.

[0031] In view of the foregoing problem, an exemplary embodiment of the present disclosure firstly provides a voice information verification method. The voice information verification method can be applied in a scenario of verifying the identity of a user by a verification code when the user logs on an App or web pages, or confirms a payment, or modifies a password or carries out other sensitive operations.

[0032] FIG. 1 shows a schematic diagram of system architecture of an operating environment of this exemplary embodiment. As shown in FIG. 1, this system 110 may comprise a user terminal 111, a network 112 and a server 113. Here, the user terminal 111 can be various kinds of terminal devices used by the user such as personal computer, tablet computer, smart phone or wearable device, and sends a collected voice to be verified regarding a verification code to the server 113 via the network 112; the server 113 can obtain the voice to be verified from the user terminal 111 and perform voice information verification of the voice to be verified.

[0033] It should be understood that the number of devices shown in FIG. 1 is exemplary only. According to the actual needs, any number of the user terminals 111 or the networks 112 can be set and the server 113 can also be a server cluster composed of a plurality of servers.

[0034] Based on the foregoing description, the method in this exemplary embodiment can be applied on the server 113 shown in FIG. 1.

[0035] Below, this exemplary embodiment is further described with reference to FIG. 2. As shown in FIG. 2, the voice information verification method may comprise the following steps S210¨S250:

[0036] Step S210, obtaining a verification code as well as a voice to be verified sent by a user terminal regarding the verification code.

[0037] Here, the verification code can be a text verification code, or a voice verification code, or an image verification code, etc., and can be generated from a specific program of a server, or obtained from other servers. The voice to be verified is a voice sent by a user according to the content of a verification code, such as a voice recorded and uploaded when the user reads the content of the verification code. The user terminal can collect a voice to be verified regarding a verification code through trigger operation of a user, and then send the voice to be verified to a server. For example, when a voice to be verified is input, a user can click on a specific control to input this voice to be verified, or select an existing sound recording file as a voice to be verified.

[0038] Step S220, judging whether the voice to be verified is of non-machine or not in response to match between the voice to be verified and the verification code.

[0039] In this exemplary embodiment, firstly, whether the voice to be verified is matched with the verification code can be detected. For various types of verification codes, there may be a plurality of methods for matching a voice to be verified with a verification code. For example, if the verification code is a text verification code, matching can be conducted by comparison with the text in the text verification code; if the verification code is a voice verification code, matching can be conducted by comparing the voice verification code with the voice print characteristics of the voice to be verified. At Step S220, match between the voice to be verified and the verification code is equivalent to the primary verification, is mainly used to verify identity consistency and can recognize the circumstances where user's account is stolen and is logged in or operated on a terminal other than the bundled terminal.

[0040] In this exemplary embodiment, if the foregoing primary verification is successful, then the voice to be verified is subjected to man machine judgment, which is called the secondary judgment. When whether a voice to be verified is of a machine or non-machine is judged, the voice print characteristics of the voice to be verified can be analyzed and matched, and compared in whether the voice print characteristics are same as or similar to the voice print characteristics of a machine synthesized sound, or whether the voice print characteristics are matched with the voice print characteristics in user's sound database.

[0041] Step S230, determining that the voice to be verified has passed the verification if the voice to be verified is of non-machine.

[0042] The secondary verification of a voice to be verified can eliminate machine synthesized voices, thereby determining that the user sending the voice to be verified has received a correct verification code and is a true user and it can be determined that the voice to be verified eventually passes verification.

[0043] In an exemplary embodiment, step S230 may comprise the following steps:

[0044] converting the voice to be verified into a target spectrogram if the voice to be verified is matched with the verification code;

[0045] analyzing the target spectrogram by a convolutional neural network model to obtain a man machine classification result of the target spectrogram; and

[0046] determining whether the voice to be verified is of non-machine or not based on the man machine classification result.

[0047] The target spectrogram is a spectrogram corresponding to the voice to be verified.
Typically there are two types of spectrograms: the first is an instantaneous spectrogram with frequency as x-coordinate and signal energy as y-coordinate, and a voice to be verified can be converted into a sequence of a plurality of instantaneous spectrograms; the other is a continuous spectrogram with time as x-coordinate and superposition frequency as y-coordinate, and a voice to be verified can be converted into a continuous spectrogram. After a server obtains a voice to be verified, the server can convert the voice to be verified into a voice file in an appropriate format, perform time domain analysis or frequency domain analysis of the voice file, and draw a target spectrogram. Apparently, the target spectrogram contains spectrum features of the voice to be verified, while the spectrum features can give expression to voice print characteristics of the voice to be verified. Therefore, through processing of a convolutional neural network model, voice print characteristics can be recognized, and are compared with voice print characteristics of machine synthesized voices learnt during model training or voice print characteristics of real human voice, and whether the target spectrogram is of machine or non-machine, i.e., the man machine classification result of the voice to be verified, is judged, thereby completing a process of the secondary verification.

[0048] FIG. 3 shows a flow chart of a voice information verification method in this exemplary embodiment. As shown in FIG. 3, after a voice to be verified is obtained, matching with a verification code can be conducted at first. If the matching is failed, a result of verification failure can be output directly. If the matching is successful, the voice to be verified will be converted into a target spectrogram, then a machine judgment probability of the target spectrogram is output according to a convolutional neural network, and whether the probability is greater than a probability threshold is judged. If the probability is greater than the probability threshold, verification is failed; if the probability is not greater than the probability threshold, a result of final success in verification can be output.

[0049] Based on the foregoing description, this exemplary embodiment matches a verification code with a voice to be verified sent by a user terminal regarding the verification code, judges whether the successfully matched voice to be verified is of non-machine or not, and obtains a verification result of the voice to be verified if the judgment result is of non-machine.
On the one hand, match between a voice to be verified and a verification code and verification of man machine judgment can verify consistency of user identity and meanwhile, reduce attacks of machine synthesized voices at a system and improve security of the voice information verification method. On the other hand, in this exemplary embodiment, the user neither needs to input voice registration information in advance nor needs to store user's voice print characteristic information, thereby reducing use cost of the voice information verification method, simplifying user's operation flow, reducing resource occupation of the system and raising efficiency.

[0050] In an exemplary embodiment, the foregoing convolutional neural network model can implement training through the following steps:

[0051] obtaining a plurality of sample voices and category tags of the sample voices.

[0052] converting the sample voices into sample spectrograms.

[0053] carrying out training using the sample spectrograms and the category tags, and obtaining a convolutional neural network model.

[0054] Here, the sample voices can be historical verification voices, and may also include some machine synthesized voices; category tags can be manually labeled tags indicating each sample voice is of "machine" or "non-machine".

[0055] The convolutional neural network model inputs sample spectrograms, and outputs classification results of the sample spectrograms. By adjusting model parameters, outputted classification results can be more and more approximate to the category tags.
During training, the sample spectrograms and the category tags can be divided into a training set and a verification set (e.g., at a ratio of 8:2). Here, the training set is used to train models and set an initial learning rate as well as preset learning times and percentage of reduction of learning rate and can make a model converge faster; the verification set is used to verify a training effect of a model. If the operating accuracy of the model on the verification set is up to a specific standard, it can be considered that the training is completed.

[0056] Through analysis of the target spectrogram by the convolutional neural network model that has completed training, a man machine classification result of the target spectrogram can be obtained, i.e., the target spectrogram is of machine or non-machine.

[0057] The man machine classification of the target spectrogram of a voice to be verified by a convolutional neural network model is equivalent to the secondary verification of the voice to be verified and is mainly used to recognize machine synthesized voices.
Therefore, when the man machine classification result of the target spectrogram is of non-machine, it can be considered that the voice to be verified is a real human voice, thereby finally judging that the voice to be verified has passed the verification.

[0058] In an exemplary embodiment, the man machine classification of a target spectrogram can be achieved through a machine judgment probability outputted by a convolutional neural network model. When a convolutional neural network model analyzes a target spectrogram, the convolutional neural network model can output a machine judgment probability based on a level of similarity between the target spectrogram and a characteristic spectrogram of machine synthesized voices. The higher the probability is, the more likely the target spectrogram is of machine. If the machine judgment probability is greater than a probability threshold, it can be considered that the man machine classification result of the target spectrogram is a machine. The probability threshold can be set and adjusted according to training and actual application condition to accurately differentiate target spectrograms of machine and non-machine.

[0059] In an exemplary embodiment, obtaining a verification code at step S210 may comprise: obtaining a verification code that has undergone disturbance processing.

[0060] Here, disturbance processing can be addition of information that disturbs recognition of the verification code, thereby raising difficulty of machine recognition.
Example: if the verification code is a text verification code, disturbance processing can be addition of a disturbing character in the text verification code; if the verification code is a voice verification code, disturbance processing can be addition of disturbing music and sound waves at a specific frequency to the voice verification code; if the verification code is an image verification code, disturbance processing can be change of image color of the image verification code, reduction of image resolution and disorganization of image composition sequence of the image verification code; if the verification code is a text message verification code, disturbance processing can be addition of out-of-order letters or numerical strings.

[0061] In an exemplary embodiment, obtaining a verification code added with disturbance information may comprise the following steps:

[0062] obtaining a preset text and converting the preset text into a target picture; and

[0063] processing the target picture by one or more of deformation, discoloring, fuzzification and increase of noisy points to generate an image verification code that has undergone disturbance processing.

[0064] FIG. 4 a flow chart of obtaining a verification code as described above. As shown in FIG. 4, firstly, a preset text is obtained. The preset text can be obtained from a text library and may also be a generated random number, etc. The obtained preset text is converted into a target picture. The target picture can be a picture containing text information of a verification code, such as a picture containing characters of a text verification code, or a picture of image verification code fragments. During conversion of a target picture, wordart of the verification code can be generated and inlaid in a specific background to obtain the target picture;
alternatively, as shown in FIG. 4, characters of the verification code are split, each character generates a word picture, the sequence is disorganized, and a target picture is spliced. The generated target picture can be tailored according to the needs and then go through disturbance processing. The deformation can be stretching, rotation, liquefaction and other processing of the picture; the discoloring refers to changing color gradation parameters of the target picture, for example, converting the target picture into a grey-scale map and adjusting brightness distribution;
the fuzzification refers to reducing the resolution of the target picture, for example, conducting local pixel compression of the target picture; and the increase of noisy points can be addition of grains that affect recognition to the target picture. In addition, disturbing elements can be randomly added to the target picture, too, for example, horizontal lines or vertical lines are added to the target picture. Those skilled in the art can readily think of that other disturbance processing methods also should be included in the scope of protection of the present disclosure.

[0065] In an exemplary embodiment, before response to match between the voice to be verified and the verification code, the voice information verification method may further comprise the following steps:

[0066] detecting a length of the voice to be verified.

[0067] judging the voice to be verified is failed in the verification and returning a failure prompt message to the user terminal if the length of the voice to be verified is smaller than a preset length; and

[0068] converting the voice to be verified into a text to be verified and matching the text to be verified with the verification code if the length of the voice to be verified is greater than or equal to a preset length.

[0069] Here, the length of the voice to be verified can be a duration or a file size of the voice to be verified. Considering that an excessively short voice to be verified might be caused by a wrong input, leading to meaningless verification and affecting work efficiency of the server, a preset length can be set as a criterion for length of a voice to be verified to filter out the foregoing condition. For example, the preset length is set to be 2s, voices of less than 2s will be automatically filtered out by the server and a failure prompt message will be returned; for another example, the preset length is set to be 5KB, then voices to be verified to be less than 5KB will be automatically filtered out by the server.

[0070] After it is detected that the length of a voice to be verified is greater than or equal to the preset length, the voice to be verified can be converted into a text to be verified. Here, the text to be verified is information to be verified in a text form corresponding to the content of the voice to be verified, and can achieve conversion by a voice text conversion tool.

[0071] The foregoing preset length can be considered as a lower limit of length of a voice to be verified. It needs to be supplemented that there might be sound recording files wrongly input or wrongly sent by the user, resulting in excessively complex or lengthy voices to be verified received by the server. This may cause processing difficulty of the server and add meaningless work. Therefore, an upper limit of the length of a voice to be verified can be set, too to filter out the foregoing excessively complex or lengthy voices to be verified. When the length of a voice exceeds the upper limit of voice length, a result of verification failure can be output, and a failure prompt message is returned. Further, an upper limit of recording time of a voice to be verified can be set in a client program. When the upper limit is exceeded, voice input is terminated automatically, and the input voice is sent to the server to conduct voice information verification.

[0072] Further, a voice to be verified can be converted into a text to be verified through the following steps:

[0073] pre-processing the voice to be verified by one or more of sound track conversion, pre-emphasis, speech enhancement and blank deletion; and

[0074] converting the pre-processed voice to be verified into the text to be verified by a time delay neural network model.

[0075] Considering that the voice to be verified obtained by the server might contain other noises and influencing factors, posing a barrier to the processing and recognition of the voice to be verified, the voice to be verified can be preprocessed when the voice to be verified is converted into a text to be verified. Here, the sound track conversion means that if voice characteristics of a voice to be verified that are to be extracted do not differentiate sound tracks, a multi-track voice to be verified can be converted into single track; the pre-emphasis can reserve voice signals of a voice to be verified within a specific frequency range, facilitating server's analysis of information of the voice to be verified; the speech enhancement can filter out noise from a voice to be verified to extract pure voice signals; the blank deletion refers to removing fragments without actual signals from a voice to be verified. For example, when a user is interrupted or is thinking during input of a voice to be verified, the voice to be verified might have noise or other blank and invalid voice fragments. The blank deletion can also reduce duration and file size of the voice to be verified and lower processing volume of the server.
When outside noise is too large, for example a user is in a heavy human traffic or is affected by other sound devices, before a voice to be verified is processed, voice activity detection can be resorted to judge if the voice to be verified has any voice information that can be verified. If not, a result of verification failure can be output. Further, those skilled in the art can readily think of that other preprocessing methods also should be included in the scope of protection of the present disclosure.

[0076] After preprocessing, a voice to be verified can be input to a time delay neural network model. The time delay neural network model can frame and recognize the voice to be verified and eventually convert the voice to be verified into a corresponding text to be verified.

[0077] In an exemplary embodiment, the foregoing verification code may comprise a text verification code and matching the text to be verified with the verification code may comprise the following steps:

[0078] matching the text to be verified with the text verification code to obtain a typo proportion of the text to be verified.

[0079] Accordingly, step S220 may comprise the following steps:

[0080] judging whether the voice to be verified is of non-machine or not if the typo proportion is lower than a threshold of matching.

[0081] Here, the typo proportion can be a percentage of the number of characters that are not matched successfully to the total number of characters. There may be a plurality of matching methods, such as forward matching, reverse matching or two-way matching. The matching results likely vary with the selected matching methods. For example, if a text verification code is "it is a lovely day today", a text to be verified is "the gas is lovely today"
and forward matching is adopted, the typo proportion of the matching result will be higher than the typo proportion from two-way matching. Under normal conditions, two-way matching can raise accuracy of matching calculation of a server, but also adds on computing tasks of the server and has higher requirements for configuration of the server. Forward matching and reverse matching have lower requirements for computing power of the server, and a smaller processing volume. An appropriate matching method can be selected according to the actual condition.

[0082] The threshold of matching can be a set acceptable upper limit of typo proportion of matching. Considering that there might be an error when a server converts a voice to be verified into a text to be verified and matches a text to be verified with a text verification code, certain inconsistency between a text to be verified and a text verification code is allowed. A threshold of matching can be set according to the actual condition. For example, when a verification code is shorter or there is less disturbing information, a higher threshold of matching can be set; when a verification code is longer or the content is more complex, a lower threshold of matching can be set. There is not any specific limitation to a threshold of matching in this embodiment. When the typo proportion is lower than the threshold of matching, it can be considered that the text to be verified is successfully matched with the verification code, i.e., the voice to be verified is successfully matched with the verification code and passes the primary verification, and the secondary verification can be started, i.e., man machine judgment is conducted on the voice to be verified.

[0083] FIG. 5 shows a sub-flow chart of a voice information verification method in this exemplary embodiment. As shown in FIG. 5, after a voice to be verified is obtained, whether the length of the obtained voice to be verified reaches the preset length or not can be judged at first.
If the length of the obtained voice to be verified is less than the preset length, a result of verification failure can be output directly; if the length of the obtained voice to be verified is up to the preset length, the voice to be verified can be preprocessed, and converted by a time delay neural network model into a text to be verified. Then the text to be verified is matched with a verification code, and whether the typo proportion is lower than the threshold of matching or not is judged. If the typo proportion is not lower than the threshold of matching, a result of verification failure can be output; if the typo proportion is lower than the threshold of matching, the voice to be verified can be converted into a target spectrogram, for subsequent verification.

[0084] In an exemplary embodiment, converting a voice to be verified into a target spectrogram may comprise: converting the voice to be verified into a target spectrogram by the short-time Fourier transform.

[0085] The short-time Fourier transform can convert complex sound signals to a frequency domain, and then analyze time domain signal characteristics of a voice to be verified according to spectral characteristics. For example, a plurality of instantaneous fragments can be extracted from the voice to be verified, and arranged according to a time sequence. Each fragment is converted into a frequency ¨ energy image, thereby obtaining an arrangement sequence of a plurality of target spectrograms, which can be processed using a convolutional neural network model subsequently.

[0086] FIG. 6 shows a flow chart of a voice information verification method in this exemplary embodiment. As shown in FIG. 6, a server can obtain a preset text from a text library and obtain a verification code through disturbance processing. A user terminal obtains a voice to be verified input by a user, preprocesses the voice to be verified, then converts the voice to be verified into a text to be verified by a time delay neural network model, and calculates a typo proportion to match the text to be verified and the verification code. If the matching is successful, the voice to be verified can be converted into a target spectrogram by the short-time Fourier transform. Further, the target spectrogram is analyzed by a convolutional neural network model that has been trained and tested to obtain a man machine classification result. If the man machine classification result is of non-machine, a result of final successful verification is output.

[0087] Exemplary embodiments of the present disclosure further provide a voice information verification apparatus. As shown in FIG. 7, the apparatus 700 may comprise: an information obtaining module 710, for obtaining a verification code and a voice to be verified sent by a user terminal regarding the verification code; a man machine judgment module 720, for judging whether the voice to be verified is of non-machine or not in response to match between the voice to be verified and the verification code; and a voice verification module 730, for determining that the voice to be verified has passed the verification if the voice to be verified is of non-machine.

[0088] In an exemplary embodiment, the man machine judgment module may comprise: a spectrogram conversion unit, for converting a voice to be verified into a target spectrogram if the voice to be verified is matched with a verification code; a spectrogram analysis unit, for analyzing the target spectrogram by a convolutional neural network model to obtain a man machine classification result of the target spectrogram; and a voice judgment unit, for determining whether the voice to be verified is of non-machine or not based on the man machine classification result.

[0089] In an exemplary embodiment, the information obtaining module may be further used for obtaining a verification code that has undergone disturbance processing.

[0090] In an exemplary embodiment, the information obtaining module may comprise: a text obtaining unit, for obtaining a preset text and converting the preset text into a target picture;
a picture processing unit, for processing the target picture by one or more of deformation, discoloring, fuzzification and increase of noisy points to generate an image verification code that has undergone disturbance processing.

[0091] In an exemplary embodiment, the voice information verification apparatus may further comprise: a voice length detection unit, for detecting the length of a voice to be verified;
a preset length judgment unit, for judging the voice to be verified is failed in the verification and returning a failure prompt message to the user terminal if the length of the voice to be verified is smaller than a preset length, and for converting the voice to be verified into a text to be verified and matching the text to be verified with the verification code if the length of the voice to be verified is greater than a preset length.

[0092] In an exemplary embodiment, the preset length judgment unit may further comprise:
a preprocessing subunit, for pre-processing the voice to be verified by one or more of sound track conversion, pre-emphasis, speech enhancement and blank deletion; and a model processing unit, for converting the pre-processed voice to be verified into the text to be verified by a time delay neural network model.

[0093] In an exemplary embodiment, the verification code may further comprise a text verification code, and the preset length judgment unit may further comprise: a text matching unit, for matching the text to be verified with the text verification code to obtain a typo proportion of the text to be verified; and a man machine judgment module, which can be used to judge whether the voice to be verified is of non-machine or not if the typo proportion is lower than a threshold of matching.

[0094] Details of the foregoing modules / units have been elaborated in embodiments of the corresponding method, so they are not described here again.

[0095] In an exemplary embodiment of the present disclosure, an electronic device that can achieve the foregoing method is further provided.

[0096] Those skilled in the art should understand that all aspects of the present disclosure can be achieved as system, method or program products, so all aspects of the present disclosure can be specifically achieved in the following forms: an implementation manner of complete hardware, an implementation manner of complete software (including firmware, microcode, etc.), or an implementation manner of combination of hardware and software. They are collectively referred to as "circuits", "modules" or "systems" here.

[0097] Below an electronic device 800 according to this exemplary embodiment of the present disclosure is described with reference to FIG. 8. The electronic device 800 shown in FIG.
8 is only an example and should not cause any limitation to the functions and use of embodiments of the present invention.

[0098] As shown in FIG. 8, the electronic device 800 is manifested in form of a general-purpose computing device. Components of the electronic device 800 may include without limitation: the foregoing at least one processing unit 810, the foregoing at least one memory unit 820, a bus 830 connecting various system components (including the memory unit 820 and the processing unit 810) and a display unit 840.

[0099] The memory unit stores program codes, which can be executed by the processing unit 810 so that the processing unit 810 implements the steps according to exemplary implementation manners of the present invention as described in the foregoing "exemplary method" section of the Description. For example, the processing unit 810 can implement steps S210¨S250 as shown in FIG. 1:

[0100] The memory unit 820 may comprise a readable medium in form of a volatile memory unit, such as random access memory unit (RAM) 821 and / or cache memory unit 822, and may further comprise a read only memory unit (ROM) 823.

[0101] The memory unit 820 may further comprise a program / utility tool 824 comprising a group of (at least one) program modules 825. Such program module 825 includes without limitation: operating system, one or more application programs, other program modules and program data. One or a combination of these examples may include realization of a network environment.

[0102] The bus 830 can be a local bus expressing one or more of a few kinds of bus structures, including memory unit bus or memory unit controller, peripheral bus, accelerated graphic port and processing unit, or using any of the plurality of bus structures.

[0103] Alternatively, the electronic device 800 can also communicate with one or more peripherals 1000 (such as keyboard, pointing device, and Bluetooth device), and can also communicate with one or more communication devices that enable a user to interact with the electronic device 800, and/or communicate with any device (such as router and modem) that enables the electronic device 800 to communicate with one or more other computing devices.
Such communication can proceed via an I/0 interface 850. Further, the electronic device 800 may also communicate with one or more networks (such as local area network (LAN), wide area network (WAN) and/or public network, such as Internet) via a network adapter 860. As shown in the figure, the network adapter 860 communicates with other modules of the electronic device 800 via the bus 830. It should be understood that though not shown in the figure, other hardware and/or software modules can be used in combination with the electronic device 800, including but not limited to: microcode, device driver, redundant processing unit, peripheral disk drive array, RAID system, tape driver and data backup storage system.

[0104] From description of the foregoing implementation manners, those skilled in the art can easily understand that the implementation manners described here can be achieved by means of software or by means of software combined with necessary hardware.
Therefore, a technical solution according to an implementation manner of the present disclosure can be embodied in form of a software product. The software product can be stored in a non-volatile memory medium (it can be a CD-ROM, a USB disk, or a mobile hard disk, or others) or in a network and comprises a number of instructions so that a computing device (it can be a personal computer, a server, a network device, or others) implements the foregoing method according to an exemplary embodiment of the present disclosure.

[0105] In the exemplary embodiments of the present disclosure, a computer readable memory medium is further provided and stores program products that can achieve the foregoing method of the Description. In some possible implementation manners, all aspects of the present invention can also be achieved as a form of program product, including program codes. When the program product runs on a terminal device, the program codes are used to make the terminal device implement the steps according to exemplary implementation manners of the present invention as described in the foregoing "exemplary method" section of the Description.

[0106] As shown in FIG. 9, a program product 900 used to achieve the foregoing method according to exemplary embodiments of the present invention is described. The program product 900 may adopt portable compact disk read only memory (CD-ROM) and include program codes and can run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited to this. In this document, a readable memory medium can be any tangible medium containing or storing programs, which can be used by an instruction execution system, apparatus or device or can be used in combination with an instruction execution system, apparatus or device.

[0107] The program product can adopt any combination of one or more readable media. The readable media can be readable signal media or readable memory media. The readable memory media for example can be, without limitation, electric, magnetic, optical, electromagnetic, infrared or semiconductor systems, apparatuses or devices, or any combination thereof More concrete examples of the readable memory media (an inexhaustive list) include:
electric connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM
or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical memory module, magnetic memory module or any appropriate combination thereof

[0108] Computer readable signal media may include data signals transmitted in base band or as part of carriers, and carry readable program codes. Such transmitted data signals can adopt various forms, including but not limited to: electromagnetic signals, optical signals or any appropriate combination thereof. The readable signal media may also be any readable media except readable memory media. Such readable media can send, transmit or transfer programs used by an instruction execution system, apparatus or device or used in combination with an instruction execution system, apparatus or device.

[0109] The program codes contained on a readable medium can be transferred by any medium, including but not limited to: wireless, wired, optical cable and RF, or any appropriate combination the foregoing thereof.

[0110] Any combination of one or more programming languages can be used to write program codes used to implement operations of the present disclosure. The programming languages include object-oriented programming languages ¨ such as Java and C++, and further include conventional procedural programming languages ¨ such as "C" language or similar programming languages. Program codes can be totally executed on a user's computing device, or partially executed on a user's device, or executed as independent software package, or partially executed on a user's computing device and partially executed on a remote computing device, or totally executed on a remote computing device or a server. Under a circumstance involving a remote computing device, the remote computing device can be connected to the user's computing device via any type of network, including local area network (LAN) or wide area network (WAN), or can be connected to an external computing device (e.g., connected via the Internet through an Internet service provider).

[0111] Further, the foregoing accompanying drawings are only schematic description of processing included according to the method of exemplary embodiments of the present disclosure, and not intended to set any limitation. It can be easily understood that processing shown in the foregoing accompanying drawings does not indicate or limit a time order of the processing. Further, it can be easily understood, too that the processing may be executed synchronously or asynchronously in a plurality of modules for example.

[0112] It should be noted that the detailed description above has mentioned several modules or units of the device used to implement actions, but such division is not compulsory. In fact, according to the exemplary embodiments of the present disclosure, the characteristics and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the characteristics and functions of one module or unit described above can be further divided into embodiment by a plurality of modules or units.

[0113] After considering the Description and practicing the invention disclosed here, those skilled in the art can easily think of other embodiments of the present disclosure. The present disclosure is intended to cover any modification, use or adaptive change of the present disclosure.
These modifications, uses or adaptive changes follow the general principle of the present disclosure and include the common general knowledge or conventional technical means in the technical field not disclosed by the present disclosure. The Description and embodiments are exemplary only. The real scope and spirit of the present disclosure are stated by the following Claims.

[0114] It should be understood that the present disclosure is not limited to the accurate structure described above and shown in the accompanying drawings and may have various modifications and changes without departing from its scope. The scope of the present disclosure is limited only by the attached Claims.

Claims

What is claimed is:

1. A voice information verification method, comprising:
obtaining a verification code and a voice to be verified sent by a user terminal regarding the verification code;
judging whether the voice to be verified is of non-machine or not in response to match between the voice to be verified and the verification code; and determining that the voice to be verified has passed the verification if the voice to be verified is of non-machine

2. The method according to claim 1, wherein judging whether the voice to be verified is of non-machine or not comprises:
converting the voice to be verified into a target spectrogram;
analyzing the target spectrogram by a convolutional neural network model to obtain a man machine classification result of the target spectrogram; and determining whether the voice to be verified is of non-machine or not based on the man machine classification result.

3. The method according to claim 1, wherein obtaining a verification code comprises:
obtaining a verification code that has undergone disturbance processing.

4. The method according to claim 3, wherein obtaining a verification code that has undergone disturbance processing comprises:
obtaining a preset text and converting the preset text into a target picture;
and processing the target picture by one or more of deformation, discoloring, fuzzification and increase of noisy points to generate an image verification code that has undergone disturbance processing.

5. The method according to claim 1, wherein before response to match between the voice to be verified and the verification code, the method further comprises:
detecting a length of the voice to be verified;
judging the voice to be verified is failed in the verification and returning a failure prompt message to the user terminal if the length of the voice to be verified is smaller than a preset length; and converting the voice to be verified into a text to be verified and matching the text to be verified with the verification code if the length of the voice to be verified is greater than or equal to a preset length.

6. The method according to claim 5, wherein the converting the voice to be verified into a text to be verified comprises:
pre-processing the voice to be verified by one or more of sound track conversion, pre-emphasis, speech enhancement and blank deletion; and converting the pre-processed voice to be verified into the text to be verified by a time delay neural network model.

7. The method according to claim 5, wherein the verification code comprises a text verification code, matching the text to be verified with the verification code comprises:
matching the text to be verified with the text verification code to obtain a typo proportion of the text to be verified; and judging whether the voice to be verified is of non-machine or not in response to match between the voice to be verified and the verification code comprises:
judging whether the voice to be verified is of non-machine or not if the typo proportion is lower than a threshold of matching.

8. A voice information verification apparatus, comprising:
an information obtaining module, for obtaining a verification code and a voice to be verified sent by a user terminal regarding the verification code;
a man machine judgment module, for judging whether the voice to be verified is of non-machine or not in response to match between the voice to be verified and the verification code; and a voice verification module, for determining that the voice to be verified has passed the verification if the voice to be verified is of non-machine

9. An electronic device, comprising:
a processor; and a memory, for storing executable instructions of the processor;
the processor being configured to execute the method in any of claims 1-7 by executing the executable instructions.

10. A computer readable memory medium, storing a computer program, wherein when being executed by a processor, the computer program achieves the method in any of claims 1-7.