CN114726635B - Authority verification method and device, electronic equipment and medium - Google Patents

Authority verification method and device, electronic equipment and medium Download PDF

Info

Publication number
CN114726635B
CN114726635B CN202210395953.7A CN202210395953A CN114726635B CN 114726635 B CN114726635 B CN 114726635B CN 202210395953 A CN202210395953 A CN 202210395953A CN 114726635 B CN114726635 B CN 114726635B
Authority
CN
China
Prior art keywords
call
user
text
sentence
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210395953.7A
Other languages
Chinese (zh)
Other versions
CN114726635A (en
Inventor
蒋超
吕一宁
何选基
黄辰
张伟鹏
李岩
丁科
李宇飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN202210395953.7A priority Critical patent/CN114726635B/en
Publication of CN114726635A publication Critical patent/CN114726635A/en
Application granted granted Critical
Publication of CN114726635B publication Critical patent/CN114726635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/0861Network architectures or network communication protocols for network security for authentication of entities using biometrical features, e.g. fingerprint, retina-scan

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the application provides a permission verification method, a permission verification device, electronic equipment and a medium, aiming at improving the speed of permission verification of an executing user of an order, wherein the method comprises the following steps: converting call voice generated in the process of executing the order into call text; determining the user role of the call user corresponding to each voice channel in the call voice based on the call text; identifying voice data of a target call user in the call voice to obtain biological characteristics of the target call user; and verifying the authority of the target call user to execute the order based on the biological characteristics and the user information of the system execution user.

Description

Authority verification method and device, electronic equipment and medium
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a method and apparatus for verifying authority, an electronic device, and a medium.
Background
With the development of internet technology, some service platforms develop online service services, where some online service services are required to be performed offline by a user. For example, in a logistics distribution scenario, where a dispatcher is required to place items in a distribution order online, the dispatcher is the executing user of the order; in the network about car scene, the driver is required to drive the car to take over the passengers, and then the driver is the executing user of the network about car order.
In order to ensure the service quality and security of the offline service, the authority of the executing user is generally verified to verify whether the executing user is the executing user allowed by the platform. However, the current rights verification scheme is inefficient, and requires a long waiting time to obtain the verification result, and in a scenario with high service timeliness requirements, such as a takeaway scenario, it is necessary to quickly obtain an accurate rights verification result.
Disclosure of Invention
In order to solve the problems, the application provides a permission verification method, a permission verification device, electronic equipment and a medium, which aim to reduce consumption of storage resources and obtain accurate permission verification results in a short time.
In a first aspect of an embodiment of the present disclosure, there is provided a rights verification method, including:
converting call voice generated in the process of executing the order into call text;
determining the user role of the call user corresponding to each voice channel in the call voice based on the call text;
identifying voice data of a target call user in the call voice to obtain biological characteristics of the target call user; the user roles corresponding to the target call users are order execution roles; the biometric features are used to characterize the age and/or sex of the user;
Performing authority verification on the authority of the target call user to execute the order based on the biological characteristics of the target call user and the user information of the system execution user; wherein the system executing user is a user assigned by the system executing the order.
Optionally, based on the call text, determining a user role of the call user corresponding to each voice channel in the call voice includes:
obtaining sentence vectors corresponding to each text sentence in the call text; wherein, a text sentence corresponds to a voice channel;
determining the probability that the call user corresponding to each voice channel respectively belongs to a plurality of preset user roles based on sentence vectors corresponding to all text sentences belonging to the same voice channel in the call text;
and determining the user roles of the call users corresponding to each voice channel based on the probabilities that the call users corresponding to each voice channel belong to a plurality of preset user roles respectively.
Optionally, the method further comprises:
determining a conversation scene corresponding to the conversation voice based on sentence vectors corresponding to all text sentences in the conversation text; the conversation scene is a scene of conversation of different user roles;
Based on the call scene, verifying the user roles of the call users corresponding to each voice channel;
identifying voice data of a target call user in the call voice comprises the following steps:
and under the condition that the verification is passed, identifying the voice data of the target call user in the call voice.
Optionally, obtaining a sentence vector corresponding to each text sentence in the call text includes:
acquiring a word vector of each word in each text sentence;
and acquiring sentence vectors corresponding to each text sentence based on the word vectors of each word in each text sentence.
Optionally, based on the word vector of each word in each text sentence, obtaining a sentence vector corresponding to each text sentence includes:
inputting word vectors corresponding to each text sentence into a pre-trained attention model, and obtaining sentence vectors corresponding to each text sentence output by the attention model; the attention model is used for predicting the attention score between the word vector corresponding to each text sentence and the word vectors of other text sentences, and obtaining the sentence vector corresponding to each text sentence based on the attention score;
or, splicing word vectors corresponding to each text sentence to obtain sentence vectors corresponding to each text sentence.
Optionally, under the condition that word vectors corresponding to each text sentence are spliced to obtain sentence vectors corresponding to each text sentence, determining probabilities that call users corresponding to each voice channel respectively belong to a plurality of preset user roles based on sentence vectors corresponding to text sentences belonging to the same voice channel in the call text, including:
respectively fusing sentence vectors of each text sentence belonging to the same voice channel in the call text with sentence vectors of text sentences belonging to other voice channels to obtain a first fusion vector corresponding to each text sentence belonging to the same voice channel;
fusing sentence vectors of all text sentences belonging to the same voice channel in the call text to obtain a second fusion vector;
based on the first fusion vector and the second fusion vector belonging to the same voice channel, the probability that the call user corresponding to each voice channel respectively belongs to a plurality of preset user roles is determined.
Optionally, determining the call scene corresponding to the call voice based on sentence vectors corresponding to all text sentences in the call text includes:
determining fusion weights corresponding to sentence vectors corresponding to each text sentence in the call text; the fusion weight characterizes the importance of the text sentence in the call text;
Based on the fusion weight, fusing sentence vectors corresponding to all text sentences in the call text to obtain a third fusion vector;
and determining a conversation scene corresponding to the conversation voice based on the third fusion vector.
Optionally, based on the call text, determining a user role of the call user corresponding to each voice channel in the call voice includes:
adding placeholders in each text sentence in the call text, wherein different placeholders are added in the text sentences corresponding to different voice channels;
inputting the call text added with the placeholder into a role determination model;
acquiring the user role of a call user corresponding to each voice channel output by the role model; the user role corresponding to each voice channel is determined by the role determination model based on the vectors in the placeholders of the voice channel, and the vectors in each placeholder are sentence vectors corresponding to the text sentences.
Optionally, the biometric feature is gender, and the identifying the voice data of the target call user in the call voice to obtain the biometric feature of the target call user includes:
filtering fragments belonging to silence and noise in the voice data to obtain clean voice data;
Extracting acoustic features from the clean speech data;
inputting the acoustic characteristics into a gender prediction model to obtain the gender of the target call user;
the gender prediction model is obtained by training a preset model by taking a plurality of acoustic features of different sexes as training samples.
Optionally, the method further comprises:
obtaining authority verification results respectively corresponding to a plurality of conversation voices generated in the process of executing the order;
and determining whether the target call user has the permission to execute the order or not based on each permission verification result.
In a second aspect of the embodiments of the present disclosure, there is provided a rights verification apparatus, the apparatus including:
the conversion module is used for converting the call voice generated in the process of executing the order into a call text;
the role determining module is used for determining the user role of the call user corresponding to each voice channel in the call voice based on the call text;
the voice recognition module is used for recognizing voice data of a target call user with the user role as an order execution role in the call voice to obtain biological characteristics of the target call user, wherein the biological characteristics are used for representing the age and/or sex of the user;
And the permission verification module is used for performing permission verification on the permission of the target call user to execute the order based on the biological characteristics and the user information of the system execution user to which the order is distributed.
In a third aspect of the disclosed embodiments, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the rights verification method according to the aspect.
Furthermore, an embodiment of the present application provides a computer readable storage medium storing a computer program for causing a processor to execute the rights verification method according to the first aspect.
In the embodiment of the application, the call voice generated in the order execution process is converted into the call text, the voice channel of the target call user of the order execution role is firstly determined based on the call text, then the voice data of the target call user is extracted and identified, the voice data is identified to obtain the gender and/or age of the target call user, and the gender and/or age of the target call user is checked with the user information of the system execution user corresponding to the order, so that whether the gender and/or age of the target call user is consistent with the recorded user information of the system execution user can be checked, and whether the target call user has the permission of executing the order is determined.
By adopting the technical scheme provided by the embodiment of the application, the method has at least the following advantages:
on the one hand, based on the character recognition of the call text, the call voice belonging to the executing user can be determined from the call voices with more characters, so that the call voice of the executing user can be accurately acquired through the text data processing. On the other hand, after the conversation voice of the user is confirmed, the biological feature recognition can be directly carried out on the conversation voice, and as the biological feature is the feature of gender, age and the like, voiceprint matching is not needed, and further the voice of the user with authority is not needed to be stored in advance, thereby greatly reducing the storage requirement and saving the storage resource.
On the other hand, the permission verification is divided into two tasks by adopting a method of combining character recognition based on call text and biological feature classification technology based on voice data: a role recognition task based on call text and a biometric recognition task based on voice data. The voice data is used for carrying out biological feature recognition, compared with voice matching, the voice data is single voice data, and the processing efficiency is high, so that authority verification of an executing user can be rapidly realized, and a verification result is rapidly obtained. Moreover, when the neural network model is utilized to realize the tasks, the related models can be independently trained to improve the execution accuracy of the two tasks, so that the accurate identity verification requirement can be met in a shorter iteration period under a complex voice call scene.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the related technical descriptions will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an implementation scenario of a rights verification method according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for verifying authority according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a call text according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating steps for determining a user role for a call user in accordance with an embodiment of the present application;
FIG. 5 is a flowchart illustrating steps for determining probabilities that a call user corresponding to each voice channel belongs to a plurality of preset user roles, respectively, according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating steps for determining a call scenario according to an embodiment of the present application;
fig. 7 is a schematic diagram of a rights verification apparatus according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
In the authentication of rights of an on-line service person (executing user), the speed and accuracy of authentication are concerns, namely, obtaining accurate authentication results in a short time as much as possible. In the related art, it is proposed that the voice print matching technology performs the authority verification, and the voice print matching technology needs to collect the call voice of the executing user who executes the order, and the voices of the respective users to be pre-stored with the execution authority, so as to perform the voice print matching between the call voice of the executing user and the stored voices of the users. Although it can be determined more accurately whether the actual executing user is the system executing user, this has the following problems:
on the one hand, when the user performs downlink service, the user generally relates to the conversation with a plurality of users, and the realization difficulty of accurately acquiring the conversation voice of the user is high; on the other hand, under the condition that the platform users are more, a large amount of storage resources are required to be consumed to store the audio of the user with the execution authority; on the other hand, the voiceprint matching method needs to continuously optimize the voiceprint matching technology, and has the advantages of high voiceprint matching difficulty, low accuracy and long optimization period. On the other hand, since the talking voice of each user needs to be matched with the voiceprint, the verification time is also long. In summary, the permission verification based on voiceprint matching in the related art may result in a long time for permission verification, and an accurate verification result may not be obtained in a short time.
In view of this, the present application proposes a permission verification method, which combines character recognition based on call text and biological feature classification based on voiceprint, and divides a permission verification task into two subtasks to reduce implementation difficulty, so that under a complex voice call scene, the requirement of authentication in voice call can be satisfied in a shorter iteration period by using fewer resources, and an accurate verification result can be obtained in a short time.
Referring to fig. 1, a schematic diagram of an implementation scenario of the present application is shown, and as shown in fig. 1, a service platform distributes an order to a registered user 1 of the platform for execution, wherein the user 1 is a user registered with the service platform and has order execution authority; in the on-line execution process, the user 2 serves as an actual execution user of the order, and correspondingly, the order is also related to other users 3, 4 and 5, and in the execution process of the order, the user 2 can communicate with the users 3, 4 and 5, so that a plurality of communication voices can be generated. In this case, the server platform collects the call voice, converts the call voice into a call text, thereby identifying the voice channel where the user 2 who performs the order is located, identifies the voice data of the voice channel to determine the age and sex of the user 2, and then compares the age and sex of the user 2 with the age and sex described in the user information of the registered user 1 to determine whether the user 2 is the registered user 1.
Next, a rights verification method of the present application will be described in detail, with reference to fig. 2, which shows a flowchart of steps of the rights verification method of the present application, as shown in fig. 2, and may specifically include the following steps:
step S201: and converting the call voice generated in the process of executing the order into call text.
In this embodiment, the order may refer to an order requiring personnel to perform on-line service, such as a take-out order, a network about vehicle order, and the like. The platform distributes the order to the system executing user, and when receiving the order receiving information returned by the system executing user for the order, the platform indicates that the order starts to be executed, and when receiving the order completing information returned by the system executing user, the platform indicates that the order is executed.
Accordingly, the process from the time the platform receives order taking information for the order until the platform receives order completion information for the order is referred to as an order being executed process. After the order is distributed to the system execution user, the call voice generated in the process of executing the order can be monitored in real time.
When monitoring the call voice generated in the order executing process, the user information of various users involved in the order executing process can be obtained, the user information can comprise communication information stored in the virtual operator by the user, such as a contact phone, etc., the communication signals corresponding to the communication information of the users can be monitored, and when the call between two or more users in the order is detected, the call voice in the call process can be collected.
For example, if in the take-away scenario, an order is assigned to the dispatcher a, the order is applied by the user B as a consumer, the order is related to the meal grade at the merchant C, and then the dispatcher a, the user B and the merchant C are the users related to the order, the conversation of the three users can be monitored, and when the conversation between two users of the three users is detected, the voice in the conversation process can be recorded, so as to obtain the conversation voice. It will be appreciated that one call voice in the present application may be directed to a complete call process.
Step S202: and determining the user roles of the call users corresponding to each voice channel in the call voice based on the call text.
In this embodiment, when recording call voice in the order execution process, one call voice recorded in each time may be converted into a call text, and specifically, the process of converting call voice into call text may refer to the related art, which is not limited herein. It should be noted that when converting the call voice into the call text, the function of "distinguishing the call person" may be set at the same time, that is, distinguishing the voice channels, in general, one voice channel represents one call user, and when there are two users in one call, there are two voice channels. Thus, each piece of text content in the call text may have an identification of the voice channel, as described above, with each piece of text content in the call text pointing to a respective corresponding voice channel, i.e., uniquely characterizing one of the users in one call.
Referring to fig. 3, a call text for making a call between two users is exemplarily shown, and it should be noted that the call text is a call scene assumed for convenience of illustration, and as shown in fig. 3, 10 pieces of text content are included in the call text, each piece of text content is preceded by a voice channel identifier, where the voice channel identifier characterizes a voice channel to which the piece of text content belongs, a voice channel identifier 01 characterizes one user 1 of the two users, and a voice channel identifier 02 characterizes the other user 2. The user role of the call user corresponding to each voice channel can be determined according to the text content included in each voice channel.
As described above, since the call voice is recorded when a call is detected between two or more users in an order based on the communication information of the users involved in the order execution process, wherein, in order to protect the privacy information of the users when the call is recorded, the call between the users is kept secret by the virtual operator, that is, the real communication information of the users in the call process is shielded (the virtual operator provides a false phone number generated in real time to the outside, such as a virtual number displayed on the mobile phone of the user instead of a real number); while the individual user associated with the order may subsequently (after the first call) make the other party's call by dialing the virtual number during execution of the order, in this case only the identity of the called party is known. For example, the rider can receive the user by dialing a virtual number by directly holding the phone, but the platform does not know whether the caller is a merchant or a rider. Therefore, it is impossible to determine which two users are speaking specifically for the recorded speaking voice platform. Thus, after the call text is obtained, the user roles of the call users involved in the call voice are recognized based on the call text. Therefore, on one hand, the privacy information of the user can be protected from being revealed, on the other hand, the user roles of all parties in the call are determined based on the call text, and the user roles are only one identity sign for the current order, namely, the application only needs to know the roles of all users in the order in order execution, and the personal privacy information of the users in the order execution process is not required to be acquired through call voice, so that the safety of the user information can be ensured.
Of course, in order to ensure the privacy of the user, in the application, the call voice generated in the order execution process can still be obtained under the condition that the authority of the user authorized to record the call is obtained.
In this embodiment, the user role may refer to an identity symbol of the user during the execution of the order, for example, if the order is a take-out order, then during the execution of the order, the merchant, the operator, the dispatcher and the customer are involved, and then the user role involved in the order may include the merchant, the dispatcher and the customer. For another example, where the order is a network about vehicle order, drivers, operators, and customers are involved in the order being executed, the user roles involved in the order may include drivers, operators, and customers. Thus, the user roles involved may be different depending on the different scenarios in which the order is placed.
The text content in the call text can reflect the call content between two call users in a call process, so that the user role of the call user corresponding to each voice channel can be determined based on the text content included in the voice channel. In the implementation, the call keywords related to each voice channel in one call can be extracted, and then the roles of each call user in the order execution process can be obtained based on the call keywords.
Illustratively, as shown in fig. 3, for the voice channel 1, keywords in the text content involved include: "take away", "put you into the gate", for voice channel 2, keywords in the text content involved include: "get down", "thank you", etc., by semantic processing, it can be known that the user role corresponding to voice channel 1 is the dispatcher and the user role corresponding to voice channel 2 is the customer.
Step S203: and identifying the voice data of the target call user in the call voice to obtain the biological characteristics of the target call user.
The user roles corresponding to the target call users are order execution roles; the biometric features are used to characterize the age and/or sex of the user.
In this embodiment, based on the call text, the user role of the call user in the call process is identified, so as to identify the voice channel where the user performing the offline service of the order is located, so as to perform the biometric feature identification based on the voice data of the user, thereby facilitating the subsequent permission verification.
In particular, as described above, an order relates to a plurality of user roles, wherein the order execution role characterizes a user role of an off-line service for executing the order, and for a call text, when recognizing that a user role corresponding to a voice channel in the call text is the order execution role, the order execution role characterizes that the call user is a target call user for executing the off-line service, so that voice data of the voice channel where the target call user is located can be extracted from call voice, and then, the extracted voice data is subjected to biological feature recognition, namely, age and/or sex corresponding to the voice data are recognized, so that the biological feature of the target call user is obtained.
As described above, the biometric features are used to characterize the age and/or sex of the user, so that the present application can recognize the age or sex of the user through voice data, or can recognize both the age and sex of the user. When the age and sex of the user are simultaneously identified, the user can be checked to determine whether the age and sex of the target call user and the system execution user are matched or not at the time of subsequent permission verification, so that the permission verification is more accurate.
Step S204: and verifying the authority of the target call user for executing the order based on the biological characteristics and the user information of the system executing user.
Wherein the system executing user is a user assigned by the system executing the order.
After the biological characteristics of the user are identified, the age and/or sex of the target call user executing the under-order line service are obtained, so that the age and/or sex can be compared with the age and/or sex recorded in the user information of the system executing user, if the comparison is consistent, the target call user is characterized as the system executing user and has the permission to execute the order, and if the comparison is inconsistent, the target call user is characterized as not being the system executing user and does not have the permission to execute the order.
For example, the age of the target call user is 20-30 years old, the sex is female, the age of the user information of the system executing user is recorded at 24 years old, and the sex female, the characteristic target call user is a system executing user with high probability, so that the target call user is determined to have the executing authority.
It should be noted that, because the system executes the order execution task that the user needs to be assigned by the bearing platform to provide the offline service, the system generally records the personal information of the user when the user registers, and these personal information platforms are authorized to obtain and use, so that the user can obtain from the database when verifying.
By adopting the technical scheme of the embodiment of the application, on one hand, due to the character recognition based on the call text, the voice data of the executing user can be accurately acquired through processing the text data on the basis of protecting the privacy information of the call user from being leaked. On the other hand, after the voice data of the target call user are extracted, the voice data can be directly subjected to biological feature recognition, and voice print matching is not needed because the biological features are the features of gender, age and the like, and voice of the authorized user is not needed to be stored in advance, so that the storage requirement is greatly reduced, and the storage resource is saved. In still another aspect, the authentication is performed by two tasks due to the combination of character recognition based on call text and biometric classification based on voice data. The voice data is used for carrying out biological feature recognition, compared with voice matching, the voice data is single voice data, and the processing efficiency is high, so that authority verification of an executing user can be rapidly realized, and a verification result is rapidly obtained. Moreover, when the neural network model is utilized to realize the tasks, the related models can be independently trained to improve the execution accuracy of the two tasks, so that the identity verification requirement with higher accuracy can be met in a shorter time under a complex voice call scene.
Of course, in some embodiments, when the order is executed, call voices with participation of multiple target call users are generated, and then one permission verification can be performed for each call voice, and then multiple permission verification results are provided, where, in combination with the multiple permission verification results, whether the target call user is a user with the order execution permission can be comprehensively determined.
Correspondingly, authority verification results respectively corresponding to a plurality of conversation voices generated in the process of executing the order can be obtained; and determining whether the target call user has permission to execute the order or not based on each permission verification result.
In this embodiment, a verification result of performing authority verification for each call voice may be obtained, where the verification result includes a result that verification is passed and a result that verification is not passed, and the result that verification is passed refers to: the target call user has the right to execute the order, and the result that the verification is not passed means that: the target call user does not have the right to execute the order.
In this way, in a plurality of call voices generated in the one-time order execution process, the authority verification result is the verification passing number and the verification result is the verification failing number, and when the verification passing number is larger than the verification failing number, the final authority verification result of the target call user can be determined as follows: has the authority to execute orders; otherwise, the final authority verification result of the target call user can be determined as follows: there is no authority to execute the order.
When the implementation mode is adopted, because the permission verification results corresponding to the plurality of call voices generated in the order execution process are based, whether the target call user has permission to execute the order is finally determined, thereby improving the accuracy of permission verification,
in the embodiment of the application, due to the role recognition based on the call text and the biological feature classification based on the voiceprint, one authority verification task is divided into two subtasks, and the role recognition based on the call text and the biological feature classification based on the voiceprint are respectively described below.
The first subtask: tasks are identified based on the role of the call text.
In one embodiment, the user role of the call user corresponding to each voice channel is obtained by performing corresponding processing on the sentence vector corresponding to each text sentence of the call text.
Referring to fig. 4, a flowchart showing steps for determining a user role of a call user according to the present application is shown, and as shown in fig. 4, the steps specifically include:
step S401: and obtaining sentence vectors corresponding to each text sentence in the call text.
Wherein a text sentence corresponds to a speech channel.
In this embodiment, when obtaining the sentence vector corresponding to each text sentence, one text sentence is generally formed by a plurality of words, so that the word vector corresponding to each word in the text sentence can be determined, and the word vectors corresponding to the words are fused to obtain the sentence vector corresponding to the text sentence. In the fusion process, word vectors corresponding to the words can be spliced.
Of course, in another embodiment, in order to improve the fusion refinement of sentence vectors, so as to fuse to richer information, a word vector corresponding to each word in each text sentence may be obtained; and then, based on the word vector of each word in each text sentence, acquiring a sentence vector corresponding to each text sentence.
In this embodiment, the call text may be preprocessed, for example, the words of the speech, the auxiliary words, etc. may be filtered, for example, the words "o", "m", "hello" etc. may be filtered. Thus, word vectors corresponding to each word in each text sentence after preprocessing are obtained.
Step S402: and determining the probability that the call user corresponding to each voice channel respectively belongs to a plurality of preset user roles based on sentence vectors corresponding to all text sentences belonging to the same voice channel in the call text.
As described in the above embodiments, each text content in the call text points to a corresponding voice channel, that is, uniquely characterizes one of the users in one call, so that one sentence vector corresponds to one voice channel. In order to determine the user roles of the call users corresponding to each voice channel, sentence vectors belonging to the same voice channel can be fused, so that a fusion vector of each voice channel is obtained, and then, based on the fusion vector, the probability that the call users corresponding to the voice channels respectively belong to a plurality of preset user roles is determined.
For example, as shown in fig. 3, sentence vectors of 4 text sentences belonging to the voice channel 1 in 10 text segments in the call text may be fused, so as to determine probabilities that the call user of the voice channel 1 belongs to a plurality of preset user roles based on the vectors obtained after the fusion.
The preset user role may be determined based on an application scenario where the order is located, and if the application scenario where the order is located is a take-away scenario, the preset user role may include: merchants, operators, distributors, and customers.
Step S403: and determining the user roles of the call users corresponding to each voice channel based on the probabilities that the call users corresponding to each voice channel belong to a plurality of preset user roles respectively.
In practice, the application determines that the user role corresponding to each voice channel is a multi-classification task, namely, the probability that each voice channel belongs to a plurality of preset user roles needs to be determined, so that the user role of the call user corresponding to each voice channel is finally determined through the probability that each voice channel belongs to the plurality of preset user roles.
For a plurality of preset user roles to which a voice channel belongs, a preset user role with probability greater than a preset probability threshold value can be determined as the user role corresponding to the voice channel. The preset probability threshold value can be determined according to the requirement. Of course, for a plurality of preset user roles to which a voice channel belongs, the preset user role with the highest probability may be determined as the user role corresponding to the voice channel.
In order to improve the recognition accuracy of the user roles of each voice channel, the sentence vectors of all text sentences in the call text can be subjected to cross-fusion processing so as to carry out cross-correlation on the text contents of different voice channels, and meanwhile, the sentence vectors of the text sentences of the same voice channel can be subjected to text fusion of the same voice channel so as to extract semantic features with more effective and higher information expression, so that when the user roles of the call users of one voice channel are recognized, the user roles of the call users of the voice channel can be recognized based on the features after the cross fusion and the features after the text fusion of the voice channel.
In one embodiment, when obtaining the sentence vector corresponding to each text sentence, the following two ways may be adopted:
the first way is: and inputting the word vector corresponding to each text sentence into a pre-trained attention model, and obtaining the sentence vector corresponding to each text sentence output by the attention model.
The attention model is used for predicting the attention score between the word vector corresponding to each text sentence and the word vectors of other text sentences, and obtaining the sentence vector corresponding to each text sentence based on the attention score.
The attention model may be a Bert model, where the attention mechanism of the Bert model is better able to learn the word-to-word association in sentences, thereby enabling context-based accuracy improvement.
In this embodiment, the training sample for training the attention model may be a call text, specifically, in the training process, a word vector of each word in each text sentence in the call text may be input to the model, and then a vector representation after the text semantic information is fused corresponding to each word output according to the model is performed. In the application, during training, the attention model can learn the attention score between the word vector corresponding to each text sentence and the word vectors of other text sentences, and then the sentence vectors of each text sentence are subjected to global fusion based on the attention score, so that the sentence vector corresponding to each text sentence is obtained. Thus, the sentence vector of one text sentence is fused to the semantic information of all other text sentences through the attention mechanism.
Under the condition, since sentence vectors of one text sentence are fused to semantic information of all other text sentences, namely global fusion of information, probability that call users corresponding to each voice channel belong to a plurality of preset user roles can be determined directly according to sentence vectors corresponding to all text sentences of the same voice channel, so that prediction accuracy is improved.
In specific implementation, the sentence vectors of the same voice channel can be spliced and subjected to the maximum pooling operation so as to extract higher-order semantic features, and therefore the probability that the call user corresponding to the voice channel respectively belongs to a plurality of preset user roles is determined based on the result of the maximum pooling operation.
The second way is: and splicing word vectors corresponding to each text sentence to obtain sentence vectors corresponding to each text sentence.
In this way, since the sentence vectors obtained by directly splicing the word vectors corresponding to the text sentences may actually be fused in order to improve the accuracy of role prediction, in a specific implementation, referring to fig. 5, a flowchart of a step of determining probabilities that the call users corresponding to each voice channel belong to a plurality of preset user roles is shown, as shown in fig. 5, specifically, the method may include the following steps:
step S501: and respectively fusing sentence vectors of each text sentence belonging to the same voice channel in the call text with sentence vectors of text sentences belonging to other voice channels to obtain a first fusion vector corresponding to each text sentence belonging to the same voice channel.
In this embodiment, after obtaining a sentence vector corresponding to a text sentence, the sentence vector of the text sentence may be fused with sentence vectors of texts of other voice channels to obtain a first fused vector fused with other voice channels.
In specific implementation, a fusion weight between the text sentence and sentence vectors of text sentences of other voice channels can be determined, the fusion weight can represent content relevance between the text sentence and text sentences of other voice channels, and in particular, the stronger the content relevance, the larger the fusion weight can be. In practice, the fusion weight may be determined based on the similarity between the sentence vector of the text sentence and the sentence vectors of the text sentences of other voice channels, and the distance of the position in the call text.
The first fusion vector can fuse the content with stronger relevance with the text content of the current text sentence in the text content of other voice channels into the current text sentence, so that the sentence vector of one voice channel can be fused to semantic features of other voice channels with relevance on the content, the content with stronger relevance can reflect the features of different call users on the call content in one call, and the features on the call content are closely related with the respective user roles, thereby helping to determine the user roles of the call users corresponding to the voice channels.
Step S502: and fusing sentence vectors of all text sentences belonging to the same voice channel in the call text to obtain a second fusion vector.
In this embodiment, for the same speech channel, sentence vectors of each text sentence belonging to the same speech channel may be fused, and when fusion is performed, each sentence vector may be spliced, or fusion may be performed according to fusion weights between each text sentence, where the fusion weights may be determined according to similarity between sentence vectors of text sentence, if the text content with high similarity has repeated words, the fusion weights are higher, so that more effective semantic features may be extracted.
When sentence vectors of all text sentences belonging to the same voice channel are fused, the high-order semantic feature of each voice channel can be extracted, so that the conversation content of the voice channel is reflected through the high-order semantic feature.
Step S503: based on the first fusion vector and the second fusion vector belonging to the same voice channel, the probability that the call user corresponding to each voice channel respectively belongs to a plurality of preset user roles is determined.
In this embodiment, when determining the user roles of the call users corresponding to each voice channel, the determining may be performed according to the first fusion vector and the second fusion vector of the voice channel, where after the first fusion vector and the second fusion vector may be spliced, the probability that the call users corresponding to the voice channel respectively belong to a plurality of preset user roles may be determined based on the spliced vectors.
When the implementation mode is adopted, when the probability that the talking user corresponding to one voice channel belongs to a plurality of preset user roles is determined, on one hand, sentence vectors of one voice channel can be fused to semantic features of other voice channels which are related on the content, so that the relationship between the voice channel and the other voice channels on the content is established; on the other hand, sentence vectors of all text sentences belonging to a voice channel can be fused, so that high-order semantic features of the voice channel are extracted; in summary, when determining the user role of one voice channel, the text content of the call text on the whole and the text content of the single voice channel are used, so that the accuracy of determining the user role is improved.
It will be appreciated that, typically during execution of an order, the executing user will call the customer, merchant, etc. to the user associated with the order for the service, but will rarely call other executing users, i.e. typically no call between two users belonging to the same user role will occur. Therefore, in order to meet the actual conversation scene, the accuracy of determining the user roles is improved. In one embodiment, multiple call scenarios may be preset based on user roles of users involved in the order, each call scenario being a scenario in which a different user role is engaged in a session.
Accordingly, the call scene to which the call voice belongs and the user roles of the voice channels can be identified simultaneously based on the call text, and then the identified user roles are verified based on the identified call scene so as to verify whether the user roles of the identified voice channels are matched with the call scene or not, so that the accuracy of user role identification is improved.
Correspondingly, determining a conversation scene corresponding to the conversation voice based on sentence vectors corresponding to all text sentences in the conversation text; the conversation scene is a scene of conversation of different user roles; and then, based on the call scene, verifying the user roles of the call users corresponding to each voice channel.
In this embodiment, when a call scene is identified, sentence vectors corresponding to all text sentences may be fused, so as to obtain a fusion vector corresponding to a call text, where the fusion vector may reflect global text content of the call text, so that based on the fusion vector, probability that a call voice belongs to a plurality of preset call scenes may be determined, then, a preset call scene with a probability greater than a probability threshold may be determined as a call scene corresponding to the call voice, or a preset call scene with a maximum probability may be determined as a call scene corresponding to the call voice.
By way of example, assuming that the order is a take-away order, the call scenario includes: the conversation scene between the dispatcher and the customer, the conversation scene between the dispatcher and the merchant, and the conversation scene between the merchant and the customer can determine the probability that one conversation voice of the order belongs to the three conversation scenes respectively, and then determine the final conversation scene of the conversation voice based on the probability.
Then, based on the identified call scenario, the user role of the call user corresponding to each voice channel can be verified, specifically, when verifying, the user roles of the call users corresponding to each voice channel can be verified, and whether the identified call scenario is satisfied or not can be verified. If yes, the verification is passed, and if not, the verification is not passed.
As shown in fig. 3, for example, the identified call scenario is a call scenario of a dispatcher and a customer, and if the identified user character corresponding to the voice channel 1 is the dispatcher and the identified user character corresponding to the voice channel 2 is the customer, the identification result of the user character satisfies the call scenario and the verification is passed. If the identified user role corresponding to the voice channel 1 is a dispatcher and the identified user role corresponding to the voice channel 2 is a merchant, the identification result of the user role does not meet the call scene, and the verification is failed.
Under the condition that verification is passed, namely, under the condition that the user role of the call user corresponding to the voice channel meets the recognized call scene, voice data of the target call user with the user role as the order execution role in the call voice is recognized.
Under the condition that verification is not passed, namely the user role of the call user corresponding to the voice channel does not meet the recognized call scene, voice data of the target call user with the user role as the order execution role in the call voice can be not required to be recognized.
In one embodiment, when determining a call scene corresponding to a call voice, sentence vectors of each text sentence in the whole call text may be fused, and specifically, feature fusion of the sentence vectors may be performed based on context. This is because, in a section of call text, the previous sentence and the next sentence are generally based on question-answer, and have close association, and by fusing the upper text sentence and the lower text sentence, information for showing a dialogue event can be extracted, so that a corresponding call scene is obtained based on which two user roles are established.
By way of example, the text content of user 1 is "do you at an address? The text content of the user 2 is "yes, i am waiting for you to get goods", then the text content of the user 1 is "i am about 15 minutes for getting goods", and the text content of the user 2 is "good, we wait for you", then the information that the dialogue event is "getting goods" can be extracted based on the fusion of the contexts, so that the follow-up accurate determination of the dialogue scene is convenient.
One implementation is: according to the relevance between text sentences of the contexts, sentence vectors of all the text sentences are sequentially fused, so that the fused vectors pay attention to the context information and also pay attention to the relevance between the text sentences of the contexts, and a conversation scene is accurately determined.
Specifically, referring to fig. 6, a flowchart illustrating steps for determining a call scenario, as shown in fig. 6, may include the following steps:
step S601: determining fusion weights corresponding to sentence vectors corresponding to each text sentence in the call text; and the fusion weight characterizes the importance of the text sentence in the call text.
In this embodiment, when sentence vectors of all text sentences in a call text are fused, a fusion weight corresponding to the sentence vector of each text sentence may be determined, and the fusion weight may represent the importance degree of the text content of the text sentence in the whole call text.
In the implementation, the similarity between the sentence vector of each text sentence and all other text sentences can be determined, if the similarity between the sentence vector of the current text sentence and the sentence vectors of all other text sentences is low, the content representing the text sentence is not important for determining the call scene, i.e. does not have enough information to reflect the main content of the call, so that a lower fusion weight can be given.
For example, if the content of one text sentence is "not" and the similarity between sentence vectors of all other text sentences is low, the text sentence does not have enough information reflecting the main content of the call, and thus can be given a low fusion weight. For another example, the content of one text sentence is "you take away and put in a storage cabinet", which has a higher similarity with other text sentences, and has information (take away, storage cabinet, put) that is enough to reflect the main content of the call, so that a higher fusion weight can be given to fuse to the feature that has more effective semantic information for determining the call scene.
Step S602: and fusing sentence vectors corresponding to all text sentences in the call text based on the fusion weight to obtain a third fusion vector.
In this embodiment, during fusion, each text sentence may be weighted and fused according to the fusion weight corresponding to each text sentence, so as to obtain a third fusion vector. Because the fusion weight corresponding to the text sentence characterizes the importance of the text sentence in the call text, more contents can be extracted from sentence vectors of the text sentence reflecting more call content information, and fewer contents can be extracted from sentence vectors of the text sentence reflecting less call content information, thereby realizing the extraction of effective information in semantic information.
Step S603: and determining a conversation scene corresponding to the conversation voice based on the third fusion vector.
In this embodiment, the third fusion vector may be subjected to a maximum pooling operation, so that a call scenario corresponding to the call voice is determined based on a result of the maximum pooling operation.
When the implementation mode is adopted, the sentence vectors corresponding to all the text sentences are fused according to the importance degree of the text sentences reflecting the main conversation content, so that the characteristics of effective semantic information for determining the conversation scene can be fused, and the accuracy of determining the conversation scene is improved.
According to the embodiment, according to the fusion weight determined by the contribution degree of the sentence vector of the text sentence to the whole call text, the sentence vectors of all the text sentences are fused to determine the call scene; and fusing the text sentences of different voice channels according to the relativity between the text sentences of one voice channel and the text sentences of other voice channels, so as to determine the user roles of the voice channels.
That is, when determining the user roles and the call scenes corresponding to the voice channels, the sentence vectors of the text sentences of each voice channel need to be fused to different degrees. In one embodiment, the recognition of the user roles corresponding to the left and right channels on the basis of the call text can be achieved by constructing a role determination model, specifically, the role determination model can be adopted to perform multi-task modeling, and simultaneously, the call scene corresponding to the call voice and the user role corresponding to each voice channel are output.
In the implementation, placeholders can be added into each text sentence in the call text, wherein different placeholders are added into text sentences corresponding to different voice channels; and inputting the call text added with the placeholder into a character determination model; then, the user role of the call user corresponding to each voice channel output by the role model is acquired; the user role corresponding to each voice channel is determined by the role determination model based on the vectors in the placeholders of the voice channel, and the vectors in each placeholder are sentence vectors corresponding to the text sentences.
The role determination model may be integrated with the attention model (Bert model) described above, that is, the role determination model is connected to the output end of the attention model, so that sentence vectors output by the attention model may be directly input to the role determination model, and the role determination model may obtain a user role of a call user corresponding to each voice channel, and as described above, may also obtain a call scene corresponding to call voice. The method can be used for training the Bert model for multiple times by taking call texts in different call scenes as training samples, so as to obtain the role determination model.
The Bert model is a self-coding language model (Autoencoder LM) which can simultaneously extract the relation features of words in sentences and extract the relation features at a plurality of different levels so as to more comprehensively reflect the sentence semantics. On the one hand, the relation among different text sentences in the call text can be extracted by using the Bert model, the sentence semantics of the call text can be comprehensively reflected, and therefore the requirement of fusion processing of the text sentence vectors in different degrees can be met. On the other hand, the parallel prediction of the call scene and the user role can be realized simultaneously by using the attention model and the role determination model.
When the attention model and the role determination model are adopted to jointly determine the user role corresponding to each voice channel, after the call text is obtained, a placeholder can be added before each text sentence, wherein the placeholder can refer to a position to be edited, which is defined in the text in advance, and is used for accommodating sentence vectors obtained by processing each text sentence by the attention model, and the placeholder can be used for accommodating sentence vectors of each text sentence as described in the embodiment above.
When the role determination model is adopted to determine the user role corresponding to each voice channel, sentence vectors in placeholders in text sentences of the same voice channel can be recombined and then subjected to maximum pooling operation, so that probabilities that the voice channels respectively belong to a plurality of preset user roles are obtained.
The role determination model may also be used to determine a call scenario corresponding to a call voice, and in particular, the role determination model may have two branches, one for predicting a user role corresponding to each voice channel, and the other for predicting a call scenario. The branch for predicting the call scene can add a placeholder corresponding to the call scene before the call text, and the placeholder is used for accommodating a third fusion vector after fusing all text sentences, so that when the call scene is determined, the vector in the placeholder corresponding to the call scene can be subjected to maximum pooling operation, and the call scene is obtained.
When the embodiment is adopted, the attention model (Bert model) can be applied to recognition of the call roles and the call scenes, and the attention model can carry out global information fusion on the word vector of each word in each text sentence input due to the superior attention mechanism, so that the sentence vector corresponding to each text sentence is fused with global information, and therefore, when the follow-up role determination model predicts the call user roles and the call scenes, prediction can be carried out based on the sentence vector fused with the global information, and the prediction accuracy is improved.
A second sub-task: subtasks of voiceprint based biometric classification techniques.
In this sub-task, a biometric model may be pre-trained, after which the voice data of the target call user is identified by the biometric model. In one scenario, the biometric feature may be a gender feature, and when the biometric feature is a gender feature, the recognition difficulty may be reduced due to a larger voice difference between different sexes, so as to improve the recognition speed, and obtain the permission verification result as soon as possible.
In specific implementation, a plurality of acoustic features with different sexes can be used as training samples to train a preset model to obtain a gender prediction model, and then the gender prediction model is used for identifying the voice data of the target call user.
In the specific implementation, fragments belonging to silence and noise in the voice data can be filtered out to obtain clean voice data; extracting acoustic features from the clean speech data; and inputting the acoustic characteristics into a gender prediction model to obtain the gender of the target call user.
In this embodiment, voice activity detection (Voice Activity Detection, VAD) may be performed on the voice data to filter out silence and noise in the voice data, thereby obtaining clean voice data; and extracting acoustic features (Fbank) from the clean voice data, wherein the process of acquiring the acoustic features can be performed by referring to related technologies, such as pre-emphasis, framing, windowing, short-time Fourier transform (STFT), mel filtering, mean value removal and the like.
The gender prediction model outputs the probability that the gender of the target call user belongs to each of the male and female and the confidence corresponding to the probability that the gender belongs to each of the male and female, and in practice, the gender with the confidence exceeding the preset confidence and the probability exceeding the preset probability may be determined as the gender of the target call user. For example, the probability that the target call user belongs to a male predicted by the gender prediction model is 0.9, the confidence coefficient is 0.85, the probability that the target call user belongs to a female is 0.8, the confidence coefficient is 0.5, the preset confidence coefficient is 0.8, and the probability threshold value is 0.8, so that the gender of the target call user can be determined to be a male.
Under the condition that the user role corresponding to each voice channel is output through the role determination model, and the gender corresponding to the target call user in the call voice is output through the gender prediction model, different types of processing can be carried out on the call voice through the two models, in practical application, the two models can be trained in an off-line mode or an on-line mode respectively, wherein the two models can be trained in a mutually independent mode, and therefore, the performance of the two models can be optimized, the two models can be carried out in parallel, the iteration period is short, and the accurate identity verification requirement can be met in a short iteration period under a complex voice call scene.
And the user role and the gender corresponding to the user are determined by using the model, so that the speed and the generalization performance of identity authentication can be improved, the method can be widely applied to the scene requiring the execution authority authentication, the authentication result is rapidly given, and the condition that an illegal execution user is rapidly detected in the initial stage of the on-line service of the execution user can be met, so that a longer time is won for the subsequent authentication processing (such as the authentication of the execution user, the expiration of the delivery time and the like), the whole on-line service flow is optimized, and the safety of the on-line service is improved.
Based on the same inventive concept as the above-described embodiments, a second aspect of the embodiments of the present disclosure provides a rights verification apparatus, and referring to fig. 7, a schematic structural diagram of the rights verification apparatus is shown, and as shown in fig. 7, the apparatus may specifically include the following modules:
the conversion module 701 is configured to convert a call voice generated in the process of executing the order into a call text;
a role determining module 702, configured to determine, based on the call text, a user role of a call user corresponding to each voice channel in the call voice;
a voice recognition module 703, configured to recognize voice data of a target call user in the call voice, so as to obtain a biological feature of the target call user; the user roles corresponding to the target call users are order execution roles; the biometric features are used to characterize the age and/or sex of the user
The permission verification module 704 is configured to perform permission verification on permission of the target call user to execute the order based on the biometric feature and user information of the system execution user; wherein the system executing user is a user assigned by the system executing the order.
Optionally, the role determination module 702 includes:
the sentence vector obtaining unit is used for obtaining sentence vectors corresponding to each text sentence in the call text;
the prediction unit is used for determining the probability that the call user corresponding to each voice channel respectively belongs to a plurality of preset user roles based on sentence vectors corresponding to all text sentences belonging to the same voice channel in the call text;
and the determining unit is used for determining the user roles of the call users corresponding to each voice channel based on the probabilities that the call users corresponding to each voice channel belong to a plurality of preset user roles respectively.
Optionally, the sentence vector obtaining unit is configured to perform the following steps:
acquiring a word vector of each word in each text sentence;
and acquiring sentence vectors corresponding to each text sentence based on the word vectors of each word in each text sentence.
Optionally, based on the word vector of each word in each text sentence, obtaining a sentence vector corresponding to each text sentence, including the following steps:
Inputting word vectors corresponding to each text sentence into a pre-trained attention model, and obtaining sentence vectors corresponding to each text sentence output by the attention model; the attention model is used for predicting the attention score between the word vector corresponding to each text sentence and the word vectors of other text sentences, and obtaining the sentence vector corresponding to each text sentence based on the attention score;
or, splicing word vectors corresponding to each text sentence to obtain sentence vectors corresponding to each text sentence.
Optionally, the apparatus further comprises:
the conversation scene determining module is used for determining a conversation scene corresponding to the conversation voice based on sentence vectors corresponding to all text sentences in the conversation text under the condition that word vectors corresponding to each text sentence are spliced to obtain sentence vectors corresponding to each text sentence; the conversation scene is a scene of conversation of different user roles;
the user role verification module is used for verifying the user roles of the call users corresponding to each voice channel based on the call scene;
the voice recognition module 703 is specifically configured to recognize voice data of a target call user whose user role is an order execution role in the call voice when the verification is passed.
Optionally, the prediction unit includes:
the first fusion subunit is used for respectively fusing sentence vectors of each text sentence belonging to the same voice channel in the call text with sentence vectors of text sentences belonging to other voice channels to obtain a first fusion vector corresponding to each text sentence belonging to the same voice channel;
the second fusion subunit is used for fusing sentence vectors of all text sentences belonging to the same voice channel in the call text to obtain a second fusion vector;
and the prediction subunit is used for determining the probability that the call user corresponding to each voice channel respectively belongs to a plurality of preset user roles based on the first fusion vector and the second fusion vector which belong to the same voice channel.
Optionally, the call scene determining module includes:
the weight determining unit is used for determining fusion weights corresponding to sentence vectors corresponding to each text sentence in the call text; the fusion weight characterizes the importance of the text sentence in the call text;
the fusion unit is used for fusing sentence vectors corresponding to all text sentences in the call text based on the fusion weight to obtain a third fusion vector;
And the determining unit is used for determining a conversation scene corresponding to the conversation voice based on the third fusion vector.
Optionally, the role determination module 702 includes:
a placeholder adding unit, configured to add a placeholder to each text sentence in the call text, where different placeholders are added to text sentences corresponding to different voice channels;
an input unit for inputting a call text to which the placeholder is added to a character determination model;
the acquisition unit is used for acquiring the user roles of the call users corresponding to each voice channel output by the role model; the user role corresponding to each voice channel is determined by the role determination model based on the vectors in the placeholders of the voice channel, and the vectors in each placeholder are sentence vectors corresponding to the text sentences.
Optionally, the biometric feature is gender, and the voice recognition module 703 includes:
the filtering unit is used for filtering fragments belonging to silence and noise in the voice data to obtain clean voice data;
a feature extraction unit for extracting acoustic features from the clean speech data;
the input unit is used for inputting the acoustic characteristics into a gender prediction model to obtain the gender of the actual execution user who actually executes the order; the gender prediction model is obtained by training a preset model by taking a plurality of acoustic features of different sexes as training samples.
Optionally, the apparatus further comprises:
the verification result acquisition module is used for acquiring authority verification results respectively corresponding to a plurality of conversation voices generated in the process of executing the order;
and the permission verification module is used for determining whether the target call user has permission to execute the order or not based on each permission verification result.
The embodiment of the application also provides an electronic device, which can comprise a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor is configured to execute the authority verification method.
Embodiments of the present application also provide a non-transitory computer-readable storage medium, which when executed by a processor, causes the processor to perform an operation to implement the above-described rights verification method of the present application.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
It will be apparent to those skilled in the art that embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the application may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.
The above detailed description of the method, the device, the electronic equipment and the medium for verifying authority provided by the invention applies specific examples to illustrate the principle and the implementation of the invention, and the above examples are only used for helping to understand the method and the core idea of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (13)

1. A method of rights verification, the method comprising:
converting call voice generated in the process of executing the order into call text;
determining the user role of the call user corresponding to each voice channel in the call voice based on the call text;
identifying voice data of a target call user in the call voice to obtain biological characteristics of the target call user; the user roles corresponding to the target call users are order execution roles; the biometric features are used to characterize the age and/or sex of the user;
performing authority verification on the authority of the target call user to execute the order based on the biological characteristics of the target call user and the user information of the system execution user; wherein the system executing user is a user assigned by the system executing the order.
2. The method of claim 1, wherein determining the user role of the call user corresponding to each voice channel in the call voice based on the call text comprises:
obtaining sentence vectors corresponding to each text sentence in the call text; wherein, a text sentence corresponds to a voice channel;
determining the probability that the call user corresponding to each voice channel respectively belongs to a plurality of preset user roles based on sentence vectors corresponding to all text sentences belonging to the same voice channel in the call text;
and determining the user roles of the call users corresponding to each voice channel based on the probabilities that the call users corresponding to each voice channel belong to a plurality of preset user roles respectively.
3. The method according to claim 2, wherein the method further comprises:
determining a conversation scene corresponding to the conversation voice based on sentence vectors corresponding to all text sentences in the conversation text; the conversation scene is a scene of conversation of different user roles;
based on the call scene, verifying the user roles of the call users corresponding to each voice channel;
identifying voice data of a target call user in the call voice comprises the following steps:
And under the condition that the verification is passed, recognizing the voice data of the target call user.
4. The method of claim 2, wherein obtaining the sentence vector corresponding to each text sentence in the call text comprises:
acquiring a word vector of each word in each text sentence;
and acquiring sentence vectors corresponding to each text sentence based on the word vectors of each word in each text sentence.
5. The method of claim 4, wherein obtaining a sentence vector corresponding to each text sentence based on the word vector of each word in each text sentence, comprises:
inputting word vectors corresponding to each text sentence into a pre-trained attention model, and obtaining sentence vectors corresponding to each text sentence output by the attention model; the attention model is used for predicting the attention score between the word vector of each text sentence and the word vectors of other text sentences, and obtaining the sentence vector corresponding to each text sentence based on the attention score;
or, splicing word vectors corresponding to each text sentence to obtain sentence vectors corresponding to each text sentence.
6. The method according to claim 5, wherein determining the probability that the call user corresponding to each voice channel respectively belongs to a plurality of preset user roles based on the sentence vectors corresponding to the text sentences belonging to the same voice channel in the call text in the case of splicing the word vectors corresponding to each text sentence to obtain the sentence vector corresponding to each text sentence, comprises:
Respectively fusing sentence vectors of each text sentence belonging to the same voice channel in the call text with sentence vectors of text sentences belonging to other voice channels to obtain a first fusion vector corresponding to each text sentence belonging to the same voice channel;
fusing sentence vectors of all text sentences belonging to the same voice channel in the call text to obtain a second fusion vector;
based on the first fusion vector and the second fusion vector belonging to the same voice channel, the probability that the call user corresponding to each voice channel respectively belongs to a plurality of preset user roles is determined.
7. The method according to any one of claims 3-6, wherein determining a call scene corresponding to the call voice based on sentence vectors corresponding to all text sentences in the call text comprises:
determining fusion weights corresponding to sentence vectors corresponding to each text sentence in the call text; the fusion weight characterizes the importance of the text sentence in the call text;
based on the fusion weight, fusing sentence vectors corresponding to all text sentences in the call text to obtain a third fusion vector;
and determining a conversation scene corresponding to the conversation voice based on the third fusion vector.
8. The method according to any one of claims 1-6, wherein determining, based on the call text, a user role of a call user corresponding to each voice channel in the call voice includes:
adding placeholders in each text sentence in the call text, wherein different placeholders are added in the text sentences corresponding to different voice channels;
inputting the call text added with the placeholder into a role determination model;
acquiring the user role of a call user corresponding to each voice channel output by the role model; the user role corresponding to each voice channel is determined by the role determination model based on the vectors in the placeholders of the voice channel, and the vectors in each placeholder are sentence vectors corresponding to the text sentences.
9. The method according to any one of claims 1-6, wherein the biometric feature is gender, and the identifying the voice data of the target call user in the call voice to obtain the biometric feature of the target call user includes:
filtering fragments belonging to silence and noise in the voice data to obtain clean voice data;
extracting acoustic features from the clean speech data;
Inputting the acoustic characteristics into a gender prediction model to obtain the gender of the target call user;
the gender prediction model is obtained by training a preset model by taking a plurality of acoustic features of different sexes as training samples.
10. The method according to any one of claims 1-6, further comprising:
obtaining authority verification results respectively corresponding to a plurality of conversation voices generated in the process of executing the order;
and determining whether the target call user has the authority to execute the order or not based on each authority verification result.
11. A rights verification apparatus, said apparatus comprising:
the conversion module is used for converting the call voice generated in the process of executing the order into a call text;
the role determining module is used for determining the user role of the call user corresponding to each voice channel in the call voice based on the call text;
the voice recognition module is used for recognizing voice data of a target call user in the call voice to obtain biological characteristics of the target call user; the user roles corresponding to the target call users are order execution roles; the biometric features are used to characterize the age and/or sex of the user;
The permission verification module is used for verifying permission of the target call user for executing the order based on the biological characteristics and the user information of the system execution user; wherein the system executing user is a user assigned by the system executing the order.
12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executed implementing the rights verification method of any one of claims 1-10.
13. A computer readable storage medium storing a computer program for causing a processor to perform the rights verification method according to any one of claims 1-10.
CN202210395953.7A 2022-04-15 2022-04-15 Authority verification method and device, electronic equipment and medium Active CN114726635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210395953.7A CN114726635B (en) 2022-04-15 2022-04-15 Authority verification method and device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210395953.7A CN114726635B (en) 2022-04-15 2022-04-15 Authority verification method and device, electronic equipment and medium

Publications (2)

Publication Number Publication Date
CN114726635A CN114726635A (en) 2022-07-08
CN114726635B true CN114726635B (en) 2023-09-12

Family

ID=82243069

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210395953.7A Active CN114726635B (en) 2022-04-15 2022-04-15 Authority verification method and device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN114726635B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115208704B (en) * 2022-09-16 2023-01-13 欣诚信息技术有限公司 Identity authentication system and political service application system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106128467A (en) * 2016-06-06 2016-11-16 北京云知声信息技术有限公司 Method of speech processing and device
CN109429523A (en) * 2017-06-13 2019-03-05 北京嘀嘀无限科技发展有限公司 Speaker verification method, apparatus and system
CN111128223A (en) * 2019-12-30 2020-05-08 科大讯飞股份有限公司 Text information-based auxiliary speaker separation method and related device
CN111862977A (en) * 2020-07-27 2020-10-30 北京嘀嘀无限科技发展有限公司 Voice conversation processing method and system
CN111883133A (en) * 2020-07-20 2020-11-03 深圳乐信软件技术有限公司 Customer service voice recognition method, customer service voice recognition device, customer service voice recognition server and storage medium
CN112086098A (en) * 2020-09-22 2020-12-15 福建鸿兴福食品有限公司 Driver and passenger analysis method and device and computer readable storage medium
CN113194210A (en) * 2021-04-30 2021-07-30 中国银行股份有限公司 Voice call access method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10347244B2 (en) * 2017-04-21 2019-07-09 Go-Vivace Inc. Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106128467A (en) * 2016-06-06 2016-11-16 北京云知声信息技术有限公司 Method of speech processing and device
CN109429523A (en) * 2017-06-13 2019-03-05 北京嘀嘀无限科技发展有限公司 Speaker verification method, apparatus and system
CN111128223A (en) * 2019-12-30 2020-05-08 科大讯飞股份有限公司 Text information-based auxiliary speaker separation method and related device
CN111883133A (en) * 2020-07-20 2020-11-03 深圳乐信软件技术有限公司 Customer service voice recognition method, customer service voice recognition device, customer service voice recognition server and storage medium
CN111862977A (en) * 2020-07-27 2020-10-30 北京嘀嘀无限科技发展有限公司 Voice conversation processing method and system
CN112086098A (en) * 2020-09-22 2020-12-15 福建鸿兴福食品有限公司 Driver and passenger analysis method and device and computer readable storage medium
CN113194210A (en) * 2021-04-30 2021-07-30 中国银行股份有限公司 Voice call access method and device

Also Published As

Publication number Publication date
CN114726635A (en) 2022-07-08

Similar Documents

Publication Publication Date Title
JP6677796B2 (en) Speaker verification method, apparatus, and system
US10991366B2 (en) Method of processing dialogue query priority based on dialog act information dependent on number of empty slots of the query
EP3327720B1 (en) User voiceprint model construction method and apparatus
US9904927B2 (en) Funnel analysis
CN104903954A (en) Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination
CN111159364B (en) Dialogue system, dialogue device, dialogue method, and storage medium
CN109462482B (en) Voiceprint recognition method, voiceprint recognition device, electronic equipment and computer readable storage medium
CN111696558A (en) Intelligent outbound method, device, computer equipment and storage medium
CN114007131A (en) Video monitoring method and device and related equipment
CN112417128A (en) Method and device for recommending dialect, computer equipment and storage medium
CN110704618A (en) Method and device for determining standard problem corresponding to dialogue data
CN114726635B (en) Authority verification method and device, electronic equipment and medium
US11868453B2 (en) Systems and methods for customer authentication based on audio-of-interest
CN111739537B (en) Semantic recognition method and device, storage medium and processor
CN116049411B (en) Information matching method, device, equipment and readable storage medium
CN110765242A (en) Method, device and system for providing customer service information
CN113724693B (en) Voice judging method and device, electronic equipment and storage medium
US11947872B1 (en) Natural language processing platform for automated event analysis, translation, and transcription verification
Nandakumar et al. Scamblk: A voice recognition-based natural language processing approach for the detection of telecommunication fraud
US20210182342A1 (en) Major point extraction device, major point extraction method, and non-transitory computer readable recording medium
US20220319496A1 (en) Systems and methods for training natural language processing models in a contact center
CN110853674A (en) Text collation method, apparatus, and computer-readable storage medium
CN112820323A (en) Method and system for adjusting priority of response queue based on client voice
US20240126851A1 (en) Authentication system and method
Ceaparu et al. Multifactor voice-based authentication system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant