CN115174748A - Voice call-out method, device, equipment and medium based on semantic recognition - Google Patents

Voice call-out method, device, equipment and medium based on semantic recognition Download PDF

Info

Publication number
CN115174748A
CN115174748A CN202210743094.6A CN202210743094A CN115174748A CN 115174748 A CN115174748 A CN 115174748A CN 202210743094 A CN202210743094 A CN 202210743094A CN 115174748 A CN115174748 A CN 115174748A
Authority
CN
China
Prior art keywords
text
voice
information
user
intention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210743094.6A
Other languages
Chinese (zh)
Inventor
黄石磊
廖晨
陈诚
冯湘
熊霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Raisound Technology Co ltd
Original Assignee
Shenzhen Raisound Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Raisound Technology Co ltd filed Critical Shenzhen Raisound Technology Co ltd
Priority to CN202210743094.6A priority Critical patent/CN115174748A/en
Publication of CN115174748A publication Critical patent/CN115174748A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/51Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing
    • H04M3/523Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing with call distribution or queueing
    • H04M3/5232Call distribution algorithms
    • H04M3/5235Dependent on call type or called number [DNIS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a voice outbound method, a device, equipment and a storage medium based on semantic recognition. The method comprises the following steps: the method comprises the steps of initiating an outbound request to a user based on a telephone number by acquiring the telephone number and travel information of the user, feeding back template voice data generated by a template text corresponding to the travel information to the user when the user responds to the outbound request, receiving to-be-processed voice fed back by the user based on the template voice data, identifying target semantic information of the to-be-processed voice based on a preset intention text set or a pre-trained semantic recognition model, matching a target speech text from a preset speech form based on the target semantic information, converting the target speech text into intermediate speech data, and feeding back the intermediate speech data to the user. The method and the device can improve the accuracy of voice recognition in the outbound process of automatic flow call.

Description

Voice call-out method, device, equipment and medium based on semantic recognition
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for voice call-out based on semantic recognition.
Background
The epidemiological survey is a short term, and the purpose of the epidemiological survey is to know the places where the relevant people have arrived in the past certain time period, people who have been obtained by contact, and the like, so as to determine whether the people are at risk of transmitting diseases or infected diseases.
When the number of people needing to perform flow dispatching is large, the manual flow dispatching personnel cannot rapidly complete the statistics of flow dispatching information by using a manual dialing telephone due to low efficiency, although related automatic outbound systems exist in the prior art, most of the automatic outbound systems are matched by using keywords or key sentences during intention recognition, the problem of low recognition accuracy of user answer voices exists, and multi-round voice interaction cannot be performed, so that the outbound systems cannot be well applied to automatic flow dispatching scenes.
Disclosure of Invention
In view of the above, the present application provides a method, an apparatus, a device and a storage medium for voice outbound based on semantic recognition, which aims to solve the technical problem of low accuracy of voice recognition in the outbound process of automatic flow call.
In a first aspect, the present application provides a speech outbound method based on semantic recognition, including:
acquiring a telephone number and travel information of a user, and initiating an outbound request to the user based on the telephone number;
when the user responds to the outbound request, feeding back template voice data generated by the template text corresponding to the travel information to the user;
receiving the voice to be processed fed back by the user based on the template voice data, and identifying target semantic information of the voice to be processed based on a preset intention text set or a pre-trained semantic identification model;
matching a target language-art text from a preset language-art form based on the target semantic information, converting the target language-art text into intermediate voice data and feeding the intermediate voice data back to the user.
Preferably, after receiving the to-be-processed voice fed back by the user based on the template voice data, the method further includes:
and performing echo cancellation processing, noise reduction processing and enhanced amplification processing on the voice to be processed to obtain the preprocessed voice to be processed.
Preferably, the recognizing the target semantic information of the speech to be processed based on a preset intention text set or a pre-trained semantic recognition model includes:
recognizing text information of the voice to be processed based on a pre-trained voice recognition model, and performing similarity calculation on the text information and the intention text set to obtain a similarity value between the text information and each intention text in the intention text set;
judging whether an intention text with a similarity value larger than a preset threshold exists or not;
when the intention texts with the similarity values larger than the preset threshold value exist, selecting the intention text with the maximum similarity value from the intention texts with the similarity values larger than the preset threshold value, and taking the semantics of the intention text with the maximum similarity value as the target semantic information of the voice to be processed;
and when judging that no intention text with the similarity value larger than a preset threshold value exists, inputting the text information into the semantic recognition model to obtain the target semantic information of the voice to be processed.
Preferably, the recognizing the text information of the speech to be processed based on the pre-trained speech recognition model includes:
acquiring identity information of the user, and calling a voice recognition model corresponding to the voice to be processed based on the identity information;
and inputting the voice to be processed into the voice recognition model to obtain the text information of the voice to be processed.
Preferably, the calculating the similarity between the text information and the intention text set to obtain the similarity between the text information and each intention text in the intention text set includes:
respectively converting the text information and each intention text into corresponding sentence vectors;
and respectively calculating the similarity between the sentence vector of the text information and the sentence vector of each intention text by using a similarity algorithm to obtain the similarity value between the text information and each intention text in the intention text set.
Preferably, the matching of the target semantic text from the preset linguistic form based on the target semantic information includes:
matching the target semantic information with a set of conversational texts in the conversational form;
when the target semantic information is successfully matched with any one of the tactical texts in the tactical text set, taking the successfully matched tactical text as the target tactical text;
and when the target semantic information fails to be matched with the language texts in the language text set, taking a preset general language text as the target language text.
Preferably, after converting the target spoken text into intermediate speech data and feeding back to the user, the method further comprises:
and storing the voice to be processed and the intermediate voice data into a preset database.
In a second aspect, the present application provides a speech outbound device based on semantic recognition, including:
an outbound module: the system comprises a server and a server, wherein the server is used for acquiring the telephone number and the travel information of a user and initiating an outbound request to the user based on the telephone number;
a first feedback module: the template voice data is used for feeding back the template voice data generated by the template text corresponding to the travel information to the user when the user responds to the outbound request;
an identification module: the voice recognition module is used for receiving the voice to be processed fed back by the user based on the template voice data, and recognizing target semantic information of the voice to be processed based on a preset intention text set or a pre-trained semantic recognition model;
a second feedback module: and the voice recognition module is used for matching a target language-art text from a preset language-art form based on the target semantic information, converting the target language-art text into intermediate voice data and feeding the intermediate voice data back to the user.
In a third aspect, the present application provides an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;
a memory for storing a computer program;
a processor, configured to implement the steps of the speech call-out method based on semantic recognition according to any embodiment of the first aspect when executing the program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech call-out method based on semantic recognition according to any one of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
according to the voice outbound method, the device, the equipment and the storage medium based on the semantic recognition, the telephone number and the travel information of a user are obtained, an outbound request is initiated to the user based on the telephone number, when the user responds to the outbound request, template voice data generated by a template text corresponding to the travel information is fed back to the user, the voice to be processed fed back by the user based on the template voice data is received, target semantic information of the voice to be processed is recognized based on a preset intention text set or a pre-trained semantic recognition model, a target speech text is matched from a preset speech form according to the target semantic information, and the target speech text is converted into intermediate voice data and then fed back to the user. The semantic information of the user answering the voice can be identified, the accuracy of voice identification in the calling-out process of the automatic flow call is improved, the semantic information is matched with the speech text and is converted into the voice, and then the voice is fed back to the user, and multiple rounds of conversations can be accurately completed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart diagram of a preferred embodiment of the speech outbound method based on semantic recognition of the present application;
FIG. 2 is a block diagram of a preferred embodiment of the speech outbound device based on semantic recognition according to the present application;
FIG. 3 is a schematic view of an electronic device according to a preferred embodiment of the present application;
the implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the descriptions in this application referring to "first", "second", etc. are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present application.
The application provides a voice outbound method based on semantic recognition. Fig. 1 is a schematic method flow diagram of an embodiment of the speech outbound method based on semantic recognition according to the present application. The method may be performed by an electronic device, which may be implemented by software and/or hardware. The voice outbound method based on semantic recognition comprises the following steps:
step S10: acquiring a telephone number and travel information of a user, and initiating an outbound request to the user based on the telephone number;
step S20: when the user responds to the outbound request, feeding back template voice data generated by the template text corresponding to the travel information to the user;
step S30: receiving the voice to be processed fed back by the user based on the template voice data, and identifying target semantic information of the voice to be processed based on a preset intention text set or a pre-trained semantic identification model;
step S40: matching a target language-art text from a preset language-art form based on the target semantic information, converting the target language-art text into intermediate voice data and feeding the intermediate voice data back to the user.
The electronic device may be a device equipped with an outbound system, and the present solution is described by taking an example that a worker uses the outbound system to automatically make a telephone inquiry to a user who needs to make an inquiry, and it is understood that an application scenario of the present solution is not limited thereto, and may also be a scenario of automatically making an outbound call when an enterprise performs a marketing campaign.
The method includes the steps of obtaining a telephone number of a user needing to be queried and travel information of the user, wherein the travel information can be geographical location information which is arrived by the user within a preset time period (for example, within 2 days), and for example, if a user arrives at a city B within 2 days, the geographical location information of the user includes the city B. Because the telephone number and the geographic position information of the user belong to the privacy information, the privacy information can be stored in a related database in an encrypted manner, and only the staff with the authority can call the private information to make an outgoing call. The method comprises the steps of initiating an outbound request to a user needing to be inquired according to the telephone number of the user, and feeding back template voice data to the user when the user responds to the outbound request, namely the user connects an outbound call, wherein the template voice data is generated according to a template text corresponding to geographical position information of the user, namely the template voice data is voice data converted from the template text, the template text is text needing to be asked for the user, different geographical position information has corresponding template texts, for example, whether the user has a B1 region in a city B or not, whether the user has a C1 region in a city C or not, and the like.
After the user hears the template voice data, the user can make a corresponding answer, the voice answered by the user is the voice to be processed, and the voice to be processed can be a complete answer, can also be an incomplete answer, and can also be an answer irrelevant to the template voice data.
The method comprises the steps of receiving to-be-processed voice fed back by a user based on template voice data, identifying target semantic information of the to-be-processed voice, obtaining intention expressed by the voice answered by the user, specifically, converting the to-be-processed voice into text through an Automatic Speech Recognition (ASR) technology, and identifying the target semantic information of the text by using an identification model trained based on a hidden Markov model.
After the target semantic information of the voice to be processed is obtained, a target dialect text can be matched from a preset dialect form. For example, if the target semantic information of the to-be-processed speech of the user is identified to represent the place where the user remembers and has been completely repeated, the next question to be answered by the user can be matched from the dialog form; if the target semantic information of the to-be-processed speech of the user is identified to represent places where the user has not remembered clearly, a dialogistic text reminding the user to remember seriously can be matched from the dialogistic form, for example, the target dialogistic text is 'places where you have remembered to remember really again'.
And then converting the target conversational text into intermediate voice data and feeding back the intermediate voice data to the user to remind the user to answer until the user finishes answering all the questions corresponding to the template voice data.
In one embodiment, after receiving the pending speech fed back by the user based on the template speech data, the method further comprises:
and performing echo cancellation processing, noise reduction processing and enhanced amplification processing on the voice to be processed to obtain the preprocessed voice to be processed.
The echo cancellation process may use an echo cancellation method, i.e. the echo may be cancelled by estimating the magnitude of the echo signal and then subtracting the estimate from the received signal. The noise reduction process may first cancel each other by using sounds of the same frequency, same amplitude, and opposite phase as the noise, and then cancel the reverberation using a dereverberated audio plug or microphone array. The enhanced amplification process may employ automatic gain control to amplify the audio. By preprocessing the speech to be processed, the accuracy of speech recognition can be improved.
In one embodiment, the recognizing the target semantic information of the speech to be processed based on a preset intention text set or a pre-trained semantic recognition model includes:
recognizing text information of the voice to be processed based on a pre-trained voice recognition model, and performing similarity calculation on the text information and the intention text set to obtain a similarity value between the text information and each intention text in the intention text set;
judging whether an intention text with a similarity value larger than a preset threshold exists or not;
when judging that the intention texts with the similarity values larger than the preset threshold exist, selecting the intention text with the maximum similarity value from the intention texts with the similarity values larger than the preset threshold, and taking the semantics of the intention text with the maximum similarity value as the target semantic information of the voice to be processed;
and when judging that no intention text with the similarity value larger than a preset threshold value exists, inputting the text information into the semantic recognition model to obtain the target semantic information of the voice to be processed.
The speech recognition model refers to a pre-trained model for converting speech into text, and since the speech to be processed answered by the user may contain accent information, for example, the accent of guangdong, the accent of sichuan or the accent of Hunan, the speech recognition models corresponding to a plurality of accents can be pre-trained to improve the accuracy of converting the speech to be processed into text. Similarity calculation is carried out on the recognized text information and a preset intention text set, a similarity value between the text information and each intention text in the intention text set can be obtained, the intention text set comprises intention texts under various scenes, each intention text represents corresponding semantic information, the similarity value between the text information and the intention text is higher, the text information is similar to the intention of the intention text, whether the intention text with the similarity value larger than a preset threshold value exists is judged, for example, whether the intention text with the similarity value larger than 95% exists is judged, when the intention text with the similarity value larger than the preset threshold value exists is judged, the intention text with the maximum similarity value is selected from the intention texts with the similarity value larger than the preset threshold value, for example, the intention texts with the similarity values of 96% and 97% to the text information exist, and the semantic meaning of the intention text with the maximum similarity value (namely, the intention text corresponding to the similarity of 97%) is used as the semantic information of the speech to be processed.
When the intention text with the similarity value larger than the preset threshold value does not exist, for example, when the intention text with the similarity value larger than 95% does not exist, the text information is input into a pre-trained semantic recognition model to obtain the semantic information of the voice target to be processed, the semantic recognition model can be obtained according to the training of a bert model, the semantic recognition model is trained by adopting a Rasa framework, rasa is an open-source machine learning framework and is used for constructing a context AI and an assistant chatting robot, and the Rasa is embedded with large semantic recognition models such as the bert and the XLNet.
If the intention text set has the intention text with higher similarity value with the text information of the voice to be processed, the semantics of the intention text is taken as the semantics of the text information of the voice to be processed, and if the intention text with the similarity value larger than the preset threshold value does not exist, the target semantic information corresponding to the text information of the voice to be processed is identified by using the semantic identification model, so that the target semantic information of the voice to be processed can be accurately determined.
Further, the recognizing the text information of the speech to be processed based on the pre-trained speech recognition model includes:
acquiring identity information of the user, and calling a voice recognition model corresponding to the voice to be processed based on the identity information;
and inputting the voice to be processed into the voice recognition model to obtain the text information of the voice to be processed.
Because the voice to be processed answered by the user may contain accent information, the identity information of the user can be obtained, the voice recognition model corresponding to the voice to be processed is called according to the identity information of the user, for example, if the user is a Sichuan person from the identity information of the user, the voice recognition model corresponding to the Sichuan accent is called, the voice to be processed is input into the voice recognition model, and the text information of the voice to be processed can be obtained more accurately.
In one embodiment, the calculating the similarity between the text information and the intention text set to obtain the similarity between the text information and each intention text in the intention text set includes:
respectively converting the text information and each intention text into corresponding sentence vectors;
and respectively calculating the similarity between the sentence vector of the text information and the sentence vector of each intention text by using a similarity algorithm to obtain the similarity value between the text information and each intention text in the intention text set.
The text information and each intention text are segmented to obtain a corresponding word set, each segmentation word of the word set is converted into a word vector, the word vectors are spliced to obtain a sentence vector corresponding to each text, the similarity between the sentence vector of the text information and the sentence vector of each intention text is calculated by using a similarity algorithm (for example, a cosine similarity algorithm), and the similarity value between the text information and each intention text in the intention text set can be obtained.
In one embodiment, the matching out the target semantic text from the preset linguistic form based on the target semantic information includes:
matching the target semantic information with a set of conversational texts in the conversational form;
when the target semantic information is successfully matched with any one of the tactical texts in the tactical text set, taking the successfully matched tactical text as the target tactical text;
and when the target semantic information fails to be matched with the language texts in the language text set, taking a preset general language text as the target language text.
The target dialect text corresponding to the target semantic information can be matched in a text similarity calculation mode, when the target semantic information and the dialect texts in the dialect text set are failed to be matched, the dialect text set does not have the dialect text corresponding to the target semantic information, and at the moment, the preset general dialect text is used as the target dialect text, for example, the general dialect text can prompt the user to answer again or prompt the user to slow down the speech speed.
In one embodiment, after converting the target spoken text into intermediate speech data and feeding back to the user, the method further comprises:
and storing the voice to be processed and the intermediate voice data into a preset database.
By storing the voice to be processed and the intermediate voice data in the database, the tracing of subsequent data and the manual verification of whether the text information corresponding to the voice is correct can be facilitated.
Referring to fig. 2, a functional module diagram of the speech outbound device 100 based on semantic recognition according to the present application is shown.
The speech outbound device 100 based on semantic recognition described herein may be installed in an electronic device. According to the implemented functions, the voice call-out device 100 based on semantic recognition may include a call-out module 110, a first feedback module 120, a recognition module 130 and a second feedback module 140. A module, also referred to as a unit in this application, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
outbound module 110: the system comprises a server, a user interface and a server, wherein the server is used for acquiring the telephone number and the travel information of the user and initiating an outbound request to the user based on the telephone number;
the first feedback module 120: the system is used for feeding back template voice data generated by the template text corresponding to the travel information to the user when the user responds to the outbound request;
the recognition module 130: the voice recognition module is used for receiving the voice to be processed fed back by the user based on the template voice data, and recognizing target semantic information of the voice to be processed based on a preset intention text set or a pre-trained semantic recognition model;
the second feedback module 140: and the voice recognition module is used for matching a target language-art text from a preset language-art form based on the target semantic information, converting the target language-art text into intermediate voice data and feeding the intermediate voice data back to the user.
In one embodiment, the identification module is further configured to:
and performing echo cancellation processing, noise reduction processing and enhanced amplification processing on the voice to be processed to obtain the preprocessed voice to be processed.
In one embodiment, the recognizing the target semantic information of the speech to be processed based on a preset intention text set or a pre-trained semantic recognition model includes:
recognizing text information of the voice to be processed based on a pre-trained voice recognition model, and performing similarity calculation on the text information and the intention text set to obtain a similarity value between the text information and each intention text in the intention text set;
judging whether an intention text with a similarity value larger than a preset threshold exists or not;
when the intention texts with the similarity values larger than the preset threshold value exist, selecting the intention text with the maximum similarity value from the intention texts with the similarity values larger than the preset threshold value, and taking the semantics of the intention text with the maximum similarity value as the target semantic information of the voice to be processed;
and when judging that no intention text with the similarity value larger than a preset threshold value exists, inputting the text information into the semantic recognition model to obtain the target semantic information of the voice to be processed.
In one embodiment, the recognizing the text information of the speech to be processed based on the pre-trained speech recognition model includes:
acquiring identity information of the user, and calling a voice recognition model corresponding to the voice to be processed based on the identity information;
and inputting the voice to be processed into the voice recognition model to obtain the text information of the voice to be processed.
In one embodiment, the calculating the similarity between the text information and the intention text set to obtain the similarity between the text information and each intention text in the intention text set includes:
respectively converting the text information and each intention text into corresponding sentence vectors;
and respectively calculating the similarity between the sentence vector of the text information and the sentence vector of each intention text by using a similarity algorithm to obtain the similarity value between the text information and each intention text in the intention text set.
In one embodiment, the matching out the target semantic text from the preset linguistic form based on the target semantic information includes:
matching the target semantic information with a set of conversational texts in the conversational form;
when the target semantic information is successfully matched with any one of the tactical texts in the tactical text set, taking the successfully matched tactical text as the target tactical text;
and when the target semantic information fails to be matched with the language texts in the language text set, taking a preset general language text as the target language text.
In one embodiment, the second feedback module is further configured to:
and storing the voice to be processed and the intermediate voice data into a preset database.
Fig. 3 is a schematic diagram of an electronic device 1 according to a preferred embodiment of the present application.
The electronic device 1 includes but is not limited to: memory 11, processor 12, display 13 and communication interface 14. The electronic device 1 is connected to a network via a communication interface 14. The network may be a wireless or wired network such as an Intranet (Internet), the Internet (Internet), a Global System of Mobile communication (GSM), wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, or a communication network.
The memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like equipped with the electronic device 1. Of course, the memory 11 may also comprise both an internal memory unit and an external memory device of the electronic device 1. In this embodiment, the memory 11 is generally used for storing an operating system installed in the electronic device 1 and various application software, such as a program code of the voice call-out program 10 based on semantic recognition. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is typically used for controlling the overall operation of the electronic device 1, such as performing data interaction or communication related control and processing. In this embodiment, the processor 12 is configured to run the program code stored in the memory 11 or process data, for example, run the program code of the voice calling program 10 based on semantic recognition.
The display 13 may be referred to as a display screen or display unit. In some embodiments, the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch screen, or the like. The display 13 is used for displaying information processed in the electronic device 1 and for displaying a visual work interface.
The communication interface 14 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), the communication interface 14 typically being used for establishing a communication connection between the electronic device 1 and other electronic devices.
Fig. 3 only shows the electronic device 1 with components 11-14 and the speech calling program 10 based on semantic recognition, but it is to be understood that not all shown components are required to be implemented, and that more or less components may alternatively be implemented.
Optionally, the electronic device 1 may further include a user interface, the user interface may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further include a standard wired interface and a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch screen, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
The electronic device 1 may further include a Radio Frequency (RF) circuit, a sensor, an audio circuit, and the like, which are not described in detail herein.
In the above embodiment, the processor 12 may implement the following steps when executing the speech calling-out program 10 based on semantic recognition stored in the memory 11:
acquiring a telephone number and travel information of a user, and initiating an outbound request to the user based on the telephone number;
when the user responds to the outbound request, feeding back template voice data generated by the template text corresponding to the travel information to the user;
receiving the voice to be processed fed back by the user based on the template voice data, and identifying target semantic information of the voice to be processed based on a preset intention text set or a pre-trained semantic identification model;
matching a target language-art text from a preset language-art form based on the target semantic information, converting the target language-art text into intermediate voice data and feeding the intermediate voice data back to the user.
The storage device may be the memory 11 of the electronic device 1, or may be another storage device communicatively connected to the electronic device 1.
For the detailed description of the above steps, please refer to the above description of fig. 2 regarding a functional block diagram of an embodiment of the speech call-out apparatus 100 based on semantic recognition and fig. 1 regarding a flowchart of an embodiment of the speech call-out method based on semantic recognition.
In addition, the embodiment of the present application also provides a computer-readable storage medium, which may be non-volatile or volatile. The computer readable storage medium may be any one or any combination of hard disks, multimedia cards, SD cards, flash memory cards, SMCs, read Only Memories (ROMs), erasable Programmable Read Only Memories (EPROMs), portable compact disc read only memories (CD-ROMs), USB memories, etc. The computer readable storage medium includes a storage data area and a storage program area, the storage program area stores the voice call-out program 10 based on semantic recognition, and the voice call-out program 10 based on semantic recognition realizes the following operations when being executed by a processor:
acquiring a telephone number and travel information of a user, and initiating an outbound request to the user based on the telephone number;
when the user responds to the outbound request, feeding back template voice data generated by the template text corresponding to the travel information to the user;
receiving the voice to be processed fed back by the user based on the template voice data, and identifying target semantic information of the voice to be processed based on a preset intention text set or a pre-trained semantic identification model;
matching a target language-art text from a preset language-art form based on the target semantic information, converting the target language-art text into intermediate voice data and feeding the intermediate voice data back to the user.
The specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the above-mentioned speech outbound method based on semantic recognition, and is not described herein again.
It should be noted that the above-mentioned serial numbers of the embodiments of the present application are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, apparatus, article, or method that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, an electronic device, or a network device) to execute the method according to the embodiments of the present application.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims (10)

1. A speech outbound method based on semantic recognition, characterized in that the method comprises:
acquiring a telephone number and travel information of a user, and initiating an outbound request to the user based on the telephone number;
when the user responds to the outbound request, feeding back template voice data generated by the template text corresponding to the travel information to the user;
receiving the voice to be processed fed back by the user based on the template voice data, and identifying target semantic information of the voice to be processed based on a preset intention text set or a pre-trained semantic identification model;
matching a target language-art text from a preset language-art form based on the target semantic information, converting the target language-art text into intermediate voice data and feeding the intermediate voice data back to the user.
2. The semantic recognition based voice callout method of claim 1, wherein after receiving the pending voice fed back by the user based on the template voice data, the method further comprises:
and performing echo cancellation processing, noise reduction processing and enhanced amplification processing on the voice to be processed to obtain the preprocessed voice to be processed.
3. The speech calling method based on semantic recognition according to claim 1 or 2, wherein the recognizing the target semantic information of the speech to be processed based on a preset intention text set or a pre-trained semantic recognition model comprises:
recognizing text information of the voice to be processed based on a pre-trained voice recognition model, and performing similarity calculation on the text information and the intention text set to obtain a similarity value between the text information and each intention text in the intention text set;
judging whether an intention text with a similarity value larger than a preset threshold exists or not;
when the intention texts with the similarity values larger than the preset threshold value exist, selecting the intention text with the maximum similarity value from the intention texts with the similarity values larger than the preset threshold value, and taking the semantics of the intention text with the maximum similarity value as the target semantic information of the voice to be processed;
and when judging that no intention text with the similarity value larger than a preset threshold value exists, inputting the text information into the semantic recognition model to obtain the target semantic information of the voice to be processed.
4. The method of claim 3, wherein the recognizing text information of the speech to be processed based on the pre-trained speech recognition model comprises:
acquiring identity information of the user, and calling a voice recognition model corresponding to the voice to be processed based on the identity information;
and inputting the voice to be processed into the voice recognition model to obtain the text information of the voice to be processed.
5. The method for speech calling based on semantic recognition according to claim 3, wherein the calculating the similarity between the text information and the intention text set to obtain the similarity between the text information and each intention text in the intention text set comprises:
respectively converting the text information and each intention text into corresponding sentence vectors;
and respectively calculating the similarity between the sentence vector of the text information and the sentence vector of each intention text by using a similarity algorithm to obtain the similarity value between the text information and each intention text in the intention text set.
6. The method of claim 1, wherein matching target phonetic text from a preset phonetic form based on the target semantic information comprises:
matching the target semantic information with a set of conversational texts in the conversational form;
when the target semantic information is successfully matched with any one of the tactical texts in the tactical text set, taking the successfully matched tactical text as the target tactical text;
and when the target semantic information fails to be matched with the language texts in the language text set, taking a preset general language text as the target language text.
7. The semantic recognition based voice callout method of claim 1, wherein after converting the target conversational text into intermediate voice data and feeding back to the user, the method further comprises:
and storing the voice to be processed and the intermediate voice data into a preset database.
8. A speech callout device based on semantic recognition, the device comprising:
an outbound module: the system comprises a server and a server, wherein the server is used for acquiring the telephone number and the travel information of a user and initiating an outbound request to the user based on the telephone number;
a first feedback module: the system is used for feeding back template voice data generated by the template text corresponding to the travel information to the user when the user responds to the outbound request;
an identification module: the voice recognition module is used for receiving the voice to be processed fed back by the user based on the template voice data, and recognizing target semantic information of the voice to be processed based on a preset intention text set or a pre-trained semantic recognition model;
a second feedback module: and the voice recognition module is used for matching a target language-art text from a preset language-art form based on the target semantic information, converting the target language-art text into intermediate voice data and feeding the intermediate voice data back to the user.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the speech call-out method based on semantic recognition according to any one of claims 1 to 7 when executing a program stored in a memory.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the semantic recognition based voice call-out method according to one of claims 1 to 7.
CN202210743094.6A 2022-06-27 2022-06-27 Voice call-out method, device, equipment and medium based on semantic recognition Pending CN115174748A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210743094.6A CN115174748A (en) 2022-06-27 2022-06-27 Voice call-out method, device, equipment and medium based on semantic recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210743094.6A CN115174748A (en) 2022-06-27 2022-06-27 Voice call-out method, device, equipment and medium based on semantic recognition

Publications (1)

Publication Number Publication Date
CN115174748A true CN115174748A (en) 2022-10-11

Family

ID=83487626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210743094.6A Pending CN115174748A (en) 2022-06-27 2022-06-27 Voice call-out method, device, equipment and medium based on semantic recognition

Country Status (1)

Country Link
CN (1) CN115174748A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052646A (en) * 2023-03-06 2023-05-02 北京水滴科技集团有限公司 Speech recognition method, device, storage medium and computer equipment
CN117059082A (en) * 2023-10-13 2023-11-14 北京水滴科技集团有限公司 Outbound call conversation method, device, medium and computer equipment based on large model

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052646A (en) * 2023-03-06 2023-05-02 北京水滴科技集团有限公司 Speech recognition method, device, storage medium and computer equipment
CN117059082A (en) * 2023-10-13 2023-11-14 北京水滴科技集团有限公司 Outbound call conversation method, device, medium and computer equipment based on large model
CN117059082B (en) * 2023-10-13 2023-12-29 北京水滴科技集团有限公司 Outbound call conversation method, device, medium and computer equipment based on large model

Similar Documents

Publication Publication Date Title
CN111028827B (en) Interaction processing method, device, equipment and storage medium based on emotion recognition
CN109587360B (en) Electronic device, method for coping with tactical recommendation, and computer-readable storage medium
US9742912B2 (en) Method and apparatus for predicting intent in IVR using natural language queries
US9633657B2 (en) Systems and methods for supporting hearing impaired users
CN115174748A (en) Voice call-out method, device, equipment and medium based on semantic recognition
CN110472224B (en) Quality of service detection method, apparatus, computer device and storage medium
US20090304161A1 (en) system and method utilizing voice search to locate a product in stores from a phone
JPWO2016092807A1 (en) SPEAKER IDENTIFYING DEVICE AND FEATURE REGISTRATION METHOD FOR REGISTERED SPEECH
CN111261162B (en) Speech recognition method, speech recognition apparatus, and storage medium
CN109462482B (en) Voiceprint recognition method, voiceprint recognition device, electronic equipment and computer readable storage medium
US20060069563A1 (en) Constrained mixed-initiative in a voice-activated command system
CN112235470B (en) Incoming call client follow-up method, device and equipment based on voice recognition
CN112836521A (en) Question-answer matching method and device, computer equipment and storage medium
CN110446061B (en) Video data acquisition method and device, computer equipment and storage medium
CN111611365A (en) Flow control method, device, equipment and storage medium of dialog system
CN114817507A (en) Reply recommendation method, device, equipment and storage medium based on intention recognition
CN110750626B (en) Scene-based task-driven multi-turn dialogue method and system
CN108234785B (en) Telephone sales prompting method, electronic device and readable storage medium
CN114238602A (en) Dialogue analysis method, device, equipment and storage medium based on corpus matching
US20060129398A1 (en) Method and system for obtaining personal aliases through voice recognition
US10819849B1 (en) Device, system and method for address validation
JP2014197140A (en) Customer identity verification support system for operator, and method therein
CN110853674A (en) Text collation method, apparatus, and computer-readable storage medium
CN113656566A (en) Intelligent dialogue processing method and device, computer equipment and storage medium
CN110717020B (en) Voice question-answering method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination