WO2022142011A1 - 一种地址识别方法、装置、计算机设备及存储介质 - Google Patents

一种地址识别方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022142011A1
WO2022142011A1 PCT/CN2021/090433 CN2021090433W WO2022142011A1 WO 2022142011 A1 WO2022142011 A1 WO 2022142011A1 CN 2021090433 W CN2021090433 W CN 2021090433W WO 2022142011 A1 WO2022142011 A1 WO 2022142011A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
address
question
text information
entity recognition
Prior art date
Application number
PCT/CN2021/090433
Other languages
English (en)
French (fr)
Inventor
张稳
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022142011A1 publication Critical patent/WO2022142011A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of speech processing in artificial intelligence, and in particular, to an address recognition method, device, computer equipment and storage medium based on named entity recognition.
  • Human-machine dialogue is an important field in the field of artificial intelligence.
  • Dialogue is a basic communication ability and skill for human beings, and the most important thing for human beings to communicate naturally and smoothly in dialogue is to understand the intention of the other party.
  • artificial intelligence it requires the cooperation of various applications and systems to achieve a human-like effect.
  • the most critical step to support this function, and also the most basic step, is to correctly identify the intention of human speech, so that the machine can make correct response.
  • semantic recognition method that is, by constructing a training corpus and training a deep learning model according to the training corpus, so that the deep learning model can identify the question and answer text information corresponding to the training corpus, so as to know that the question and answer text information is the actual intention.
  • the purpose of the embodiments of the present application is to propose an address recognition method, device, computer equipment and storage medium based on named entity recognition, so as to solve the problem that the traditional semantic recognition method cannot be applied to semi-closed man-machine dialogue situations, and the deep learning model cannot Problems with weak generalization ability.
  • the embodiment of the present application provides an address identification method based on named entity identification, and adopts the following technical solutions:
  • the target address result is output.
  • the embodiments of the present application also provide an address identification device based on named entity identification, which adopts the following technical solutions:
  • the audio acquisition module is used to receive the question and answer audio data sent by the audio acquisition device;
  • a speech recognition module used for performing speech recognition operation on the question and answer audio data to obtain question and answer text information
  • the address text extraction module is used for performing an address text extraction operation on the question and answer text information to obtain the address text information
  • a vector conversion module for inputting the address text information into the Embedding layer to perform a vector conversion operation to obtain an address text vector
  • a feature expansion module for inputting the question and answer text information and the address text vector into the CNN model to perform a feature expansion operation to obtain an expanded text vector
  • an entity recognition module for inputting the address text vector and the expanded text vector into the trained named entity recognition model for entity recognition operation to obtain a target address result
  • a result output module used for outputting the target address result.
  • the embodiment of the present application also provides a computer device, which adopts the following technical solutions:
  • a memory and a processor are included, and computer-readable instructions are stored in the memory.
  • the processor executes the computer-readable instructions, the processor implements the steps of the address recognition method based on named entity recognition as intended:
  • the target address result is output.
  • the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:
  • Computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by the processor, the steps of the address recognition method based on named entity recognition as described below are implemented:
  • the target address result is output.
  • the address recognition method based on named entity recognition includes: receiving question and answer audio data sent by an audio collection device; performing a voice recognition operation on the question and answer audio data to obtain question and answer text information; performing address text on the question and answer text information Extracting operation to obtain address text information; inputting the address text information into the Embedding layer for vector conversion operation to obtain an address text vector; inputting the question and answer text information and the address text vector into the CNN model for feature expansion operation, and obtaining Expanding the text vector; inputting the address text vector and the expanded text vector into the trained named entity recognition model for entity recognition operation to obtain a target address result; and outputting the target address result.
  • the audio information is converted into text information and into a question and answer text vector, and the question and answer text vector is input into the CNN model.
  • the feature information is combined to obtain the extended text vector, and finally the question and answer text vector and the extended text vector are input into the trained named entity recognition model for named entity recognition, and the target address result is obtained. Since the extended text vector combines the token's following phrase features
  • the information and the feature information of the token enable the extended text vector to solve the generalization ability of the model for entity extraction in a specific range of suffixes without requiring a large amount of data for fitting, reducing model training costs and improving model recognition. ability.
  • Fig. 1 is the realization flow chart of the address identification method based on named entity identification provided by the first embodiment of the present application;
  • Fig. 2 is a flow chart of a specific implementation of step S103 in Fig. 1;
  • step S103 in FIG. 1 is a flowchart of another specific implementation of step S103 in FIG. 1;
  • Fig. 4 is the realization flow chart of obtaining the trained named entity recognition model provided by the first embodiment of the present application.
  • Fig. 5 is a flow chart of a specific implementation manner of step S401 in Fig. 4;
  • Fig. 6 is a flow chart of a specific implementation manner of step S402 in Fig. 4;
  • FIG. 7 is a schematic structural diagram of an address identification device based on named entity identification provided in Embodiment 2 of the present application:
  • FIG. 8 is a schematic structural diagram of a specific implementation manner of the address text extraction module 130 in FIG. 7;
  • FIG. 9 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • FIG. 1 shows a flowchart for realizing the address identification method based on named entity identification provided by Embodiment 1 of the present application. For the convenience of description, only the part related to the present application is shown.
  • the above-mentioned address identification method based on named entity identification includes the following steps:
  • Step S101 Receive the question and answer audio data sent by the audio collection device.
  • the question and answer audio data refers to a waveform file that converts an audio signal during a phone call into a waveform signal.
  • the question and answer audio data can be obtained by importing audio signals collected by a microphone, a telephone or other equipment into the computer through a digital audio interface in the computer for recording.
  • Step S102 Perform a voice recognition operation on the question and answer audio data to obtain the question and answer text information.
  • the speech recognition operation is mainly used to convert the above-mentioned collected question and answer audio data into text data.
  • the speech recognition operation can be realized by a pattern matching method.
  • Each word is said in turn, and its feature vector is stored in the template library as a template.
  • the recognition stage the similarity between the feature vector of the input speech and each template in the template library is compared in turn, and the one with the highest similarity is used as the identification. result output.
  • the question and answer text information can be distinguished from the question and answer text information recognized by the voice according to the waveform characteristics of the user, and the text content information is displayed in the form of "one question and one answer", so as to obtain the question and answer of the customer service staff.
  • Text information and user's Q&A text information are displayed in the form of "one question and one answer", so as to obtain the question and answer of the customer service staff.
  • Step S103 performing an address text extraction operation on the question and answer text information to obtain address text information.
  • the address text extraction operation may be a word segmentation operation on the question and answer text information to obtain a plurality of words, and a filtering operation on the words based on the stop word table to obtain the filtered address text information.
  • the address text extraction operation may also be a word segmentation operation on the question and answer text information to obtain a plurality of words, and a filtering operation is performed on the words based on the stop word table to obtain the filtered words to be confirmed, and calculate each word to be confirmed.
  • the first word frequency of words in the question and answer text information read the local corpus, calculate the second word frequency of each word to be confirmed in the local corpus, filter the words to be confirmed according to the product of the first word frequency and the second word frequency, and obtain the address text information .
  • Step S104 Input the address text information into the Embedding layer to perform a vector conversion operation to obtain an address text vector.
  • the vector transformation operation refers to inputting the question and answer text information into the Embedding layer for vector transformation to obtain the question and answer text vector.
  • Step S105 Input the question-and-answer text information and the address text vector into the CNN model to perform a feature expansion operation to obtain an expanded text vector.
  • the CNN performs expansion processing on the obtained question and answer text vector through a sliding window, that is, adding contextual feature information to obtain an expanded text vector with contextual feature information expanded.
  • Step S106 Input the address text vector and the expanded text vector into the trained named entity recognition model to perform named entity recognition operation, and obtain the target address result.
  • the expanded text vector with the expanded contextual feature information is combined with the original question-and-answer text vector and input into the trained named entity recognition model, and the expanded text vector obtained by combining with the CNN model and the vector conversion
  • the obtained question and answer text vector increases the feature information of the context and improves the generalization ability of the trained named entity recognition model for entity extraction in a specific range of suffixes, especially long-tail address entities (such as: *** Mongolia Autonomous County ), because through the CNN sliding window, more context information of a long tail suffix such as "Mongolia Autonomous County" can be given to the downstream network layer for model parameter learning and improve the generalization ability of the model.
  • the *** area answered by the customer is extracted through the NER model, and then indexed through the national address database, and the address is retrieved through fuzzy matching of words and sounds, and it is judged whether the address said by the customer actually exists or not. It belongs to the administrative level. If the administrative level of the address mentioned by the customer is a district (county), the city to which the address said by the customer belongs is searched, and then the district (county) administrative level address in the text answered by the customer is replaced by the city to which it belongs. , to complete the text preprocessing.
  • Step S107 output the target address result.
  • the provided address recognition method based on named entity recognition includes: receiving question and answer audio data sent by an audio collection device; performing a voice recognition operation on the question and answer audio data to obtain question and answer text information; Text extraction operation to obtain address text information; input the address text information to the Embedding layer for vector conversion operation to obtain address text vector; input the question and answer text information and address text vector to the CNN model for feature expansion operation to obtain the expanded text vector; The address text vector and the expanded text vector are input to the trained named entity recognition model for entity recognition operation, and the target address result is obtained; the target address result is output.
  • the audio information is converted into text information and into a question and answer text vector, and the question and answer text vector is input into the CNN model.
  • the feature information is combined to obtain the extended text vector, and finally the question and answer text vector and the extended text vector are input into the trained named entity recognition model for named entity recognition, and the target address result is obtained. Since the extended text vector combines the token's following phrase features
  • the information and the feature information of the token enable the extended text vector to solve the generalization ability of the model for entity extraction in a specific range of suffixes without requiring a large amount of data for fitting, reducing model training costs and improving model recognition. ability.
  • FIG. 2 a flow chart of a specific implementation manner of step S103 in FIG. 1 is shown. For the convenience of description, only the parts related to the present application are shown.
  • step S103 specifically includes: step S201 and step S202.
  • Step S201 Perform word segmentation on the question and answer text information to obtain a plurality of words.
  • the method of word segmentation operation may be based on string matching, that is, scanning the string, and if a substring of the string is found to be the same as the word in the dictionary, it is considered a match, such as a mechanical word segmentation method.
  • string matching that is, scanning the string
  • This kind of word segmentation usually adds some heuristic rules, such as "forward/reverse maximum matching", "long word first” and so on.
  • the second category is the word segmentation method based on statistics and machine learning.
  • the model parameters are trained according to the observed data (labeled corpus), and in the word segmentation stage Then, the probability of occurrence of various word segmentations is calculated through the model, and the word segmentation result with the highest probability is used as the final result, and finally the address text information is obtained one by one.
  • the address text information may be a general term for all words, not necessarily the names of main words in the question and answer text information.
  • Step S202 Perform a filtering operation on the words based on the stop word table to obtain filtered address text information.
  • the address text information obtained after word segmentation can also be filtered according to the stop word table to remove some unimportant words (also called stop words), as an example , for example: "ah”, "oh”, etc.
  • FIG. 3 there is shown a flowchart of another specific implementation manner of step S103 in FIG. 1 .
  • step S103 in FIG. 1 .
  • FIG. 3 For the convenience of description, only the parts related to the present application are shown.
  • step S103 specifically includes: step S301 , step S302 , step S303 , step S304 and step S305 .
  • Step S301 Perform word segmentation on the question and answer text information to obtain a plurality of words.
  • the method of word segmentation operation may be based on string matching, that is, scanning the string, and if a substring of the string is found to be the same as the word in the dictionary, it is considered a match, such as a mechanical word segmentation method.
  • string matching that is, scanning the string
  • This kind of word segmentation usually adds some heuristic rules, such as "forward/reverse maximum matching", "long word first” and so on.
  • the second category is the word segmentation method based on statistics and machine learning.
  • the model parameters are trained according to the observed data (labeled corpus), and in the word segmentation stage Then, the probability of occurrence of various word segmentations is calculated through the model, and the word segmentation result with the highest probability is used as the final result, and finally the address text information is obtained one by one.
  • the address text information may be a general term for all words, not necessarily the names of main words in the question and answer text information.
  • Step S302 Perform a filtering operation on words based on the stop word table to obtain filtered words to be confirmed.
  • the address text information obtained after word segmentation can also be filtered according to the stop word table to remove some unimportant words (also called stop words), as an example , for example: "ah”, "oh”, etc.
  • Step S303 Calculate the first word frequency of each word to be confirmed in the question and answer text information.
  • the probability of it being a stop word is relatively high, and the first word frequency is mainly used to determine whether the word to be confirmed is a stop word.
  • Step S304 Read the local corpus, and calculate the second word frequency of each word to be confirmed in the local corpus.
  • K 2 is the second word frequency
  • n is the total number of documents in the corpus
  • m is the number of documents containing a certain word. If a word is more common, the size of K2 is closer to 0 , and the denominator is increased by 1 to avoid the denominator being 0, that is, all documents contain the word. It can be seen that if a word, such as "Anyway” appears in the input text, but its second word frequency is relatively low, it means that "Anyway” may be more important in the current input text, indicating that "Anyway” may be more important in the current input text. The word “line” is most likely the stem word in this input text.
  • Step S305 Filter the words to be confirmed according to the product of the first word frequency and the second word frequency to obtain address text information.
  • the words are filtered through regular expressions to obtain the words to be confirmed, and then the number of the words to be confirmed in the question and answer text information is calculated.
  • Word frequency after obtaining the second word frequency of the word to be confirmed in the corpus, and finally filtering the word to be confirmed according to the first word frequency and the second word frequency to obtain the filtered address text information.
  • FIG. 4 an implementation flowchart of obtaining a trained named entity recognition model provided by Embodiment 1 of the present application is shown. For the convenience of description, only parts related to the present application are shown.
  • step S106 before step S106, it further includes: step S401 and step S402.
  • Step S401 Obtain an initial training set and a data set to be identified.
  • the initial training set is a data set obtained by performing the following preprocessing on the labeling data set: the text in the labeling data set is segmented according to the sentence segmentation rules to obtain a plurality of sentences; Perform word segmentation on each sentence to get a sentence composed of multiple words, each word with a label; query the word dictionary and label dictionary to obtain the word ID and label ID of each word to convert the sentence into the form of word ID and label ID Representation; padding or truncating sentences to unify all sentences to a specified length.
  • the data set to be identified is a data set obtained by preprocessing the unlabeled data set as follows: the text in the unlabeled data set is segmented according to the sentence segmentation rules to obtain multiple sentences; each sentence is processed according to the preset vocabulary table. Word segmentation to get a sentence composed of multiple words; query the word dictionary to obtain the word ID of each word to convert the sentence into the form of word ID; fill or truncate the sentence to unify all sentences into a specified length.
  • the clauses can be segmented according to the clauses rules by using regular expressions to match.
  • Step S402 Perform multiple rounds of training operations on the initial named entity recognition model based on the initial training set and the data set to be identified until it converges to obtain a trained named entity recognition model, wherein each round of training operations includes: The initial named entity recognition model is supervised and trained to obtain the initial named entity recognition model after supervised training; the initial named entity recognition model after supervised training is based on the trained named entity recognition model.
  • the data set to be identified is extracted from the weakly labeled data set to be identified in this round, and the subset and the initial training set are formed into a training set for the next round of training.
  • the weak label of the named entity recognition model for the named entity labeling of the data set to be recognized during the training process is used as the labeling result of the data set to be recognized, and its subset and the initial training set are selected to form the training of the next round of training
  • the size of the data set to be recognized can be set as needed, so that the size of the training set used to train the named entity recognition model is expanded with the data set to be recognized of this size, so that the final named entity recognition model has better generalization capabilities.
  • the recognition ability is better on the data set to be recognized.
  • step S401 in FIG. 4 a flowchart of a specific implementation manner of step S401 in FIG. 4 is shown. For the convenience of description, only the parts related to the present application are shown.
  • step S401 specifically includes: step S501 , step S502 , step S503 , step S504 , step S505 , step S506 , step S507 , step S508 and step S509 .
  • Step S501 Read the local database, and obtain the pre-labeled data set and the unlabeled data set in the local database.
  • the initial training set is a data set obtained by performing the following preprocessing on the labeled data set;
  • the data set to be identified is a data set obtained by performing the following preprocessing on the unlabeled data set.
  • Step S502 Perform sentence segmentation on the text in the pre-labeled data set according to the sentence segmentation rules to obtain a plurality of pre-labeled sentences.
  • Step S503 Perform word segmentation on each pre-labeled sentence based on the preset word table, to obtain a pre-labeled sentence composed of multiple words, and each word carries label information respectively.
  • the word table may be a word table corresponding to the BERT model pre-trained by Google.
  • Step S504 query the word dictionary and the tag dictionary to obtain the word ID and label ID of each word to convert the pre-labeled sentence into a representation in the form of the word ID and the label ID.
  • the word dictionary and the label dictionary may be the word dictionary and the label dictionary corresponding to the BERT model pre-trained by Google.
  • Each word in the word dictionary has a corresponding word ID.
  • word IDs corresponding to unknown words are also set in the word dictionary. That is, if the word ID of a word is queried in the word dictionary, but the word is not recorded in the dictionary, the query feedback result is the word ID corresponding to the unknown word.
  • Each tag in the tag dictionary has a corresponding tag ID.
  • Step S505 unifying the length of the pre-labeled sentences to obtain an initial training set.
  • the length unification operation refers to filling or truncating sentences to a predetermined length
  • the predetermined length refers to the longest length of a predetermined sentence, which is generally set to 128, that is, the longest sentence contains 128 words. For example, if a sentence is less than 128 words, it will be filled with 0 at the end of the sentence to 128 words, and if it is more than 128 words, it will be truncated from the excess.
  • Step S506 Perform sentence segmentation on the text in the unlabeled data set according to the sentence segmentation rules to obtain a plurality of unlabeled sentences.
  • Step S507 Perform word segmentation on each unlabeled sentence based on the preset word table to obtain an unlabeled sentence composed of multiple words.
  • Step S508 Convert the unlabeled sentence into a word identification form based on the word dictionary.
  • Step S509 Unify the length of the unlabeled sentences to obtain the data set to be recognized.
  • step S402 in FIG. 4 a flow chart of a specific implementation manner of step S402 in FIG. 4 is shown. For the convenience of description, only the parts related to the present application are shown.
  • step S402 specifically includes: step S601 , step S602 , step S603 and step S604 .
  • Step S601 Input the sentences of the current round in the data set of the current round into the BERT layer of the BERT-CRF model in the named entity recognition model, and obtain the encoding vectors of the words in the sentences of the current round.
  • Step S602 Input the encoding vector into the CRF layer of the BERT-CRF model, and obtain a probability matrix of the current round of sentences composed of the probability sequences of all labels corresponding to all words in the current round of sentences.
  • Step S603 Obtain the optimal labeling sequence of the probability matrix of each current round of sentences based on the Viterbi algorithm.
  • Step S604 Obtain the identification label of the word according to the optimal labeling sequence, and adjust the parameters of the BERT-CRF model in the named entity recognition model based on the identification label of the word and the label of the word in the annotation data set.
  • the prior art uses the BERT layer + the fully connected layer to solve the sequence labeling problem.
  • the output of a single word The vector is then processed by Softmax, and the value of each dimension represents the probability that the word is a certain category. Based on this data, the loss can be calculated and the model can be trained.
  • the present invention replaces the fully connected layer with a CRF layer, and better captures the structural characteristics between tags through the BERT-CRF model.
  • the structure of the BERT-CRF model includes the BERT layer and the CRF layer that are connected in sequence.
  • the words (Word) in the sentence are input into the BERT layer to obtain the encoding vector, and the encoding vector is used as the input of the CRF layer to obtain the probability sequence of all labels corresponding to the words. Then, according to the probability matrix, the Viterbi algorithm is used for decoding to obtain the optimal labeling sequence, and the optimal labeling sequence contains the label (Label) corresponding to the word.
  • the address recognition method based on named entity recognition includes: receiving the question and answer audio data sent by the audio collection device; performing a voice recognition operation on the question and answer audio data to obtain the question and answer text information; extracting the address text from the question and answer text information operation to obtain the address text information; input the address text information to the Embedding layer for vector conversion operation to obtain the address text vector; input the question and answer text information and the address text vector to the CNN model for feature expansion operation to obtain the expanded text vector; convert the address text The vector and the expanded text vector are input to the trained named entity recognition model for entity recognition operation, and the target address result is obtained; the target address result is output.
  • the audio information is converted into text information and into a question and answer text vector, and the question and answer text vector is input into the CNN model.
  • the feature information is combined to obtain the extended text vector, and finally the question and answer text vector and the extended text vector are input into the trained named entity recognition model for named entity recognition, and the target address result is obtained, because the extended text vector combines the token's following phrase features.
  • the information and the feature information of the token enable the extended text vector to solve the generalization ability of the model for entity extraction in a specific range of suffixes without requiring a large amount of data for fitting, reducing model training costs and improving model recognition. ability.
  • the above question and answer audio data may also be stored in a node of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the present application may be used in numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, and the like.
  • the application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote computer storage media including storage devices.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
  • the present application provides an embodiment of an address identification device based on named entity recognition, and the device embodiment corresponds to the method embodiment shown in FIG. 1 .
  • the device can be specifically applied to various electronic devices.
  • the address recognition device 100 based on named entity recognition in this embodiment includes: an audio acquisition module 110, a speech recognition module 120, an address text extraction module 130, a vector conversion module 140, a feature expansion module 150, and an entity recognition module 160 and a result output module 170. in:
  • An audio acquisition module 110 configured to receive the question and answer audio data sent by the audio acquisition device
  • the speech recognition module 120 is used for performing speech recognition operation on the question and answer audio data to obtain the question and answer text information
  • the address text extraction module 130 is configured to perform an address text extraction operation on the question and answer text information to obtain the address text information
  • the vector conversion module 140 is used to input the address text information into the Embedding layer to perform a vector conversion operation to obtain an address text vector;
  • the feature expansion module 150 is used to input the question and answer text information and the address text vector into the CNN model for feature expansion operation to obtain the expanded text vector;
  • the entity recognition module 160 is used to input the address text vector and the expanded text vector into the trained named entity recognition model to perform entity recognition operation, and obtain the target address result;
  • the result output module 170 is used for outputting the target address result.
  • the question and answer audio data refers to a waveform file that converts an audio signal during a phone call into a waveform signal.
  • the question and answer audio data can be obtained by importing audio signals collected by a microphone, a telephone or other equipment into the computer through a digital audio interface in the computer for recording.
  • the speech recognition operation is mainly used to convert the above-mentioned collected question and answer audio data into text data.
  • the speech recognition operation can be realized by a pattern matching method.
  • Each word is said in turn, and its feature vector is stored in the template library as a template.
  • the recognition stage the similarity between the feature vector of the input speech and each template in the template library is compared in turn, and the one with the highest similarity is used as the identification. result output.
  • the question and answer text information can be distinguished from the question and answer text information recognized by the voice according to the waveform characteristics of the user, and the text content information is displayed in the form of "one question and one answer", so as to obtain the question and answer of the customer service staff.
  • Text information and user's Q&A text information are displayed in the form of "one question and one answer", so as to obtain the question and answer of the customer service staff.
  • the address text extraction operation may be a word segmentation operation on the question and answer text information to obtain a plurality of words, and a filtering operation on the words based on the stop word table to obtain the filtered address text information.
  • the address text extraction operation may also be a word segmentation operation on the question and answer text information to obtain a plurality of words, and a filtering operation is performed on the words based on the stop word table to obtain the filtered words to be confirmed, and calculate each word to be confirmed.
  • the first word frequency of words in the question and answer text information read the local corpus, calculate the second word frequency of each word to be confirmed in the local corpus, filter the words to be confirmed according to the product of the first word frequency and the second word frequency, and obtain the address text information .
  • the vector transformation operation refers to inputting the question and answer text information into the Embedding layer for vector transformation to obtain the question and answer text vector.
  • the CNN performs expansion processing on the obtained question and answer text vector through a sliding window, that is, adding contextual feature information to obtain an expanded text vector with contextual feature information expanded.
  • the expanded text vector with the expanded contextual feature information is combined with the original question-and-answer text vector and input into the trained named entity recognition model, and the expanded text vector obtained by combining with the CNN model and the vector conversion
  • the obtained question and answer text vector increases the feature information of the context and improves the generalization ability of the trained named entity recognition model for entity extraction in a specific range of suffixes, especially long-tail address entities (such as: *** Mongolia Autonomous County ), because through the CNN sliding window, more context information of a long tail suffix such as "Mongolia Autonomous County" can be given to the downstream network layer for model parameter learning and improve the generalization ability of the model.
  • the *** area answered by the customer is extracted through the NER model, and then indexed through the national address database, and the address is retrieved through fuzzy matching of words and sounds, and it is judged whether the address said by the customer actually exists or not. It belongs to the administrative level. If the administrative level of the address mentioned by the customer is a district (county), the city to which the address said by the customer belongs is searched, and then the district (county) administrative level address in the text answered by the customer is replaced by the city to which it belongs. , to complete the text preprocessing.
  • the provided address recognition device based on named entity recognition includes: an audio acquisition module for receiving question-and-answer audio data sent by an audio collection device; a speech recognition module for performing a speech recognition operation on the question-and-answer audio data , to obtain the question and answer text information; the address text extraction module is used to extract the address text information from the question and answer text information to obtain the address text information; the vector conversion module is used to input the address text information into the Embedding layer for vector transformation operation to obtain the address text Vector; feature expansion module, which is used to input the question and answer text information and address text vector into the CNN model for feature expansion operation to obtain the expanded text vector; entity recognition module, which is used to input the address text vector and the expanded text vector into the trained name
  • the entity recognition model performs the entity recognition operation to obtain the target address result; the result output module is used to output the target address result.
  • the audio information is converted into text information and into a question and answer text vector, and the question and answer text vector is input into the CNN model.
  • the feature information is combined to obtain the extended text vector, and finally the question and answer text vector and the extended text vector are input into the trained named entity recognition model for named entity recognition, and the target address result is obtained, because the extended text vector combines the token's following phrase features.
  • the information and the feature information of the token enable the extended text vector to solve the generalization ability of the model for entity extraction in a specific range of suffixes without requiring a large amount of data for fitting, reducing model training costs and improving model recognition. ability.
  • FIG. 8 a schematic structural diagram of a specific implementation manner of the address text extraction module 130 in FIG. 7 is shown. For the convenience of description, only the parts related to the present application are shown.
  • the foregoing address text extraction module 130 includes: a first word segmentation submodule 131 and a first filtering submodule 132 . in:
  • the first word segmentation sub-module 131 is used to perform word segmentation operation on the question and answer text information to obtain a plurality of words;
  • the first filtering sub-module 132 is configured to perform a filtering operation on words based on the stop word table to obtain filtered address text information.
  • the method of word segmentation operation may be based on string matching, that is, scanning the string, and if a substring of the string is found to be the same as the word in the dictionary, it is considered a match, such as a mechanical word segmentation method.
  • string matching that is, scanning the string
  • This kind of word segmentation usually adds some heuristic rules, such as "forward/reverse maximum matching", "long word first” and so on.
  • the second category is the word segmentation method based on statistics and machine learning.
  • the model parameters are trained according to the observed data (labeled corpus), and in the word segmentation stage Then, the probability of occurrence of various word segmentations is calculated through the model, and the word segmentation result with the highest probability is used as the final result, and finally the address text information is obtained one by one.
  • the address text information may be a general term for all words, not necessarily the names of main words in the question and answer text information.
  • the address text information obtained after word segmentation can also be filtered according to the stop word table to remove some unimportant words (also called stop words), as an example , for example: "ah”, "oh”, etc.
  • the address text extraction module 130 includes: a second word segmentation submodule, a second filtering submodule, a first word frequency calculation submodule, a second word frequency calculation submodule, and a third filter submodule. in:
  • the second word segmentation sub-module is used to perform word segmentation operation on the question and answer text information to obtain multiple words
  • the second filtering sub-module is used to perform a filtering operation on words based on the stop word table to obtain filtered words to be confirmed;
  • the first word frequency calculation submodule is used to calculate the first word frequency of each word to be confirmed in the question and answer text information
  • the second word frequency calculation submodule is used to read the local corpus, and calculate the second word frequency of each word to be confirmed in the local corpus;
  • the third filtering sub-module is configured to filter the words to be confirmed according to the product of the first word frequency and the second word frequency to obtain address text information.
  • the above-mentioned address recognition apparatus 100 based on named entity recognition further includes: a training data acquisition module and a multi-round training module. in:
  • the training data acquisition module is used to acquire the initial training set and the data set to be identified;
  • the multi-round training module is used to perform multiple rounds of training operations on the initial named entity recognition model based on the initial training set and the data set to be recognized until it converges to obtain a trained named entity recognition model; wherein, each round of training operations includes: based on this
  • the initial named entity recognition model after supervised training is obtained by supervised training of the initial named entity recognition model in the round training set.
  • Data set extract a subset from the weakly labeled to-be-identified data set obtained in this round, and combine the subset and the initial training set into a training set for the next round of training.
  • the above training data acquisition module includes: a training data acquisition submodule, a first sentence segmentation module, a third word segmentation submodule, a first sentence conversion submodule, and a first length unification submodule module, the second sentence segmentation module, the fourth word segmentation submodule, the second sentence conversion submodule, and the second length unification submodule. in:
  • the training data acquisition sub-module is used to read the local database, and obtain the pre-labeled data set and unlabeled data set in the local database;
  • the first sentence segmentation module is used to perform sentence segmentation operations on the text in the pre-labeled data set according to the sentence segmentation rules to obtain multiple pre-labeled sentences;
  • the third word segmentation sub-module is used to perform word segmentation operation on each pre-labeled sentence based on the preset word table, and obtain a pre-labeled sentence composed of multiple words, each of which has label information;
  • the first sentence conversion sub-module is used to query the word dictionary and the tag dictionary to obtain the word ID and label ID of each word, so as to convert the pre-labeled sentence into a representation in the form of word ID and label ID;
  • the first length unification sub-module is used to unify the length of the pre-labeled sentences to obtain the initial training set;
  • the second sentence segmentation module is used to segment the text in the unlabeled data set according to the sentence segmentation rules to obtain multiple unlabeled sentences;
  • the fourth word segmentation sub-module is used to perform word segmentation operation on each unlabeled sentence based on the preset word table, so as to obtain an unlabeled sentence composed of multiple words;
  • the second sentence conversion sub-module is used to convert the unlabeled sentence into the form of word identification based on the word dictionary;
  • the second length unification sub-module is used to perform length unification operation on unlabeled sentences to obtain the data set to be recognized.
  • the above-mentioned multi-round training module specifically includes: a data input submodule, a probability matrix composition submodule, an optimal sequence acquisition submodule, and a parameter adjustment submodule. in:
  • the data input sub-module is used to input the sentences in the current round of the data set into the BERT layer of the BERT-CRF model in the named entity recognition model, and obtain the encoding vectors of the words in the sentences in this round;
  • the probability matrix is composed of sub-modules, which are used to input the encoding vector into the CRF layer of the BERT-CRF model, and obtain the probability matrix of this round of sentences composed of the probability sequences of all tags corresponding to all words in this round of sentences;
  • the optimal sequence obtaining sub-module is used to obtain the optimal labeling sequence of the probability matrix of each current round of sentences based on the Viterbi algorithm;
  • the parameter adjustment sub-module is used to obtain the identification label of the word according to the optimal labeling sequence, and adjust the parameters of the BERT-CRF model in the named entity recognition model based on the identification label of the word and the label of the word in the annotation data set.
  • the address recognition device based on named entity recognition includes: an audio acquisition module for receiving the question and answer audio data sent by the audio acquisition device; a speech recognition module for performing a voice recognition operation on the question and answer audio data to obtain Question and answer text information; the address text extraction module is used to extract the address text information from the question and answer text information to obtain the address text information; the vector conversion module is used to input the address text information into the Embedding layer for vector conversion operation to obtain the address text vector; The feature expansion module is used to input the question and answer text information and the address text vector into the CNN model for feature expansion operation to obtain the expanded text vector; the entity recognition module is used to input the address text vector and the expanded text vector to the trained named entity recognition.
  • the model performs entity recognition operation to obtain the target address result; the result output module is used to output the target address result.
  • the audio information is converted into text information and into a question and answer text vector, and the question and answer text vector is input into the CNN model.
  • the feature information is combined to obtain the extended text vector, and finally the question and answer text vector and the extended text vector are input into the trained named entity recognition model for named entity recognition, and the target address result is obtained, because the extended text vector combines the token's following phrase features
  • the information and the feature information of the token enable the extended text vector to solve the generalization ability of the model for entity extraction in a specific range of suffixes without requiring a large amount of data for fitting, reducing model training costs and improving model recognition. ability.
  • FIG. 9 is a block diagram of the basic structure of a computer device according to this embodiment.
  • the computer device 200 includes a memory 210 , a processor 220 , and a network interface 230 that communicate with each other through a system bus. It should be noted that only the computer device 200 with components 210-230 is shown in the figure, but it should be understood that implementation of all of the shown components is not required, and more or less components may be implemented instead.
  • the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • embedded equipment etc.
  • the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment.
  • the computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.
  • the memory 210 includes at least one type of readable storage medium, including flash memory, hard disk, multimedia card, card-type memory (eg, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the computer readable storage Media can be non-volatile or volatile.
  • the memory 210 may be an internal storage unit of the computer device 200 , such as a hard disk or a memory of the computer device 200 .
  • the memory 210 may also be an external storage device of the computer device 200, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 210 may also include both the internal storage unit of the computer device 200 and its external storage device.
  • the memory 210 is generally used to store the operating system and various application software installed on the computer device 200 , such as computer-readable instructions based on an address recognition method based on named entity recognition.
  • the memory 210 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 220 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 220 is typically used to control the overall operation of the computer device 200 .
  • the processor 220 is configured to execute computer-readable instructions stored in the memory 210 or process data, for example, computer-readable instructions for executing the address identification method based on named entity identification.
  • the network interface 230 may include a wireless network interface or a wired network interface, and the network interface 230 is generally used to establish a communication connection between the computer device 200 and other electronic devices.
  • the address recognition method based on named entity recognition provided by this application, in the process of man-machine question and answer, after obtaining the audio information of the user's reply, the audio information is converted into text information and into a question and answer text vector, and the question and answer text vector is input
  • the CNN model combines the feature information of the following phrases of the token with the feature information of the token to obtain the expanded text vector.
  • the question and answer text vector and the expanded text vector are input into the trained named entity recognition model for named entity recognition, and the target address is obtained.
  • the extended text vector since the extended text vector combines the feature information of the token's following phrases and the feature information of the token, the extended text vector can solve the generalization ability of the model for entity extraction in a specific range of suffixes without requiring a large amount of data for fitting. , which reduces the cost of model training and improves the recognition ability of the model.
  • the present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to perform the steps of the method for address identification based on named entity identification as described above.
  • the address recognition method based on named entity recognition provided by this application, in the process of man-machine question and answer, after obtaining the audio information of the user's reply, the audio information is converted into text information and into a question and answer text vector, and the question and answer text vector is input
  • the CNN model combines the feature information of the following phrases of the token with the feature information of the token to obtain the expanded text vector.
  • the question and answer text vector and the expanded text vector are input into the trained named entity recognition model for named entity recognition, and the target address is obtained.
  • the extended text vector since the extended text vector combines the feature information of the token's following phrases and the feature information of the token, the extended text vector can solve the generalization ability of the model for entity extraction in a specific range of suffixes without requiring a large amount of data for fitting. , which reduces the cost of model training and improves the recognition ability of the model.
  • the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Molecular Biology (AREA)
  • Acoustics & Sound (AREA)
  • Machine Translation (AREA)

Abstract

一种基于命名实体识别的地址识别方法、装置、计算机设备及存储介质,属于人工智能中的语音处理技术领域,还涉及区块链技术,用户的问答音频数据可存储于区块链中。本方法由于扩充文本向量结合了token的下文词组特征信息与token的特征信息,使得该扩充文本向量可以解决模型在特定范围的后缀中实体提取的泛化能力,而不需要大量的数据进行拟合,减少了模型训练成本,同时又提升了模型识别能力。

Description

一种地址识别方法、装置、计算机设备及存储介质
本申请以2020年12月30日提交的申请号为202011609093.X,名称为“一种地址识别方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。
技术领域
本申请涉及人工智能中的语音处理技术领域,尤其涉及一种基于命名实体识别的地址识别方法、装置、计算机设备及存储介质。
背景技术
人机对话是人工智能领域重要的领域。对话对于人类来说是基本的沟通能力和技能,而人类在对话中做到自然、通畅的交流,最重要的一点就是理解对方说话的意图。而对于人工智能来说,却需要各个应用和***协同配合才能达到类似人的效果,然而支撑这个功能的最关键的一步,也是最基础一步,就是正确识别人类说话的意图,这样机器才能做出正确的回应。
现有一种语义识别方法,即通过构建训练语料,并根据该训练语料训练深度学习模型,使得该深度学习模型可以识别与该训练语料相对应的问答文本信息,从而获知问答文本信息是实际意图。
然而,申请人意识到传统的语义识别方法普遍不智能,对于半封闭式的人机对话场合,例如:机器人提问,Q:请问你住在A城市还是B城市?客户回答:在**区(县);在上述场合中,传统的语义识别方法则无法确认该**区(县)是属于A城市还是B城市,然而实现上述精确的识别,需要再投入庞大的数据,以覆盖满足上述半封闭式的人机对话场合,由此可见,传统的语义识别方法无法应用于半封闭式的人机对话场合,深度学习模型的泛化能力较弱的问题。
发明内容
本申请实施例的目的在于提出一种基于命名实体识别的地址识别方法、装置、计算机设备及存储介质,以解决传统的语义识别方法无法应用于半封闭式的人机对话场合,深度学习模型的泛化能力较弱的问题。
为了解决上述技术问题,本申请实施例提供一种基于命名实体识别的地址识别方法,采用了如下所述的技术方案:
接收音频采集设备发送的问答音频数据;
对所述问答音频数据进行语音识别操作,得到问答文本信息;
对所述问答文本信息进行地址文本提取操作,得到地址文本信息;
将所述地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;
将所述问答文本信息以及所述地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;
将所述地址文本向量以及所述扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;
输出所述目标地址结果。
为了解决上述技术问题,本申请实施例还提供一种基于命名实体识别的地址识别装置,采用了如下所述的技术方案:
音频获取模块,用于接收音频采集设备发送的问答音频数据;
语音识别模块,用于对所述问答音频数据进行语音识别操作,得到问答文本信息;
地址文本提取模块,用于对所述问答文本信息进行地址文本提取操作,得到地址文本信息;
向量转换模块,用于将所述地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;
特征扩充模块,用于将所述问答文本信息以及所述地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;
实体识别模块,用于将所述地址文本向量以及所述扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;
结果输出模块,用于输出所述目标地址结果。
为了解决上述技术问题,本申请实施例还提供一种计算机设备,采用了如下所述的技术方案:
包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如想所述的基于命名实体识别的地址识别方法的步骤:
接收音频采集设备发送的问答音频数据;
对所述问答音频数据进行语音识别操作,得到问答文本信息;
对所述问答文本信息进行地址文本提取操作,得到地址文本信息;
将所述地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;
将所述问答文本信息以及所述地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;
将所述地址文本向量以及所述扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;
输出所述目标地址结果。
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,采用了如下所述的技术方案:
所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下所述的基于命名实体识别的地址识别方法的步骤:
接收音频采集设备发送的问答音频数据;
对所述问答音频数据进行语音识别操作,得到问答文本信息;
对所述问答文本信息进行地址文本提取操作,得到地址文本信息;
将所述地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;
将所述问答文本信息以及所述地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;
将所述地址文本向量以及所述扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;
输出所述目标地址结果。
与现有技术相比,本申请实施例主要有以下有益效果:
本申请提供的基于命名实体识别的地址识别方法,包括:接收音频采集设备发送的问答音频数据;对所述问答音频数据进行语音识别操作,得到问答文本信息;对所述问答文本信息进行地址文本提取操作,得到地址文本信息;将所述地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;将所述问答文本信息以及所述地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;将所述地址文本向量以及所述扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;输出所述目标地址结果。在进行人机问答过程中,获取用户答复的音频信息后,将该音频信息转换为文本信息并转换为问答文本向量,将该问答文本向量输入至CNN模型将token的下 文词组特征信息与token的特征信息进行结合,得到扩充文本向量,最后将该问答文本向量以及扩充文本向量输入至训练好的命名实体识别模型进行命名实体识别,得到目标地址结果,由于扩充文本向量结合了token的下文词组特征信息与token的特征信息,使得该扩充文本向量可以解决模型在特定范围的后缀中实体提取的泛化能力,而不需要大量的数据进行拟合,减少了模型训练成本,同时又提升了模型识别能力。
附图说明
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例一提供的基于命名实体识别的地址识别方法的实现流程图;
图2是图1中步骤S103的一种具体实施方式的流程图;
图3是图1中步骤S103的另一种具体实施方式的流程图;
图4是本申请实施例一提供的获取训练好的命名实体识别模型的实现流程图;
图5是图4中步骤S401的一种具体实施方式的流程图;
图6是图4中步骤S402的一种具体实施方式的流程图;
图7是本申请实施例二提供的基于命名实体识别的地址识别装置的结构示意图:
图8是图7中地址文本提取模块130的一种具体实施方式的结构示意图;
图9是根据本申请的计算机设备的一个实施例的结构示意图。
具体实施方式
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。
如图1所示,示出了本申请实施例一提供的基于命名实体识别的地址识别方法的实现流程图,为了便于说明,仅示出与本申请相关的部分。
上述的基于命名实体识别的地址识别方法,包括以下步骤:
步骤S101:接收音频采集设备发送的问答音频数据。
在本申请实施例中,问答音频数据指的是将电话通话过程的音频信号转换成波形信号的波形文件。
在本申请实施例中,问答音频数据可通过计算机中的数字音频接口将话筒、电话机或其他设备采集到的音频信号导入到计算机中进行录制得到。
步骤S102:对问答音频数据进行语音识别操作,得到问答文本信息。
在本申请实施例中,语音识别操作主要用于将上述采集到的问答音频数据转换成文本数据,具体的,该语音识别操作可以通过模式匹配法实现,在训练阶段,用户将词汇表中 的每一词依次说一遍,并且将其特征矢量作为模板存入模板库,在识别阶段,将输入语音的特征矢量依次与模板库中的每个模板进行相似度比较,将相似度最高者作为识别结果输出。
在本申请实施例中,问答文本信息可以针以及用户的波形特征对语音识别到的问答文本信息进行区分,并通过“一问一答”的形式展示该文本内容信息,从而获得客服人员的问答文本信息和用户的问答文本信息。
步骤S103:对问答文本信息进行地址文本提取操作,得到地址文本信息。
在本申请实施例中,为了获取问答文本信息中可能出现的地址词汇,需要对问答文本信息进行地址文本提取操作,得到地址文本信息。
在本申请实施例中,地址文本提取操作可以是对问答文本信息进行分词操作,得到多个词语,基于停用词表对词语进行过滤操作,得到过滤后的地址文本信息。
在本申请实施例中,地址文本提取操作还可以是对问答文本信息进行分词操作,得到多个词语,基于停用词表对词语进行过滤操作,得到过滤后的待确认词语,计算各待确认词语在问答文本信息中的第一词频,读取本地语料库,计算各待确认词语在本地语料库中的第二词频,根据第一词频与第二词频的乘积对待确认词语进行过滤,得到地址文本信息。
步骤S104:将地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量。
在本申请实施例中,向量转换操作指的是将该问答文本信息输入至Embedding层进行向量转换,以得到该问答文本向量。
步骤S105:将问答文本信息以及地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量。
在本申请实施例中,CNN通过滑动窗口对得到的问答文本向量进行扩充处理,即增加上下文的特征信息,得到扩充有上下文特征信息的扩充文本向量。
步骤S106:将地址文本向量以及扩充文本向量输入至训练好的命名实体识别模型进行命名实体识别操作,得到目标地址结果。
在本申请实施例中,将扩充有上下文特征信息的扩充文本向量与原有的问答文本向量结合并输入至训练好的命名实体识别模型中,通过结合CNN模型处理得到的扩充文本向量和向量转换得到的问答文本向量,增加了上下文的特征信息,提升了训练好的命名实体识别模型在特定范围的后缀中实体提取的泛化能力,尤其是长尾的地址实体(如:***蒙古自治县),因为通过CNN滑动窗口,能将“蒙古自治县”这样一个长尾后缀更多的上下文信息给到下游网络层进行模型参数学习,提升模型泛化能力。
在本申请实施例中,通过NER模型将客户回答的***区抽取出来,然后通过全国地址库进行索引,通过字、音模糊匹配,进行地址检索,判断客户说的地址是否真实存在已经其所属行政级别,若客户所说的地址行政级别是区(县),则将客户说的地址所属城市进行检索,然后将客户回答的文本中的区(县)行政级别地址替换为其所属的城市,完成文本的预处理。
步骤S107:输出目标地址结果。
在本申请实施例中,提供的基于命名实体识别的地址识别方法,包括:接收音频采集设备发送的问答音频数据;对问答音频数据进行语音识别操作,得到问答文本信息;对问答文本信息进行地址文本提取操作,得到地址文本信息;将地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;将问答文本信息以及地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;将地址文本向量以及扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;输出目标地址结果。在进行人机问答过程中,获取用户答复的音频信息后,将该音频信息转换为文本信息并转换为问答文本向量,将该问答文本向量输入至CNN模型将token的下文词组特征信息与token的特征信息进行结合,得到扩充文本向量,最后将该问答文本向量以及扩充文本向量输入 至训练好的命名实体识别模型进行命名实体识别,得到目标地址结果,由于扩充文本向量结合了token的下文词组特征信息与token的特征信息,使得该扩充文本向量可以解决模型在特定范围的后缀中实体提取的泛化能力,而不需要大量的数据进行拟合,减少了模型训练成本,同时又提升了模型识别能力。
继续参阅图2,示出了是图1中步骤S103的一种具体实施方式的流程图,为了便于说明,仅示出与本申请相关的部分。
在本实施例的一些可选的实现方式中,步骤S103具体包括:步骤S201以及步骤S202。
步骤S201:对问答文本信息进行分词操作,得到多个词语。
在本申请实施例中,分词操作的方式可以是基于字符串匹配,即扫描字符串,如果发现字符串的子串和词典中的词相同,就算匹配,比如机械分词方法。这类分词通常会加入一些启发式规则,比如“正向/反向最大匹配”,“长词优先”等。第二类是基于统计以及机器学习的分词方法,它们基于人工标注的词性和统计特征,对中文进行建模,即根据观测到的数据(标注好的语料)对模型参数进行训练,在分词阶段再通过模型计算各种分词出现的概率,将概率最大的分词结果作为最终结果,最终得到一个个的地址文本信息。在一些实施例中的地址文本信息可以是对所有词语的统称,不一定是问答文本信息中主要词语的名称。
步骤S202:基于停用词表对词语进行过滤操作,得到过滤后的地址文本信息。
在本申请实施例中,在对问答文本信息进行分词后,还可以根据停用词表对分词后得到的地址文本信息进行过滤把一些不重要的词(也叫停用词)去掉,作为示例,例如:“啊”、“哦”等等。
继续参阅图3,示出了是图1中步骤S103的另一种具体实施方式的流程图,为了便于说明,仅示出与本申请相关的部分。
在本实施例的一些可选的实现方式中,步骤S103具体包括:步骤S301、步骤S302、步骤S303、步骤S304以及步骤S305。
步骤S301:对问答文本信息进行分词操作,得到多个词语。
在本申请实施例中,分词操作的方式可以是基于字符串匹配,即扫描字符串,如果发现字符串的子串和词典中的词相同,就算匹配,比如机械分词方法。这类分词通常会加入一些启发式规则,比如“正向/反向最大匹配”,“长词优先”等。第二类是基于统计以及机器学习的分词方法,它们基于人工标注的词性和统计特征,对中文进行建模,即根据观测到的数据(标注好的语料)对模型参数进行训练,在分词阶段再通过模型计算各种分词出现的概率,将概率最大的分词结果作为最终结果,最终得到一个个的地址文本信息。在一些实施例中的地址文本信息可以是对所有词语的统称,不一定是问答文本信息中主要词语的名称。
步骤S302:基于停用词表对词语进行过滤操作,得到过滤后的待确认词语。
在本申请实施例中,在对问答文本信息进行分词后,还可以根据停用词表对分词后得到的地址文本信息进行过滤把一些不重要的词(也叫停用词)去掉,作为示例,例如:“啊”、“哦”等等。
步骤S303:计算各待确认词语在问答文本信息中的第一词频。
在本申请实施例中,一个词在一段文本中的出现次数较多,那么它是停用词的概率就比较大,该第一词频主要用于判断该待确认词语是否为停用词。
步骤S304:读取本地语料库,计算各待确认词语在本地语料库中的第二词频。
在本申请实施例中,由于部分词语出现的频率较高同时重要程度也较大,为了避免被第一词频确认为停用词,因此需要在当前语料中出现次数的基础上再引入一个词频,即,词语在语料库中出现词频,作为第二词频。
在本申请实施例中,我们需要定义一个语料库,用来模拟语言的使用环境,来计算第二词频,具体地,可以通过公式(1)计算:
Figure PCTCN2021090433-appb-000001
其中,K 2为第二词频、n为语料库中文档的总数、m为包含某词语的文档数。如果一个词语越常见,则K 2的大小越接近于0,其中,分母之所以要加1,是为了避免分母为0,即所有的文档都包含该词语。可以看到,如果一个词语,比如“任我行”在输入文本中出现,但是其第二词频又比较低,那么说明“任我行”可能在当前的输入文本中比较重要,说明“任我行”一词极有可能是本次输入文本中的主干词语。具体地,以K 1*K 2的结果来表示一个词语是否可能为主干词语,可以得到更加精确的主干词语,不仅可以降低后续对词语的计算量,而且还能提高对实体识别的准确度。而且,通过这种方式自动提取主干词语简单快速,比较符合实际情况。
步骤S305:根据第一词频与第二词频的乘积对待确认词语进行过滤,得到地址文本信息。
在本申请实施例中,在对问答文本信息进行分词操作后,基于停用词表,通过正则表达式对词语进行过滤,得到待确认词语,然后计算各待确认词语在问答文本信息中的第一词频,在获取待确认词语在语料库中的第二词频,最后根据第一词频与第二词频的乘机对待确认词语进行过滤,得到过滤后的地址文本信息。
继续参阅图4,示出了本申请实施例一提供的获取训练好的命名实体识别模型的实现流程图,为了便于说明,仅示出与本申请相关的部分。
在本实施例的一些可选的实现方式中,在步骤S106之前,还包括:步骤S401以及步骤S402。
步骤S401:获取初始训练集和待识别数据集。
在本申请实施例中,初始训练集是对标注数据集进行如下预处理后得到的数据集:将标注数据集中的文本按照分句规则进行分句得到多个句子;根据预设的词语表对每个句子进行分词,得到由多个词语组成的句子,每个词语带有标签;查询词语词典和标签词典获取每个词语的词语ID和标签ID以将句子转换成以词语ID和标签ID形式表示;将句子进行填充或截断以将所有句子统一为规定长度。待识别数据集是对无标注数据集进行如下预处理后得到的数据集:将无标注数据集中的文本按照分句规则进行分句得到多个句子;根据预设的词语表对每个句子进行分词,得到由多个词语组成的句子;查询词语词典获取每个词语的词语ID以将句子转换成以词语ID形式表示;将句子进行填充或截断以将所有句子统一为规定长度。按照分句规则进行分句可以是使用正则表达式进行匹配的方式进行分句。
步骤S402:基于初始训练集以及待识别数据集对初始命名实体识别模型进行多轮训练操作直至其收敛,得到训练好的命名实体识别模型,其中,每轮训练操作包括:基于本轮训练集对初始命名实体识别模型进行监督训练得到经监督训练后的初始命名实体识别模型;基于训练好的命名实体识别模型经监督训练后的初始命名实体识别模型对待识别数据集进行命名实体标注,得到弱标注的待识别数据集;从本轮得到的弱标注的待识别数据集中提取子集,将子集以及初始训练集组成用于下一轮训练的训练集。
在本申请实施例中,将训练过程中命名实体识别模型对待识别数据集进行命名实体标注的弱标注作为待识别数据集的标注结果并选取其子集与初始训练集组成下一轮训练的训练集,待识别数据集的规模大小可按需设置,由此用该规模的待识别数据集扩充用于训练命名实体识别模型的训练集的大小,使得最终的命名实体识别模型具有更佳的泛化能力,在待识别数据集上的识别效果更好。
继续参阅图5,示出了图4中步骤S401的一种具体实施方式的流程图,为了便于说明,仅示出与本申请相关的部分。
在本实施例的一些可选的实现方式中,上述步骤S401具体包括:步骤S501、步骤S502、步骤S503、步骤S504、步骤S505、步骤S506、步骤S507、步骤S508以及步骤S509。
步骤S501:读取本地数据库,在本地数据库中获取预标注数据集以及无标注数据集。
在本申请实施例中,初始训练集是对标注数据集进行如下预处理后得到的数据集;待识别数据集是对无标注数据集进行如下预处理后得到的数据集。
步骤S502:将预标注数据集中的文本按照分句规则进行分句操作,得到多个预标注句子。
步骤S503:基于预设词语表对每个预标注句子进行分词操作,得到由多个词语组成的预标注句子,每个词语分别带有标签信息。
在本申请实施例中,词语表可以是谷歌预训练的BERT模型对应的词语表。
步骤S504:查询词语词典和标签词典获取每个词语的词语标识和标签标识以将预标注句子转换成以词语标识和标签标识形式进行表示。
在本申请实施例中,词语词典、标签词典可以是谷歌预训练的BERT模型对应的词语词典、标签词典。词语词典中每个词语都有一个对应的词语ID。此外词语词典中还设有未知词对应的词语ID,即,如果在词语词典中查询一个词语的词语ID,但是该词语没有记录在词典中,则查询反馈的结果为未知词对应的词语ID。标签词典中每个标签都有一个对应的标签ID。
步骤S505:对预标注句子进行长度统一操作,得到初始训练集。
在本申请实施例中,长度统一操作指的是将句子填充或者截断成规定长度,该规定长度是指规定的句子最长长度,一般设置为128,即句子最长包含128个词语。例如,一个句子不足128个词语则在句子末尾填充0补齐至128个词语,多于128个词语则从超出处截断。
步骤S506:将无标注数据集中的文本按照分句规则进行分句操作,得到多个无标注句子。
步骤S507:基于预设词语表对每个无标注句子进行分词操作,得到由多个词语组成的无标注句子。
步骤S508:基于词语词典将无标注句子转换成以词语标识形式进行表示。
步骤S509:对无标注句子进行长度统一操作,得到待识别数据集。
继续参阅图6,示出了图4中步骤S402的一种具体实施方式的流程图,为了便于说明,仅示出与本申请相关的部分。
在本实施例的一些可选的实现方式中,步骤S402具体包括:步骤S601、步骤S602、步骤S603以及步骤S604。
步骤S601:将本轮数据集中的本轮句子输入至命名实体识别模型中BERT-CRF模型的BERT层,得到本轮句子中词语的编码向量。
步骤S602:将编码向量输入BERT-CRF模型的CRF层,得到由本轮句子中所有词语对应的所有标签的概率序列组成的本轮句子的概率矩阵。
步骤S603:基于维特比算法获取每个本轮句子的概率矩阵的最优标注序列。
步骤S604:根据最优标注序列得到词语的识别标签标识,并基于词语的识别标签标识与标注数据集中词语的标签标识调整命名实体识别模型中BERT-CRF模型的参数。
在本申请实施例中,现有技术用BERT层+全连接层解决序列标注问题,在命名实体识别时,经过BERT层后得到的编码向量通过全连接层映射到标签集合后,单个词语的输出向量再经过Softmax处理,每一维度的数值就表示该词语为某一类别的概率,基于此数据便可计算损失并训练模型。而本发明将全连接层替换为CRF层,通过BERT-CRF模型更好地捕捉标签之间的结构特性。BERT-CRF模型的结构包括依次连接的BERT层和CRF层,句子中的词语(Word)输入BERT层得到编码向量,将该编码向量作为CRF层的输入,得到词语对应的所有标签的概率序列组成的概率矩阵,然后根据概率矩阵用维特比算法进行解码,得到最优标注序列,最优标注序列中含有词语对应的标签(Label)。
综上,本申请提供的基于命名实体识别的地址识别方法,包括:接收音频采集设备发送的问答音频数据;对问答音频数据进行语音识别操作,得到问答文本信息;对问答文本 信息进行地址文本提取操作,得到地址文本信息;将地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;将问答文本信息以及地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;将地址文本向量以及扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;输出目标地址结果。在进行人机问答过程中,获取用户答复的音频信息后,将该音频信息转换为文本信息并转换为问答文本向量,将该问答文本向量输入至CNN模型将token的下文词组特征信息与token的特征信息进行结合,得到扩充文本向量,最后将该问答文本向量以及扩充文本向量输入至训练好的命名实体识别模型进行命名实体识别,得到目标地址结果,由于扩充文本向量结合了token的下文词组特征信息与token的特征信息,使得该扩充文本向量可以解决模型在特定范围的后缀中实体提取的泛化能力,而不需要大量的数据进行拟合,减少了模型训练成本,同时又提升了模型识别能力。
需要强调的是,为进一步保证上述问答音频数据的私密和安全性,上述问答音频数据还可以存储于一区块链的节点中。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
本申请可用于众多通用或专用的计算机***环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器***、基于微处理器的***、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何***或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
实施例二
进一步参考图7,作为对上述图1所示方法的实现,本申请提供了一种基于命名实体识别的地址识别装置的一个实施例,该装置实施例与图1所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图7所示,本实施例的基于命名实体识别的地址识别装置100包括:音频获取模块110、语音识别模块120、地址文本提取模块130、向量转换模块140、特征扩充模块150、实体识别模块160以及结果输出模块170。其中:
音频获取模块110,用于接收音频采集设备发送的问答音频数据;
语音识别模块120,用于对问答音频数据进行语音识别操作,得到问答文本信息;
地址文本提取模块130,用于对问答文本信息进行地址文本提取操作,得到地址文本信息;
向量转换模块140,用于将地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;
特征扩充模块150,用于将问答文本信息以及地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;
实体识别模块160,用于将地址文本向量以及扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;
结果输出模块170,用于输出目标地址结果。
在本申请实施例中,问答音频数据指的是将电话通话过程的音频信号转换成波形信号的波形文件。
在本申请实施例中,问答音频数据可通过计算机中的数字音频接口将话筒、电话机或其他设备采集到的音频信号导入到计算机中进行录制得到。
在本申请实施例中,语音识别操作主要用于将上述采集到的问答音频数据转换成文本数据,具体的,该语音识别操作可以通过模式匹配法实现,在训练阶段,用户将词汇表中的每一词依次说一遍,并且将其特征矢量作为模板存入模板库,在识别阶段,将输入语音的特征矢量依次与模板库中的每个模板进行相似度比较,将相似度最高者作为识别结果输出。
在本申请实施例中,问答文本信息可以针以及用户的波形特征对语音识别到的问答文本信息进行区分,并通过“一问一答”的形式展示该文本内容信息,从而获得客服人员的问答文本信息和用户的问答文本信息。
在本申请实施例中,为了获取问答文本信息中可能出现的地址词汇,需要对问答文本信息进行地址文本提取操作,得到地址文本信息。
在本申请实施例中,地址文本提取操作可以是对问答文本信息进行分词操作,得到多个词语,基于停用词表对词语进行过滤操作,得到过滤后的地址文本信息。
在本申请实施例中,地址文本提取操作还可以是对问答文本信息进行分词操作,得到多个词语,基于停用词表对词语进行过滤操作,得到过滤后的待确认词语,计算各待确认词语在问答文本信息中的第一词频,读取本地语料库,计算各待确认词语在本地语料库中的第二词频,根据第一词频与第二词频的乘积对待确认词语进行过滤,得到地址文本信息。在本申请实施例中,向量转换操作指的是将该问答文本信息输入至Embedding层进行向量转换,以得到该问答文本向量。
在本申请实施例中,CNN通过滑动窗口对得到的问答文本向量进行扩充处理,即增加上下文的特征信息,得到扩充有上下文特征信息的扩充文本向量。
在本申请实施例中,将扩充有上下文特征信息的扩充文本向量与原有的问答文本向量结合并输入至训练好的命名实体识别模型中,通过结合CNN模型处理得到的扩充文本向量和向量转换得到的问答文本向量,增加了上下文的特征信息,提升了训练好的命名实体识别模型在特定范围的后缀中实体提取的泛化能力,尤其是长尾的地址实体(如:***蒙古自治县),因为通过CNN滑动窗口,能将“蒙古自治县”这样一个长尾后缀更多的上下文信息给到下游网络层进行模型参数学习,提升模型泛化能力。
在本申请实施例中,通过NER模型将客户回答的***区抽取出来,然后通过全国地址库进行索引,通过字、音模糊匹配,进行地址检索,判断客户说的地址是否真实存在已经其所属行政级别,若客户所说的地址行政级别是区(县),则将客户说的地址所属城市进行检索,然后将客户回答的文本中的区(县)行政级别地址替换为其所属的城市,完成文本的预处理。
在本申请实施例中,提供的基于命名实体识别的地址识别装置,包括:音频获取模块, 用于接收音频采集设备发送的问答音频数据;语音识别模块,用于对问答音频数据进行语音识别操作,得到问答文本信息;地址文本提取模块,用于对问答文本信息进行地址文本提取操作,得到地址文本信息;向量转换模块,用于将地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;特征扩充模块,用于将问答文本信息以及地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;实体识别模块,用于将地址文本向量以及扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;结果输出模块,用于输出目标地址结果。在进行人机问答过程中,获取用户答复的音频信息后,将该音频信息转换为文本信息并转换为问答文本向量,将该问答文本向量输入至CNN模型将token的下文词组特征信息与token的特征信息进行结合,得到扩充文本向量,最后将该问答文本向量以及扩充文本向量输入至训练好的命名实体识别模型进行命名实体识别,得到目标地址结果,由于扩充文本向量结合了token的下文词组特征信息与token的特征信息,使得该扩充文本向量可以解决模型在特定范围的后缀中实体提取的泛化能力,而不需要大量的数据进行拟合,减少了模型训练成本,同时又提升了模型识别能力。
继续参阅图8,示出了图7中地址文本提取模块130的一种具体实施方式的结构示意图,为了便于说明,仅示出与本申请相关的部分。
在本实施例的一些可选的实现方式中,上述地址文本提取模块130包括:第一分词子模块131以及第一过滤子模块132。其中:
第一分词子模块131,用于对问答文本信息进行分词操作,得到多个词语;
第一过滤子模块132,用于基于停用词表对词语进行过滤操作,得到过滤后的地址文本信息。
在本申请实施例中,分词操作的方式可以是基于字符串匹配,即扫描字符串,如果发现字符串的子串和词典中的词相同,就算匹配,比如机械分词方法。这类分词通常会加入一些启发式规则,比如“正向/反向最大匹配”,“长词优先”等。第二类是基于统计以及机器学习的分词方法,它们基于人工标注的词性和统计特征,对中文进行建模,即根据观测到的数据(标注好的语料)对模型参数进行训练,在分词阶段再通过模型计算各种分词出现的概率,将概率最大的分词结果作为最终结果,最终得到一个个的地址文本信息。在一些实施例中的地址文本信息可以是对所有词语的统称,不一定是问答文本信息中主要词语的名称。
在本申请实施例中,在对问答文本信息进行分词后,还可以根据停用词表对分词后得到的地址文本信息进行过滤把一些不重要的词(也叫停用词)去掉,作为示例,例如:“啊”、“哦”等等。
在本实施例的一些可选的实现方式中,上述地址文本提取模块130包括:第二分词子模块、第二过滤子模块、第一词频计算子模块、第二词频计算子模块以及第三过滤子模块。其中:
第二分词子模块,用于对问答文本信息进行分词操作,得到多个词语;
第二过滤子模块,用于基于停用词表对词语进行过滤操作,得到过滤后的待确认词语;
第一词频计算子模块,用于计算各待确认词语在问答文本信息中的第一词频;
第二词频计算子模块,用于读取本地语料库,计算各待确认词语在本地语料库中的第二词频;
第三过滤子模块,用于根据第一词频与第二词频的乘积对待确认词语进行过滤,得到地址文本信息。
在本实施例的一些可选的实现方式中,上述基于命名实体识别的地址识别装置100还包括:训练数据获取模块以及多轮训练模块。其中:
训练数据获取模块,用于获取初始训练集和待识别数据集;
多轮训练模块,用于基于初始训练集以及待识别数据集对初始命名实体识别模型进行 多轮训练操作直至其收敛,得到训练好的命名实体识别模型;其中,每轮训练操作包括:基于本轮训练集对初始命名实体识别模型进行监督训练得到经监督训练后的初始命名实体识别模型;基于经监督训练后的初始命名实体识别模型对待识别数据集进行命名实体标注,得到弱标注的待识别数据集;从本轮得到的弱标注的待识别数据集中提取子集,将子集以及初始训练集组成用于下一轮训练的训练集。
在本实施例的一些可选的实现方式中,上述训练数据获取模块包括:训练数据获取子模块、第一分句子模块、第三分词子模块、第一句子转换子模块、第一长度统一子模块、第二分句子模块、第四分词子模块、第二句子转换子模块以及第二长度统一子模块。其中:
训练数据获取子模块,用于读取本地数据库,在本地数据库中获取预标注数据集以及无标注数据集;
第一分句子模块,用于将预标注数据集中的文本按照分句规则进行分句操作,得到多个预标注句子;
第三分词子模块,用于基于预设词语表对每个预标注句子进行分词操作,得到由多个词语组成的预标注句子,每个词语分别带有标签信息;
第一句子转换子模块,用于查询词语词典和标签词典获取每个词语的词语标识和标签标识以将预标注句子转换成以词语标识和标签标识形式进行表示;
第一长度统一子模块,用于对预标注句子进行长度统一操作,得到初始训练集;
第二分句子模块,用于将无标注数据集中的文本按照分句规则进行分句操作,得到多个无标注句子;
第四分词子模块,用于基于预设词语表对每个无标注句子进行分词操作,得到由多个词语组成的无标注句子;
第二句子转换子模块,用于基于词语词典将无标注句子转换成以词语标识形式进行表示;
第二长度统一子模块,用于对无标注句子进行长度统一操作,得到待识别数据集。
在本实施例的一些可选的实现方式中,上述多轮训练模块具体包括:数据输入子模块、概率矩阵组成子模块、最优序列获取子模块以及参数调整子模块。其中:
数据输入子模块,用于将本轮数据集中的本轮句子输入至命名实体识别模型中BERT-CRF模型的BERT层,得到本轮句子中词语的编码向量;
概率矩阵组成子模块,用于将编码向量输入BERT-CRF模型的CRF层,得到由本轮句子中所有词语对应的所有标签的概率序列组成的本轮句子的概率矩阵;
最优序列获取子模块,用于基于维特比算法获取每个本轮句子的概率矩阵的最优标注序列;
参数调整子模块,用于根据最优标注序列得到词语的识别标签标识,并基于词语的识别标签标识与标注数据集中词语的标签标识调整命名实体识别模型中BERT-CRF模型的参数。
综上,本申请提供的基于命名实体识别的地址识别装置,包括:音频获取模块,用于接收音频采集设备发送的问答音频数据;语音识别模块,用于对问答音频数据进行语音识别操作,得到问答文本信息;地址文本提取模块,用于对问答文本信息进行地址文本提取操作,得到地址文本信息;向量转换模块,用于将地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;特征扩充模块,用于将问答文本信息以及地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;实体识别模块,用于将地址文本向量以及扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;结果输出模块,用于输出目标地址结果。在进行人机问答过程中,获取用户答复的音频信息后,将该音频信息转换为文本信息并转换为问答文本向量,将该问答文本向量输入至CNN模型将token的下文词组特征信息与token的特征信息进行结合,得到扩充文本向量,最后将该问答文本向量以及扩充文本向量输入至训练好的命名实体识别模型进 行命名实体识别,得到目标地址结果,由于扩充文本向量结合了token的下文词组特征信息与token的特征信息,使得该扩充文本向量可以解决模型在特定范围的后缀中实体提取的泛化能力,而不需要大量的数据进行拟合,减少了模型训练成本,同时又提升了模型识别能力。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图9,图9为本实施例计算机设备基本结构框图。
所述计算机设备200包括通过***总线相互通信连接存储器210、处理器220、网络接口230。需要指出的是,图中仅示出了具有组件210-230的计算机设备200,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
所述存储器210至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等,所述计算机可读存储介质可以是非易失性,也可以是易失性。在一些实施例中,所述存储器210可以是所述计算机设备200的内部存储单元,例如该计算机设备200的硬盘或内存。在另一些实施例中,所述存储器210也可以是所述计算机设备200的外部存储设备,例如该计算机设备200上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器210还可以既包括所述计算机设备200的内部存储单元也包括其外部存储设备。本实施例中,所述存储器210通常用于存储安装于所述计算机设备200的操作***和各类应用软件,例如基于命名实体识别的地址识别方法的计算机可读指令等。此外,所述存储器210还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器220在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器220通常用于控制所述计算机设备200的总体操作。本实施例中,所述处理器220用于运行所述存储器210中存储的计算机可读指令或者处理数据,例如运行所述基于命名实体识别的地址识别方法的计算机可读指令。
所述网络接口230可包括无线网络接口或有线网络接口,该网络接口230通常用于在所述计算机设备200与其他电子设备之间建立通信连接。
本申请提供的基于命名实体识别的地址识别方法,在进行人机问答过程中,获取用户答复的音频信息后,将该音频信息转换为文本信息并转换为问答文本向量,将该问答文本向量输入至CNN模型将token的下文词组特征信息与token的特征信息进行结合,得到扩充文本向量,最后将该问答文本向量以及扩充文本向量输入至训练好的命名实体识别模型进行命名实体识别,得到目标地址结果,由于扩充文本向量结合了token的下文词组特征信息与token的特征信息,使得该扩充文本向量可以解决模型在特定范围的后缀中实体提取的泛化能力,而不需要大量的数据进行拟合,减少了模型训练成本,同时又提升了模型识别能力。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所 述至少一个处理器执行如上述的基于命名实体识别的地址识别方法的步骤。
本申请提供的基于命名实体识别的地址识别方法,在进行人机问答过程中,获取用户答复的音频信息后,将该音频信息转换为文本信息并转换为问答文本向量,将该问答文本向量输入至CNN模型将token的下文词组特征信息与token的特征信息进行结合,得到扩充文本向量,最后将该问答文本向量以及扩充文本向量输入至训练好的命名实体识别模型进行命名实体识别,得到目标地址结果,由于扩充文本向量结合了token的下文词组特征信息与token的特征信息,使得该扩充文本向量可以解决模型在特定范围的后缀中实体提取的泛化能力,而不需要大量的数据进行拟合,减少了模型训练成本,同时又提升了模型识别能力。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。

Claims (20)

  1. 一种基于命名实体识别的地址识别方法,其中,包括下述步骤:
    接收音频采集设备发送的问答音频数据;
    对所述问答音频数据进行语音识别操作,得到问答文本信息;
    对所述问答文本信息进行地址文本提取操作,得到地址文本信息;
    将所述地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;
    将所述问答文本信息以及所述地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;
    将所述地址文本向量以及所述扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;
    输出所述目标地址结果。
  2. 根据权利要求1所述的基于命名实体识别的地址识别方法,其中,所述对所述问答文本信息进行地址文本提取操作,得到地址文本信息的步骤,具体包括:
    对所述问答文本信息进行分词操作,得到多个词语;
    基于停用词表对所述词语进行过滤操作,得到过滤后的所述地址文本信息。
  3. 根据权利要求1所述的基于命名实体识别的地址识别方法,其中,所述对所述问答文本信息进行地址文本提取操作,得到地址文本信息的步骤,具体包括:
    对所述问答文本信息进行分词操作,得到多个词语;
    基于停用词表对所述词语进行过滤操作,得到过滤后的待确认词语;
    计算各所述待确认词语在所述问答文本信息中的第一词频;
    读取本地语料库,计算各所述待确认词语在所述本地语料库中的第二词频;
    根据所述第一词频与所述第二词频的乘积对所述待确认词语进行过滤,得到所述地址文本信息。
  4. 根据权利要求1所述的基于命名实体识别的地址识别方法,其中,在所述将所述地址文本向量以及所述扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果的步骤之前,还包括:
    获取初始训练集和待识别数据集;
    基于所述初始训练集以及所述待识别数据集对初始命名实体识别模型进行多轮训练操作直至其收敛,得到所述训练好的命名实体识别模型;其中,所述每轮训练操作包括:基于本轮训练集对所述初始命名实体识别模型进行监督训练得到经监督训练后的初始命名实体识别模型;基于所述经监督训练后的初始命名实体识别模型对所述待识别数据集进行命名实体标注,得到弱标注的待识别数据集;从所述本轮得到的所述弱标注的待识别数据集中提取子集,将所述子集以及所述初始训练集组成用于下一轮训练的训练集。
  5. 根据权利要求4所述的基于命名实体识别的地址识别方法,其中,所述获取初始训练集和待识别数据集的步骤,具体包括:
    读取本地数据库,在所述本地数据库中获取预标注数据集以及无标注数据集;
    将所述预标注数据集中的文本按照分句规则进行分句操作,得到多个预标注句子;
    基于预设词语表对所述每个预标注句子进行分词操作,得到由多个词语组成的预标注句子,所述每个词语分别带有标签信息;
    查询词语词典和标签词典获取每个词语的词语标识和标签标识以将所述预标注句子转换成以词语标识和标签标识形式进行表示;
    对所述预标注句子进行长度统一操作,得到所述初始训练集;
    将所述无标注数据集中的文本按照所述分句规则进行分句操作,得到多个无标注句子;
    基于预设词语表对所述每个无标注句子进行所述分词操作,得到由多个词语组成的无标注句子;
    基于所述词语词典将所述无标注句子转换成以词语标识形式进行表示;
    对所述无标注句子进行所述长度统一操作,得到所述待识别数据集。
  6. 根据权利要求4所述的基于命名实体识别的地址识别方法,其中,所述基于所述初始训练集以及所述待识别数据集对初始命名实体识别模型进行多轮训练操作直至其收敛,得到所述训练好的命名实体识别模型的步骤,具体包括:
    将所述本轮数据集中的本轮句子输入至命名实体识别模型中BERT-CRF模型的BERT层,得到所述本轮句子中词语的编码向量;
    将所述编码向量输入BERT-CRF模型的CRF层,得到由所述本轮句子中所有词语对应的所有标签的概率序列组成的所述本轮句子的概率矩阵;
    基于维特比算法获取每个所述本轮句子的概率矩阵的最优标注序列;
    根据所述最优标注序列得到所述词语的识别标签标识,并基于所述词语的识别标签标识与所述标注数据集中词语的标签标识调整命名实体识别模型中BERT-CRF模型的参数。
  7. 一种基于命名实体识别的地址识别装置,其中,包括:
    音频获取模块,用于接收音频采集设备发送的问答音频数据;
    语音识别模块,用于对所述问答音频数据进行语音识别操作,得到问答文本信息;
    地址文本提取模块,用于对所述问答文本信息进行地址文本提取操作,得到地址文本信息;
    向量转换模块,用于将所述地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;
    特征扩充模块,用于将所述问答文本信息以及所述地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;
    实体识别模块,用于将所述地址文本向量以及所述扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;
    结果输出模块,用于输出所述目标地址结果。
  8. 根据权利要求7所述的基于命名实体识别的地址识别装置,其中,所述地址文本提取模块包括:
    第一分词子模块,用于对所述问答文本信息进行分词操作,得到多个词语;
    第一过滤子模块,用于基于停用词表对所述词语进行过滤操作,得到过滤后的所述地址文本信息。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下所述的基于命名实体识别的地址识别方法的步骤:
    接收音频采集设备发送的问答音频数据;
    对所述问答音频数据进行语音识别操作,得到问答文本信息;
    对所述问答文本信息进行地址文本提取操作,得到地址文本信息;
    将所述地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;
    将所述问答文本信息以及所述地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;
    将所述地址文本向量以及所述扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;
    输出所述目标地址结果。
  10. 根据权利要求9所述的计算机设备,其中,所述对所述问答文本信息进行地址文本提取操作,得到地址文本信息的步骤,具体包括:
    对所述问答文本信息进行分词操作,得到多个词语;
    基于停用词表对所述词语进行过滤操作,得到过滤后的所述地址文本信息。
  11. 根据权利要求9所述的计算机设备,其中,所述对所述问答文本信息进行地址文本提取操作,得到地址文本信息的步骤,具体包括:
    对所述问答文本信息进行分词操作,得到多个词语;
    基于停用词表对所述词语进行过滤操作,得到过滤后的待确认词语;
    计算各所述待确认词语在所述问答文本信息中的第一词频;
    读取本地语料库,计算各所述待确认词语在所述本地语料库中的第二词频;
    根据所述第一词频与所述第二词频的乘积对所述待确认词语进行过滤,得到所述地址文本信息。
  12. 根据权利要求9所述的计算机设备,其中,在所述将所述地址文本向量以及所述扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果的步骤之前,还包括:
    获取初始训练集和待识别数据集;
    基于所述初始训练集以及所述待识别数据集对初始命名实体识别模型进行多轮训练操作直至其收敛,得到所述训练好的命名实体识别模型;其中,所述每轮训练操作包括:基于本轮训练集对所述初始命名实体识别模型进行监督训练得到经监督训练后的初始命名实体识别模型;基于所述经监督训练后的初始命名实体识别模型对所述待识别数据集进行命名实体标注,得到弱标注的待识别数据集;从所述本轮得到的所述弱标注的待识别数据集中提取子集,将所述子集以及所述初始训练集组成用于下一轮训练的训练集。
  13. 根据权利要求12所述的计算机设备,其中,所述获取初始训练集和待识别数据集的步骤,具体包括:
    读取本地数据库,在所述本地数据库中获取预标注数据集以及无标注数据集;
    将所述预标注数据集中的文本按照分句规则进行分句操作,得到多个预标注句子;
    基于预设词语表对所述每个预标注句子进行分词操作,得到由多个词语组成的预标注句子,所述每个词语分别带有标签信息;
    查询词语词典和标签词典获取每个词语的词语标识和标签标识以将所述预标注句子转换成以词语标识和标签标识形式进行表示;
    对所述预标注句子进行长度统一操作,得到所述初始训练集;
    将所述无标注数据集中的文本按照所述分句规则进行分句操作,得到多个无标注句子;
    基于预设词语表对所述每个无标注句子进行所述分词操作,得到由多个词语组成的无标注句子;
    基于所述词语词典将所述无标注句子转换成以词语标识形式进行表示;
    对所述无标注句子进行所述长度统一操作,得到所述待识别数据集。
  14. 根据权利要求12所述的计算机设备,其中,所述基于所述初始训练集以及所述待识别数据集对初始命名实体识别模型进行多轮训练操作直至其收敛,得到所述训练好的命名实体识别模型的步骤,具体包括:
    将所述本轮数据集中的本轮句子输入至命名实体识别模型中BERT-CRF模型的BERT层,得到所述本轮句子中词语的编码向量;
    将所述编码向量输入BERT-CRF模型的CRF层,得到由所述本轮句子中所有词语对应的所有标签的概率序列组成的所述本轮句子的概率矩阵;
    基于维特比算法获取每个所述本轮句子的概率矩阵的最优标注序列;
    根据所述最优标注序列得到所述词语的识别标签标识,并基于所述词语的识别标签标识与所述标注数据集中词语的标签标识调整命名实体识别模型中BERT-CRF模型的参数。
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下所述的基于命名实体识别的地址识别方法的步骤:
    接收音频采集设备发送的问答音频数据;
    对所述问答音频数据进行语音识别操作,得到问答文本信息;
    对所述问答文本信息进行地址文本提取操作,得到地址文本信息;
    将所述地址文本信息输入至Embedding层进行向量转换操作,得到地址文本向量;
    将所述问答文本信息以及所述地址文本向量输入至CNN模型进行特征扩充操作,得到扩充文本向量;
    将所述地址文本向量以及所述扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果;
    输出所述目标地址结果。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述对所述问答文本信息进行地址文本提取操作,得到地址文本信息的步骤,具体包括:
    对所述问答文本信息进行分词操作,得到多个词语;
    基于停用词表对所述词语进行过滤操作,得到过滤后的所述地址文本信息。
  17. 根据权利要求15所述的计算机可读存储介质,其中,所述对所述问答文本信息进行地址文本提取操作,得到地址文本信息的步骤,具体包括:
    对所述问答文本信息进行分词操作,得到多个词语;
    基于停用词表对所述词语进行过滤操作,得到过滤后的待确认词语;
    计算各所述待确认词语在所述问答文本信息中的第一词频;
    读取本地语料库,计算各所述待确认词语在所述本地语料库中的第二词频;
    根据所述第一词频与所述第二词频的乘积对所述待确认词语进行过滤,得到所述地址文本信息。
  18. 根据权利要求15所述的计算机可读存储介质,其中,在所述将所述地址文本向量以及所述扩充文本向量输入至训练好的命名实体识别模型进行实体识别操作,得到目标地址结果的步骤之前,还包括:
    获取初始训练集和待识别数据集;
    基于所述初始训练集以及所述待识别数据集对初始命名实体识别模型进行多轮训练操作直至其收敛,得到所述训练好的命名实体识别模型;其中,所述每轮训练操作包括:基于本轮训练集对所述初始命名实体识别模型进行监督训练得到经监督训练后的初始命名实体识别模型;基于所述经监督训练后的初始命名实体识别模型对所述待识别数据集进行命名实体标注,得到弱标注的待识别数据集;从所述本轮得到的所述弱标注的待识别数据集中提取子集,将所述子集以及所述初始训练集组成用于下一轮训练的训练集。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述获取初始训练集和待识别数据集的步骤,具体包括:
    读取本地数据库,在所述本地数据库中获取预标注数据集以及无标注数据集;
    将所述预标注数据集中的文本按照分句规则进行分句操作,得到多个预标注句子;
    基于预设词语表对所述每个预标注句子进行分词操作,得到由多个词语组成的预标注句子,所述每个词语分别带有标签信息;
    查询词语词典和标签词典获取每个词语的词语标识和标签标识以将所述预标注句子转换成以词语标识和标签标识形式进行表示;
    对所述预标注句子进行长度统一操作,得到所述初始训练集;
    将所述无标注数据集中的文本按照所述分句规则进行分句操作,得到多个无标注句子;
    基于预设词语表对所述每个无标注句子进行所述分词操作,得到由多个词语组成的无标注句子;
    基于所述词语词典将所述无标注句子转换成以词语标识形式进行表示;
    对所述无标注句子进行所述长度统一操作,得到所述待识别数据集。
  20. 根据权利要求18所述的计算机可读存储介质,其中,所述基于所述初始训练集以及所述待识别数据集对初始命名实体识别模型进行多轮训练操作直至其收敛,得到所述训练好的命名实体识别模型的步骤,具体包括:
    将所述本轮数据集中的本轮句子输入至命名实体识别模型中BERT-CRF模型的BERT层, 得到所述本轮句子中词语的编码向量;
    将所述编码向量输入BERT-CRF模型的CRF层,得到由所述本轮句子中所有词语对应的所有标签的概率序列组成的所述本轮句子的概率矩阵;
    基于维特比算法获取每个所述本轮句子的概率矩阵的最优标注序列;
    根据所述最优标注序列得到所述词语的识别标签标识,并基于所述词语的识别标签标识与所述标注数据集中词语的标签标识调整命名实体识别模型中BERT-CRF模型的参数。
PCT/CN2021/090433 2020-12-30 2021-04-28 一种地址识别方法、装置、计算机设备及存储介质 WO2022142011A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011609093.X 2020-12-30
CN202011609093.XA CN112633003B (zh) 2020-12-30 2020-12-30 一种地址识别方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022142011A1 true WO2022142011A1 (zh) 2022-07-07

Family

ID=75286641

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/090433 WO2022142011A1 (zh) 2020-12-30 2021-04-28 一种地址识别方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN112633003B (zh)
WO (1) WO2022142011A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081449A (zh) * 2022-08-23 2022-09-20 北京睿企信息科技有限公司 一种地址识别方法及***
CN116991983A (zh) * 2023-09-27 2023-11-03 之江实验室 一种面向公司资讯文本的事件抽取方法及***
CN117992600A (zh) * 2024-04-07 2024-05-07 之江实验室 一种业务执行方法、装置、存储介质以及电子设备

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633003B (zh) * 2020-12-30 2024-05-31 平安科技(深圳)有限公司 一种地址识别方法、装置、计算机设备及存储介质
CN113254639A (zh) * 2021-05-24 2021-08-13 珠海大横琴科技发展有限公司 一种监控报警定位跟踪方法、装置及电子设备
CN113539270B (zh) * 2021-07-22 2024-04-02 阳光保险集团股份有限公司 一种位置识别方法、装置、电子设备和存储介质
CN113535880B (zh) * 2021-09-16 2022-02-25 阿里巴巴达摩院(杭州)科技有限公司 地理信息确定方法、装置、电子设备及计算机存储介质
CN113836920A (zh) * 2021-10-19 2021-12-24 平安普惠企业管理有限公司 地址信息的识别方法、装置、计算机设备及存储介质
CN116050402B (zh) * 2022-05-23 2023-10-20 荣耀终端有限公司 文本地址识别方法、电子设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083831A (zh) * 2019-04-16 2019-08-02 武汉大学 一种基于BERT-BiGRU-CRF的中文命名实体识别方法
US20190318737A1 (en) * 2013-03-14 2019-10-17 Amazon Technologies, Inc. Dynamic gazetteers for personalized entity recognition
CN110377686A (zh) * 2019-07-04 2019-10-25 浙江大学 一种基于深度神经网络模型的地址信息特征抽取方法
CN110442856A (zh) * 2019-06-14 2019-11-12 平安科技(深圳)有限公司 一种地址信息标准化方法、装置、计算机设备及存储介质
CN111738004A (zh) * 2020-06-16 2020-10-02 中国科学院计算技术研究所 一种命名实体识别模型的训练方法及命名实体识别的方法
CN111933129A (zh) * 2020-09-11 2020-11-13 腾讯科技(深圳)有限公司 音频处理方法、语言模型的训练方法、装置及计算机设备
CN112633003A (zh) * 2020-12-30 2021-04-09 平安科技(深圳)有限公司 一种地址识别方法、装置、计算机设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060047500A1 (en) * 2004-08-31 2006-03-02 Microsoft Corporation Named entity recognition using compiler methods
CN103440311A (zh) * 2013-08-27 2013-12-11 深圳市华傲数据技术有限公司 一种地名实体识别的方法及***
CN109299469B (zh) * 2018-10-29 2023-05-02 复旦大学 一种在长文本中识别复杂住址的方法
CN110287479B (zh) * 2019-05-20 2022-07-22 平安科技(深圳)有限公司 命名实体识别方法、电子装置及存储介质
CN111950287B (zh) * 2020-08-20 2024-04-23 广东工业大学 一种基于文本的实体识别方法及相关装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190318737A1 (en) * 2013-03-14 2019-10-17 Amazon Technologies, Inc. Dynamic gazetteers for personalized entity recognition
CN110083831A (zh) * 2019-04-16 2019-08-02 武汉大学 一种基于BERT-BiGRU-CRF的中文命名实体识别方法
CN110442856A (zh) * 2019-06-14 2019-11-12 平安科技(深圳)有限公司 一种地址信息标准化方法、装置、计算机设备及存储介质
CN110377686A (zh) * 2019-07-04 2019-10-25 浙江大学 一种基于深度神经网络模型的地址信息特征抽取方法
CN111738004A (zh) * 2020-06-16 2020-10-02 中国科学院计算技术研究所 一种命名实体识别模型的训练方法及命名实体识别的方法
CN111933129A (zh) * 2020-09-11 2020-11-13 腾讯科技(深圳)有限公司 音频处理方法、语言模型的训练方法、装置及计算机设备
CN112633003A (zh) * 2020-12-30 2021-04-09 平安科技(深圳)有限公司 一种地址识别方法、装置、计算机设备及存储介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115081449A (zh) * 2022-08-23 2022-09-20 北京睿企信息科技有限公司 一种地址识别方法及***
CN115081449B (zh) * 2022-08-23 2022-11-04 北京睿企信息科技有限公司 一种地址识别方法及***
CN116991983A (zh) * 2023-09-27 2023-11-03 之江实验室 一种面向公司资讯文本的事件抽取方法及***
CN116991983B (zh) * 2023-09-27 2024-02-02 之江实验室 一种面向公司资讯文本的事件抽取方法及***
CN117992600A (zh) * 2024-04-07 2024-05-07 之江实验室 一种业务执行方法、装置、存储介质以及电子设备
CN117992600B (zh) * 2024-04-07 2024-06-11 之江实验室 一种业务执行方法、装置、存储介质以及电子设备

Also Published As

Publication number Publication date
CN112633003A (zh) 2021-04-09
CN112633003B (zh) 2024-05-31

Similar Documents

Publication Publication Date Title
WO2022142011A1 (zh) 一种地址识别方法、装置、计算机设备及存储介质
WO2021135910A1 (zh) 基于机器阅读理解的信息抽取方法、及其相关设备
WO2022088672A1 (zh) 基于bert的机器阅读理解方法、装置、设备及存储介质
WO2020224219A1 (zh) 中文分词方法、装置、电子设备及可读存储介质
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
WO2021121198A1 (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
WO2021135469A1 (zh) 基于机器学习的信息抽取方法、装置、计算机设备及介质
CN112328761B (zh) 一种意图标签设置方法、装置、计算机设备及存储介质
CN112287069B (zh) 基于语音语义的信息检索方法、装置及计算机设备
US20220147835A1 (en) Knowledge graph construction system and knowledge graph construction method
CN111782763A (zh) 基于语音语义的信息检索方法、及其相关设备
CN111783471B (zh) 自然语言的语义识别方法、装置、设备及存储介质
CN112836521A (zh) 问答匹配方法、装置、计算机设备及存储介质
CN115099239B (zh) 一种资源识别方法、装置、设备以及存储介质
CN111126084B (zh) 数据处理方法、装置、电子设备和存储介质
CN112084779A (zh) 用于语义识别的实体获取方法、装置、设备及存储介质
WO2022073341A1 (zh) 基于语音语义的疾病实体匹配方法、装置及计算机设备
WO2023092719A1 (zh) 病历数据的信息抽取方法、终端设备及可读存储介质
CN113987162A (zh) 文本摘要的生成方法、装置及计算机设备
CN112632956A (zh) 文本匹配方法、装置、终端和存储介质
CN112417875A (zh) 配置信息的更新方法、装置、计算机设备及介质
CN116796730A (zh) 基于人工智能的文本纠错方法、装置、设备及存储介质
CN114842982B (zh) 一种面向医疗信息***的知识表达方法、装置及***
CN111401069A (zh) 会话文本的意图识别方法、意图识别装置及终端
CN113741864A (zh) 基于自然语言处理的语义化服务接口自动设计方法与***

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912773

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21912773

Country of ref document: EP

Kind code of ref document: A1