CN110598066B

CN110598066B - Bank full-name rapid matching method based on word vector expression and cosine similarity

Info

Publication number: CN110598066B
Application number: CN201910851391.0A
Authority: CN
Inventors: 李振; 鲍东岳; 张刚; 尹正
Original assignee: Minsheng Science And Technology Co ltd
Current assignee: Minsheng Science And Technology Co ltd
Priority date: 2019-09-10
Filing date: 2019-09-10
Publication date: 2022-05-10
Anticipated expiration: 2039-09-10
Also published as: CN110598066A

Abstract

The invention provides a bank name rapid matching method based on word vector expression and cosine similarity, which comprises the steps of taking a bank name library as a training set, training the training set to obtain a word vector matrix and a training model, then segmenting the bank name to be matched and carrying out word vector processing, finally carrying out transposition multiplication on a word vector processing result to be retrieved and a word vector matrix based on a cosine similarity calculation method, combining a maximum value result of each row of the matrix after multiplication with a comparison result in the retrieved word and the training model to obtain the bank name, converting the maximum value result into a matrix multiplication and carrying out simultaneous calculation with 2 processes in order to improve the speed, and finally reaching the speed of 2000 pieces of 2 s; the cosine similarity between the input result of each behavior and the result recorded in the bank and the word vector is calculated through a matrix, so that the using circulation speed is greatly reduced.

Description

Bank full-name rapid matching method based on word vector expression and cosine similarity

Technical Field

The invention belongs to the technical field of bank information processing, and particularly relates to a bank full-name rapid matching method based on word vector expression and cosine similarity.

Background

In the modern day of the growing times, bank public-to-public business is continuously increased due to the rapid increase of medium and small enterprises and micro-enterprises, and the bank public-to-public business comprises enterprise electronic banks, unit deposit business, credit business, institution business, international business, entrusted housing finance, fund clearing, intermediate business, asset recommendation, fund escrow and the like. The basic departments and works inside the bank include: savings (private), accounting (public), and credit. Accounting is the background and service department of credit, credit is the deposit and loan business of units, and all business transactions between the units and banks are realized through the accounting department. Specifically, the public business is mainly the customers of enterprise legal people, units and the like, and various check, exchange, loan and other businesses are developed around public accounts, the business has the problem of slow speed of large-batch manual retrieval, and the algorithm of text similarity matching on the market is slow at present, so that the requirement of banks for fast searching cannot be met.

Disclosure of Invention

In order to solve the existing problems, 1, the internal part of the bank has a large number of tasks for public business, and the manual retrieval speed is low; 2. the invention provides a bank full-name rapid matching method based on word vector expression and cosine similarity, which comprises the steps of processing a bank full-name library to obtain a training set, training the training set to obtain a word vector matrix and a training model, then segmenting and processing the bank full-name to be matched, finally transposing and multiplying a word vector processing result to be retrieved and the word vector matrix based on a cosine similarity calculation method, and obtaining the bank full-name by combining a maximum value result of each row of the multiplied matrix and a comparison result in the matched bank full-name and the training model;

further, the fast matching method comprises the following steps:

s1: carrying out word removal, segmentation and combination processing on the bank full-name library to obtain a training set;

s2: performing word vector processing on the training set to obtain a tf-idf word vector matrix of the training set, performing standardization processing on each row, and simultaneously storing the tf-idf word vector matrix and a training model;

s3: inputting the bank full-name to be matched, carrying out word removal, segmentation and combination processing on the bank full-name to obtain a plurality of 2-word phrases, converting the bank full-name subjected to the word removal, segmentation and combination processing and the plurality of 2-word phrases into a character string, and finally converting the character string into a tf-idf word vector;

s5: multiplying the tf-idf word vector converted in the S3 by the transpose of the tf-idf word vector matrix of the training set in the S2, and selecting a bank full name corresponding to the position of the maximum value in each row according to the multiplied matrix result;

s6: comparing the bank full name to be matched with the training model, merging the two parts of bank full names according to the comparison result and the result in the S5, and outputting the final result;

further, the de-wording, segmentation and combination processing in S1 and S3 specifically includes:

and (3) word removal: removing the words of irrelevant key information in the bank full name to reduce the calculated amount;

cutting: carrying out word segmentation processing on the bank full name without the key information characters to obtain a simplified entry;

combining: performing 2-word combination on the simplified entries after word segmentation processing to obtain a plurality of 2-word phrases;

the combined 2-word group and the set of the multiple simplified entries in the S1 are used as a training set;

further, the words without key information in the segmentation process include but are not limited to companies, stocks companies, banks and branches;

further, the method for obtaining the "2 word phrase" in S3 and S1 is as follows: randomly selecting two characters from the simplified entry, and arranging and combining all possible characters according to the positive sequence of the Chinese characters in the simplified entry to form a 2-character word group;

further, the S2 specifically includes: converting each simplified entry in the training set and a 2-word phrase obtained by correspondingly combining the simplified entries into a character string, then converting the character string into tf-idf word vectors, and finally, enabling all tf-idf word vectors to be m-n matrixes which are recorded as tf-idf-train, wherein each row vector of the matrixes is standardized and the modulus is equal to 1, wherein m is the total sample number in the training set, and n is the corresponding dimension of the word vector;

further, the training model in S2 is a specific method for converting characters into tf-idf word vectors, and the tf-idf word vectors are calculated on a training set to obtain a mapping relationship of the tf-idf word vectors corresponding to each word;

further, the S5 specifically includes: converting the bank full scale to be matched into tf-idf word vectors, wherein the tf-idf word vectors are a, b and are recorded as tf-idf-test, wherein a is the number of the bank full scale to be matched, and b is the dimension of the word vectors;

result_cos_sim＝tf-idf-test*tf-idf-train.T；

wherein result _ cos _ sim is a matrix of a × m, and each behavior in the matrix has cosine similarity between an input and a word vector recorded in a bank;

further, the tf-idf is calculated in the following manner:

tf-idf ═ tf × idf, wherein;

tf calculation formula:

idf calculation formula:

the larger the tf-idf value is, the larger the probability of being a keyword is;

the invention has the following beneficial effects:

1) in order to increase the speed, the speed is converted into matrix multiplication and 2 processes for simultaneous calculation, and finally the speed of 2000 strips of 2s can be achieved;

2) one input of each behavior and a result recorded in a bank, and cosine similarity between word vectors greatly reduce the speed of using circulation through matrix operation;

3) dividing the input bank full name with possible errors into two batches, running by two processes and increasing the speed by one time.

Drawings

FIG. 1 is a detailed flow chart of the training steps in the method of the present invention;

fig. 2 is a diagram of the steps of matching the bank full name in the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.

The invention is further described with reference to the following figures and specific examples, which are not intended to be limiting. The following are preferred examples of the present invention:

as shown in fig. 1-2, the present invention provides a bank full-name fast matching method based on word vector expression and cosine similarity, the fast matching method includes:

the de-word, segmentation and combination processing in S1 and S3 specifically comprises:

and (3) word removal: removing characters of irrelevant key information in the bank full name to reduce the calculated amount;

characters without key information in the segmentation processing include but are not limited to companies, stocks companies, banks and branches;

the obtaining method of the 2-word phrase in the S3 and the S1 is as follows: randomly selecting two characters from the simplified entries, and arranging and combining all possible characters according to the positive sequence of the Chinese characters in the simplified entries to form a 2-character group;

the S2 specifically includes: converting each simplified entry in the training set and a 2-word phrase obtained by correspondingly combining the simplified entries into a character string, then converting the character string into tf-idf word vectors, and finally, enabling all tf-idf word vectors to be m-n matrixes which are recorded as tf-idf-train, wherein each row vector of the matrixes is standardized and the modulus is equal to 1, wherein m is the total sample number in the training set, and n is the corresponding dimension of the word vector;

the training model in the S2 is a specific method for converting characters into tf-idf word vectors, and the tf-idf word vectors are calculated on a training set to obtain a mapping relation of the tf-idf word vectors corresponding to each word and character;

the S5 specifically includes: converting the bank full scale to be matched into tf-idf word vectors, wherein the tf-idf word vectors are a, b and are recorded as tf-idf-test, wherein a is the number of the bank full scale to be matched, and b is the dimension of the word vectors;

result_cos_sim＝tf-idf-test*tf-idf-train.T；

the result _ cos _ sim is a matrix of a × m, and cosine similarity between input of each behavior in the matrix and word vectors recorded in the bank is calculated;

further, the tf-idf is calculated in the following manner:

tf-idf ═ tf × idf, wherein;

tf calculation formula:

idf calculation formula:

the larger the tf-idf value, the greater the probability of being a keyword.

The invention mainly solves the problem that a plurality of (about 2000 general) bank names which possibly have errors are manually input to match the correct bank full name. The invention converts all texts into word vectors, then calculates cosine similarity between the word vectors, and converts the cosine similarity into matrix multiplication and 2 processes for simultaneous calculation in order to improve speed. This process can eventually reach a speed of 2000 strips for 2 s.

The formula used in the present invention is as follows:

1. cosine similarity calculation formula:

the larger this value represents

And

the closer together.

2. tf-idf calculation mode:

tf-idf ═ tf × idf, wherein;

tf calculation formula:

idf calculation formula:

the larger the tf-idf value is, the larger the probability of being a keyword is, the larger the tf-idf value is, the larger the probability of being a keyword is.

The following explanation takes the civil bank cis-chequer branch as a detailed procedure of an embodiment:

1. the words of irrelevant key information of the bank complete library (such as words of company Limited, stock Limited, bank, branch, etc.) are removed to reduce the calculation amount, and the example is as follows: "Minsheng Bank shun Yi Zhi xing" - "Minsheng shun Yi".

2. Segmenting words, namely segmenting each text according to each character, such as: "the folk life is shun Yi" - "the civilian, raw, shun, Yi".

3. Constructing a new 2-word phrase: the individual often makes an abbreviation for the bank, and needs to consider the sequential thinking, for example: the bank is different from bank, and the characters of the bank name are combined into a word in pairs. For example: "Minsheng shun yi" - "Minsheng, Minshun, Minyi, sheng shun, sheng yi, shun;

4. through the operations of the first three steps, the obtained result is that: the 'Minsheng Bank consequent branch' outputs: "Min, Sheng, shun, Yi, Min Sheng, Min shun, Min Yi, Sheng shun, Sheng Yi, and Shun Yi".

5. Each record in the bank is converted into a character string after the first three steps of processing, and then converted into a tf _ idf word vector, and finally all the word vectors are a matrix (137907 × 305628) and are recorded as tf-idf-train, and each row vector of the matrix is normalized and is modulo equal to 1 (the previous steps are shown in fig. 1).

6. The input bank full name is directly converted into tf-idf vector in the same way, for example, 2000 pieces are input, the word vector (2000 × 305628) is recorded as tf-idf-test, and result _ cos _ sim is recorded as tf-idf-test. According to the formula for calculating cosine similarity, in the formula (1), result _ cos _ sim is a matrix of (2000 × 137907), cosine similarity between input words and 137907 recorded word vectors in a bank is formed in each action, and the speed of using circulation is greatly reduced through matrix operation.

7. And each row is called the bank with the position with the maximum cosine similarity.

8. Dividing the input bank full name with possible errors into two batches, running together in two processes, and increasing the speed by one time (steps 6-8 are shown in figure 2).

The above-described embodiment is only one of the preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims

1. A bank full-name quick matching method based on word vector expression and cosine similarity is characterized in that a bank full-name library is processed by the quick matching method to obtain a training set, the training set is trained to obtain a word vector matrix and a training model, then segmentation and word vector processing are carried out on the bank full-name to be matched, finally, a calculation method based on cosine similarity is used for multiplying the word vector processing result to be retrieved and the word vector matrix in a transposition mode, and the bank full-name is obtained by combining the maximum value result of each row of the matrix after multiplication with the comparison result in the matched bank full-name and the training model; the quick matching method comprises the following steps:

s4: and multiplying the tf-idf word vector converted in the step S3 by the transpose of the tf-idf word vector matrix of the training set in the step S2, and selecting the bank corresponding to the position of the maximum value in each row as an output final result according to the multiplied matrix result.

2. The method according to claim 1, wherein the de-wording, slicing and combining processes in S1 and S3 are specifically:

and the combined 2-word group and the set of the multiple reduced entries in the S1 are used as a training set.

3. The method of claim 2, wherein the non-critical information-free text in the segmentation process includes but is not limited to companies, stocks, banks, and branches.

4. The method according to claim 2, wherein the "2-word phrase" in S3 and S1 is obtained by: two characters are selected from the simplified entry at will, and all possible permutation and combination are carried out according to the positive sequence permutation of the Chinese characters in the simplified entry to form a 2-character phrase.

5. The method according to claim 2, wherein S2 is specifically: converting each simplified entry in the training set and a 2-word phrase obtained after the simplified entries are correspondingly combined into a character string, then converting the character string into tf-idf word vectors, finally, enabling all tf-idf word vectors to be m-n matrixes which are recorded as tf-idf-train, and standardizing each row vector of the matrixes with the modulus equal to 1, wherein m is the total sample number in the training set, and n is the corresponding dimension of the word vector.

6. The method as claimed in claim 5, wherein the training model in S2 is a specific method for converting text into tf-idf word vectors, and the tf-idf word vectors are calculated on a training set to obtain a mapping relationship between each word and the tf-idf word vector corresponding to the word.

7. The method according to claim 5, wherein the S4 is specifically: converting the bank full scale to be matched into tf-idf word vectors, wherein the tf-idf word vectors are a, b and are recorded as tf-idf-test, wherein a is the number of the bank full scale to be matched, and b is the dimension of the word vectors;

result_cos_sim＝tf-idf-test*tf-idf-train.T；

tf-idf-train.T is the transpose of tf-idf-train.

8. The method of claim 6, wherein tf-idf is calculated by:

tf-idf ═ tf × idf, wherein;

tf calculation formula:

idf calculation formula:

the larger the tf-idf value is, the larger the probability of being a keyword is.