CN110516040A - Semantic Similarity comparative approach, equipment and computer storage medium between text - Google Patents

Semantic Similarity comparative approach, equipment and computer storage medium between text Download PDF

Info

Publication number
CN110516040A
CN110516040A CN201910749686.7A CN201910749686A CN110516040A CN 110516040 A CN110516040 A CN 110516040A CN 201910749686 A CN201910749686 A CN 201910749686A CN 110516040 A CN110516040 A CN 110516040A
Authority
CN
China
Prior art keywords
text
vector
processing result
word segmentation
term vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910749686.7A
Other languages
Chinese (zh)
Other versions
CN110516040B (en
Inventor
祝文博
雷欣
李志飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Go Out And Ask (wuhan) Information Technology Co Ltd
Original Assignee
Go Out And Ask (wuhan) Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Go Out And Ask (wuhan) Information Technology Co Ltd filed Critical Go Out And Ask (wuhan) Information Technology Co Ltd
Priority to CN201910749686.7A priority Critical patent/CN110516040B/en
Publication of CN110516040A publication Critical patent/CN110516040A/en
Application granted granted Critical
Publication of CN110516040B publication Critical patent/CN110516040B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses Semantic Similarity comparative approach, equipment and the computer storage mediums between a kind of text, comprising: determines the first text and the second text;Word segmentation processing is carried out to first text and the second text respectively, obtains corresponding first participle processing result and the second word segmentation processing result;The first participle processing result and the second word segmentation processing result are subjected to vector conversion, obtain corresponding first term vector and the second term vector;First term vector and the second term vector are mapped to higher dimensional space respectively, obtain corresponding first map vector and the second map vector;Similarity system design is carried out to first map vector and the second map vector, obtains the comparison result for characterizing Semantic Similarity between the first text and the second text.

Description

Semantic Similarity comparative approach, equipment and computer storage medium between text
Technical field
The present invention relates to the Semantic Similarity sides between natural language processing technique field more particularly to a kind of text Method, equipment and computer storage medium.
Background technique
Natural language processing is an important directions in computer science and artificial intelligence field.It is studied can be real The various theory and methods of efficient communication are carried out between existing people and computer with natural language.During natural language processing, Similarity is the basic operations of text-processing between calculating text, and the precision of the text similarity as prefix operation directly affects The result of final operation.
Text is generally understood as the object of unlimited dimension in calculation processing as a kind of unstructured data, so Between calculating text before similarity, need to carry out the dimension-reduction treatment of structuring.For text dimensionality reduction, currently used dimensionality reduction side The with good grounds word frequency statistics of formula carry out dimensionality reduction and carry out dimensionality reduction according to the importance value of word.But in the dimensionality reduction comparison procedure of text In, the accuracy of comparison result is unsatisfactory.
Summary of the invention
The present invention provides Semantic Similarity comparative approach, equipment and the computer storage medium between a kind of text, Neng Gouti The precision of Semantic Similarity comparison result between high text.
One aspect of the present invention provides the Semantic Similarity comparative approach between a kind of text, comprising: determines the first text and Two texts;Word segmentation processing is carried out to first text and the second text respectively, obtain corresponding first participle processing result and Second word segmentation processing result;The first participle processing result and the second word segmentation processing result are subjected to vector conversion, obtained pair The first term vector and the second term vector answered;First term vector and the second term vector are mapped to higher dimensional space respectively, obtained To corresponding first map vector and the second map vector;Similitude is carried out to first map vector and the second map vector Compare, obtains the comparison result for characterizing Semantic Similarity between first text and the second text.
In a kind of embodiment, the first participle processing result and the second word segmentation processing result are subjected to vector and turned It changes, comprising: using the corresponding relationship between participle and the TF-IDF value of word frequency inverse document frequency, first participle processing is tied Fruit and the second word segmentation processing result carry out vector conversion.
It is described that first term vector and the second term vector are mapped to higher-dimension sky respectively in a kind of embodiment Between, comprising: determine the corresponding target domain of content of text of first text and the second text;Obtain the corresponding target neck The corpus sample in domain;By the corpus sample training model of the target domain, the mapping mould for corresponding to the target domain is obtained Type;First term vector and the second term vector are mapped to higher dimensional space using the mapping model of the target domain.
In a kind of embodiment, the method also includes: described in the corpus sample calculating by the target domain The loss function of mapping model;The mapping model is updated using the calculated result of the loss function.
In a kind of embodiment, the comparison result is similarity-rough set result;Correspondingly, described to described first Map vector and the second map vector carry out similarity system design, comprising: calculate first map vector in the higher dimensional space With the included angle cosine value of the second map vector, the similarity-rough set result is corresponded to by the included angle cosine value.
The Semantic Similarity that another aspect of the present invention is provided between a kind of text compares equipment, comprising: determining module, for true Fixed first text and the second text;Word segmentation module is obtained for carrying out word segmentation processing to first text and the second text respectively To corresponding first participle processing result and the second word segmentation processing result;Conversion module, for tying first participle processing Fruit and the second word segmentation processing result carry out vector conversion, obtain corresponding first term vector and the second term vector;Mapping block is used In first term vector and the second term vector are mapped to higher dimensional space respectively, corresponding first map vector and second are obtained Map vector;Comparison module is used for for carrying out similarity system design to first map vector and the second map vector Characterize the comparison result of Semantic Similarity between first text and the second text.
In a kind of embodiment, the word segmentation module is specifically used for: utilizing the TF- of participle and word frequency inverse document frequency The first participle processing result and the second word segmentation processing result are carried out vector conversion by the corresponding relationship between IDF value.
In a kind of embodiment, the mapping block, comprising: submodule is determined, for determining first text Target domain corresponding with the content of text of the second text;Submodule is obtained, for obtaining the corpus of the corresponding target domain Sample;Training submodule obtains corresponding to the target domain for passing through the corpus sample training model of the target domain Mapping model;Mapping submodule, for the mapping model using the target domain by first term vector and the second word to Amount is mapped to higher dimensional space.
In a kind of embodiment, the equipment further include: computing module, for passing through the corpus of the target domain Sample calculates the loss function of the mapping model;Update module, for updating institute using the calculated result of the loss function State mapping model.
In a kind of embodiment, the comparison result is similarity-rough set result;Correspondingly, the comparison module tool Body is used for: the included angle cosine value of first map vector and the second map vector is calculated in the higher dimensional space, by described Included angle cosine value corresponds to the similarity-rough set result.
Another aspect of the present invention provides a kind of computer storage medium, and it is executable that computer is stored in the storage medium Instruction, when executed for the Semantic Similarity side between text described in any one of above-mentioned embodiment Method.
Semantic Similarity comparative approach, equipment and computer storage medium between text through the embodiment of the present invention, can Whether the Deep Semantics compared between text similar, compared to only compare text shallow semantic determine between text it is semantic whether It is similar, using the Semantic Similarity comparative approach of the embodiment of the present invention, it can more precisely compare similitude between text, from And improve the accuracy of equipment relatively to semanteme.
Detailed description of the invention
Fig. 1 shows the flow diagram of the Semantic Similarity comparative approach between text of the embodiment of the present invention;
Fig. 2 shows the flow diagrams of model training method in comparative approach of the embodiment of the present invention;
Fig. 3 shows the flow diagram of model update method in comparative approach of the embodiment of the present invention.
The Semantic Similarity that Fig. 4 shows between text of the embodiment of the present invention compares the module diagram of equipment.
Specific embodiment
To keep the purpose of the present invention, feature, advantage more obvious and understandable, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only It is only a part of the embodiment of the present invention, and not all embodiments.Based on the embodiments of the present invention, those skilled in the art are not having Every other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.
Fig. 1 shows the flow diagram of the Semantic Similarity comparative approach between text of the embodiment of the present invention.
Referring to Fig. 1, on the one hand the embodiment of the present invention provides the Semantic Similarity comparative approach between a kind of text, comprising: step Rapid 101, determine the first text and the second text;Step 102, word segmentation processing is carried out to the first text and the second text respectively, obtained To corresponding first participle processing result and the second word segmentation processing result;Step 103, by first participle processing result and second point Word processing result carries out vector conversion, obtains corresponding first term vector and the second term vector;Step 104, respectively by the first word The second term vector of vector sum is mapped to higher dimensional space, obtains corresponding first map vector and the second map vector;Step 105, Similarity system design is carried out to the first map vector and the second map vector, is obtained for characterizing between the first text and the second text The comparison result of Semantic Similarity.
The Semantic Similarity comparative approach provided through the embodiment of the present invention, the Deep Semantics that can compare between text are It is no similar, determine whether the semanteme between text is similar compared to text shallow semantic is only compared, using the embodiment of the present invention Semantic Similarity comparative approach can more precisely compare similitude between text, and it is accurate that raising equipment compares semanteme Property.
In carrying out semantic comparison procedure, it is necessary first to determine the first text and the second text.First text and the second text This method of determination is unrestricted, and the first text and the second text can be determined by way of speech recognition, can also be passed through The mode being electronically entered determines the first text and the second text, can also be determined by way of handwriting recognition the first text and Second text.Specifically, the voice content of user is obtained first, by voice content when by the way of speech recognition Be converted into the text of corresponding voice content, for example, voice content be " compare weather today very well and today weather it is pretty good ".Equipment Identification text information can be obtained first text and correspondence " today, weather was pretty good " second text of corresponding " today, weather was fine ". When determining the first text and the second text by the way of being electronically entered, display is for the first text input in the display interface First dialog box and the second dialog box for supplying the second text input, the content of text by input in the first dialog box are determined as first Text, the content of text by input in the second dialog box are determined as the second text.
After confirming the first text and the second text, word segmentation processing is carried out respectively to the first text and the second text.This Without limiting, equipment can be calculated the word segmentation processing mode at place using the learning equipment based on dictionary segmentation methods or based on statistics Method carries out word segmentation processing to the first text and the second text, it should be noted that the participle that the first text and the second text use Processing mode is identical.As being " can find remittance after bank transfer several working days to account " when the first text, the second text is It is " silver by the first participle processing result to the first text word segmentation processing, obtained when " when bank transfer can be found " Row, remittance, several, working day, it is rear, just, can, find, to account ", using method same as the first text to the second text just Row word segmentation processing, the second obtained word segmentation processing result be " bank, remittance, when, can, reach, find ".Word segmentation processing herein Method carries out word segmentation processing preferably by participle model corresponding with the first text and the second text fields.
After obtaining first participle processing result and the second word segmentation processing result, step 103 is carried out, at the first participle It manages result and the second word segmentation processing result carries out vector conversion, obtain corresponding first term vector and the second term vector.The present invention Embodiment is not defined the concrete mode of vector conversion, in a kind of specific vector conversion regime, first participle processing As a result totally 9 words, corresponding first term vector are 9, and totally 5 words, corresponding second term vector are 5 to the second word segmentation processing result, In another vector conversion regime, firstly, first participle processing result and the second word segmentation processing result are summarized, obtain It include total bag of words of first participle processing result and the second word segmentation processing result, the word occurred in total bag of words is " bank, remittance Money, several, working day, it is rear, just, can, find, to account, when, enough " have 11 words, the bag of words vector length of the total bag of words of correspondence altogether It is 11, first participle processing result totally 9 words, corresponding first term vector is 9/11, the second word segmentation processing result totally 5 words, Corresponding second term vector is 5/11.Wherein, the first term vector and the second term vector are bivector.
After obtaining the first term vector and the second term vector, the first term vector and the second term vector are mapped to higher-dimension respectively Space obtains corresponding first map vector and the second map vector.The mapping of vector is realized here by mapping model, is obtained The first map vector and the second map vector be high dimension vector.Specific mapping model uses sequence semantic embedding (Sequence Semantic Embedding, SSE) model, SSE model are a kind of sentence comparison models, by giving first Term vector and the second term vector, SSE model map the first term vector by encoder A, obtain the first map vector, SSE model maps the second term vector by encoder B, obtains the second map vector.You need to add is that volume herein Code device A and encoder B first passes through the corpus in field described in the first text and the second text in advance and is trained, corpus include it is several Labeled good similar sentence and several labeled good dissimilar sentences make encoder A and coding by training Field corpus where device B adapts to the first text and the second text is semantic, to enable encoder A and encoder B by correspondence The plane vector of the sentence in the field is mapped to higher dimensional space.
It is similar by being carried out to the first map vector and the second map vector after plane vector is mapped to higher dimensional space Property compares, and obtains the comparison result for characterizing Semantic Similarity between the first text and the second text.The embodiment of the present invention is not The method of similarity system design is defined, by similarity system design, reaches the similarity compared between sentence from Deep Semantics Purpose, improve equipment to semanteme relatively accuracy.
For the understanding convenient for above embodiment, a kind of specific scene embodiment presented below is implemented in the scene In mode, with display panel and the equipment compared for carrying out the Semantic Similarity between text, between display panel and equipment Communication connection.
Firstly, equipment obtains the first text and the second text to be compared by display panel, wherein the first text is " remittance can be found after bank transfer several working days to account ", the second text be " when bank transfer can be crossed and find ".
Then, equipment to the first text carry out word segmentation processing, obtain first participle processing result " bank, remittance, it is several, Working day, it is rear, just, can, find, to account ", equipment carries out word segmentation processing to the second text in the same way, obtains the second participle Processing result " bank, remittance, when, can, it is enough, find ", first participle processing result is subjected to vector conversion, obtains corresponding the Second word segmentation processing result is carried out vector conversion in the same way, obtains second by the first term vector of one word segmentation processing result Term vector, the first term vector and the second term vector are plane vector.
Later, the cosine similarity for calculating the first term vector and the second term vector is obtained for characterizing the first text The comparison result A of Semantic Similarity between the second text.
After again, the first term vector is reflected by the encoder A for the SSE model being trained in advance using the bank field corpus It is mapped to higher dimensional space, obtains the first map vector, passes through the coding for the SSE model being trained in advance using the bank field corpus Second term vector is mapped to higher dimensional space by device B, obtains the second map vector.
Finally, calculating the cosine similarity of the first map vector and the second map vector, obtain for characterizing first The comparison result B of Semantic Similarity between text and the second text.
Compared result A and comparison result B are compared, the first text and the second text determined by comparing result B Similarity it is larger, by comparing result A determine the first text and the second text similarity it is smaller.
In embodiments of the present invention, step 103 includes: using between participle and the TF-IDF value of word frequency inverse document frequency First participle processing result and the second word segmentation processing result are carried out vector conversion by corresponding relationship, obtain corresponding first word to Amount and the second term vector.
To further increase the accuracy of equipment relatively to semanteme, carrying out at first participle processing result and the second participle When managing the vector conversion of result, the embodiment of the present invention preferably uses word frequency inverse document frequency (Term Frequency-Inverse Document Frequency, TF-IDF) the obtained corresponding participle of the TF-IDF value of word of technology, to obtain the first term vector and the Two term vectors.
Specifically, after obtaining first participle processing result and the second word segmentation processing result, by first participle processing result Summarize with the second word segmentation processing result and obtains total bag of words and corresponding total bag of words vector.In total bag of words, TF-IDF technology is utilized The TF-IDF value of corresponding each word is obtained, TF-IDF value is obtained for determining importance of the word in sentence by TF-IDF value The first term vector and the second term vector, compared to the first term vector and the second word that directly evaluating word importance does not directly obtain Vector, the first term vector and the second term vector obtained by TF-IDF value is on shallow semantic to the standard of text similarity evaluation True property is more accurate, and then reflects the first term vector and the second term vector by being obtained by word frequency inverse document frequency technology The first map vector and the second map vector penetrated contain more semantic informations, make the comparison result of semantic similarity more Accurately.
Fig. 2 shows the flow diagrams of model training method in comparative approach of the embodiment of the present invention.
Referring to fig. 2, in embodiments of the present invention, step 104 includes: step 1041, determines the first text and the second text The corresponding target domain of content of text;Step 1042, the corpus sample of corresponding target domain is obtained;Step 1043, pass through mesh The corpus sample training model in mark field obtains the mapping model of corresponding target domain;Step 1044, reflecting using target domain It penetrates model and the first term vector and the second term vector is mapped to higher dimensional space;Wherein, the corpus sample of target domain includes or not Including the first text and/or the second text.
To improve accuracy and reasonability in vector mapping process, the vector mapping process of the embodiment of the present invention has centainly Supervision, make vector mapping process of the present invention that there is specific aim to the first text and the second text.The embodiment of the present invention passes through The corpus sample for belonging to same target domain with the first text and the second text is trained SSE model, and SSE model passes through corpus Sample training obtains encoder.It should be noted that the purpose of step 1041 to step 1043 is made to train mapping model It can be carried out with specific aim, the training process before obtaining the first text and the second text, can also be in the first text said It carries out, can also be carried out when obtaining the first text and the second text, i.e. step 1041 to step after this and the second text Either step in 1043 and step 101 are to having no temporal successive restriction between step 103, step 105.
Specifically, the embodiment of the present invention determines the corresponding target neck of the content of text of the first text and the second text first Domain, it should be noted that when the first text of input and the second text are to determine the content of target domain, which can When being carried out before obtaining the first text and the second text.For example, it is assumed that this method is used for banking, the first text is not being obtained When with the second text particular content, it can first determine the first text and the corresponding target domain of the second text for bank's neck Domain can obtain by the corpus sample training model of corresponding the bank field and map the first term vector and second for specific aim The mapping model of term vector.
Likewise, the embodiment of the present invention can pass through analysis first when formerly obtaining the first text and the second text The content of text and the second text determines corresponding target domain, then the corpus sample of corresponding target domain is obtained by information scratching This, is trained model, obtains the mapping model for specific aim mapping the first term vector and the second term vector.
Supplement is needed further exist for, when for the first text and/or the second text and for the corpus sample of model training When inconsistent, between text is calculated in equipment after similarity-rough set result, the first text and/or the second text can be used as language Material sample is updated mapping model, updated mapping model for carry out next time between text it is semantic relatively, as reflecting Penetrate model use.
Fig. 3 shows the flow diagram of model update method in comparative approach of the embodiment of the present invention.
In embodiments of the present invention, step 1043, by the corpus sample training model of target domain, corresponding target is obtained After the mapping model in field, method further include: step 1045, the damage of mapping model is calculated by the corpus sample of target domain Lose function;Step 1046, mapping model is updated using the calculated result of loss function.
Specifically, mapping model is SSE model, including encoder A and encoder B, training is being updated to SSE model When, corpus sample is several labeled similar sentences and several labeled good dissimilar sentences.By the damage of SSE model It loses function and is defined as cross entropy, cross entropy, then cross entropy obtained by calculation are calculated by corpus sample to update SSE model Encoder A and encoder B, target domain where encoder A and the corpus sample of encoder B adaptation training is allowed, to compile The corresponding plane vector of the text of target domain can be mapped to higher dimensional space by code device A and encoder B, form high dimension vector. It should be noted that without contacting between step 1045 and step 1046 and step 1044, also without temporal precedence relationship, in step After rapid 1043 are finished, step 1044 can be executed, step 1045 can also be executed, it can be with step 1044 and step 1045 It is performed simultaneously.
In embodiments of the present invention, comparison result is similarity-rough set result;Correspondingly, step 105 includes: in higher-dimension sky Between calculate the included angle cosine value of the first map vector and the second map vector, similarity-rough set knot is corresponded to by included angle cosine value Fruit.
The embodiment of the present invention is not defined the mode of similarity system design, characterize the first text and between language between text The comparison result of adopted similitude can be similarity-rough set result or dissmilarity degree comparison result, and comparison result is with visualized data Mode exports, visualized data be text, image it is any one or several.Preferably, comparison result is to pass through cosine similarity The similarity-rough set result compared.By the way that plane vector is mapped to higher-dimension, and cosine similarity is carried out in higher dimensional space Compare, compare two-dimensional space, higher dimensional space contains much information, and comparison result is more diversified, and accuracy is high.
For the understanding convenient for above embodiment, another kind scene embodiment presented below, in the scene embodiment In, with display panel and the equipment compared for carrying out the Semantic Similarity between text, display panel and communication between devices Connection.
Firstly, equipment obtains the first text and the second text to be compared by display panel, wherein the first text is " remittance can be found after bank transfer several working days to account ", the second text be " when bank transfer can be crossed and find ".
Then, equipment to the first text carry out word segmentation processing, obtain first participle processing result " bank, remittance, it is several, Working day, it is rear, just, can, find, to account ", equipment carries out word segmentation processing to the second text in the same way, obtains the second participle Processing result " bank, remittance, when, can, it is enough, find ", first participle processing result and the second word segmentation processing result are summarized, The bag of words vector for obtaining total bag of words and corresponding total bag of words is calculated according to the bag of words vector of total bag of words by TF-IDF technology, is obtained The first term vector and the second term vector of corresponding TF-IDF value.
Later, the cosine similarity for calculating the first term vector and the second term vector is obtained for characterizing the first text The cosine similarity value of Semantic Similarity is 0.24 between the second text.According to comparing, 0.24 is greater than comparison result A.
After again, the first term vector is reflected by the encoder A for the SSE model being trained in advance using the bank field corpus It is mapped to higher dimensional space, obtains the first map vector, passes through the coding for the SSE model being trained in advance using the bank field corpus Second term vector is mapped to higher dimensional space by device B, obtains the second map vector.
Finally, calculating the cosine similarity of the first map vector and the second map vector, obtain for characterizing first Semantic cosine similarity value is 0.93 between text and the second text.According to comparing, 0.93 is greater than comparison result B.
This method has found an effective, succinct measurement Deep Semantics by TF-IDF technology and SSE models coupling Method.
The Semantic Similarity that Fig. 4 shows between text of the embodiment of the present invention compares the module diagram of equipment.
Referring to fig. 4, the Semantic Similarity that on the other hand embodiment of the present invention is provided between a kind of text compares equipment, comprising: Determining module 401, for determining the first text and the second text;Word segmentation module 402, for respectively to the first text and the second text This progress word segmentation processing obtains corresponding first participle processing result and the second word segmentation processing result;Conversion module 403, is used for First participle processing result and the second word segmentation processing result are subjected to vector conversion, obtain corresponding first term vector and the second word Vector;Mapping block 404 obtains corresponding for the first term vector and the second term vector to be mapped to higher dimensional space respectively One map vector and the second map vector;Comparison module 405, it is similar for being carried out to the first map vector and the second map vector Property compares, and obtains the comparison result for characterizing Semantic Similarity between the first text and the second text.
In embodiments of the present invention, word segmentation module 401 is specifically used for: utilizing the TF-IDF of participle and word frequency inverse document frequency First participle processing result and the second word segmentation processing result are carried out vector conversion by the corresponding relationship between value.
In embodiments of the present invention, mapping block 404 comprises determining that submodule 4041, for determining the first text and The corresponding target domain of the content of text of two texts;Submodule 4042 is obtained, for obtaining the corpus sample of corresponding target domain; Training submodule 4043 obtains the mapping model of corresponding target domain for passing through the corpus sample training model of target domain; First term vector and the second term vector are mapped to higher-dimension sky for the mapping model using target domain by mapping submodule 4044 Between;Wherein, the corpus sample of target domain includes or does not include the first text and/or the second text.
In embodiments of the present invention, equipment further include: computing module 406, based on the corpus sample by target domain Calculate the loss function of mapping model;Update module 407, for updating mapping model using the calculated result of loss function.
In embodiments of the present invention, comparison result is similarity-rough set result;Correspondingly, comparison module 405, comprising: In Higher dimensional space calculates the included angle cosine value of the first map vector and the second map vector, corresponds to similarity ratio by included angle cosine value Relatively result.
On the other hand the embodiment of the present invention provides a kind of computer storage medium, computer is stored in storage medium to be held Row instruction, when executed for the Semantic Similarity comparative approach between any one of above-mentioned embodiment text.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.Moreover, particular features, structures, materials, or characteristics described It may be combined in any suitable manner in any one or more of the embodiments or examples.In addition, without conflicting with each other, this The technical staff in field can be by the spy of different embodiments or examples described in this specification and different embodiments or examples Sign is combined.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic." first " is defined as a result, the feature of " second " can be expressed or hidden It include at least one this feature containing ground.In the description of the present invention, the meaning of " plurality " is two or more, unless otherwise Clear specific restriction.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims (10)

1. the Semantic Similarity comparative approach between a kind of text characterized by comprising
Determine the first text and the second text;
Word segmentation processing is carried out to first text and the second text respectively, obtains corresponding first participle processing result and second Word segmentation processing result;
The first participle processing result and the second word segmentation processing result are subjected to vector conversion, obtain corresponding first term vector With the second term vector;
First term vector and the second term vector are mapped to higher dimensional space respectively, obtain corresponding first map vector and Two map vectors;
Similarity system design is carried out to first map vector and the second map vector, obtain for characterize first text with The comparison result of Semantic Similarity between second text.
2. the method according to claim 1, wherein by the first participle processing result and the second word segmentation processing As a result vector conversion is carried out, comprising:
Using the corresponding relationship between participle and the TF-IDF value of word frequency inverse document frequency, by the first participle processing result and Second word segmentation processing result carries out vector conversion.
3. the method according to claim 1, wherein described respectively by first term vector and the second term vector It is mapped to higher dimensional space, comprising:
Determine the corresponding target domain of content of text of first text and the second text;
Obtain the corpus sample of the corresponding target domain;
By the corpus sample training model of the target domain, the mapping model for corresponding to the target domain is obtained;
First term vector and the second term vector are mapped to higher dimensional space using the mapping model of the target domain.
4. according to the method described in claim 3, it is characterized in that, the method also includes:
The loss function of the mapping model is calculated by the corpus sample of the target domain;
The mapping model is updated using the calculated result of the loss function.
5. the method according to claim 1, wherein the comparison result is similarity-rough set result;
Correspondingly, described carry out similarity system design to first map vector and the second map vector, comprising: in the higher-dimension Space calculates the included angle cosine value of first map vector and the second map vector, by described in included angle cosine value correspondence Similarity-rough set result.
6. the Semantic Similarity between a kind of text compares equipment characterized by comprising
Determining module, for determining the first text and the second text;
Word segmentation module obtains the corresponding first participle for carrying out word segmentation processing to first text and the second text respectively Processing result and the second word segmentation processing result;
Conversion module obtains pair for the first participle processing result and the second word segmentation processing result to be carried out vector conversion The first term vector and the second term vector answered;
Mapping block obtains corresponding for first term vector and the second term vector to be mapped to higher dimensional space respectively One map vector and the second map vector;
Comparison module is obtained for carrying out similarity system design to first map vector and the second map vector for characterizing The comparison result of Semantic Similarity between first text and the second text.
7. equipment according to claim 6, which is characterized in that the word segmentation module is specifically used for: utilizing participle and word frequency Corresponding relationship between the TF-IDF value of inverse document frequency, by the first participle processing result and the second word segmentation processing result into Row vector conversion.
8. equipment according to claim 6, which is characterized in that the mapping block includes:
Submodule is determined, for determining the corresponding target domain of content of text of first text and the second text;
Submodule is obtained, for obtaining the corpus sample of the corresponding target domain;
Training submodule obtains corresponding to the target domain for passing through the corpus sample training model of the target domain Mapping model;
Mapping submodule maps first term vector and the second term vector for the mapping model using the target domain To higher dimensional space.
9. equipment according to claim 8, which is characterized in that the equipment further include:
Computing module calculates the loss function of the mapping model for the corpus sample by the target domain;
Update module, for updating the mapping model using the calculated result of the loss function.
10. a kind of computer storage medium, which is characterized in that computer executable instructions are stored in the storage medium, when Described instruction is performed the Semantic Similarity comparative approach required between any one of the 1-5 text for perform claim.
CN201910749686.7A 2019-08-14 2019-08-14 Method, device and computer storage medium for semantic similarity comparison between texts Active CN110516040B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910749686.7A CN110516040B (en) 2019-08-14 2019-08-14 Method, device and computer storage medium for semantic similarity comparison between texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910749686.7A CN110516040B (en) 2019-08-14 2019-08-14 Method, device and computer storage medium for semantic similarity comparison between texts

Publications (2)

Publication Number Publication Date
CN110516040A true CN110516040A (en) 2019-11-29
CN110516040B CN110516040B (en) 2022-08-05

Family

ID=68625162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910749686.7A Active CN110516040B (en) 2019-08-14 2019-08-14 Method, device and computer storage medium for semantic similarity comparison between texts

Country Status (1)

Country Link
CN (1) CN110516040B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111049683A (en) * 2019-12-11 2020-04-21 中国科学院深圳先进技术研究院 Attention mechanism-based large-scale network group real-time fault prediction method
CN111401928A (en) * 2020-04-01 2020-07-10 支付宝(杭州)信息技术有限公司 Method and device for determining semantic similarity of text based on graph data
CN111611371A (en) * 2020-06-17 2020-09-01 厦门快商通科技股份有限公司 Method, device, equipment and storage medium for matching FAQ based on wide and deep network
CN113011172A (en) * 2021-03-15 2021-06-22 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106557485A (en) * 2015-09-25 2017-04-05 北京国双科技有限公司 A kind of method and device for choosing text classification training set
CN106777090A (en) * 2016-12-14 2017-05-31 大连交通大学 The medical science big data search method of the Skyline that view-based access control model vocabulary is matched with multiple features
CN106776559A (en) * 2016-12-14 2017-05-31 东软集团股份有限公司 The method and device of text semantic Similarity Measure
US20170177703A1 (en) * 2015-12-21 2017-06-22 Ebay Inc. Automatic taxonomy mapping using sequence semantic embedding
CN107220311A (en) * 2017-05-12 2017-09-29 北京理工大学 A kind of document representation method of utilization locally embedding topic modeling
CN109271462A (en) * 2018-11-23 2019-01-25 河北航天信息技术有限公司 A kind of taxpayer's tax registration registered address information cluster method based on K-means algorithm model
CN109411082A (en) * 2018-11-08 2019-03-01 西华大学 A kind of Evaluation of Medical Quality and medical recommended method
CN109614479A (en) * 2018-10-29 2019-04-12 山东大学 A kind of judgement document's recommended method based on distance vector
CN109783806A (en) * 2018-12-21 2019-05-21 众安信息技术服务有限公司 A kind of text matching technique using semantic analytic structure

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106557485A (en) * 2015-09-25 2017-04-05 北京国双科技有限公司 A kind of method and device for choosing text classification training set
US20170177703A1 (en) * 2015-12-21 2017-06-22 Ebay Inc. Automatic taxonomy mapping using sequence semantic embedding
CN106777090A (en) * 2016-12-14 2017-05-31 大连交通大学 The medical science big data search method of the Skyline that view-based access control model vocabulary is matched with multiple features
CN106776559A (en) * 2016-12-14 2017-05-31 东软集团股份有限公司 The method and device of text semantic Similarity Measure
CN107220311A (en) * 2017-05-12 2017-09-29 北京理工大学 A kind of document representation method of utilization locally embedding topic modeling
CN109614479A (en) * 2018-10-29 2019-04-12 山东大学 A kind of judgement document's recommended method based on distance vector
CN109411082A (en) * 2018-11-08 2019-03-01 西华大学 A kind of Evaluation of Medical Quality and medical recommended method
CN109271462A (en) * 2018-11-23 2019-01-25 河北航天信息技术有限公司 A kind of taxpayer's tax registration registered address information cluster method based on K-means algorithm model
CN109783806A (en) * 2018-12-21 2019-05-21 众安信息技术服务有限公司 A kind of text matching technique using semantic analytic structure

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TOM KENTER等: "short text similarity with word embeddings", 《PROCEEDINGS OF THE 24TH ACM INTERNATIONAL ON CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT》 *
郑帅等: "基于多维语义空间的垃圾短信过滤算法", 《自动化技术与应用》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111049683A (en) * 2019-12-11 2020-04-21 中国科学院深圳先进技术研究院 Attention mechanism-based large-scale network group real-time fault prediction method
CN111401928A (en) * 2020-04-01 2020-07-10 支付宝(杭州)信息技术有限公司 Method and device for determining semantic similarity of text based on graph data
CN111401928B (en) * 2020-04-01 2022-04-12 支付宝(杭州)信息技术有限公司 Method and device for determining semantic similarity of text based on graph data
CN111611371A (en) * 2020-06-17 2020-09-01 厦门快商通科技股份有限公司 Method, device, equipment and storage medium for matching FAQ based on wide and deep network
CN111611371B (en) * 2020-06-17 2022-08-23 厦门快商通科技股份有限公司 Method, device, equipment and storage medium for matching FAQ based on wide and deep network
CN113011172A (en) * 2021-03-15 2021-06-22 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN113011172B (en) * 2021-03-15 2023-08-22 腾讯科技(深圳)有限公司 Text processing method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN110516040B (en) 2022-08-05

Similar Documents

Publication Publication Date Title
CN110516040A (en) Semantic Similarity comparative approach, equipment and computer storage medium between text
AU2018247340B2 (en) Dvqa: understanding data visualizations through question answering
CN111858940B (en) Multi-head attention-based legal case similarity calculation method and system
WO2023020005A1 (en) Neural network model training method, image retrieval method, device, and medium
CN107330011A (en) The recognition methods of the name entity of many strategy fusions and device
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
CN110688489B (en) Knowledge graph deduction method and device based on interactive attention and storage medium
CN106970912A (en) Chinese sentence similarity calculating method, computing device and computer-readable storage medium
CN111542841A (en) System and method for content identification
CN114298121B (en) Multi-mode-based text generation method, model training method and device
CN112990035B (en) Text recognition method, device, equipment and storage medium
CN113326852A (en) Model training method, device, equipment, storage medium and program product
CN112328761A (en) Intention label setting method and device, computer equipment and storage medium
EP4116859A3 (en) Document processing method and apparatus and medium
CN109871757A (en) A kind of radar signal intra-pulse modulation kind identification method based on joint time-frequency feature
CN111444715A (en) Entity relationship identification method and device, computer equipment and storage medium
CN111666376A (en) Answer generation method and device based on paragraph boundary scan prediction and word shift distance cluster matching
CN111144109B (en) Text similarity determination method and device
CN114445832A (en) Character image recognition method and device based on global semantics and computer equipment
CN114519397B (en) Training method, device and equipment for entity link model based on contrast learning
US11727710B2 (en) Weakly supervised semantic parsing
CN116561274A (en) Knowledge question-answering method based on digital human technology and natural language big model
CN111428486B (en) Article information data processing method, device, medium and electronic equipment
CN112949433B (en) Method, device and equipment for generating video classification model and storage medium
CN113569018A (en) Question and answer pair mining method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant