CN109977995A - Text template recognition methods, device and computer readable storage medium - Google Patents

Text template recognition methods, device and computer readable storage medium Download PDF

Info

Publication number
CN109977995A
CN109977995A CN201910109887.0A CN201910109887A CN109977995A CN 109977995 A CN109977995 A CN 109977995A CN 201910109887 A CN201910109887 A CN 201910109887A CN 109977995 A CN109977995 A CN 109977995A
Authority
CN
China
Prior art keywords
similarity
text
template
text template
matched
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910109887.0A
Other languages
Chinese (zh)
Inventor
刘轲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910109887.0A priority Critical patent/CN109977995A/en
Priority to PCT/CN2019/088628 priority patent/WO2020164204A1/en
Publication of CN109977995A publication Critical patent/CN109977995A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of text template recognition methods, this method comprises: obtaining pre-set text template and matched text;The first similarity of the matched text Yu the pre-set text template is calculated according to the text similarity measurement algorithm based on word frequency;And/or the second similarity of the matched text Yu the pre-set text template is calculated according to semantic-based text similarity measurement algorithm;When first similarity and/or second similarity meet default similarity condition, determine that the matched text is text template similar with the pre-set text template.The present invention also proposes a kind of text template identification device and a kind of computer readable storage medium.The efficiency and accuracy of text template identification can be improved in the present invention.

Description

Text template recognition methods, device and computer readable storage medium
Technical field
The present invention relates to natural language processing technique field more particularly to a kind of text template recognition methods, device and meter Calculation machine readable storage medium storing program for executing.
Background technique
With the development of internet technology, people of all occupations can freely be issued and lower information carrying by the network platform Breath, this makes the information on network more and more, and big data analysis analyze to the data of network Shanghai amount and then be extracted Required information.It when carrying out big data analysis sometimes for text template is used, that is, include the text envelope of certain specific characters Breath.In general, identical text information or similar text information can correspond to a text template.In the prior art, text is obtained The method of this template is usually to be extracted from various information by staff, however this method takes time and effort, and work people Member, which needs to take a long time, to be identified and then obtains text module.
Summary of the invention
The present invention provides a kind of text template recognition methods, device and computer readable storage medium, main purpose and exists In the efficiency and accuracy that improve text template identification.
To achieve the above object, the present invention also provides a kind of text template recognition methods, this method comprises:
Obtain pre-set text template and matched text;
The first of the matched text and the pre-set text template is calculated according to the text similarity measurement algorithm based on word frequency Similarity;And/or
The second of the matched text and the pre-set text template is calculated according to semantic-based text similarity measurement algorithm Similarity;
When first similarity and/or second similarity meet default similarity condition, the matching is determined Text is text template similar with the pre-set text template.
Optionally, the basis calculates the matched text and the pre-set text based on the text similarity measurement algorithm of word frequency First similarity of template and/or the matched text and the default text are calculated according to semantic-based text similarity measurement algorithm Second similarity of this template includes:
The first similarity of the matched text Yu the pre-set text template is calculated using vector space model;
It is similar to the second of the pre-set text model that the matched text is calculated using LDA document subject matter generation model Degree;
First similarity and second similarity meet default similarity condition
Carry out linear weighted function according to first similarity and second similarity, obtain the matched text with it is described The third similarity of pre-set text template;
Judge whether the third similarity is greater than third and presets similarity;
If the third similarity is greater than the default similarity, first similarity and second similarity are determined Meet default condition of similarity.
Optionally, described that linear weighted function is carried out according to first similarity and second similarity, obtain described Include: with text and the third similarity of the pre-set text template
First similarity, second similarity are input to predetermined linear weighted formula, export the matching text The third similarity of this and the pre-set text template, the predetermined linear weighted formula are as follows:
Sim (p, q)=α simLDA(p,q)+βsimTFIDF(p, q),
Wherein, p and q is respectively the matched text and the pre-set text template, simTFIDF(p, q) is first phase Like degree, simLDA(p, q) is second similarity, and sim (p, q) is the third similarity, and α and β are default weighted value.
Optionally, the method also includes:
Obtain the weighted value for being used for linear weighted function, comprising:
The first initial value is assigned to the weighted value, according to third similarity described in first calculation of initial value;
Judge whether the matching template and the pre-set text template are the same category by default clustering algorithm, obtains Cluster result;
Judge whether the third similarity obtained according to first calculation of initial value is quasi- by the cluster result Really;
If it is determined that it is accurate according to the third similarity that first calculation of initial value obtains, determine that described first is initial Value is the weighted value for linear weighted function;
If it is determined that accurately inaccurate according to the third similarity that first calculation of initial value obtains, described the is adjusted One initial value executes the operation of the third similarity according to first calculation of initial value.
Optionally, first similarity or the default similarity condition of second similarity satisfaction include:
First similarity is greater than the first default similarity or second similarity is greater than the second default similarity.
In addition, to achieve the above object, the present invention also provides a kind of text template identification device, which includes memory And processor, the text template recognizer that can be run on the processor, the text mould are stored in the memory Plate recognizer realizes following steps when being executed by the processor:
Obtain pre-set text template and matched text;
The first of the matched text and the pre-set text template is calculated according to the text similarity measurement algorithm based on word frequency Similarity;And/or
The second of the matched text and the pre-set text template is calculated according to semantic-based text similarity measurement algorithm Similarity;
When first similarity and/or second similarity meet default similarity condition, the matching is determined Text is text template similar with the pre-set text template.
Optionally, the basis calculates the matched text and the pre-set text based on the text similarity measurement algorithm of word frequency First similarity of template and/or the matched text and the default text are calculated according to semantic-based text similarity measurement algorithm Second similarity of this template includes:
The first similarity of the matched text Yu the pre-set text template is calculated using vector space model;
It is similar to the second of the pre-set text model that the matched text is calculated using LDA document subject matter generation model Degree;
First similarity and second similarity meet default similarity condition
Carry out linear weighted function according to first similarity and second similarity, obtain the matched text with it is described The third similarity of pre-set text template;
Judge whether the third similarity is greater than third and presets similarity;
If the third similarity is greater than the default similarity, first similarity and second similarity are determined Meet default condition of similarity.
Optionally, described that linear weighted function is carried out according to first similarity and second similarity, obtain described Include: with text and the third similarity of the pre-set text template
First similarity, second similarity are input to predetermined linear weighted formula, export the matching text The third similarity of this and the pre-set text template, the predetermined linear weighted formula are as follows:
Sim (p, q)=α simLDA(p,q)+βsimTFIDF(p, q),
Wherein, p and q is respectively the matched text and the pre-set text template, simTFIDF(p, q) is first phase Like degree, simLDA(p, q) is second similarity, and sim (p, q) is the third similarity, and α and β are default weighted value.
Optionally, the text template recognizer is executed by the processor, also realization following steps:
Obtain the weighted value for being used for linear weighted function, comprising:
The first initial value is assigned to the weighted value, according to third similarity described in first calculation of initial value;
Judge whether the matching template and the pre-set text template are the same category by default clustering algorithm, obtains Cluster result;
Judge whether the third similarity obtained according to first calculation of initial value is quasi- by the cluster result Really;
If it is determined that it is accurate according to the third similarity that first calculation of initial value obtains, determine that described first is initial Value is the weighted value for linear weighted function;
If it is determined that accurately inaccurate according to the third similarity that first calculation of initial value obtains, described the is adjusted One initial value executes the operation of the third similarity according to first calculation of initial value.
Optionally, first similarity or the default similarity condition of second similarity satisfaction include:
First similarity is greater than the first default similarity or second similarity is greater than the second default similarity.
In addition, to achieve the above object, it is described computer-readable the present invention also provides a kind of computer readable storage medium Text template recognizer is stored on storage medium, the text template recognizer can be held by one or more processor Row, the step of to realize text template recognition methods as described above.
Text template recognition methods, text template identification device and computer readable storage medium proposed by the present invention, are obtained Take pre-set text template and matched text;According to the text similarity measurement algorithm based on word frequency calculate the matched text with it is described pre- If the first similarity of text template;And/or the matched text and institute are calculated according to semantic-based text similarity measurement algorithm State the second similarity of pre-set text template;When first similarity and/or second similarity meet default similarity When condition, determine that the matched text is text template similar with the pre-set text template.Without staff people one by one Work judgement, it will be able to be rapidly obtained text module similar with pre-set text template, realize and improve text template identification The purpose of efficiency by the text similarity measurement algorithm based on word frequency and/or be based on language also, when calculating text similarity The text similarity measurement algorithm of justice is calculated, and can be improved the accuracy of text template identification.
Detailed description of the invention
Fig. 1 is the flow diagram for the text template recognition methods that one embodiment of the invention provides;
Fig. 2 is the schematic diagram of internal structure for the text template identification device that one embodiment of the invention provides;
The module signal of text template recognizer in the text template identification device that Fig. 3 provides for one embodiment of the invention Figure.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
The present invention provides a kind of text template recognition methods.It is the text that first embodiment of the invention provides shown in referring to Fig.1 The flow diagram of this template recognition methods.This method can be executed by an electronic device.
In the present embodiment, text template recognition methods includes:
Step S10: pre-set text template and matched text are obtained.
The pre-set text template can be (such as being stored in electronic equipment) text for being stored in advance in default memory block This template.The pre-set text template can be obtained by user and be stored in default memory block, alternatively, the pre-set text template is by passing through The text of several similar words is analyzed, and extracts similar keyword in the text, obtains pre-set text template.
In a kind of possible embodiment, pre-set text template is any one text mould in a text template set Plate, all to include various inhomogeneous text templates in same class text template or text set in text set.Institute Stating and obtaining pre-set text template includes: to obtain text template set;Obtain the text template in the text template set.
The matched text is the text for judge whether it is Similar Text template.The matched text can be by one A or multiple sentence compositions.
Step S20: the matched text and the pre-set text mould are calculated according to the text similarity measurement algorithm based on word frequency First similarity of plate and/or the matched text and the pre-set text are calculated according to semantic-based text similarity measurement algorithm Second similarity of template.
It is described that similarity between two texts is calculated by the frequency of occurrences of word based on the text similarity measurement algorithm of word frequency; The semantic-based text similarity measurement algorithm calculates the similarity between two texts by this semanteme.
Specifically the text similarity measurement algorithm based on word frequency and the semantic-based text similarity measurement algorithm can be from It obtains in the prior art, details are not described herein again.
Optionally, in an alternative embodiment of the invention, the basis calculates described based on the text similarity measurement algorithm of word frequency With text with the first similarity of the pre-set text template and/or according to the calculating of semantic-based text similarity measurement algorithm Second similarity of matched text and the pre-set text template includes:
The first similarity of the matched text Yu the pre-set text template is calculated using vector space model;
It is similar to the second of the pre-set text model that the matched text is calculated using LDA document subject matter generation model Degree.
The first similarity of matched text and pre-set text template is calculated using vector space model in the present embodiment.
Matched text and pre-set text mould are calculated using the vector space model (Vector Space Model, SVM) First similarity of plate includes:
Pretreatment operation is carried out to matched text and pre-set text template, the pretreatment operation includes but is not limited to point Word, go stop words (including word, symbol, punctuate, messy code for having little significance to content of text etc., as " this " " " " "), obtain To pretreated matched text and pretreated pre-set text template;
The frequency of word determines the first keyword from pretreated matched text, and from pretreated default text The frequency of word determines the second keyword in this template, wherein the first keyword and the second keyword all may include multiple words;
For example, the word for determining that the frequency of occurrences is greater than predeterminated frequency in pretreated matched text is the first keyword.
After determining the first keyword and the second keyword, the reverse text frequency of the first keyword, Yi Ji are calculated The reverse text frequency of two keywords, and generate indicate matched text primary vector and indicate pre-set text template second to Amount;
Wherein, reverse text frequency (inverse document frequency, IDF) is for measuring keyword weight Index.
The reverse text frequency of a certain keyword can be according to its formula IDF=log (D/Dw) calculated, wherein D is The total quantity of text, D in sample databasewFor the quantity for the text that keyword occurred.
In the present embodiment, primary vector and secondary vector are obtained according to the following formula:
D=D (T1, W1;T2, W2;..., Tn, Wn)
Wherein, T1 is a keyword, and W1 is the reverse text frequency of the keyword;T2 is another keyword, and W2 is The reverse text frequency of the keyword;And so on, Tn is n-th of keyword, and Wn is the reverse text frequency of the keyword.
In vector space model, the content degree of correlation Sim (D1, D2) between two texts commonly uses angle between vector Cosine value expression, therefore, after the secondary vector of the primary vector spatial model and pre-set text template that obtain matched text, The cosine of primary vector and secondary vector is calculated, to obtain the first similarity of pre-matching text Yu pre-set text template, is counted The formula for calculating cosine can obtain from the prior art, and details are not described herein again.
In the present embodiment, text is reduced to carry out table by the N-dimensional vector of component of the weight of characteristic item (keyword) Show, simplify the complex relationship in text between keyword, so that model is had computability, and then can quickly be matched The first similarity between text and pre-set text template.
In the present embodiment, the base of LDA (Latent Dirichlet Allocation implies the distribution of Di Li Cray) model This thought be by document description be the theme probability distribution and further by subject description be lexical item probability distribution.Specifically, how It can be from the prior art according to the second similarity that LDA document subject matter generates model calculating matched text and pre-set text model It obtains, details are not described herein again.
Step S30: it when first similarity and/or second similarity meet default similarity condition, determines The matched text is text template similar with the pre-set text template.
The default similarity condition can be pre-set.
Optionally, in an alternative embodiment of the invention, first similarity or second similarity meet default phase Include: like degree condition
First similarity is greater than the first default similarity or second similarity is greater than the second default similarity.
The first default similarity and the second default similarity can according to need and preset, and described first is pre- If the value of similarity and the second default similarity can be same or different.For example, the first default similarity is 85%, the Two default similarities are 90%;Alternatively, the first default similarity and the second default similarity are all 90%.
Optionally, in an alternative embodiment of the invention, first similarity and second similarity meet default phase Include: like degree condition
Carry out linear weighted function according to first similarity and second similarity, obtain the matched text with it is described The third similarity of pre-set text template;
Judge whether the third similarity is greater than third and presets similarity;
If the third similarity, which is greater than the third, presets similarity, first similarity and second phase are determined Meet default condition of similarity like degree.
Linear weighted function assigns certain weighted value with the second similarity to the first similarity and is added again, and it is similar to obtain third Degree.
The default similarity of the third can be pre-set.
Optionally, in an alternative embodiment of the invention, it is described according to first similarity and second similarity into Row linear weighted function, the third similarity for obtaining the matched text and the pre-set text template include:
First similarity, second similarity are input to predetermined linear weighted formula, export the matching text The third similarity of this and the pre-set text template, the predetermined linear weighted formula are as follows:
Sim (p, q)=α simLDA(p,q)+βsimTFIDF(p, q),
Wherein, p and q is respectively the matched text and the pre-set text template, simTFIDF(p, q) is first phase Like degree, simLDA(p, q) is second similarity, and sim (p, q) is the third similarity, and α and β are default weighted value.
In the present embodiment, 0≤α≤1,0≤β≤1, and the sum of α and β are 1.
Optionally, in an alternative embodiment of the invention, the method also includes: obtain be used for linear weighted function weighted value. Described obtain include: for the weighted value of linear weighted function
The first initial value is assigned to the weighted value, according to third similarity described in first calculation of initial value;
Judge whether the matching template and the pre-set text template are the same category by default clustering algorithm, obtains Cluster result;
Judge whether the third similarity obtained according to first calculation of initial value is quasi- by the cluster result Really;
If it is determined that it is accurate according to the third similarity that first calculation of initial value obtains, determine that described first is initial Value is the weighted value for linear weighted function;
If it is determined that the third similarity inaccuracy obtained according to first calculation of initial value, at the beginning of adjustment described first Initial value executes the operation of the third similarity according to first calculation of initial value.
Above-mentioned steps are used to obtain the value of α or β.
The cluster result is matching template and pre-set text template is the same category or matching template and pre-set text Template is not the same category.
First initial value can be 0.1, when adjusting the first initial value, can adjust increase by 0.1 every time.For example, If the weight obtained is α, i.e., just having started assignment season α is 0.1, then β is 0.9 at this time, is calculated according to predetermined linear weighted formula The third similarity of matched text and pre-set text template, and matching template and pre-set text template are judged by clustering algorithm It whether is the same category, if third similarity, less than 50%, and clustering algorithm judges that matching template is not with pre-set text template The same category, it is determined that the third similarity inaccuracy obtained according to the first calculation of initial value.α=α+0.1 is enabled, then α is 0.2, β is 0.8 at this time, and the third similarity of matched text and pre-set text template is calculated according to predetermined linear weighted formula, with And judge whether matching template and pre-set text template are the same category, if inaccurate, enable α=α+0.1, then by clustering algorithm α is 0.3, and β is 0.7 at this time, is calculated again, and so on, until finding the value of optimal α and the value of β.
It in the present embodiment, can general when determining matched text is text template similar with pre-set text template It is added in the template set of pre-set text template with text, thus through this embodiment, available multiple text template collection It closes, is similar text template in each text template set.
The text template recognition methods that the present embodiment proposes obtains pre-set text template and matched text;According to word-based The text similarity measurement algorithm of frequency calculates the first similarity of the matched text Yu the pre-set text template;And/or according to base The second similarity of the matched text Yu the pre-set text template is calculated in semantic text similarity measurement algorithm;When described When one similarity and/or second similarity meet default similarity condition, determine that the matched text is to preset with described The similar text template of text template.Without staff's artificial judgment one by one, it will be able to be rapidly obtained and pre-set text The similar text module of template realizes the purpose for improving the efficiency of text template identification, also, is calculating text similarity When, it is calculated, be can be improved by text similarity measurement algorithm based on word frequency and/or semantic-based text similarity measurement algorithm The accuracy of text template identification.
The present invention also provides a kind of text template identification devices.It is the text that one embodiment of the invention provides referring to shown in Fig. 2 The schematic diagram of internal structure of this template identification device.
In the present embodiment, text template identification device 1 can be PC (PersonalComputer, PC), It can be the terminal devices such as smart phone, tablet computer, portable computer.Text template identification device 1 includes at least storage Device 11, processor 12, network interface 13 and communication bus 14.
Wherein, memory 11 include at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory, Hard disk, multimedia card, card-type memory (for example, SD or DX memory etc.), magnetic storage, disk, CD etc..Memory 11 It can be the internal storage unit of text template identification device 1 in some embodiments, such as text template identification device 1 Hard disk.Memory 11 is also possible to the External memory equipment of text template identification device 1, such as text in further embodiments The plug-in type hard disk being equipped on template identification device 1, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Further, memory 11 can also both include text The internal storage unit of template identification device 1 also includes External memory equipment.Memory 11 can be not only used for storage and be installed on Application software and Various types of data, such as the code of text template recognizer 01 of text template identification device 1 etc. can also be used In temporarily storing the data that has exported or will export.
Processor 12 can be in some embodiments a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips, the program for being stored in run memory 11 Code or processing data, such as execute text template recognizer 01 etc..
Network interface 13 optionally may include standard wireline interface and wireless interface (such as WI-FI interface), be commonly used in Communication connection is established between the device 1 and other electronic equipments.
Communication bus 14 is for realizing the connection communication between these components.
Optionally, text template identification device 1 can also include user interface, and user interface may include display (Display), input unit such as keyboard (Keyboard), optional user interface can also include standard wireline interface, Wireless interface.Optionally, in some embodiments, it is aobvious to can be light-emitting diode display, liquid crystal display, touch control type LCD for display Show that device and OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) touch device etc..Wherein, display Can also it is appropriate be known as display screen or display unit, for be shown in the information handled in text template identification device 1 and For showing visual user interface.
Fig. 2 illustrates only the text template identification device 1 with component 11-14 and text template recognizer 01, this Field technical staff it is understood that Fig. 2 shows structure do not constitute the restriction to text template identification device 1, can be with Including perhaps combining certain components or different component layouts than illustrating less perhaps more components.
In 1 embodiment of text template identification device shown in Fig. 2, text template recognizer is stored in memory 11 01;Processor 12 realizes following steps when executing the text template recognizer 01 stored in memory 11:
Obtain pre-set text template and matched text.
The pre-set text template can be (such as being stored in electronic equipment) text for being stored in advance in default memory block This template.The pre-set text template can be obtained by user and be stored in default memory block, alternatively, the pre-set text template is by passing through The text of several similar words is analyzed, and extracts similar keyword in the text, obtains pre-set text template.
In a kind of possible embodiment, pre-set text template is any one text mould in a text template set Plate, all to include various inhomogeneous text templates in same class text template or text set in text set.Institute Stating and obtaining pre-set text template includes: to obtain text template set;Obtain the text template in the text template set.
The matched text is the text for judge whether it is Similar Text template.The matched text can be by one A or multiple sentence compositions.
The first of the matched text and the pre-set text template is calculated according to the text similarity measurement algorithm based on word frequency Similarity and/or the of the matched text and the pre-set text template is calculated according to semantic-based text similarity measurement algorithm Two similarities.
It is described that similarity between two texts is calculated by the frequency of occurrences of word based on the text similarity measurement algorithm of word frequency; The semantic-based text similarity measurement algorithm calculates the similarity between two texts by this semanteme.
Specifically the text similarity measurement algorithm based on word frequency and the semantic-based text similarity measurement algorithm can be from It obtains in the prior art, details are not described herein again.
Optionally, in an alternative embodiment of the invention, the basis calculates described based on the text similarity measurement algorithm of word frequency Described is calculated with the first similarity of the pre-set text template and according to semantic-based text similarity measurement algorithm with text Include: with text and the second similarity of the pre-set text template
The first similarity of the matched text Yu the pre-set text template is calculated using vector space model;
It is similar to the second of the pre-set text model that the matched text is calculated using LDA document subject matter generation model Degree.
The first similarity of matched text and pre-set text template is calculated using vector space model in the present embodiment.
Matched text and pre-set text mould are calculated using the vector space model (Vector Space Model, SVM) First similarity of plate includes:
Pretreatment operation is carried out to matched text and pre-set text template, the pretreatment operation includes but is not limited to point Word, go stop words (including word, symbol, punctuate, messy code for having little significance to content of text etc., as " this " " " " "), obtain To pretreated matched text and pretreated pre-set text template;
The frequency of word determines the first keyword from pretreated matched text, and from pretreated default text The frequency of word determines the second keyword in this template, wherein the first keyword and the second keyword all may include multiple words;
For example, the word for determining that the frequency of occurrences is greater than predeterminated frequency in pretreated matched text is the first keyword.
After determining the first keyword and the second keyword, the reverse text frequency of the first keyword, Yi Ji are calculated The reverse text frequency of two keywords, and generate indicate matched text primary vector and indicate pre-set text template second to Amount;
Wherein, reverse text frequency (inverse document frequency, IDF) is for measuring keyword weight Index.
The reverse text frequency of a certain keyword can be according to its formula IDF=log (D/Dw) calculated, wherein D is The total quantity of text, D in sample databasewFor the quantity for the text that keyword occurred.
In the present embodiment, primary vector and secondary vector are obtained according to the following formula:
D=D (T1, W1;T2, W2;..., Tn, Wn)
Wherein, T1 is a keyword, and W1 is the reverse text frequency of the keyword;T2 is another keyword, and W2 is The reverse text frequency of the keyword;And so on, Tn is n-th of keyword, and Wn is the reverse text frequency of the keyword.
In vector space model, the content degree of correlation Sim (D1, D2) between two texts commonly uses angle between vector Cosine value expression, therefore, after the secondary vector of the primary vector spatial model and pre-set text template that obtain matched text, The cosine of primary vector and secondary vector is calculated, to obtain the first similarity of pre-matching text Yu pre-set text template, is counted The formula for calculating cosine can obtain from the prior art, and details are not described herein again.
In the present embodiment, text is reduced to carry out table by the N-dimensional vector of component of the weight of characteristic item (keyword) Show, simplify the complex relationship in text between keyword, so that model is had computability, and then can quickly be matched The first similarity between text and pre-set text template.
In the present embodiment, the base of LDA (Latent Dirichlet Allocation implies the distribution of Di Li Cray) model This thought be by document description be the theme probability distribution and further by subject description be lexical item probability distribution.Specifically, how It can be from the prior art according to the second similarity that LDA document subject matter generates model calculating matched text and pre-set text model It obtains, details are not described herein again.
When first similarity and/or second similarity meet default similarity condition, the matching is determined Text is text template similar with the pre-set text template.
The default similarity condition can be pre-set.
Optionally, in an alternative embodiment of the invention, first similarity or second similarity meet default phase Include: like degree condition
First similarity is greater than the first default similarity or second similarity is greater than the second default similarity.
The first default similarity and the second default similarity can according to need and preset, and described first is pre- If the value of similarity and the second default similarity can be same or different.For example, the first default similarity is 85%, the Two default similarities are 90%;Alternatively, the first default similarity and the second default similarity are all 90%.
Optionally, in an alternative embodiment of the invention, first similarity and second similarity meet default phase Include: like degree condition
Carry out linear weighted function according to first similarity and second similarity, obtain the matched text with it is described The third similarity of pre-set text template;
Judge whether the third similarity is greater than third and presets similarity;
If the third similarity, which is greater than the third, presets similarity, first similarity and second phase are determined Meet default condition of similarity like degree.
Linear weighted function assigns certain weighted value with the second similarity to the first similarity and is added again, and it is similar to obtain third Degree.
The default similarity of the third can be pre-set.
Optionally, in an alternative embodiment of the invention, it is described according to first similarity and second similarity into Row linear weighted function, the third similarity for obtaining the matched text and the pre-set text template include:
First similarity, second similarity are input to predetermined linear weighted formula, export the matching text The third similarity of this and the pre-set text template, the predetermined linear weighted formula are as follows:
Sim (p, q)=α simLDA(p,q)+βsimTFIDF(p, q),
Wherein, p and q is respectively the matched text and the pre-set text template, simTFIDF(p, q) is first phase Like degree, simLDA(p, q) is second similarity, and sim (p, q) is the third similarity, and α and β are default weighted value.
In the present embodiment, 0≤α≤1,0≤β≤1, and the sum of α and β are 1.
Optionally, in an alternative embodiment of the invention, the text template recognizer is executed by the processor, also real Existing following steps:
Obtain the weighted value for being used for linear weighted function.
Described obtain include: for the weighted value of linear weighted function
The first initial value is assigned to the weighted value, according to third similarity described in first calculation of initial value;
Judge whether the matching template and the pre-set text template are the same category by default clustering algorithm, obtains Cluster result;
Judge whether the third similarity obtained according to first calculation of initial value is quasi- by the cluster result Really;
If it is determined that it is accurate according to the third similarity that first calculation of initial value obtains, determine that described first is initial Value is the weighted value for linear weighted function;
If it is determined that the third similarity inaccuracy obtained according to first calculation of initial value, at the beginning of adjustment described first Initial value executes the operation of the third similarity according to first calculation of initial value.
Above-mentioned steps are used to obtain the value of α or β.
The cluster result is matching template and pre-set text template is the same category or matching template and pre-set text Template is not the same category.
First initial value can be 0.1, when adjusting the first initial value, can adjust increase by 0.1 every time.For example, If the weight obtained is α, i.e., just having started assignment season α is 0.1, then β is 0.9 at this time, is calculated according to predetermined linear weighted formula The third similarity of matched text and pre-set text template, and matching template and pre-set text template are judged by clustering algorithm It whether is the same category, if third similarity, less than 50%, and clustering algorithm judges that matching template is not with pre-set text template The same category, it is determined that the third similarity inaccuracy obtained according to the first calculation of initial value.α=α+0.1 is enabled, then α is 0.2, β is 0.8 at this time, and the third similarity of matched text and pre-set text template is calculated according to predetermined linear weighted formula, with And judge whether matching template and pre-set text template are the same category, if inaccurate, enable α=α+0.1, then by clustering algorithm α is 0.3, and β is 0.7 at this time, is calculated again, and so on, until finding the value of optimal α and the value of β.
It in the present embodiment, can general when determining matched text is text template similar with pre-set text template It is added in the template set of pre-set text template with text, thus through this embodiment, available multiple text template collection It closes, is similar text template in each text template set.
The text template identification device that the present embodiment proposes obtains pre-set text template and matched text;According to word-based The text similarity measurement algorithm of frequency calculates the first similarity of the matched text Yu the pre-set text template;And/or according to base The second similarity of the matched text Yu the pre-set text template is calculated in semantic text similarity measurement algorithm;When described When one similarity and/or second similarity meet default similarity condition, determine that the matched text is to preset with described The similar text template of text template.Without staff's artificial judgment one by one, it will be able to be rapidly obtained and pre-set text The similar text module of template realizes the purpose for improving the efficiency of text template identification, also, is calculating text similarity When, it is calculated, be can be improved by text similarity measurement algorithm based on word frequency and/or semantic-based text similarity measurement algorithm Text template obtains the accuracy of identification.
Optionally, in other embodiments, text template recognizer can also be divided into one or more module, One or more module is stored in memory 11, and by one or more processors (the present embodiment is processor 12) institute It executes to complete the present invention, the so-called module of the present invention is the series of computation machine program instruction for referring to complete specific function Section, for describing implementation procedure of the text template recognizer in text template identification device.
It is the text template recognizer in one embodiment of text template identification device of the present invention for example, referring to shown in Fig. 3 01 program module schematic diagram, in the embodiment, text template recognizer, which can be divided into, obtains module 10, computing module 20 and determining module 30, illustratively:
It obtains module 10 to be used for: obtaining pre-set text template and matched text;
Computing module 20 is used for: being calculated the matched text according to the text similarity measurement algorithm based on word frequency and is preset with described First similarity of text template;And/or according to semantic-based text similarity measurement algorithm calculate the matched text with it is described Second similarity of pre-set text template;
Determining module 30 is used for: when first similarity and/or second similarity meet default similarity condition When, determine that the matched text is text template similar with the pre-set text template.
The program modules such as above-mentioned acquisition module 10, computing module 20 and determining module 30 be performed realized function or Operating procedure is substantially the same with above-described embodiment, and details are not described herein.
In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium On be stored with text template recognizer, the text template recognizer can be executed by one or more processors, with realize Following operation:
Obtain pre-set text template and matched text;
The first of the matched text and the pre-set text template is calculated according to the text similarity measurement algorithm based on word frequency Similarity;And/or
The second of the matched text and the pre-set text template is calculated according to semantic-based text similarity measurement algorithm Similarity;
When first similarity and/or second similarity meet default similarity condition, the matching is determined Text is text template similar with the pre-set text template.
Computer readable storage medium specific embodiment of the present invention and above-mentioned text template identification device and each reality of method It is essentially identical to apply example, does not make tired state herein.
It should be noted that the serial number of the above embodiments of the invention is only for description, do not represent the advantages or disadvantages of the embodiments.And The terms "include", "comprise" herein or any other variant thereof is intended to cover non-exclusive inclusion, so that packet Process, device, article or the method for including a series of elements not only include those elements, but also including being not explicitly listed Other element, or further include for this process, device, article or the intrinsic element of method.Do not limiting more In the case where, the element that is limited by sentence "including a ...", it is not excluded that including process, device, the article of the element Or there is also other identical elements in method.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone, Computer, server or network equipment etc.) execute method described in each embodiment of the present invention.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of text template recognition methods, which is characterized in that the described method includes:
Obtain pre-set text template and matched text;
It is similar to the first of the pre-set text template that the matched text is calculated according to the text similarity measurement algorithm based on word frequency Degree;And/or
It is similar to the second of the pre-set text template that the matched text is calculated according to semantic-based text similarity measurement algorithm Degree;
When first similarity and/or second similarity meet default similarity condition, the matched text is determined For text template similar with the pre-set text template.
2. text template recognition methods as described in claim 1, which is characterized in that it is characterized in that, described according to word-based The text similarity measurement algorithm of frequency calculates the first similarity of the matched text and the pre-set text template and/or according to being based on The second similarity that semantic text similarity measurement algorithm calculates the matched text and the pre-set text template includes:
The first similarity of the matched text Yu the pre-set text template is calculated using vector space model;
The second similarity that model calculates the matched text Yu the pre-set text model is generated using LDA document subject matter;
First similarity and second similarity meet default similarity condition
Linear weighted function is carried out according to first similarity and second similarity, the matched text is obtained and is preset with described The third similarity of text template;
Judge whether the third similarity is greater than third and presets similarity;
If the third similarity is greater than the default similarity, determine that first similarity and second similarity meet Default condition of similarity.
3. text template recognition methods as claimed in claim 2, which is characterized in that described according to first similarity and institute It states the second similarity and carries out linear weighted function, the third similarity for obtaining the matched text and the pre-set text template includes:
First similarity, second similarity are input to predetermined linear weighted formula, export the matched text with The third similarity of the pre-set text template, the predetermined linear weighted formula are as follows:
Sim (p, q)=α simLDA(p,q)+βsimTFIDF(p, q),
Wherein, p and q is respectively the matched text and the pre-set text template, simTFIDF(p, q) is described first similar Degree, simLDA(p, q) is second similarity, and sim (p, q) is the third similarity, and α and β are default weighted value.
4. text template recognition methods as claimed in claim 2 or claim 3, which is characterized in that the method also includes:
Obtain the weighted value for being used for linear weighted function, comprising:
The first initial value is assigned to the weighted value, according to third similarity described in first calculation of initial value;
Judge whether the matching template and the pre-set text template are the same category by default clustering algorithm, obtains cluster As a result;
Judge whether the third similarity obtained according to first calculation of initial value is accurate by the cluster result;
If it is determined that it is accurate according to the third similarity that first calculation of initial value obtains, determine that first initial value is Weighted value for linear weighted function;
If it is determined that it is accurately inaccurate according to the third similarity that first calculation of initial value obtains, at the beginning of adjustment described first Initial value executes the operation of the third similarity according to first calculation of initial value.
5. text template recognition methods as described in claim 1, which is characterized in that first similarity or second phase Meeting default similarity condition like degree includes:
First similarity is greater than the first default similarity or second similarity is greater than the second default similarity.
6. a kind of text template identification device, which is characterized in that described device includes memory and processor, on the memory It is stored with the text template recognizer that can be run on the processor, the text template recognizer is by the processor Following steps are realized when execution:
Obtain pre-set text template and matched text;
It is similar to the first of the pre-set text template that the matched text is calculated according to the text similarity measurement algorithm based on word frequency Degree;And/or
It is similar to the second of the pre-set text template that the matched text is calculated according to semantic-based text similarity measurement algorithm Degree;
When first similarity and/or second similarity meet default similarity condition, the matched text is determined For text template similar with the pre-set text template.
7. text template identification device as claimed in claim 6, which is characterized in that text of the basis based on word frequency is similar Degree algorithm calculates the first similarity of the matched text and the pre-set text template and/or according to semantic-based text phase Include: like the second similarity that degree algorithm calculates the matched text and the pre-set text template
The first similarity of the matched text Yu the pre-set text template is calculated using vector space model;
The second similarity that model calculates the matched text Yu the pre-set text model is generated using LDA document subject matter;
First similarity and second similarity meet default similarity condition
Linear weighted function is carried out according to first similarity and second similarity, the matched text is obtained and is preset with described The third similarity of text template;
Judge whether the third similarity is greater than third and presets similarity;
If the third similarity is greater than the default similarity, determine that first similarity and second similarity meet Default condition of similarity.
8. text template identification device as claimed in claim 7, which is characterized in that described according to first similarity and institute It states the second similarity and carries out linear weighted function, the third similarity for obtaining the matched text and the pre-set text template includes:
First similarity, second similarity are input to predetermined linear weighted formula, export the matched text with The third similarity of the pre-set text template, the predetermined linear weighted formula are as follows:
Sim (p, q)=α simLDA(p,q)+βsimTFIDF(p, q),
Wherein, p and q is respectively the matched text and the pre-set text template, simTFIDF(p, q) is described first similar Degree, simLDA(p, q) is second similarity, and sim (p, q) is the third similarity, and α and β are default weighted value.
9. text template identification device as claimed in claim 7 or 8, which is characterized in that the text template recognizer quilt The processor executes, also realization following steps:
Obtain the weighted value for being used for linear weighted function, comprising:
The first initial value is assigned to the weighted value, according to third similarity described in first calculation of initial value;
Judge whether the matching template and the pre-set text template are the same category by default clustering algorithm, obtains cluster As a result;
Judge whether the third similarity obtained according to first calculation of initial value is accurate by the cluster result;
If it is determined that it is accurate according to the third similarity that first calculation of initial value obtains, determine that first initial value is Weighted value for linear weighted function;
If it is determined that it is accurately inaccurate according to the third similarity that first calculation of initial value obtains, at the beginning of adjustment described first Initial value executes the operation of the third similarity according to first calculation of initial value.
10. a kind of computer readable storage medium, which is characterized in that be stored with text mould on the computer readable storage medium Plate recognizer, the text template recognizer can be executed by one or more processor, with realize as claim 1 to Described in any one of 5 the step of text template recognition methods.
CN201910109887.0A 2019-02-11 2019-02-11 Text template recognition methods, device and computer readable storage medium Pending CN109977995A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910109887.0A CN109977995A (en) 2019-02-11 2019-02-11 Text template recognition methods, device and computer readable storage medium
PCT/CN2019/088628 WO2020164204A1 (en) 2019-02-11 2019-05-27 Text template recognition method and apparatus, and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910109887.0A CN109977995A (en) 2019-02-11 2019-02-11 Text template recognition methods, device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN109977995A true CN109977995A (en) 2019-07-05

Family

ID=67076907

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910109887.0A Pending CN109977995A (en) 2019-02-11 2019-02-11 Text template recognition methods, device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN109977995A (en)
WO (1) WO2020164204A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033640A (en) * 2021-03-16 2021-06-25 深圳棱镜空间智能科技有限公司 Template matching method, device, equipment and computer readable storage medium
CN113780449A (en) * 2021-09-16 2021-12-10 平安科技(深圳)有限公司 Text similarity calculation method and device, storage medium and computer equipment

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113724738B (en) * 2021-08-31 2024-04-23 硅基(昆山)智能科技有限公司 Speech processing method, decision tree model training method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001155027A (en) * 1999-11-26 2001-06-08 Nec Corp Method, system and device for calculating similarity between documents, and recording medium recorded with program for similarity calculation
CN103377239A (en) * 2012-04-26 2013-10-30 腾讯科技(深圳)有限公司 Method and device for calculating inter-textual similarity
CN108829799A (en) * 2018-06-05 2018-11-16 中国人民公安大学 Based on the Text similarity computing method and system for improving LDA topic model
CN109165291A (en) * 2018-06-29 2019-01-08 厦门快商通信息技术有限公司 A kind of text matching technique and electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544255B (en) * 2013-10-15 2017-01-11 常州大学 Text semantic relativity based network public opinion information analysis method
CN107992477B (en) * 2017-11-30 2019-03-29 北京神州泰岳软件股份有限公司 Text subject determines method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001155027A (en) * 1999-11-26 2001-06-08 Nec Corp Method, system and device for calculating similarity between documents, and recording medium recorded with program for similarity calculation
CN103377239A (en) * 2012-04-26 2013-10-30 腾讯科技(深圳)有限公司 Method and device for calculating inter-textual similarity
CN108829799A (en) * 2018-06-05 2018-11-16 中国人民公安大学 Based on the Text similarity computing method and system for improving LDA topic model
CN109165291A (en) * 2018-06-29 2019-01-08 厦门快商通信息技术有限公司 A kind of text matching technique and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
华秀丽等: ""语义分析与词频统计相结合的中文文本相似度量方法研究"", 《计算机应用研究》, vol. 29, no. 3, pages 834 - 835 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033640A (en) * 2021-03-16 2021-06-25 深圳棱镜空间智能科技有限公司 Template matching method, device, equipment and computer readable storage medium
CN113033640B (en) * 2021-03-16 2023-08-15 深圳棱镜空间智能科技有限公司 Template matching method, device, equipment and computer readable storage medium
CN113780449A (en) * 2021-09-16 2021-12-10 平安科技(深圳)有限公司 Text similarity calculation method and device, storage medium and computer equipment
CN113780449B (en) * 2021-09-16 2023-08-25 平安科技(深圳)有限公司 Text similarity calculation method and device, storage medium and computer equipment

Also Published As

Publication number Publication date
WO2020164204A1 (en) 2020-08-20

Similar Documents

Publication Publication Date Title
CN108629043B (en) Webpage target information extraction method, device and storage medium
US10713432B2 (en) Classifying and ranking changes between document versions
CN109815333B (en) Information acquisition method and device, computer equipment and storage medium
US8832655B2 (en) Systems and methods for finding project-related information by clustering applications into related concept categories
US9767144B2 (en) Search system with query refinement
CN110163476A (en) Project intelligent recommendation method, electronic device and storage medium
WO2021218322A1 (en) Paragraph search method and apparatus, and electronic device and storage medium
CN113449187B (en) Product recommendation method, device, equipment and storage medium based on double images
CN109145215A (en) Internet public opinion analysis method, apparatus and storage medium
CN105653562B (en) The calculation method and device of correlation between a kind of content of text and inquiry request
CN111753167B (en) Search processing method, device, computer equipment and medium
CN107102993B (en) User appeal analysis method and device
CN109977995A (en) Text template recognition methods, device and computer readable storage medium
CN101097570A (en) Advertisement classification method capable of automatic recognizing classified advertisement type
CN107818491A (en) Electronic installation, Products Show method and storage medium based on user's Internet data
CN110209928A (en) A kind of information recommendation method, device and storage medium
CN112948553B (en) Legal intelligent question-answering method and device, electronic equipment and storage medium
CN101937432A (en) System and method for negotiation between two parties according to supply and demand information
CN112363814A (en) Task scheduling method and device, computer equipment and storage medium
CN107688595B (en) Information retrieval Accuracy Evaluation, device and computer readable storage medium
CN109918420B (en) Competitor recommendation method and server
CN112948705A (en) Intelligent matching method, device and medium based on policy big data
CN112487154A (en) Intelligent search method based on natural language
CN111078744A (en) Method, equipment and storage medium for pre-docking and guiding scientific and technological requirements
CN113935328A (en) Text abstract generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination