CN106708969B - Semantic core method for latent semantic vector space model based on document resource topic clustering co-occurrence - Google Patents

Semantic core method for latent semantic vector space model based on document resource topic clustering co-occurrence Download PDF

Info

Publication number
CN106708969B
CN106708969B CN201611095873.0A CN201611095873A CN106708969B CN 106708969 B CN106708969 B CN 106708969B CN 201611095873 A CN201611095873 A CN 201611095873A CN 106708969 B CN106708969 B CN 106708969B
Authority
CN
China
Prior art keywords
matrix
semantic
occurrence
keyword
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611095873.0A
Other languages
Chinese (zh)
Other versions
CN106708969A (en
Inventor
牛奉高
张亚宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi University
Original Assignee
Shanxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi University filed Critical Shanxi University
Priority to CN201611095873.0A priority Critical patent/CN106708969B/en
Publication of CN106708969A publication Critical patent/CN106708969A/en
Application granted granted Critical
Publication of CN106708969B publication Critical patent/CN106708969B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of semantic kernel methods of semantic vector space models, and particularly relates to a semantic kernel method of a latent semantic vector space model based on document resource topic clustering co-occurrence. The invention mainly solves the problems that the semantic information extraction complexity is higher, the semantic information extraction is insufficient, the dimension of the model is higher, the time and space complexity is higher when the current semantic vector space model semantic kernel method is applied to a clustering algorithm, and the like. The invention discloses a semantic core method of a document resource topic clustering co-occurrence potential semantic vector space model, which comprises the following steps: preprocessing literature data; performing word frequency statistics on the extracted keywords for subsequent co-occurrence matrix establishment; thirdly, establishing a vector space model represented by the literature by taking whether the keywords appear in the literature as weights; fourthly, constructing a co-occurrence potential semantic vector space model; fifthly, constructing a semantic kernel function; and sixthly, clustering the literature.

Description

Semantic core method for latent semantic vector space model based on document resource topic clustering co-occurrence
Technical Field
The invention belongs to the technical field of semantic kernel methods of semantic vector space models, and particularly relates to a semantic kernel method of a latent semantic vector space model based on document resource topic clustering co-occurrence.
Background
In the big data era, a large amount of unstructured text resources are brought to people, and clustering as an unsupervised machine learning method is one of main means for achieving text resource mining. Text clustering is different from general data clustering, and text information is subjected to data structure representation firstly. The basic model of text representation is a Vector Space Model (VSM), which maps each document to a high-dimensional sparse vector in text space, so that the semantic similarity calculation problem between texts when performing text clustering can be converted into the calculation of vectors in vector space, that is: and measuring the similarity between texts by calculating the similarity between vectors so as to realize text clustering. However, the VSM ignores semantic relationships from word to word, resulting in inaccurate computation of text similarity. The generalized vector space model (GVSSM) is provided to mine co-occurrence information between words, so that the accuracy of text similarity calculation is improved, but the problem of insufficient extraction of semantic information in a text representation vector cannot be changed. Therefore, the following research is mainly based on VSM or GVSM models, and the Semantic Vector Space Model (SVSM) is constructed by background knowledge such as ontology or corpus to realize the calculation of document similarity. However, the general ontology has high construction cost and incomplete domain knowledge.
Semantic kernels were first introduced by Siolas G as one of the concepts of kernel functions and used for text mining as the underlying kernel in the support vector machine approach. The research of semantic kernel functions has been divided into two categories: one type of research mainly extracts semantic relations between feature words by using large-scale ontologies such as WordNet, Wikipedia and HowNet as knowledge backgrounds to realize the construction of semantic kernel functions, however, ontology knowledge construction complexity is high and domain knowledge is not complete; the other type of research is based on a statistical method, a semantic core is constructed by excavating potential concepts among feature words, most of the existing research of semantic core functions applied to text resource clustering is constructed based on a basic vector space model and a generalized vector space model, semantic information is insufficiently extracted, and clustering effect is poor.
Disclosure of Invention
The invention mainly aims at the problems that the semantic information extraction complexity is high, the semantic information extraction is insufficient, the model dimension is high, the time and space complexity is high when the method is applied to a clustering algorithm and the like in the existing semantic kernel method of a semantic vector space model, and provides a text resource topic clustering co-occurrence potential semantic vector space model semantic kernel method.
The technical scheme adopted by the invention to solve the problems is as follows:
the method for semantic core of the latent semantic vector space model co-existing in document resource topic clustering comprises the following steps:
the first step is as follows: preprocessing of literature data: data cleaning, document marking, extracting key words of each document, and keeping the corresponding relation between the key words and the corresponding documents;
the second step is that: carrying out word frequency statistics on the extracted keywords, and arranging the keywords in descending order according to the word frequency for later use in establishing a co-occurrence matrix;
the third step: and (3) constructing a vector space model represented by the literature by taking whether the keywords appear in the literature as weights:
dl=(al1,al2,...,alm)T∈Rm,l=1,2,…,n.
wherein: dlIs that of n documents the first document is in Euclidean space RmRepresents a vector of (1), alj(j ═ 1,2, …, m) is the weight of the jth keyword in the l th document, when the jth keyword is document dlWhen a keyword of (2)ljEqual to 1, otherwise 0; l is document serial number, n is document total space, m is number of total keywords in keyword set, RmFor Euclidean space, T represents transpose operation, and the "article-word" matrix A of the literature set is (a)lj)n×m
The fourth step: constructing a co-occurrence potential semantic vector space model:
(1) computing co-occurrence intensity matrices
Co-occurrence matrix between keywords C ═ ATA=(cij)m×mWherein, when i ≠ j, cijCo-occurrence frequency of ith keyword and jth keyword, when i equals j, ciiThe total frequency of the ith keyword;
a co-occurrence intensity matrix B is then calculated,
Figure GDA0002194221970000032
wherein, c11,c22,…,cmmFrequency numbers of the 1 st keyword, the 2 nd keyword, … … th keyword and the m th keyword respectively; when i ≠ j, bijIs the co-occurrence strength of the ith keyword and the jth keyword, when i is equal to j, b isii1, that is, all diagonal elements of the matrix B are 1;
(2) extraction of co-occurrence information
Note aljJ has an index set of 1l1Namely: i isl1={j|alj1, this term is calledThe latent semantic similarity of the l document to the j keyword,
Figure GDA0002194221970000034
set of representations bjtMeet the condition t ∈ I in }l1Is recorded as
Figure GDA0002194221970000035
When a isljWhen 1, qlj1 is ═ 1; when a isljWhen 0 is equal to 0, q is not less than 0lj<1;
(3) Co-occurrence latent semantic vector space model (CLSVSM)
Figure GDA0002194221970000037
Wherein:
Figure GDA0002194221970000036
the new "chapter-word" matrix based on CLSVSM is:
Figure GDA0002194221970000041
the fifth step: construction of semantic Kernel function
(1) Singular value decomposition of the transpose of the new "chapter-word" matrix
Obtaining Q through matlab software operation according to singular value decomposition theoryTThe decomposition formula (2):
Figure GDA0002194221970000042
wherein QTIs a new "word-piece" matrix of dimension m x n; u, V are singular matrices, are square matrices with dimensions m and n, respectively, and are orthogonal matrices, UUT=I,VVT=I;
Figure GDA0002194221970000043
Is a matrix of dimension m × n, assuming a "word-term" matrix QTIs r, Δ ═ diag (δ)1δ2δ3…δr),δi(i-1, 2, …, r) is a non-zero singular value and is arranged in order of magnitude as δ1≥δ2≥…≥δrCorrelation matrix Q between keywordsTQ=UΣVTTUT=UΣΣTUT=UΛUTSingular matrix U being equal to QTMatrix of orthogonal unit eigenvectors of Q, matrix
Figure GDA0002194221970000044
Is a m x m dimensional square matrix with elements on the diagonal of QTThe characteristic value corresponding to the Q is set,
Figure GDA0002194221970000045
a diagonal matrix composed of non-zero eigenvalues;
(2) feature extraction and dimension reduction
Selecting the first k maximum eigenvalues, wherein the size of k depends on the requirement of the accumulative contribution rate of the eigenvalues, and when the accumulative contribution rate of the desirable eigenvalues is not less than 90%, k is
Figure GDA0002194221970000051
Simultaneously selecting the front k columns of corresponding singular matrixes U and V, realizing dimension reduction processing on the singular matrixes, and respectively recording the dimension reduction processing as UkAnd VkThen a Q is obtainedTK-th order approximation of the matrix, i.e. Xk T=UkΣkVk T
(3) Semantic core based on CLSVSM
k(dl,ds)=(Uk Tφ(dl))T(Uk Tφ(ds))=φT(dl)UkUk Tφ(ds),l,s=1,…,n
The semantic kernel function obtains a consistent kernel matrix as:
Figure GDA0002194221970000052
the semantic core based on the CLSVSM is abbreviated as CLSVSM _ K;
and a sixth step: document clustering
And performing semantic kernel function representation on the documents, taking a kernel matrix as a similarity matrix between the documents, and selecting a clustering algorithm to perform document theme clustering.
By adopting the technical scheme, compared with the semantic kernel function in the previous research, the semantic kernel function extracts richer semantic information, avoids background knowledge such as imperfect ontology with higher construction cost and the like, improves the clustering effect by more than 20%, and not only realizes the combination of synonymous information among text feature words but also reduces the dimension of a feature word space when extracting the semantic information.
Detailed Description
Example 1
The first step is as follows: data preprocessing: and cleaning data, marking documents, extracting keywords of each document, and keeping the corresponding relation between the keywords and the corresponding documents.
The data are from CNKI, according to the classification, 300 documents of three subjects of 'publishing', 'book intelligence and digital library' and 'archive and museum' under the information science are respectively selected as analyzed documents, 4 documents without keywords are removed, the total number of the finally obtained documents is 896, wherein 299 documents of 'publishing', 'book intelligence and digital library' 298, 'archive and museum' 299, and 2509 different keywords are obtained. Namely: the number of documents n is 896, the number of keywords m is 2509, and the table below shows the top 20 documents intercepted and all the keywords corresponding to them. In Table 1, LM is the document category, ID is the document number, and k1-k10 are the corresponding keywords of the document.
Table 1: list of documents and corresponding keywords (parts)
LM ID T1 K1 K2 K3 K4 K5 K6 K7 K8 K9 K10
Drawing (A) Love of a person 1001 Culture of villages and townsStand on New rural cultural building Function in setting Towns and towns Culture Rural area Culture Construction of buildings Peasants Group of people Culture Movement of Article (Chinese character) Transforming Station Activity device Movable part Shape of Formula (II) Culture Cause of business Unit of Base layer Culture Work by Country Ballast for ballast Political affairs administration Mansion Basal layer organization
Drawing (A) Love of a person 1002 Public library electricity Reading of sub reading room Person service Reader's book Service Electronic device Reading device Chamber Public Book with detachable cover Shop
Drawing (A) Love of a person 1003 To the reader Library management flow Engineering optimization discussion Reader's book Book with detachable cover Shop Managing Means for
Drawing (A) Love of a person 1004 Self-evergreen collection of languages Versions and pairs thereof Simultaneous speech and material price Value of Language Self-evergreen Collection (I) Beijing Official speech Literature reference Investigation Modern times Chinese character
Drawing (A) Love of a person 1005 I-way of farm bookstore Exhibition situation, problem and countermeasure-based on Farmhouse Book room Book with detachable cover Purchasing Daily life Maintenance Managing Mechanism for controlling a motor
Drawing (A) Love of a person 1006 Song carved "five minister notes text Selection of Meng's Ben and Chen Eight lang book relation examination Five minister drugs Note and text (iii) selection Ancestor book Meng's disease Book (I) Chen Ba Lang book Do not need to use Showa Chinese character of Asian race Book (I)
Drawing (A) Love of a person 1007 Under the condition of informatization College library net Of network information resources Construction of Information Transforming Colleges and universities Book with detachable cover Shop Network Information Resource(s)
Drawing (A) Love of a person 1008 Independent college of examination and discussion Library reading guide I work Development of (2) Book with detachable cover Reading guide University Generating Long and long Book with detachable cover Shop Independent of each other College of academic
Drawing (A) Love of a person 1009 The library of colleges and universities is Culture of innovated talents Important base Innovation of Talents Colleges and universities Book with detachable cover Shop
Drawing (A) Love of a person 1010 Number of middle school in Shanghai City Teaching of word experiment Status sample survey And is divided into Number of Chemical and physical therapy Test (experiment) Experiment of Teaching aid Data of Analysis of
Drawing (A) Love of a person 1011 Continuing teaching to colleges and universities Opening article of the institute of education Idea of dedicating retrieval lesson Examination Continue to use Education College of academic Literature reference Retrieval Lesson Information Vegetarian food
Drawing (A) Love of a person 1012 How the library should To social media Influence of Society, its own and other related applications Chemical media Body New media Body On-line Media Book with detachable cover Shop
Drawing (A) Love of a person 1013 Knowledge-based environment College library Establishment of system University Book with detachable cover Shop system Degree of rotation Study of Big type Study the picture Book with detachable cover Knowledge of Managing Mode(s)
Drawing (A) Love of a person 1014 Colleges and universities' library allies Learning under alliance environment Development strategy for librarian Slightly less than Subject of discipline Librarian Book with detachable cover Shop couplet Alliance type Subject of discipline Service
Drawing (A) Love of a person 1015 College library duty For helping students Problem of training Colleges and universities Book with detachable cover Shop Working worker Learning aid Student's desk Training
Drawing (A) Love of a person 1016 Human care perspective Library culture Innovative practices Book with detachable cover Shop and literature Transforming Humanity Care Culture Innovation of
Drawing (A) Love of a person 1017 Book for universities and colleges of professorship Utilization rate of museum's literature Reason for and pair of Policy and plan for making Book with detachable cover Shop High-job High special purpose Colleges and universities Literature reference By using Rate of change
Drawing (A) Love of a person 1018 College library self Assisted service application Analysis of Colleges and universities Book with detachable cover Shop Self-help Service RFID
Drawing (A) Love of a person 1019 Colleges and universities' library clothes Business adult education cash Examination of conditions and countermeasures Colleges and universities Book with detachable cover Shop Adult Education Reader's book Service
Drawing (A) Love of a person 1020 Based on SCI and SSCI And A&Henan of HCI University paper statistics Henan province University Study and study Paper (S) SCI SSCI A& HCI
The second step is that: and constructing a keyword space, carrying out word frequency statistics on the extracted keywords, and arranging the keywords in a descending order according to the word frequency. Table 2 shows the first 20 keywords and the corresponding word frequencies in our experimental results:
table 2: keyword frequency statistics (part)
Figure GDA0002194221970000071
The third step: and (3) constructing a vector space model represented by the literature by taking whether the keywords appear in the literature as weights:
dl=(al1,al2,...,al,2509)T∈R2509,l=1,2,…,896
wherein: dl896 in the first document, in Euclidean space R2509The expression vector in (1) has 2509 keywords, so that the Euclidean space is R2509,alj(j is 1,2, …,2509) is the weight of the jth keyword in the ith document, l is the document number, T is the transposition operation, and when the jth keyword is the document dlWhen a keywordljEqual to 1, otherwise 0, the "article-word" matrix of the document collection is a ═ a (a)lj)896×2509. Table 3 presents the data for matrix a in Excel in the first 20 rows and the first 15 columns of the experiment where the dimension of matrix a is 896 x 2509. 2509 keywords are recorded in row 1 of table 3; the 1 st column records category information; column 2 records the ID of the document; line 1 column 1 position 897 refers to the use of the Excel table 897 line.
Table 3: VSM-based "word-context" matrix A (part)
Figure GDA0002194221970000072
The fourth step: constructing a co-occurrence potential semantic vector space model:
(1) computing co-occurrence intensity matrices
Co-occurrence matrix between keywords C ═ ATA=(cij)2509×2509Table 4 presents some of the results of the experiment for matrix C, where C is when i ≠ jijCo-occurrence frequency of ith keyword and jth keyword, when i equals j, ciiThe total frequency of the ith keyword, i.e., the value on the diagonal. The table has row 1 and column 1 as keywords.
Table 4: keyword co-occurrence matrix C (part)
Figure DA00021942219761026
A co-occurrence intensity matrix B is then calculated,
Figure GDA0002194221970000082
wherein, c11,c22,…,c2509,2509Frequency counts of the 1 st, 2 nd, … … th and 2509 th keywords respectively; when i ≠ j, bijIs the co-occurrence strength of the ith keyword and the jth keyword, when i is equal to j, b isiiThe table below shows the results of some experiments in which matrix B is shared in the truncated experiments. The table has row 1 and column 1 as keywords.
Table 5: co-occurrence intensity matrix B (part)
(2) Extraction of co-occurrence information
For a in the matrix AljThe portion of 0 is supplemented with co-occurrence information, namely: the co-occurrence information is supplemented for the portion with value 0 in table 3. The method comprises the following steps: note aljJ has an index set of 1l1Namely: i isl1={j|alj1, this term is called
Figure GDA0002194221970000092
For the potential semantic similarity of the l document and the j keyword,
Figure GDA0002194221970000093
set of representations bjtMeet the condition t ∈ I in }i1Is recorded as
Figure GDA0002194221970000094
When a isijWhen 1, qij1 is ═ 1; when a isijWhen 0 is equal to 0, q is not less than 0ijLess than 1; the following table is aljWhen q is 0ljHere we only cut the first 20 rows and the first 15 columns of the experimental results. Not all of aljThe values can be supplemented when the value is 0, the values of the parts which cannot be supplemented are still 0, and table 6 only shows the values when the values can be supplemented; category information is recorded in column 1 of table 6, document ID is recorded in column 2, and 2509 keywords are recorded in row 1.
Table 6: co-occurrence information supplement matrix (part)
Figure GDA0002194221970000101
(3) Co-occurrence latent semantic vector space model (CLSVSM)
Figure GDA0002194221970000103
Wherein:
Figure GDA0002194221970000102
the results of the new "article-word" matrix based on CLSVSM in the experiments are shown in the following table, where we only cut the first 20 rows and the first 15 columns, where the 1 st column records the document category information, the 2 nd column records the document ID, and the 1 st row records 2509 keywords:
table 7: new "chapter-word" matrix Q (part) from CLSVSM
The fifth step: construction of semantic Kernel function
(4) Transpose Q of the corresponding "chapter-word" matrix Q of Table 7TPerforming singular value decomposition
Obtaining Q through matlab software operation according to singular value decomposition theoryTThe decomposition formula (2):
Figure GDA0002194221970000112
to QTThe singular matrices U and V corresponding to the singular value decomposition are shown in tables 8 and 9, and the matrix Σ has a value shown in table 10. Table 8 row 1 and column 1 are keywords; table 9 rows 1 and columns 1 identify documents, Table 10 rows 1 identify documents, and columns 1 identify keywords. Simultaneously solving the matrix QTRank r of 896.
Table 8: singular matrix U (part)
Figure DA00021942219761416
Table 9: singular matrix V (part)
Figure GDA0002194221970000121
Table 10: matrix Σ (part)
Calculating sigma-sigmaTThe matrix Λ was obtained and the first 20 rows and the first 15 columns of the experimental results are shown in table 11, where Λ is a square matrix with dimensions 2509 × 2509.
Table 11: matrix Λ (part)
Figure DA00021942219761522
(5) Feature extraction and dimension reduction
And selecting the first k maximum characteristic values. The magnitude of k depends on the cumulative contribution rate requirement of the eigenvalues. Here, the cumulative contribution rate of the feature value is not less than 90%, and the sum of the feature values obtained by MATLAB calculation is 7.5457e +03, that is, the sum of the feature values is
Figure GDA0002194221970000131
When the cumulative contribution rate of the eigenvalues is not less than 90%, k is 247,
namely:
Figure GDA0002194221970000132
therefore, the first 247 eigenvalues of the matrix Λ are selected, the corresponding first 247 columns of the singular matrices U and V are simultaneously selected, dimension reduction processing is realized on the singular matrices, and the singular matrices are respectively marked as U247. Similarly, when the cumulative contribution rate of the feature value is not less than 95% and 98%, the values of k are 356 and 468, respectively.
(6) Semantic core based on CLSVSM
k(dl,ds)=(U247 Tφ(dl))T(U247 Tφ(ds))=φT(dl)U247U247 Tφ(ds),l,s=1,2,…,896
The semantic kernel function obtains a consistent kernel matrix as:
Figure GDA0002194221970000133
the semantic kernel based on CLSVSM is abbreviated as CLSVSM _ K.
Nuclear matrix obtained in experiment
Figure GDA0002194221970000141
The first 20 rows and the first 15 columns of table 12,
Figure GDA0002194221970000142
is dimension of
896 × 896 square matrix. Line 1 and column 1 of table 12 are the ID information of the document.
Table 12: core matrix
Figure GDA0002194221970000144
(part)
Figure DA00021942219761622
And a sixth step: document clustering
And performing semantic kernel function representation on the documents, taking a kernel matrix as a similarity matrix between the documents, and selecting a clustering algorithm to perform document clustering. In the experiment, a k-means clustering algorithm is adopted. The results of the experimental comparison are shown in tables 13 and 14:
in the experiment, clustering results under several clustering schemes are respectively compared, and 22 experiments are carried out in total. The results are shown in Table 13.
Table 13: comparison of experimental results of CLSVSM and VSM
Figure GDA0002194221970000143
Figure GDA0002194221970000151
The experimental results show that the CLSVSM results are far superior to VSM. And the results of the test for CLSVSM are optimal when selecting scheme D-I2.
Then comparing a semantic core of the co-occurrence potential semantic vector space model with a linear core of the co-occurrence potential semantic vector space model and the co-occurrence potential semantic vector space model, wherein the selection of a parameter K during the construction of the semantic core respectively ensures that the sum of the first K characteristic values accounts for 90%, 95% and 98% of the sum of the characteristic values, the constructed semantic core function table is respectively abbreviated as 90% CLSVSM _ K, 95% CLSVSM _ K and 98% CLSVSM _ K, an optimal scheme D-I2 is selected, each model is subjected to 50 times of experiments, the clustering result is evaluated through the mean values of three indexes of entropy, purity and F value obtained through multiple experiments, and the analysis and comparison result is shown in a table 14.
Table 14: clustering comparisons of different methods
Entropy value ↓ Purity × × F value ↓ Dimension ↓offeature word space
CLSVSM 0.596±0.039 0.768±0.037 0.776±0.034 2509
Linear kernel 0.571±0.016 0.791±0.014 0.795±0.009 2509
90%CLSVSM_K 0.599±0.017 0.785±0.006 0.785±0.006 247※
95%CLSVSM_K 0.571±0.043 0.801±0.004※ 0.798±0.004 356
98%CLSVSM_K 0.565±0.003※ 0.797±0.001 0.798±0.001※ 468
↓ in the above table indicates that the smaller the experiment result is, the better; conversely,. rho.e. indicates that the larger the experimental result, the better. In the table we mark the optimum results of the experiment in asterisks. The higher the purity and the F value are, the better the clustering effect is; conversely, a smaller entropy value is better.
Two groups of experimental results show that the co-occurrence potential semantic vector space model greatly improves the clustering precision compared with the previous model, and the semantic kernels constructed based on the co-occurrence potential semantic vector space model obviously perform dimension reduction processing on the feature word space while improving the clustering progress, so that the complexity of a clustering algorithm on time and space is reduced. Therefore, the method is applied to text clustering to extract richer semantic information, and meanwhile, the dimensionality of the feature word space is reduced.

Claims (1)

1. The method for semantic core of the latent semantic vector space model co-occurrence of document resource topic clusters is characterized by comprising the following steps of:
the first step is as follows: preprocessing of literature data: data cleaning, document marking, extracting key words of each document, and keeping the corresponding relation between the key words and the corresponding documents;
the second step is that: carrying out word frequency statistics on the extracted keywords, and arranging the keywords in descending order according to the word frequency for later use in establishing a co-occurrence matrix;
the third step: and (3) constructing a vector space model represented by the literature by taking whether the keywords appear in the literature as weights:
dl=(al1,al2,...,alm)T∈Rm,l=1,2,…,n.
wherein: dlIs that of n documents the first document is in Euclidean space RmRepresents a vector of (1), alj(j ═ 1,2, …, m) is the weight of the jth keyword in the l th document, when the jth keyword is document dlWhen a keyword of (2)ljEqual to 1, otherwise 0; l is document serial number, n is document total space, m is number of total keywords in keyword set, RmFor Euclidean space, T represents transpose operation, and the "article-word" matrix A of the literature set is (a)lj)n×m
The fourth step: constructing a co-occurrence potential semantic vector space model:
(1) computing co-occurrence intensity matrices
Co-occurrence matrix between keywords C ═ ATA=(cij)m×mWherein, when i ≠ j, cijCo-occurrence frequency of ith keyword and jth keyword, when i equals j, ciiThe total frequency of the ith keyword;
a co-occurrence intensity matrix B is then calculated,
Figure FDA0002194221960000021
Figure FDA0002194221960000022
wherein, c11,c22,…,cmmFrequency numbers of the 1 st keyword, the 2 nd keyword, … … th keyword and the m th keyword respectively; when i ≠ j, bijIs the co-occurrence strength of the ith keyword and the jth keyword, when i is equal to j, b isii1, that is, all diagonal elements of the matrix B are 1;
(2) extraction of co-occurrence information
Note aljJ has an index set of 1l1Namely: i isl1={j|alj1, this term is calledThe latent semantic similarity of the l document to the j keyword,
Figure FDA0002194221960000024
set of representations bjtMeet the condition t ∈ I in }l1Is recorded as
Figure FDA0002194221960000025
When a isljWhen 1, qlj1 is ═ 1; when a isljWhen 0 is equal to 0, q is not less than 0lj<1;
(3) Co-occurrence latent semantic vector space model (CLSVSM)
Figure FDA0002194221960000026
Wherein:
Figure FDA0002194221960000027
the new "chapter-word" matrix based on CLSVSM is:
Figure FDA0002194221960000031
the fifth step: construction of semantic Kernel function
(1) Singular value decomposition of the transpose of the new "chapter-word" matrix
Obtaining Q through matlab software operation according to singular value decomposition theoryTThe decomposition formula (2):
Figure FDA0002194221960000032
wherein QTIs a new "word-piece" matrix of dimension m x n; u, V are singular matrices, are square matrices with dimensions m and n, respectively, and are orthogonal matrices, UUT=I,VVT=I;
Figure FDA0002194221960000033
Is a matrix of dimension m × n, assuming a "word-term" matrix QTIs r, Δ ═ diag (δ)1δ2δ3… δr),δi(i-1, 2, …, r) is a non-zero singular value and is arranged in order of magnitude as δ1≥δ2≥…≥δrCorrelation matrix Q between keywordsTQ=UΣVTTUT=UΣΣTUT=UΛUTSingular matrix U being equal to QTMatrix of orthogonal unit eigenvectors of Q, matrix
Figure FDA0002194221960000034
Is a m x m dimensional square matrix with elements on the diagonal of QTThe characteristic value corresponding to the Q is set,
Figure FDA0002194221960000035
a diagonal matrix composed of non-zero eigenvalues;
(2) feature extraction and dimension reduction
Selecting the first k maximum eigenvalues, wherein the size of k depends on the requirement of the accumulative contribution rate of the eigenvalues, and when the accumulative contribution rate of the desirable eigenvalues is not less than 90%, k is
Figure FDA0002194221960000041
Simultaneously selecting the front k columns of corresponding singular matrixes U and V, realizing dimension reduction processing on the singular matrixes, and respectively recording the dimension reduction processing as UkAnd VkThen a Q is obtainedTK-th order approximation of the matrix, i.e. Xk T=UkΣkVk T
(3) Semantic core based on CLSVSM
k(dl,ds)=(Uk Tφ(dl))T(Uk Tφ(ds))=φT(dl)UkUk Tφ(ds),l,s=1,L,n
The semantic kernel function obtains a consistent kernel matrix as:
Figure FDA0002194221960000042
the semantic core based on the CLSVSM is abbreviated as CLSVSM _ K;
and a sixth step: document clustering
And performing semantic kernel function representation on the documents, taking a kernel matrix as a similarity matrix between the documents, and selecting a clustering algorithm to perform document theme clustering.
CN201611095873.0A 2016-12-02 2016-12-02 Semantic core method for latent semantic vector space model based on document resource topic clustering co-occurrence Active CN106708969B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611095873.0A CN106708969B (en) 2016-12-02 2016-12-02 Semantic core method for latent semantic vector space model based on document resource topic clustering co-occurrence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611095873.0A CN106708969B (en) 2016-12-02 2016-12-02 Semantic core method for latent semantic vector space model based on document resource topic clustering co-occurrence

Publications (2)

Publication Number Publication Date
CN106708969A CN106708969A (en) 2017-05-24
CN106708969B true CN106708969B (en) 2020-01-10

Family

ID=58934486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611095873.0A Active CN106708969B (en) 2016-12-02 2016-12-02 Semantic core method for latent semantic vector space model based on document resource topic clustering co-occurrence

Country Status (1)

Country Link
CN (1) CN106708969B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108933691B (en) * 2017-05-26 2021-09-07 华为技术有限公司 Method for obtaining standard configuration template of network equipment and computing equipment
CN107273474A (en) * 2017-06-08 2017-10-20 成都数联铭品科技有限公司 Autoabstract abstracting method and system based on latent semantic analysis
CN107329954B (en) * 2017-06-29 2020-10-30 浙江工业大学 Topic detection method based on document content and mutual relation
CN108647236B (en) * 2018-03-30 2021-07-13 山东管理学院 Chinese medicine prescription vector space model method and device based on word co-occurrence
CN108647213A (en) * 2018-05-21 2018-10-12 辽宁工程技术大学 A kind of composite key semantic relevancy appraisal procedure based on coupled relation analysis
CN108717411B (en) * 2018-05-23 2022-04-08 安徽数据堂科技有限公司 Questionnaire design auxiliary system based on big data
CN108960296B (en) * 2018-06-14 2022-03-29 厦门大学 Model fitting method based on continuous latent semantic analysis
CN108874755B (en) * 2018-06-28 2020-12-08 电子科技大学 MeSH-based medical literature set similarity measurement method
CN109255026B (en) * 2018-08-23 2021-06-25 云南师范大学 Learning demand analysis method based on common word analysis and cluster analysis
CN109829634B (en) * 2019-01-18 2021-02-26 北京工业大学 Self-adaptive college patent and scientific research team identification method
CN109840325B (en) * 2019-01-28 2020-09-29 山西大学 Text semantic similarity measurement method based on point mutual information
CN109829109B (en) * 2019-01-28 2021-02-02 山西大学 Recommendation method based on co-occurrence analysis
CN111259150B (en) * 2020-01-20 2022-07-19 山西大学 Document representation method based on word frequency co-occurrence analysis

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970729A (en) * 2014-04-29 2014-08-06 河海大学 Multi-subject extracting method based on semantic categories
CN104778204A (en) * 2015-03-02 2015-07-15 华南理工大学 Multi-document subject discovery method based on two-layer clustering

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970729A (en) * 2014-04-29 2014-08-06 河海大学 Multi-subject extracting method based on semantic categories
CN104778204A (en) * 2015-03-02 2015-07-15 华南理工大学 Multi-document subject discovery method based on two-layer clustering

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Document copy detection based on kernel method;Bao Jun-Peng 等;《 International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003》;20040322;250-256 *
Support Vector Machines based on a semantic kernel for text categorization;Georges Siolas 等;《 Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium》;20020806;205-209 *
基于共现潜在语义向量空间模型的语义核构建;牛奉高 等;《情报学报》;20170824;第36卷(第8期);834-842 *
数字文献资源高维向量表示模型与聚类检验;牛奉高 等;《情报学报》;20150122;第33卷(第10期);53-66 *
数字文献资源高维聚合模型研究;牛奉高;《中国博士学位论文全文数据库 信息科技辑》;20150615(第6期);I143-3 *

Also Published As

Publication number Publication date
CN106708969A (en) 2017-05-24

Similar Documents

Publication Publication Date Title
CN106708969B (en) Semantic core method for latent semantic vector space model based on document resource topic clustering co-occurrence
Chen et al. Experimental explorations on short text topic mining between LDA and NMF based Schemes
Li et al. A co-attention neural network model for emotion cause analysis with emotional context awareness
Yu et al. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering
Calvo et al. Emotions in text: dimensional and categorical models
Blacoe et al. A quantum-theoretic approach to distributional semantics
Greene et al. Producing accurate interpretable clusters from high-dimensional data
CN111078852A (en) College leading-edge scientific research team detection system based on machine learning
Pocostales Nuig-unlp at semeval-2016 task 13: A simple word embedding-based approach for taxonomy extraction
Sadr et al. Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms
Dehghan et al. Mining shape of expertise: A novel approach based on convolutional neural network
Kundu et al. A nil-aware answer extraction framework for question answering
Saha et al. Development of a practical system for computerized evaluation of descriptive answers of middle school level students
Alhawarat Extracting topics from the holy Quran using generative models
Li et al. A unified model for document-based question answering based on human-like reading strategy
Subramaniam et al. Modified firefly algorithm and fuzzy c-mean clustering based semantic information retrieval
Darmalaksana et al. Latent semantic analysis and cosine similarity for hadith search engine
Niraula et al. Combining word representations for measuring word relatedness and similarity
Meena et al. Evaluation of the descriptive type answers using hyperspace analog to language and self-organizing map
Wu et al. Multiple hypergraph clustering of web images by miningword2image correlations
AlMahmoud et al. The effect of clustering algorithms on question answering
Ben Hassine et al. A novel imbalanced data classification approach for suicidal ideation detection on social media
Thalor A descriptive answer evaluation system using cosine similarity technique
Zhong et al. A novel feature selection method based on probability latent semantic analysis for Chinese text classification
Gong The assessment research and preventive of student's health by using deep belief networks and restricted boltzmann machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant