CN106708969B

CN106708969B - Semantic core method for latent semantic vector space model based on document resource topic clustering co-occurrence

Info

Publication number: CN106708969B
Application number: CN201611095873.0A
Authority: CN
Inventors: 牛奉高; 张亚宇
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2016-12-02
Filing date: 2016-12-02
Publication date: 2020-01-10
Anticipated expiration: 2036-12-02
Also published as: CN106708969A

Abstract

The invention belongs to the technical field of semantic kernel methods of semantic vector space models, and particularly relates to a semantic kernel method of a latent semantic vector space model based on document resource topic clustering co-occurrence. The invention mainly solves the problems that the semantic information extraction complexity is higher, the semantic information extraction is insufficient, the dimension of the model is higher, the time and space complexity is higher when the current semantic vector space model semantic kernel method is applied to a clustering algorithm, and the like. The invention discloses a semantic core method of a document resource topic clustering co-occurrence potential semantic vector space model, which comprises the following steps: preprocessing literature data; performing word frequency statistics on the extracted keywords for subsequent co-occurrence matrix establishment; thirdly, establishing a vector space model represented by the literature by taking whether the keywords appear in the literature as weights; fourthly, constructing a co-occurrence potential semantic vector space model; fifthly, constructing a semantic kernel function; and sixthly, clustering the literature.

Description

Semantic core method for latent semantic vector space model based on document resource topic clustering co-occurrence

Technical Field

The invention belongs to the technical field of semantic kernel methods of semantic vector space models, and particularly relates to a semantic kernel method of a latent semantic vector space model based on document resource topic clustering co-occurrence.

Background

In the big data era, a large amount of unstructured text resources are brought to people, and clustering as an unsupervised machine learning method is one of main means for achieving text resource mining. Text clustering is different from general data clustering, and text information is subjected to data structure representation firstly. The basic model of text representation is a Vector Space Model (VSM), which maps each document to a high-dimensional sparse vector in text space, so that the semantic similarity calculation problem between texts when performing text clustering can be converted into the calculation of vectors in vector space, that is: and measuring the similarity between texts by calculating the similarity between vectors so as to realize text clustering. However, the VSM ignores semantic relationships from word to word, resulting in inaccurate computation of text similarity. The generalized vector space model (GVSSM) is provided to mine co-occurrence information between words, so that the accuracy of text similarity calculation is improved, but the problem of insufficient extraction of semantic information in a text representation vector cannot be changed. Therefore, the following research is mainly based on VSM or GVSM models, and the Semantic Vector Space Model (SVSM) is constructed by background knowledge such as ontology or corpus to realize the calculation of document similarity. However, the general ontology has high construction cost and incomplete domain knowledge.

Semantic kernels were first introduced by Siolas G as one of the concepts of kernel functions and used for text mining as the underlying kernel in the support vector machine approach. The research of semantic kernel functions has been divided into two categories: one type of research mainly extracts semantic relations between feature words by using large-scale ontologies such as WordNet, Wikipedia and HowNet as knowledge backgrounds to realize the construction of semantic kernel functions, however, ontology knowledge construction complexity is high and domain knowledge is not complete; the other type of research is based on a statistical method, a semantic core is constructed by excavating potential concepts among feature words, most of the existing research of semantic core functions applied to text resource clustering is constructed based on a basic vector space model and a generalized vector space model, semantic information is insufficiently extracted, and clustering effect is poor.

Disclosure of Invention

The invention mainly aims at the problems that the semantic information extraction complexity is high, the semantic information extraction is insufficient, the model dimension is high, the time and space complexity is high when the method is applied to a clustering algorithm and the like in the existing semantic kernel method of a semantic vector space model, and provides a text resource topic clustering co-occurrence potential semantic vector space model semantic kernel method.

The technical scheme adopted by the invention to solve the problems is as follows:

the method for semantic core of the latent semantic vector space model co-existing in document resource topic clustering comprises the following steps:

the first step is as follows: preprocessing of literature data: data cleaning, document marking, extracting key words of each document, and keeping the corresponding relation between the key words and the corresponding documents;

the second step is that: carrying out word frequency statistics on the extracted keywords, and arranging the keywords in descending order according to the word frequency for later use in establishing a co-occurrence matrix;

the third step: and (3) constructing a vector space model represented by the literature by taking whether the keywords appear in the literature as weights:

d_l＝(a_l1,a_l2,...,a_lm)^T∈R^m，l＝1,2,…,n.

wherein: d_lIs that of n documents the first document is in Euclidean space R^mRepresents a vector of (1), a_lj(j ═ 1,2, …, m) is the weight of the jth keyword in the l th document, when the jth keyword is document d_lWhen a keyword of (2)_ljEqual to 1, otherwise 0; l is document serial number, n is document total space, m is number of total keywords in keyword set, R^mFor Euclidean space, T represents transpose operation, and the "article-word" matrix A of the literature set is (a)_lj)_n×m；

The fourth step: constructing a co-occurrence potential semantic vector space model:

(1) computing co-occurrence intensity matrices

Co-occurrence matrix between keywords C ═ A^TA＝(c_ij)_m×mWherein, when i ≠ j, c_ijCo-occurrence frequency of ith keyword and jth keyword, when i equals j, c_iiThe total frequency of the ith keyword;

a co-occurrence intensity matrix B is then calculated,

wherein, c₁₁,c₂₂,…,c_mmFrequency numbers of the 1 st keyword, the 2 nd keyword, … … th keyword and the m th keyword respectively; when i ≠ j, b_ijIs the co-occurrence strength of the ith keyword and the jth keyword, when i is equal to j, b is_ii1, that is, all diagonal elements of the matrix B are 1;

(2) extraction of co-occurrence information

Note a_ljJ has an index set of 1_l1Namely: i is_l1＝{j|a_lj1, this term is calledThe latent semantic similarity of the l document to the j keyword,

set of representations b_jtMeet the condition t ∈ I in }_l1Is recorded as

When a is_ljWhen 1, q_lj1 is ═ 1; when a is_ljWhen 0 is equal to 0, q is not less than 0_lj＜1；

(3) Co-occurrence latent semantic vector space model (CLSVSM)

Wherein:

the new "chapter-word" matrix based on CLSVSM is:

the fifth step: construction of semantic Kernel function

(1) Singular value decomposition of the transpose of the new "chapter-word" matrix

Obtaining Q through matlab software operation according to singular value decomposition theory^TThe decomposition formula (2):

wherein Q^TIs a new "word-piece" matrix of dimension m x n; u, V are singular matrices, are square matrices with dimensions m and n, respectively, and are orthogonal matrices, UU^T＝I，VV^T＝I；

Is a matrix of dimension m × n, assuming a "word-term" matrix Q^TIs r, Δ ═ diag (δ)₁δ₂δ₃…δ_r)，δ_i(i-1, 2, …, r) is a non-zero singular value and is arranged in order of magnitude as δ₁≥δ₂≥…≥δ_rCorrelation matrix Q between keywords^TQ＝UΣV^TVΣ^TU^T＝UΣΣ^TU^T＝UΛU^TSingular matrix U being equal to Q^TMatrix of orthogonal unit eigenvectors of Q, matrix

Is a m x m dimensional square matrix with elements on the diagonal of Q^TThe characteristic value corresponding to the Q is set,

a diagonal matrix composed of non-zero eigenvalues;

(2) feature extraction and dimension reduction

Selecting the first k maximum eigenvalues, wherein the size of k depends on the requirement of the accumulative contribution rate of the eigenvalues, and when the accumulative contribution rate of the desirable eigenvalues is not less than 90%, k is

Simultaneously selecting the front k columns of corresponding singular matrixes U and V, realizing dimension reduction processing on the singular matrixes, and respectively recording the dimension reduction processing as U_kAnd V_kThen a Q is obtained^TK-th order approximation of the matrix, i.e. X_k ^T＝U_kΣ_kV_k ^T；

(3) Semantic core based on CLSVSM

k(d_l,d_s)＝(U_k ^Tφ(d_l))^T(U_k ^Tφ(d_s))＝φ^T(d_l)U_kU_k ^Tφ(d_s)，l,s＝1,…,n

The semantic kernel function obtains a consistent kernel matrix as:

the semantic core based on the CLSVSM is abbreviated as CLSVSM _ K;

and a sixth step: document clustering

And performing semantic kernel function representation on the documents, taking a kernel matrix as a similarity matrix between the documents, and selecting a clustering algorithm to perform document theme clustering.

By adopting the technical scheme, compared with the semantic kernel function in the previous research, the semantic kernel function extracts richer semantic information, avoids background knowledge such as imperfect ontology with higher construction cost and the like, improves the clustering effect by more than 20%, and not only realizes the combination of synonymous information among text feature words but also reduces the dimension of a feature word space when extracting the semantic information.

Detailed Description

Example 1

The first step is as follows: data preprocessing: and cleaning data, marking documents, extracting keywords of each document, and keeping the corresponding relation between the keywords and the corresponding documents.

The data are from CNKI, according to the classification, 300 documents of three subjects of 'publishing', 'book intelligence and digital library' and 'archive and museum' under the information science are respectively selected as analyzed documents, 4 documents without keywords are removed, the total number of the finally obtained documents is 896, wherein 299 documents of 'publishing', 'book intelligence and digital library' 298, 'archive and museum' 299, and 2509 different keywords are obtained. Namely: the number of documents n is 896, the number of keywords m is 2509, and the table below shows the top 20 documents intercepted and all the keywords corresponding to them. In Table 1, LM is the document category, ID is the document number, and k1-k10 are the corresponding keywords of the document.

Table 1: list of documents and corresponding keywords (parts)

LM

ID

T1

K1

K2

K3

K4

K5

K6

K7

K8

K9

K10

Drawing (A) Love of a person

1001

Culture of villages and townsStand on New rural cultural building Function in setting

Towns and towns Culture

Rural area Culture Construction of buildings

Peasants Group of people

Culture Movement of

Article (Chinese character) Transforming Station

Activity device Movable part Shape of Formula (II)

Culture Cause of business Unit of

Base layer Culture Work by

Country Ballast for ballast Political affairs administration Mansion

Basal layer organization

Drawing (A) Love of a person

1002

Public library electricity Reading of sub reading room Person service

Reader's book Service

Electronic device Reading device Chamber

Public Book with detachable cover Shop

Drawing (A) Love of a person

1003

To the reader Library management flow Engineering optimization discussion

Reader's book

Book with detachable cover Shop

Managing Means for

Drawing (A) Love of a person

1004

Self-evergreen collection of languages Versions and pairs thereof Simultaneous speech and material price Value of

Language Self-evergreen Collection (I)

Beijing Official speech

Literature reference Investigation

Modern times Chinese character

Drawing (A) Love of a person

1005

I-way of farm bookstore Exhibition situation, problem and countermeasure-based on

Farmhouse Book room

Book with detachable cover Purchasing

Daily life Maintenance

Managing Mechanism for controlling a motor

Drawing (A) Love of a person

1006

Song carved "five minister notes text Selection of Meng's Ben and Chen Eight lang book relation examination

Five minister drugs Note and text (iii) selection

Ancestor book

Meng's disease Book (I)

Chen Ba Lang book

Do not need to use Showa Chinese character of Asian race Book (I)

Drawing (A) Love of a person

1007

Under the condition of informatization College library net Of network information resources Construction of

Information Transforming

Colleges and universities Book with detachable cover Shop

Network Information Resource(s)

Drawing (A) Love of a person

1008

Independent college of examination and discussion Library reading guide I work Development of (2)

Book with detachable cover Reading guide

University Generating Long and long

Book with detachable cover Shop

Independent of each other College of academic

Drawing (A) Love of a person

1009

The library of colleges and universities is Culture of innovated talents Important base

Innovation of Talents

Colleges and universities Book with detachable cover Shop

Drawing (A) Love of a person

1010

Number of middle school in Shanghai City Teaching of word experiment Status sample survey And is divided into

Number of Chemical and physical therapy Test (experiment)

Experiment of Teaching aid

Data of Analysis of

Drawing (A) Love of a person

1011

Continuing teaching to colleges and universities Opening article of the institute of education Idea of dedicating retrieval lesson Examination

Continue to use Education College of academic

Literature reference Retrieval Lesson

Information Vegetarian food

Drawing (A) Love of a person

1012

How the library should To social media Influence of

Society, its own and other related applications Chemical media Body

New media Body

On-line Media

Book with detachable cover Shop

Drawing (A) Love of a person

1013

Knowledge-based environment College library Establishment of system

University Book with detachable cover Shop system Degree of rotation

Study of Big type Study the picture Book with detachable cover

Knowledge of Managing Mode(s)

Drawing (A) Love of a person

1014

Colleges and universities' library allies Learning under alliance environment Development strategy for librarian Slightly less than

Subject of discipline Librarian

Book with detachable cover Shop couplet Alliance type

Subject of discipline Service

Drawing (A) Love of a person

1015

College library duty For helping students Problem of training

Colleges and universities Book with detachable cover Shop

Working worker Learning aid

Student's desk Training

Drawing (A) Love of a person

1016

Human care perspective Library culture Innovative practices

Book with detachable cover Shop and literature Transforming

Humanity Care

Culture Innovation of

Drawing (A) Love of a person

1017

Book for universities and colleges of professorship Utilization rate of museum's literature Reason for and pair of Policy and plan for making

Book with detachable cover Shop

High-job High special purpose Colleges and universities

Literature reference By using Rate of change

Drawing (A) Love of a person

1018

College library self Assisted service application Analysis of

Colleges and universities Book with detachable cover Shop

Self-help Service

RFID

Drawing (A) Love of a person

1019

Colleges and universities' library clothes Business adult education cash Examination of conditions and countermeasures

Colleges and universities Book with detachable cover Shop

Adult Education

Reader's book Service

Drawing (A) Love of a person

1020

Based on SCI and SSCI And A&Henan of HCI University paper statistics

Henan province University

Study and study Paper (S)

SCI

SSCI

A& HCI

The second step is that: and constructing a keyword space, carrying out word frequency statistics on the extracted keywords, and arranging the keywords in a descending order according to the word frequency. Table 2 shows the first 20 keywords and the corresponding word frequencies in our experimental results:

table 2: keyword frequency statistics (part)

d_l＝(a_l1,a_l2,...,a_l,2509)^T∈R²⁵⁰⁹，l＝1,2,…,896

wherein: d_l896 in the first document, in Euclidean space R²⁵⁰⁹The expression vector in (1) has 2509 keywords, so that the Euclidean space is R²⁵⁰⁹，a_lj(j is 1,2, …,2509) is the weight of the jth keyword in the ith document, l is the document number, T is the transposition operation, and when the jth keyword is the document d_lWhen a keyword_ljEqual to 1, otherwise 0, the "article-word" matrix of the document collection is a ═ a (a)_lj)_896×2509. Table 3 presents the data for matrix a in Excel in the first 20 rows and the first 15 columns of the experiment where the dimension of matrix a is 896 x 2509. 2509 keywords are recorded in row 1 of table 3; the 1 st column records category information; column 2 records the ID of the document; line 1 column 1 position 897 refers to the use of the Excel table 897 line.

Table 3: VSM-based "word-context" matrix A (part)

(1) computing co-occurrence intensity matrices

Co-occurrence matrix between keywords C ═ A^TA＝(c_ij)_2509×2509Table 4 presents some of the results of the experiment for matrix C, where C is when i ≠ j_ijCo-occurrence frequency of ith keyword and jth keyword, when i equals j, c_iiThe total frequency of the ith keyword, i.e., the value on the diagonal. The table has row 1 and column 1 as keywords.

Table 4: keyword co-occurrence matrix C (part)

A co-occurrence intensity matrix B is then calculated,

wherein, c₁₁，c₂₂，…，c_2509，2509Frequency counts of the 1 st, 2 nd, … … th and 2509 th keywords respectively; when i ≠ j, b_ijIs the co-occurrence strength of the ith keyword and the jth keyword, when i is equal to j, b is_iiThe table below shows the results of some experiments in which matrix B is shared in the truncated experiments. The table has row 1 and column 1 as keywords.

Table 5: co-occurrence intensity matrix B (part)

(2) Extraction of co-occurrence information

For a in the matrix A_ljThe portion of 0 is supplemented with co-occurrence information, namely: the co-occurrence information is supplemented for the portion with value 0 in table 3. The method comprises the following steps: note a_ljJ has an index set of 1_l1Namely: i is_l1＝{j|a_lj1, this term is called

For the potential semantic similarity of the l document and the j keyword,

set of representations b_jtMeet the condition t ∈ I in }_i1Is recorded as

When a is_ijWhen 1, q_ij1 is ═ 1; when a is_ijWhen 0 is equal to 0, q is not less than 0_ijLess than 1; the following table is a_ljWhen q is 0_ljHere we only cut the first 20 rows and the first 15 columns of the experimental results. Not all of a_ljThe values can be supplemented when the value is 0, the values of the parts which cannot be supplemented are still 0, and table 6 only shows the values when the values can be supplemented; category information is recorded in column 1 of table 6, document ID is recorded in column 2, and 2509 keywords are recorded in row 1.

Table 6: co-occurrence information supplement matrix (part)

(3) Co-occurrence latent semantic vector space model (CLSVSM)

Wherein:

the results of the new "article-word" matrix based on CLSVSM in the experiments are shown in the following table, where we only cut the first 20 rows and the first 15 columns, where the 1 st column records the document category information, the 2 nd column records the document ID, and the 1 st row records 2509 keywords:

table 7: new "chapter-word" matrix Q (part) from CLSVSM

The fifth step: construction of semantic Kernel function

(4) Transpose Q of the corresponding "chapter-word" matrix Q of Table 7^TPerforming singular value decomposition

to Q^TThe singular matrices U and V corresponding to the singular value decomposition are shown in tables 8 and 9, and the matrix Σ has a value shown in table 10. Table 8 row 1 and column 1 are keywords; table 9 rows 1 and columns 1 identify documents, Table 10 rows 1 identify documents, and columns 1 identify keywords. Simultaneously solving the matrix Q^TRank r of 896.

Table 8: singular matrix U (part)

Table 9: singular matrix V (part)

Table 10: matrix Σ (part)

Calculating sigma-sigma^TThe matrix Λ was obtained and the first 20 rows and the first 15 columns of the experimental results are shown in table 11, where Λ is a square matrix with dimensions 2509 × 2509.

Table 11: matrix Λ (part)

(5) Feature extraction and dimension reduction

And selecting the first k maximum characteristic values. The magnitude of k depends on the cumulative contribution rate requirement of the eigenvalues. Here, the cumulative contribution rate of the feature value is not less than 90%, and the sum of the feature values obtained by MATLAB calculation is 7.5457e +03, that is, the sum of the feature values is

When the cumulative contribution rate of the eigenvalues is not less than 90%, k is 247,

namely:

therefore, the first 247 eigenvalues of the matrix Λ are selected, the corresponding first 247 columns of the singular matrices U and V are simultaneously selected, dimension reduction processing is realized on the singular matrices, and the singular matrices are respectively marked as U₂₄₇. Similarly, when the cumulative contribution rate of the feature value is not less than 95% and 98%, the values of k are 356 and 468, respectively.

(6) Semantic core based on CLSVSM

k(d_l，d_s)＝(U₂₄₇ ^Tφ(d_l))^T(U₂₄₇ ^Tφ(d_s))＝φ^T(d_l)U₂₄₇U₂₄₇ ^Tφ(d_s)，l，s＝1，2，…，896

The semantic kernel function obtains a consistent kernel matrix as:

the semantic kernel based on CLSVSM is abbreviated as CLSVSM _ K.

Nuclear matrix obtained in experiment

The first 20 rows and the first 15 columns of table 12,

is dimension of

896 × 896 square matrix. Line 1 and column 1 of table 12 are the ID information of the document.

Table 12: core matrix

(part)

And a sixth step: document clustering

And performing semantic kernel function representation on the documents, taking a kernel matrix as a similarity matrix between the documents, and selecting a clustering algorithm to perform document clustering. In the experiment, a k-means clustering algorithm is adopted. The results of the experimental comparison are shown in tables 13 and 14:

in the experiment, clustering results under several clustering schemes are respectively compared, and 22 experiments are carried out in total. The results are shown in Table 13.

Table 13: comparison of experimental results of CLSVSM and VSM

The experimental results show that the CLSVSM results are far superior to VSM. And the results of the test for CLSVSM are optimal when selecting scheme D-I2.

Then comparing a semantic core of the co-occurrence potential semantic vector space model with a linear core of the co-occurrence potential semantic vector space model and the co-occurrence potential semantic vector space model, wherein the selection of a parameter K during the construction of the semantic core respectively ensures that the sum of the first K characteristic values accounts for 90%, 95% and 98% of the sum of the characteristic values, the constructed semantic core function table is respectively abbreviated as 90% CLSVSM _ K, 95% CLSVSM _ K and 98% CLSVSM _ K, an optimal scheme D-I2 is selected, each model is subjected to 50 times of experiments, the clustering result is evaluated through the mean values of three indexes of entropy, purity and F value obtained through multiple experiments, and the analysis and comparison result is shown in a table 14.

Table 14: clustering comparisons of different methods

	Entropy value ↓	Purity × ×	F value ↓	Dimension ↓offeature word space
					CLSVSM	0.596±0.039	0.768±0.037	0.776±0.034	2509
Linear kernel	0.571±0.016	0.791±0.014	0.795±0.009	2509
					90％CLSVSM_K	0.599±0.017	0.785±0.006	0.785±0.006	247※
95％CLSVSM_K	0.571±0.043	0.801±0.004※	0.798±0.004	356
					98％CLSVSM_K	0.565±0.003※	0.797±0.001	0.798±0.001※	468

↓ in the above table indicates that the smaller the experiment result is, the better; conversely,. rho.e. indicates that the larger the experimental result, the better. In the table we mark the optimum results of the experiment in asterisks. The higher the purity and the F value are, the better the clustering effect is; conversely, a smaller entropy value is better.

Two groups of experimental results show that the co-occurrence potential semantic vector space model greatly improves the clustering precision compared with the previous model, and the semantic kernels constructed based on the co-occurrence potential semantic vector space model obviously perform dimension reduction processing on the feature word space while improving the clustering progress, so that the complexity of a clustering algorithm on time and space is reduced. Therefore, the method is applied to text clustering to extract richer semantic information, and meanwhile, the dimensionality of the feature word space is reduced.

Claims

1. The method for semantic core of the latent semantic vector space model co-occurrence of document resource topic clusters is characterized by comprising the following steps of:

d_l＝(a_l1,a_l2,...,a_lm)^T∈R^m，l＝1,2,…,n.

(1) computing co-occurrence intensity matrices

a co-occurrence intensity matrix B is then calculated,

(2) extraction of co-occurrence information

set of representations b_jtMeet the condition t ∈ I in }_l1Is recorded as

(3) Co-occurrence latent semantic vector space model (CLSVSM)

Wherein:

the new "chapter-word" matrix based on CLSVSM is:

the fifth step: construction of semantic Kernel function

Is a matrix of dimension m × n, assuming a "word-term" matrix Q^TIs r, Δ ═ diag (δ)₁δ₂δ₃… δ_r)，δ_i(i-1, 2, …, r) is a non-zero singular value and is arranged in order of magnitude as δ₁≥δ₂≥…≥δ_rCorrelation matrix Q between keywords^TQ＝UΣV^TVΣ^TU^T＝UΣΣ^TU^T＝UΛU^TSingular matrix U being equal to Q^TMatrix of orthogonal unit eigenvectors of Q, matrix

a diagonal matrix composed of non-zero eigenvalues;

(2) feature extraction and dimension reduction

(3) Semantic core based on CLSVSM

k(d_l,d_s)＝(U_k ^Tφ(d_l))^T(U_k ^Tφ(d_s))＝φ^T(d_l)U_kU_k ^Tφ(d_s)，l,s＝1,L,n

The semantic kernel function obtains a consistent kernel matrix as:

the semantic core based on the CLSVSM is abbreviated as CLSVSM _ K;

and a sixth step: document clustering