CN107861924A - A kind of eBook content method for expressing based on Partial Reconstruction model - Google Patents

A kind of eBook content method for expressing based on Partial Reconstruction model Download PDF

Info

Publication number
CN107861924A
CN107861924A CN201710889265.5A CN201710889265A CN107861924A CN 107861924 A CN107861924 A CN 107861924A CN 201710889265 A CN201710889265 A CN 201710889265A CN 107861924 A CN107861924 A CN 107861924A
Authority
CN
China
Prior art keywords
book
node
vector
partial reconstruction
reconstruction model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710889265.5A
Other languages
Chinese (zh)
Inventor
张海军
王双
姬玉柱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Publication of CN107861924A publication Critical patent/CN107861924A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention proposes a kind of eBook content method for expressing based on Partial Reconstruction model, and methods described includes:A, tree structure is expressed:For each e-book, some pages are divided into, each page is divided into some paragraphs, each e-book tissue is turned into " e-book>Page>The tree height of three of paragraph ";B, node diagnostic is expressed:Vocabulary is built, calculates word distribution vector, dimensionality reduction, compression are carried out to the word distribution vector of each Hierarchy nodes using principal component analysis;C, Partial Reconstruction model is established:Its parent information is reconstructed using the information of child nodes, that is, establishes Partial Reconstruction model, Partial Reconstruction model is solved and obtains reconstruction coefficient vector;D, the unified vector representation of tree structure:According to the reconstruction coefficient vector obtained in C, the node and its child nodes are subjected to information fusion, the characteristic vector for updating the node represents;E, e-book retrieval and recommendation based on content.

Description

A kind of eBook content method for expressing based on Partial Reconstruction model
Technical field
The invention belongs to Text Mining System field, more particularly to the eBook content expression based on Partial Reconstruction model Method, methods described use e-book to be inputted as most original.
Background technology
In recent years, widely using with mobile reading equipment, the quantity of e-book is increasing, therefore designs effective E-book proposed algorithm, so as to for user carry out precisely, effectively recommend it is significant.
On the recommendation of e-book, already present technology can be largely classified into two classes:Collaborative filtering recommending and based on content Recommendation.The method of collaborative filtering is largely dependent upon the behavior of user, and its recommendation process is dependent on similar between user Preference, and require a number of user's scoring in system.If scored without enough users, or some books are not purchased Or be not scored, then can not effectively it be recommended using collaborative filtering.But in real life, the sales volume of most of book Or user's scoring is all less, this method is caused significant limitation to be present during the use of reality.Further, formed Content-based recommendation algorithm, but such method depends on detailed feature selection process, and need to carry to every book in advance The attribute information for for specifying, and then carry out content-based recommendation using automatic file classification method.But for this base In the recommendation of content, its recommendation process only relies upon specific text meta-data, and the not content of e-book in itself.
" bag of words " model is to obtain as the research method typically based on entire contents expression, the purpose of such method The vector of entire chapter document content can be represented by obtaining.But such method only relies upon the word frequency statisticses for word in text, and neglects The space distribution information of word is omited, it is difficult to distinguish that word frequency is similar but the spatial distribution of word is discrepant to cause this method Two books.
Tree structure can embody the level inside data and close as a kind of effective data tissue and expression way System and space structure relation.Therefore, can by e-book according to " e-book->Page->The mode of paragraph " carries out tissue, is formed One three layers of tree structure, so as to embody the spatial hierarchy published books, make up to a certain extent " bag of words " model for Text space information is ignored.But the data that tissue is carried out according to tree structure are not easy to the calculating of similarity between sample, therefore Need further to integrate the hierarchical information of tree structure data, so as to form unified vector representation, in order to realize Further system recommendation.
The content of the invention
It is an object of the invention to provide a kind of eBook content method for expressing based on Partial Reconstruction model, it is intended to solves Certainly problems of the prior art.
In order to realize that the hierarchical information of tree structure data is integrated, one kind is proposed in the present invention and is based on class-cosine (cosine-type) the Partial Reconstruction model of distance function, by using child nodes signal reconstruct, its parent information obtains Reconstruction coefficient vector, and then the local message in tree structure data is integrated, the process is bottom-up, until will tree Shape structured data is converted into unified vector representation, so that the vector contains the hierarchical information of the tree structure data.
The present invention is achieved through the following technical solutions:A kind of eBook content method for expressing based on Partial Reconstruction model, It the described method comprises the following steps:
A. tree structure is expressed:For each e-book, some pages are divided into, further, by each page Some paragraphs are divided into, so that each e-book is formed " e-book->Page->The tree height of three of paragraph ";
B. node diagnostic is expressed:Vocabulary is built, word distribution vector is calculated, further using principal component analysis (Principal Component Analysis, PCA) carries out dimensionality reduction, compression to the word distribution vector of each Hierarchy nodes, so as to In further model calculation;
C. Partial Reconstruction model is established:Its parent information is reconstructed using the information of child nodes, i.e. foundation office Portion's reconstruction model, solve Partial Reconstruction model and obtain reconstruction coefficients;
D. the unified vector representation of tree structure:According to the reconstruction coefficients obtained in the Partial Reconstruction modelling phase to Amount, the node and its child nodes are subjected to information fusion, the characteristic vector for updating the node represents;The process is bottom-up Successively carry out, until the electronic book data represented by tree structure to be collapsed into unified vector representation;
E. e-book retrieval and recommendation based on content:The inspection of e-book is carried out using the unified vector representation of e-book Rope, pass through the e-book recommendation for being calculated as user and carrying out related content of similarity.
As a further improvement on the present invention, the tree structure expression step comprises the following steps:
A1, e-book segmentation:By identify e-book paragraph delimiter " r n ", e-book is split, will One e-book is divided into several paragraphs;
A2, page division:Several adjacent paragraphs are merged, until the length of the paragraph merged exceedes in advance The minimum threshold of the page of setting, then form a new page.The threshold value of the minimum length of page is set to 1000 in the present invention;
A3, paragraph division:For the page formed in previous step, reuse paragraph delimiter " r n " split, And merge several adjacent paragraphs, until its length exceedes the minimum threshold of paragraph set in advance, then form one Individual new paragraph.The threshold value of the minimum length of paragraph is set to 50 in the present invention.
As a further improvement on the present invention, the node diagnostic expression step comprises the following steps:
B1, structure vocabulary:By text segmentation, the Text Pretreatments such as stop words, root reduction, word error correction are gone to grasp After work, the vocabulary of full dataset is established, and word frequency statisticses are carried out to the e-book that data are concentrated;
B2, calculate word distribution vector:Use term frequency-inverse document frequency (term frequency-inverse Document frequency, tf-idf) model calculates the weight of each word, so as to obtain each node in tree construction Word distribution vector;
B3, Feature Dimension Reduction:In order to realize the feasibility of calculating, PCA (Principal is used Component Analysis, PCA) term vector that each node weights in tree construction is compressed, dimensionality reduction.
As a further improvement on the present invention, the Partial Reconstruction model is established and comprised the following steps:
C1, establish Partial Reconstruction model:For a certain node (node there are child nodes) in tree, the node is used The information of child nodes the information of the node is reconstructed, and weighed using class-cosine (cosine-type) distance function Measure reconstructed error of its child nodes information to its child nodes of the reconstructed error amount of the nodal information to the node;
C2, Partial Reconstruction model is solved, obtain Partial Reconstruction coefficient vector, the size of reconstruction coefficients shows that the child saves The re-configurability to its parent information is put, reconstruction coefficients show that more greatly the node is stronger to the re-configurability of its father node.
As a further improvement on the present invention, the unified vector representation step of the tree structure comprises the following steps:
D1, according to by solve Partial Reconstruction model obtain reconstruction coefficient vector, by the characteristic item of child nodes to Amount is multiplied by the reconstruction coefficients corresponding to it respectively, and it is weighted with the characteristic vector of its father node and is added, so as to obtain The new character representation of the father node.
It is D2, bottom-up, the operation of previous step successively is performed to tree interior joint, until the electricity that will be represented by tree structure Philosophical works data compression turns into unified vector representation.
As a further improvement on the present invention, the e-book retrieval and recommendation step based on content include following step Suddenly:
The retrieval of e-book is carried out using the unified vector representation of e-book, is calculated and examined by using COS distance function The similarity of rope sample and e-book sample in database, obtains the retrieval list of e-book, so as to realize the electricity based on content The philosophical works is recommended.
The beneficial effects of the invention are as follows:EBook content method for expressing provided by the invention based on Partial Reconstruction model, By e-book according to " e-book->Page->The form of paragraph " is organized into three layers of tree structure, with traditional " bag of words " model Compared to the hierarchical structure that can embody text, be advantageous to further enhance the vector representation of text.And to tree interior joint and its Child nodes establish Partial Reconstruction model, and traditional reconstruction model weighs reconstructed error using Euclidean distance function, and this hair The bright advantage weighed for text similarity in view of COS distance function, propose to be based on class-cosine (cosine-type) away from From the Partial Reconstruction model of function.Set forth herein innovatory algorithm " the Partial Reconstruction model based on class-COS distance function ": For a certain node, signal reconstruct is carried out to the node using its child nodes, and weigh reconstruct using class-cosine function and miss Difference, Partial Reconstruction model is solved, obtain Partial Reconstruction coefficient vector.The model is based on certain hypothesis:The information of child nodes Inherit in its parent information, and parent information can be indicated by the linear combination of its child nodes information.Further, The feature representation of father node is updated to:The original feature vector of the node is multiplied by its institute plus its child nodes characteristic vector Corresponding reconstruction coefficients, the then node updated had both remained the raw information of the node, remained two layers where the node again The structural information of subtree (thering are the node and its child nodes to form).The process is bottom-up successively to be carried out, and has finally given this The unified vector representation of e-book.Further, the retrieval and recommendation of the e-book of related content are realized.
Brief description of the drawings
Fig. 1 is the flow chart of the eBook content method for expressing based on Partial Reconstruction model of the present invention;
Fig. 2 is the structure chart of the eBook content commending system based on Partial Reconstruction model of the present invention.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with drawings and Examples, The present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only to explain the present invention, It is not intended to limit the present invention.
The eBook content method for expressing based on Partial Reconstruction model of the present invention.The work of the main innovation of the present invention For following five parts:1) tree structure is expressed;2) node diagnostic is expressed;3) Partial Reconstruction model is established;4) tree structure to Amount expression;5) e-book retrieval and recommendation based on content.Part I divides to the e-book of input, so as to build " e-book->Page->The tree height of three of paragraph ".Part II carries out the feature representation of node, builds vocabulary, calculates Word distribution vector, and the dimensionality reduction of use principal component analysis (Principal Component Analysis, PCA) progress feature, Compression.Part III establishes Partial Reconstruction model, obtains reconstruction coefficient vector.The bottom-up letter for carrying out node of Part IV Breath is integrated, and ultimately forms unified vector representation.Part V realizes e-book retrieval and recommendation based on content.
Fig. 1 shows the flow chart of the eBook content method for expressing provided by the invention based on Partial Reconstruction model, its Details are as follows:
Step S1, tree structure expression:The e-book of input is divided according to page, paragraph, so as to build " electronics Book->Page->The tree height of three of paragraph ".Comprise the following steps that:
First, by identify e-book paragraph delimiter " r n ", e-book is split, by an e-book It is divided into several paragraphs.
Then, the formation of page is carried out.Several adjacent paragraphs are merged, until the length of the paragraph merged surpasses The minimum threshold of page set in advance is crossed, then forms a new page.Specifically, by the threshold of the minimum length of page in the present invention Value is set to 1000.There is no the threshold restriction of minimum length for the last page in book.
Further, the division of paragraph is carried out.For page each newly formed in previous step, paragraph segmentation is reused Symbol " r n " split, each newly formed page is divided into several paragraphs.And several adjacent paragraphs are closed And until its length then forms a new paragraph more than the minimum threshold of paragraph set in advance.Specifically, in the present invention The threshold value of the minimum length of paragraph is set to 50, does not have the threshold restriction of minimum length for last paragraph of a page.
Finally, form " e-book->Page->The tree height of three of paragraph ".Wherein first layer " e-book " illustrates The content of whole book, the second layer " page " illustrate one " e-book " and are made up of some " pages ", and third layer " paragraph " represents one " page " is made up of several " paragraphs ".
Step S2, node diagnostic expression:To by " e-book->Page->The e-book of paragraph " tree height of three tissue is entered Row feature representation, information extraction both was carried out to the content corresponding to tree interior joint, the feature representation of all nodes is mapped to phase Same semantic space.Comprise the following steps that:
(S21) vocabulary is built:
Text segmentation:In order to carry out the extraction of keyword, it is necessary first to carry out text segmentation.It is because of the present invention E-book is English text, therefore only needs to remove the punctuation mark in text, and is split according to space.
Remove stop words:Rejected for some using more frequent but no physical meaning word, such as " a ", " the ", " are " etc..
Root reduces:Many forms, such as verb, noun, single plural number change be present in the word in English.Therefore need Root reduction is carried out for word, such as " read ", " reads ", " reading ", is all considered to same word in the present invention Language.
Word error correction:The place of word mistake is there may be in text extraction process, so needing progress word to entangle It is wrong.
Vocabulary is built:After above-mentioned pre-treatment step, count and preserve the word frequency (term of the word after reduction Frequency, tf), text frequency (document frequency, df),(in all documents word of u-th of word Frequently),(total number for the document that u-th of word occurs).Further, the present invention only retains list of the frequency of occurrences more than 5 times Word, so as to build final vocabulary.
(S22) word distribution vector is calculated:
The vocabulary determined using previous step, word frequency statisticses are carried out to the content corresponding to each node in tree and obtain word Frequency distribution vector, further use term frequency-inverse document frequency (term frequency-inverse document Frequency, tf-idf) model calculates the weight of each word, so as to obtain the distribution of the weighted words of each node in tree construction Vector.
For the root node in tree, that is, " e-book " node of e-book entire contents is represented, its word distribution vector can table It is shown asWherein, nvRepresent time that v-th of word in vocabulary occurs Number, TbookRepresent the length of established vocabulary.Similarly, the word distribution vector of " page " and " paragraph " node in tree can divide It is not expressed as:WithAccording in S1 for e-book tree structure division principle, can obtain:With
Use term frequency-inverse document frequency (term frequency-inverse document frequency, tf-idf) The word distribution vector that model is weighted calculates.For " e-book " node, its word distribution vector weighted can be expressed as:
WhereinNbookFor the number of e-book in data set.Similarly, H can be obtainedpageWith Hpara
(S23) Feature Dimension Reduction:
The word distribution vector obtained in S23, its dimension are TbookThat is the length of vocabulary, the value are generally larger.And In practical application, the excessive characteristic vector of dimension is unfavorable for the calculating in modeling process and the measurement of similarity, therefore uses master Constituent analysis (Principal Component Analysis, PCA) carries out Feature Dimension Reduction.
The word distribution vector of node is mapped to a lower characteristic vector of dimension using PCA, can be expressed as:
Fh=H × B
Wherein, B represents the mapping square for using the content of top mode (i.e. " e-book ") in data set to solve to obtain by PCA Battle array, its dimension is Tbook×mF, mFFor the dimension of word distribution vector characteristic vector of gained after mapping matrix B compressions, in tree Retain the characteristic vector of identical dimensional after the Node compression of three levels, that is, be mF;H represents the word distribution obtained in S22 Vector, can be HbookOr HpageOr Hpara;FhRepresent the characteristic vector of word distribution vector H gained after mapping matrix B compressions.
Step S3, its parent information is reconstructed using child nodes information, that is, establishes Partial Reconstruction model, and Solve Partial Reconstruction model and obtain Partial Reconstruction vector.Comprise the following steps that:
The information that one child nodes is included can be divided into two parts:Represent child nodes unique characteristics feature to Amount, the partial information for being inherited from tree structure its father node.
In order to which the information of father's node is described more fully with, it is necessary to extract in child nodes to its father node Side information.For a father node, Partial Reconstruction is carried out to the information of the father node using the information of its child nodes, and Reconstructed error of its child nodes information to the nodal information is weighed using class-cosine (cosine-type) distance function, should Partial Reconstruction model is represented by:
subject to 1Tβ=1.
Wherein, Fi,lThe characteristic vector of i-th of node in tree structure is represented, and the node is located at l layers;DiRepresent by i-th The matrix that the characteristic vector of the child nodes of node is formed, these child nodes are respectively positioned on (l+1) layer, are represented by(kmax> 1 is the number of child nodes);λ is a parameter (λ > 0);β is the solution mould The Partial Reconstruction vector that type is obtained, the vector dimension are equal to kmax, wherein it is β that k-th of child nodes, which corresponds to its reconstruction coefficients,k (k=1,2 ..., kmax), βkReflect Fkl+1Corresponding node carries out the ability of signal reconstruct to its father node.
To solve Partial Reconstruction model needs pairRewritten:
Further build Lagrangian:
L (β, μ)=βTWβ+λβTβ+η(1Tβ-1)
Wherein,OrderCan :
λ β+μ the 1=0. of 2W β+2
μ=- 2 (1 can be obtained by solving above formulaT(W+λI)-11)-1.Bring μ into above formulas, can obtain Partial Reconstruction vector β is:
β=Ψ/(1),
Wherein, Ψ=(W+ λ I)-11。
Step S4, using the Partial Reconstruction vector obtained in step s3, by the weight in child nodes for its father node Information of the structure information with father node in itself is merged, and strengthens the information representation to the father node.Comprise the following steps that:
For a certain node A, information of the information of its child nodes with node A in itself is merged, is represented by:
Wherein, FkRepresent the characteristic vector of node A k-th of child nodes;μ ∈ [0,1] represent weight, for balancing father Information fusion between node and its child nodes;FARepresent the characteristic vectors of node A in itself;Represent node A in itself After the information fusion of information and its child nodes, the information representation of the enhancing for node A obtained.In the operation, node A characteristic vector FAIt is updated to
For the tree of common l layers, said process is bottom-up successively to be carried out (except the bottom, because the bottom saves without child Point).The characteristic vector of all nodes of (l-1) layer is updated, then the feature of all nodes of (l-1) layer to Amount, not only comprising the node layer information in itself but also the reconfiguration information from its child nodes was included.Further, to (l-2) The characteristic vector of node layer is updated, until completing the renewal to the characteristic vector of root node in tree.It is then original according to tree The electronic book data of shape structure organization can be indicated by the characteristic vector updated in root node, that is, complete tree structure The unified vector representation of data.
Step S5, e-book retrieval and recommendation based on content:In step S4, the e-book according to tree structure tissue Content has been converted into unified vector form and has been indicated.E-book is carried out using the unified vector representation of e-book Retrieval, the similarity of e-book sample in sample retrieval and database is calculated by using COS distance function, obtains electronics The retrieval list of book, so as to realize that the e-book based on content is recommended.
Fig. 2 shows a kind of eBook content commending system based on Partial Reconstruction model, described to be based on Partial Reconstruction mould The eBook content commending system of type includes:
Tree structure expresses module:For each e-book, some pages are divided into, further, by each Page is divided into some paragraphs, so as to, formed for each e-book " e-book->Page->The tree height of three of paragraph ";
Node diagnostic expresses module:Vocabulary is built, word distribution vector is calculated, further using principal component analysis (Principal Component Analysis, PCA) carries out dimensionality reduction, compression to the word distribution vector of each Hierarchy nodes, so as to In further model calculation;
Partial Reconstruction model building module:Its parent information is reconstructed using the information of child nodes, that is, established Partial Reconstruction model, solve Partial Reconstruction model and obtain reconstruction coefficients;
The unified vector representation module of tree structure:According to the reconstruction coefficient vector obtained in C, by the node and its child Child node carries out information fusion, and the characteristic vector for updating the node represents.The process is bottom-up successively to be carried out, until will be by The electronic book data that tree structure represents is collapsed into unified vector representation;
E-book retrieval and recommending module based on content:E-book is carried out using the unified vector representation of e-book Retrieval, pass through the e-book recommendation for being calculated as user and carrying out related content of similarity.
The tree structure expression module includes:
The segmentation of e-book:By identify e-book paragraph delimiter " r n ", e-book is split, by one This e-book is divided into several paragraphs;
The division of page:Several adjacent paragraphs are merged, preset until the length of the paragraph merged exceedes Page minimum threshold, then formed a new page.The threshold value of the minimum length of page is set to 1000 in the present invention;
The division of paragraph:For the page formed in previous step, reuse paragraph delimiter " r n " split, and Several adjacent paragraphs are merged, until its length exceedes the minimum threshold of paragraph set in advance, then form one New paragraph.The threshold value of the minimum length of paragraph is set to 50 in the present invention.
The node diagnostic expression module includes:
Build vocabulary:By text segmentation, the Text Pretreatments such as stop words, root reduction, word error correction are gone to operate it Afterwards, the vocabulary of full dataset is established, and word frequency statisticses are carried out to the e-book that data are concentrated;
Calculate word distribution vector:Use term frequency-inverse document frequency (term frequency- inverse document Frequency, tf-idf) model calculates the weight of each word, so as to obtain the word of each node in tree construction be distributed to Amount;Feature Dimension Reduction:In order to realize the feasibility of calculating, PCA (Principal Component are used Analysis, PCA) term vector that each node weights in tree construction is compressed, dimensionality reduction.
The Partial Reconstruction model building module includes:
Establish Partial Reconstruction model:For a certain node (node there are child nodes) in tree, the child of the node is used The information of the node is reconstructed the information of child node, and weighs it using class-cosine (cosine-type) distance function Reconstructed error of the child nodes information to its child nodes of the reconstructed error amount of the nodal information to the node;
Partial Reconstruction model is solved, obtains Partial Reconstruction coefficient vector, the size of reconstruction coefficients shows the child nodes pair The re-configurability of its parent information, reconstruction coefficients show that more greatly the node is stronger to the re-configurability of its father node.
The unified vector representation module of the tree structure includes:
According to the reconstruction coefficient vector obtained by solving Partial Reconstruction model, by the feature item vector of child nodes The reconstruction coefficients corresponding to it are multiplied by respectively, and it is weighted with the characteristic vector of its father node and is added, so as to be somebody's turn to do The new character representation of father node.
It is bottom-up, the operation of previous step successively is performed to tree interior joint, until the e-book that will be represented by tree structure Data compression turns into unified vector representation.
The e-book retrieval and recommending module based on content include:
The retrieval of e-book is carried out using the unified vector representation of e-book, is calculated and examined by using COS distance function The similarity of rope sample and e-book sample in database, obtains the retrieval list of e-book, so as to realize the electricity based on content The philosophical works is recommended.
The main contributions of the present invention have at following 2 points:First, by eBook content according to its internal structure relational organization into Tree structure is expressed, then has both expressed the content characteristic of e-book in itself, embodies the level of e-book internal structure again Feature, so as to be advantageous to strengthen the information representation of e-book.Secondly, it is proposed that the Partial Reconstruction based on class-COS distance function The measurement of error is reconstructed using Euclidean distance function, cosine is considered in the present invention for model, already present similar research The superiority that distance function is weighed for content of text similitude, it is proposed that weigh reconstruct using class-COS distance function and miss Difference, so as to effectively raise in tree child nodes information to the re-configurability of its parent information.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should be included in the scope of the protection.

Claims (6)

  1. A kind of 1. eBook content method for expressing based on Partial Reconstruction model, it is characterised in that:Methods described includes following step Suddenly:
    A, tree structure is expressed:For each e-book, some pages are divided into, further, each page is divided For some paragraphs, so as to, formed for each e-book " e-book->Page->The tree height of three of paragraph ";
    B, node diagnostic is expressed:Vocabulary is built, word distribution vector is calculated, further using principal component analysis (Principal Component Analysis, PCA) dimensionality reduction, compression are carried out to the word distribution vectors of each Hierarchy nodes, in order to further mould Type computing;
    C, Partial Reconstruction model is established:Its parent information is reconstructed using the information of child nodes, that is, establishes local weight Structure model, solve Partial Reconstruction model and obtain reconstruction coefficients;
    D, the unified vector representation of tree structure:According to the reconstruction coefficient vector obtained in C, the node and its child nodes are entered Row information merges, and the characteristic vector for updating the node represents;The process is bottom-up successively to be carried out, until will be by treelike structural table The electronic book data shown is collapsed into unified vector representation;
    E, e-book retrieval and recommendation based on content:The retrieval of e-book is carried out using the unified vector representation of e-book, is led to Cross the e-book recommendation for being calculated as user and carrying out related content of similarity.
  2. 2. the eBook content method for expressing according to claim 1 based on Partial Reconstruction model, it is characterised in that:It is described Step A comprises the following steps:
    A1, e-book segmentation:By identify e-book paragraph delimiter " r n ", e-book is split, by one E-book is divided into several paragraphs;
    A2, page division:Several adjacent paragraphs are merged, until the length of the paragraph merged is more than set in advance The minimum threshold of page, then form a new page;
    A3, paragraph division:For the page formed in previous step, reuse paragraph delimiter " r n " split, and will Several adjacent paragraphs merge, until its length exceedes the minimum threshold of paragraph set in advance, then form one newly Paragraph.
  3. 3. the eBook content method for expressing according to claim 1 based on Partial Reconstruction model, it is characterised in that:It is described Step B comprises the following steps:
    B1, structure vocabulary:By text segmentation, the Text Pretreatments such as stop words, root reduction, word error correction are gone to operate it Afterwards, the vocabulary of full dataset is established, and word frequency statisticses are carried out to the e-book that data are concentrated;
    B2, calculate word distribution vector:Use term frequency-inverse document frequency (term frequency-inverse document Frequency, tf-idf) model calculates the weight of each word, so as to obtain the word distribution vector of each node in tree construction;
    B3, Feature Dimension Reduction:In order to realize the feasibility of calculating, PCA (Principal Component are used Analysis, PCA) term vector that each node weights in tree construction is compressed, dimensionality reduction.
  4. 4. the eBook content method for expressing according to claim 1 based on Partial Reconstruction model, it is characterised in that described Step C comprises the following steps:
    C1, establish Partial Reconstruction model:For a certain node (node there are child nodes) in tree, the child of the node is used The information of the node is reconstructed the information of node, and weighs its child using class-cosine (cosine-type) distance function Reconstructed error of the nodal information to its child nodes of the reconstructed error amount of the nodal information to the node;
    C2, Partial Reconstruction model is solved, obtain Partial Reconstruction coefficient vector, the size of reconstruction coefficients shows the child nodes to it The re-configurability of parent information, reconstruction coefficients show that more greatly the node is stronger to the re-configurability of its father node.
  5. 5. the eBook content method for expressing according to claim 1 based on Partial Reconstruction model, it is characterised in that:It is described Step D comprises the following steps:
    The reconstruction coefficient vector that D1, basis are obtained by solving Partial Reconstruction model, by the feature item vector point of child nodes The reconstruction coefficients corresponding to it are not multiplied by, and it is weighted with the characteristic vector of its father node and is added, so as to obtain the father The new character representation of node;
    It is D2, bottom-up, the operation of previous step successively is performed to tree interior joint, until the e-book number that will be represented by tree structure According to being collapsed into unified vector representation.
  6. 6. the eBook content method for expressing according to claim 1 based on Partial Reconstruction model, it is characterised in that:It is described Step E comprises the following steps:
    The retrieval of e-book is carried out using the unified vector representation of e-book, sample retrieval is calculated by using COS distance function With the similarity of e-book sample in database, the retrieval list of e-book is obtained, so as to realize that the e-book based on content pushes away Recommend.
CN201710889265.5A 2017-08-17 2017-09-27 A kind of eBook content method for expressing based on Partial Reconstruction model Pending CN107861924A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2017107080797 2017-08-17
CN201710708079 2017-08-17

Publications (1)

Publication Number Publication Date
CN107861924A true CN107861924A (en) 2018-03-30

Family

ID=61698725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710889265.5A Pending CN107861924A (en) 2017-08-17 2017-09-27 A kind of eBook content method for expressing based on Partial Reconstruction model

Country Status (1)

Country Link
CN (1) CN107861924A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956019A (en) * 2019-11-27 2020-04-03 北大方正集团有限公司 List processing system, method, device and computer readable storage medium
CN113568999A (en) * 2021-07-09 2021-10-29 哈尔滨工业大学 Reviewer recommendation method based on tree structure representation

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105117386A (en) * 2015-09-19 2015-12-02 杭州电子科技大学 Semantic association method based on book content structures

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105117386A (en) * 2015-09-19 2015-12-02 杭州电子科技大学 Semantic association method based on book content structures

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HAIJUN ZHANG ET AL.: "Recommending e-Books by Multi-layer Clustering and Locality Reconstruction", 《2017 IEEE 15TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956019A (en) * 2019-11-27 2020-04-03 北大方正集团有限公司 List processing system, method, device and computer readable storage medium
CN110956019B (en) * 2019-11-27 2021-10-26 北大方正集团有限公司 List processing system, method, device and computer readable storage medium
CN113568999A (en) * 2021-07-09 2021-10-29 哈尔滨工业大学 Reviewer recommendation method based on tree structure representation

Similar Documents

Publication Publication Date Title
CN108491377B (en) E-commerce product comprehensive scoring method based on multi-dimensional information fusion
CN104794212B (en) Context sensibility classification method and categorizing system based on user comment text
CN108427723B (en) Author recommendation method and system based on clustering algorithm and local perception reconstruction model
CN106484664B (en) Similarity calculating method between a kind of short text
CN105468605B (en) Entity information map generation method and device
RU2628436C1 (en) Classification of texts on natural language based on semantic signs
CN110188168A (en) Semantic relation recognition methods and device
CN106776711A (en) A kind of Chinese medical knowledge mapping construction method based on deep learning
CN108038725A (en) A kind of electric business Customer Satisfaction for Product analysis method based on machine learning
CN109376352B (en) Patent text modeling method based on word2vec and semantic similarity
JP4038717B2 (en) Text sentence comparison device
CN106202184A (en) A kind of books personalized recommendation method towards libraries of the universities and system
CN111858940B (en) Multi-head attention-based legal case similarity calculation method and system
CN109657230A (en) Merge the name entity recognition method and device of term vector and part of speech vector
CN106776562A (en) A kind of keyword extracting method and extraction system
CN106447066A (en) Big data feature extraction method and device
Zhang et al. Combining sentiment analysis with a fuzzy kano model for product aspect preference recommendation
CN110046250A (en) Three embedded convolutional neural networks model and its more classification methods of text
JP2004110161A (en) Text sentence comparing device
CN108710663A (en) A kind of data matching method and system based on ontology model
CN107506377A (en) This generation system is painted in interaction based on commending system
CN105279264A (en) Semantic relevancy calculation method of document
Zhao et al. Contextual self-organizing map: software for constructing semantic representations
CN106897437B (en) High-order rule multi-classification method and system of knowledge system
CN113779387A (en) Industry recommendation method and system based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180330

RJ01 Rejection of invention patent application after publication