CN105808729B - Academic big data analysis method based on adduction relationship between paper - Google Patents

Academic big data analysis method based on adduction relationship between paper Download PDF

Info

Publication number
CN105808729B
CN105808729B CN201610131343.0A CN201610131343A CN105808729B CN 105808729 B CN105808729 B CN 105808729B CN 201610131343 A CN201610131343 A CN 201610131343A CN 105808729 B CN105808729 B CN 105808729B
Authority
CN
China
Prior art keywords
paper
data
adduction relationship
dic
citation network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610131343.0A
Other languages
Chinese (zh)
Other versions
CN105808729A (en
Inventor
谈兆炜
刘长风
周劲光
杜佳俊
骆铮
毛宇宁
沈嘉明
王彪
傅洛伊
王新兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201610131343.0A priority Critical patent/CN105808729B/en
Publication of CN105808729A publication Critical patent/CN105808729A/en
Application granted granted Critical
Publication of CN105808729B publication Critical patent/CN105808729B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • G06F16/3326Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of academic big data analysis methods based on adduction relationship between paper, including step 1: constructing paper citation network in the database after local paper data set is analyzed and handled accordingly;Step 2: according to the adduction relationship creation analysis algorithm in paper citation network, the importance and mutual relationship of the paper citation network interior joint are obtained by the parser, and obtains different degree of the paper relative to center paper;Step 3: converting the one-to-one adduction relationship of paper to the mapping ensemblen in reference direction and the mapping ensemblen in the direction that is cited, the development path between specified paper is obtained by extraction algorithm in the paper citation network, and calculates the different degree in path according to the paper different degree obtained in step 2.Method in the present invention can easily analyze the adduction relationship of paper in database, and obtain the development path between paper, improve the precision of these retrieval.

Description

Academic big data analysis method based on adduction relationship between paper
Technical field
The present invention relates to big data processing and network analysis technique fields, and in particular, to one kind is based on quoting between paper The academic big data analysis method of relationship.
Background technique
With the rapid development of information technology, the mode that people obtain data is more and more, the total amount of data also presents quick-fried The growth of fried property.The result of study of International Data Corporation (IDC) (IDC) shows that the data volume that the whole world in 2008 generates is 0.49ZB, Data volume in 2009 is 0.8ZB, and increasing within 2010 is 1.2ZB, and quantity in 2011 is more up to 1.82ZB, is equivalent to the whole world Everyone generates the data of 200GB or more.And until 2012, the data volume of all printing materials of human being's production is 200PB. The research of IBM claims, and in entire human civilization total data obtained, 90% generates in two years in the past.And it arrives The year two thousand twenty, data scale caused by the whole world are up to 44 times of today.With the appearance of mass data, mass data is handled Tool also come into being, such as Hadoop and Spark be all can to mass data carry out distributed treatment software frame.
And in scientific research field, as investment of the every country for scientific research activity is increasing, scientific research it is direct at Fruit --- paper is also increasing at an amazing speed every year.By taking China as an example, according to Chinese science and technology information research September 26 in 2014 The statistical result showed for the chinese scientific papers that day announces: 2004 in Septembers, 2014, middle Kuomintang-Communist deliver international paper 136.98 Ten thousand, come the 2nd, the world;Paper is cited 1037.01 ten thousand times altogether, comes the 4th, the world.And for paper, in addition to Text information, complicated adduction relationship constitutes a very large network between them, and then forms " academic big Data ".For the paper for helping researcher quick obtaining oneself desired, there are some commercial companies to develop certainly now Oneself academics search system, come the article for helping researcher fast search to want to oneself, such as Microsoft and Google.
In these academics search systems, many Network algorithms are applied in paper sequence, wherein most notable It is PageRank algorithm and HITS algorithm (Hypertext-Induced Topic Search Algorithm, hypertext guidance Theme search algorithm).But most of algorithm only considered the global importance an of paper in the ranking, there is no consider The relative importance of its paper specific for one.This is not particularly suited for wishing to find influencing maximum several opinions to a paper The situation of text.In addition to this, most of existing algorithm can not except sequence, provide more about two papers it Between development grain information.Which results in researchers can not efficiently obtain Scientific Research Resource, comprehensive grasp research neck Development trend in domain.
The present invention in this context, develops a set of paper that can either analyze based on paper citation network to another system The relative importance of final conclusion text, and the system of the development grain between two papers can be showed.
Summary of the invention
For the defects in the prior art, the object of the present invention is to provide a kind of science based on adduction relationship between paper is big Data analysing method.
The academic big data analysis method based on adduction relationship between paper provided according to the present invention, includes the following steps:
Step 1: constructing paper reference after local paper data set is analyzed and handled accordingly in the database Network;
Step 2: according to the adduction relationship creation analysis algorithm in paper citation network, by described in parser acquisition The importance and mutual relationship of paper citation network interior joint, and obtain different degree of the paper relative to center paper;
The center paper refers to: the paper that user is inquired by input is (here it is considered that user discusses this piece Text and its correlative theses are interested, and want to know about percentage contribution and relative Link Importance of other papers relative to this paper, Rather than global different degree).
Step 3: converting the one-to-one adduction relationship of paper to the mapping ensemblen in reference direction and the mapping in the direction that is cited Collection obtains the development path between specified paper in the paper citation network, and according to the paper different degree obtained in step 2 To calculate the different degree in path.
Preferably, further include step a: distributed treatment being carried out to the parser in step 2, is quoted by the paper The multinode of network calculates different degree of the every paper relative to center paper in the paper citation network.
Preferably, further include step b: distributed treatment being carried out to the extraction algorithm in step 3, is quoted by the paper The multinode of network calculates the different degree of the development path between paper citation network middle finger final conclusion text.
Preferably, the step 1 includes:
Step 1.1: utilizing text-processing and analytical technology, extract the reference information in local paper data, reference information It is the information which paper is referred to comprising any paper in collection of thesis;
Step 1.2: building paper citation network;
Step 1.3: removing duplicate content, utilization after being compared to the reference information in the paper citation network of acquisition Database software is stored and is established index, and the adduction relationship between paper is stored in database in the form of key-value pair In.
Preferably, the step 2 includes:
Step 2.1: according to the adduction relationship in paper citation network, calculating the score based on Crosslinking Structural;
Step 2.2: it is opposite to calculate each paper for the subgraph paid close attention to using breadth first algorithm search spread user In the longest path of center paper and the ratio of shortest path, as the score based on reference step analysis, breadth first algorithm Calculation formula it is as follows:
Score=longest path/shortest path based on reference step analysis;
Step 2.3: corresponding weight is selected, by the score based on Crosslinking Structural and point based on reference step analysis Different degree of the number included together as other final papers relative to center paper, calculation formula are as follows:
Different degree=score of the weight * based on Crosslinking Structural based on Crosslinking Structural+based on reference level point Score of the weight * of analysis based on reference step analysis.
Preferably, the step 3 includes:
Step 3.1: converting the mapping ensemblen in reference direction for the one-to-one adduction relationship of paper in database and drawn With the mapping ensemblen in direction;
Step 3.2: the adduction relationship of preliminary analysis paper takes python program design language to call the turn the data knot of dictionary Structure;
Step 3.3: extracting the information in path between two papers.
Preferably, the step 3.1 includes:
Initialize two dictionaries ref_dic, refed_dic, wherein ref_dic is indicated by paper central node to multiple The mapping relations of person who quote, refed_dic are indicated from the mapping relations by paper central node to multiple persons quoted;
Each row of data in ergodic data library finds the left data of any row, if the row is left in ref_dic dictionary key assignments Data on the right side of the row in ref_dic dictionary key assignments, are then added to the tail portion of key assignments respective items, if not existing by side data It in ref_dic dictionary key assignments, is then saved left data as new key assignments, and using right side data as corresponding item, to make The one-to-one adduction relationship for obtaining paper in database is converted into the mapping ensemblen in reference direction;
For refed_dic dictionary, then using right side data as key assignments, left data is as item, in refed_dic dictionary The right side data of the row are found in key assignments, it, will be on the left of the row if data are in refed_dic dictionary key assignments on the right side of the row Data are added to the tail portion of key assignments respective items, if not protecting in refed_dic dictionary key assignments using right side data as new key assignments It deposits, and using left data as corresponding item, so that the one-to-one adduction relationship of paper is converted into and is drawn in database With the mapping ensemblen in direction.
Compared with prior art, the present invention have it is following the utility model has the advantages that
1, the academic big data analysis method based on adduction relationship between paper illustrated in the present invention can more effectively be shown The contribution and relative importance that other papers make the paper of user query allow users to more easily emerging from a sense The paper of interest sets out, and finds other correlative theses.
2, the present invention further illustrates the distributed treatment side based on the academic big data analysis of adduction relationship between paper Method, this is beneficial to the speed for improving academic big data analysis;When establishing paper inquiry system using the present invention, utilize The response time of user query can be effectively reduced in the distributed approach, improves user experience.
Detailed description of the invention
Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:
Fig. 1 is distributed marking routing algorithm flow chart;
Fig. 2 is track search flow chart between two papers;
Fig. 3 is path label algorithm flow chart;
Fig. 4 is path label algorithm demo system figure.
Specific embodiment
The present invention is described in detail combined with specific embodiments below.Following embodiment will be helpful to the technology of this field Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill of this field For personnel, without departing from the inventive concept of the premise, various modifications and improvements can be made.These belong to the present invention Protection scope.
The academic big data analysis method based on adduction relationship between paper provided according to the present invention, includes the following steps:
Step 1: constructing paper reference after local paper data set is analyzed and handled accordingly in the database Network;
Step 2: according to the adduction relationship creation analysis algorithm in paper citation network, by described in parser acquisition The importance and mutual relationship of paper citation network interior joint, and obtain different degree of the paper relative to center paper;
Step a: distributed treatment is carried out to the parser in step 2, passes through the multinode of the paper citation network Calculate different degree of the every paper relative to center paper in the paper citation network.
Step 3: converting the one-to-one adduction relationship of paper to the mapping ensemblen in reference direction and the mapping in the direction that is cited Collection obtains the development path between specified paper by extraction algorithm in the paper citation network;
Step b: distributed treatment is carried out to the extraction algorithm in step 3, passes through the multinode of the paper citation network Calculate the different degree of the development path between paper citation network middle finger final conclusion text.
Specifically, the step 1 includes:
Step 1.1: utilizing text-processing and analytical technology, extract the reference information in local paper data, reference information It is the information which paper is referred to comprising any paper in collection of thesis;
Step 1.2: building paper citation network, developing algorithm are as follows:
Construct an an empty point set V and empty side collection E.To each paper u in database, first paper u It is added in vertex set V as a paper node;Then each paper v that paper u will be quoted, constructs directed edge e (u, v), E (u, v) is added in side collection E, and it is 1 that weight, which is arranged,.It, will after this operation has all carried out one time to papers all in database Point set and Bian Ji collectively constitute paper citation network G (V, E) at this time.
Step 1.3: removal repetition uniformly being compared using computer program to the paper citation network data of acquisition, is utilized Database software is stored and is established index, and the adduction relationship between paper is stored in database in the form of key-value pair In.
More specifically, carrying out text using Python and Mathematica tool based on local paper data set Cleaning removes such as space therein, the meaningless character such as newline, and adds unique number side of being organized into every paper Just the format analyzed;After obtaining the reference paper of every article, by searching for obtain reference paper number, extract data between Adduction relationship is stored adduction relationship in the database using database technology in the form of key-value pair.What it is to acquisition includes point Paper is actively quoted and by introduction text as two in primary reference by way of key-value pair with the network information on side Key assignments, storage is in the database.Then to the items in database carry out unified automation, structuring duplicate removal, storage and Establish index.
The step 2 includes:
Step 2.1 calculates the score based on Crosslinking Structural according to the adduction relationship in paper citation network, calculating Algorithm is as follows:
Input: paper citation network G, center paper v0;Output: every article (removes center paper in paper citation network Different degree relative to center paper outside).
Creation processing queue Q is an empty queue, and it is an empty set that excessive paper set S is beaten in creation.
Center paper is added in processing queue Q.
Center paper is added to and is beaten in excessive paper set S.
When queue not empty, perform the following operation:
One paper v of team out.
The reference Quantity of Papers of paper v is assigned to n.
The paper v ' quoted to paper v any one, performs the following operation:
By the score n equal part of paper v, it is denoted as delta_score.
By the score of v ' plus after delta_score as new score.
If v ' is to be updated score for the first time, perform the following operation:
Paper v' is added in processing queue Q.
Paper v' is added to and is beaten in excessive paper set S.
Step 2.2: it is opposite to calculate each paper for the subgraph paid close attention to using breadth first algorithm search spread user In the longest path of center paper and the ratio of shortest path, as the score based on reference step analysis, calculation formula is as follows:
Score=longest path/shortest path based on reference step analysis;
Step 2.3: optimal parameter is selected, by the score based on Crosslinking Structural and point based on reference step analysis Different degree of the number included together as other final papers relative to center paper, calculation formula are as follows:
Different degree=score of the weight * based on Crosslinking Structural based on Crosslinking Structural+based on reference level point Score of the weight * of analysis based on reference step analysis.
The step a include: with the parser in distributed program procedure 2, enable be in distribution It is run on system, using the parallel frame of Spark, data is stored in memory, kept interacting with front end, be carried out on multiple computer points Parallel computation.The arithmetic speed of program can be greatlyd improve in this way, and dynamically provides calculated result for opening up for front end Show, algorithm is as follows:
Step 2 has used queue, one paper node of each cycle calculations, and each cycle calculations in distributed implementation One layer of paper.The score for initializing center paper first is 1.0, and the paper in first layer only has center paper.Each In wheel circulation, the score of paper in preceding layer is averagely given into the paper cited in them, with distributed Map (mapping) method Generate for each piece be cited paper score contribute.Then, it is all had been calculated in the score contribution of this layer of upper all papers After finishing, score is contributed and is added with the score before each paper, (is returned here with distributed Reduce About) method.Next it (does not include accessing in the past that the paper for updating next layer, which is the paper quoted for the first time in this wheel circulation, Paper).When next layer of paper is sky or search time is too long, circulation terminates.At this moment be returned to calculated result to Front end.
More specifically, calculating separately the score based on Crosslinking Structural by the topological structure of analysis network and being based on Quote the score of step analysis.Finally by optimizing and revising for parameter, two kinds of scores are combined, it is opposite as other papers In the different degree of center paper.
Firstly, calculating the score based on Crosslinking Structural.Generally, it in the reference paper list of a paper, can both wrap The paper having a major impact to it is included, some unessential papers are also had, is such as used as background introduction.Directly drawn at these With and in the paper that has a major impact of center paper, it both may include the initiative paper in for a long time previous field, also had It may include just delivering recently but have the paper directly inspired to center paper.And these papers that are cited delivered recently, Because belonging to the same field, the paper of same piece initiative is also very likely quoted.Since the paper of center, according to reference net The score of every paper is averagely given paper cited in it by network structure, successively to update the score of paper as fractional increments. Since the initiative paper in a field will receive the reference from different levels paper, they can obtain higher point Number.From high score paper to the reference path of center paper, the process that paper development can be shown and developed.And a plurality of reference road The building of diameter can then show the panorama of the entire academic map centered on the paper of center.
Every paper has been obtained after calculating based on the score of Crosslinking Structural to obtain relative to the preliminary of center paper Point.In order to reflect the attribute of every paper more fully hereinafter, every paper is characterized with based on the score of reference step analysis Initiative, score is higher, initiative stronger.Define a paper to center paper longest path, as this paper in The level of heart paper.The initiative paper in one field, it is more likely that straight across multilayer by the subsequent paper of the major part in this field Connect reference.And the sequence that these subsequent papers are gradually developed according to research, and there is the relationships gradually quoted across single layer.In this way Relationship just determine from center paper, reach an initiative paper, both there is very long path, there is also very short Path.Then we are just using the ratio of the two as the score of characterization paper initiative.Pass through breadth first search, traversal One subgraph of user's concern calculates each paper relative to the longest path of center paper and the ratio of shortest path, makees For the score based on reference step analysis.Using the recommendation collection of thesis in several fields as training set, in this recommendation collection of thesis The paper of middle appearance obtains high score as target, and optimization algorithm obtains suitable parameter.Using parameter as weight coefficient, by base In Crosslinking Structural score and based on reference step analysis score be combined, as other final papers relative to The different degree of center paper.
More specifically, the program for carrying out algorithm is realized according to the design philosophy of distributed program.Program is to quote the number of plies Unit, successively calculates the changing value of paper score in citation network, and stores changing value as temporary variable, at every layer It is unified after calculating to update score.Since when calculating the changing value of reference paper score, the calculating process of every paper is Mutually independent, these independent calculating process are assigned in different calculate nodes and carry out parallel computation by program, make to succeed in one's scheme Speed is calculated to greatly promote.And score is uniformly updated, it avoids and occurs the locked situation of database in calculating process.We will discuss In the critical datas such as serial number, reference listing, the score of text deposit memory, using the parallel frame of Spark, possessing multiple calculating sections It is run in the distributed system of point.After the completion of calculating, topic, the publication date, periodical meeting of paper are read from database Equal text informations, the data format of leading portion needs is processed by program, front end is sent to and is shown.It needs to update in leading portion When the paper of center, by the information of receiving front-end, new calculating is carried out, realization dynamically provides data for leading portion, is user's exhibition Show an interactive interface and function.
Further, step a has used queue, one paper node of each cycle calculations, and in distributed implementation The paper of each one layer of cycle calculations.The score for initializing center paper first is 1.0, and the paper in first layer only has center Paper.In each round circulation, the score of paper in preceding layer is averagely given into the paper cited in them, use is distributed Map (mapping) method generate for each piece be cited paper score contribute.Then, in the score of this layer of upper all papers Contribution all calculates finish after, score is contributed and is added with the score before each paper, here with distribution Reduce (reduction) method.Next the paper for updating next layer is that the paper quoted for the first time in this wheel circulation (does not wrap Include the paper accessed in the past).When next layer of paper is empty, circulation terminates.
The step 3 includes:
Step 3.1: converting the mapping ensemblen in reference direction for the one-to-one adduction relationship of paper in database and drawn With the mapping ensemblen in direction;
Step 3.2: the adduction relationship of preliminary analysis paper takes python program design language to call the turn the data structure of dictionary (i.e. node 1:[corresponding node 1, corresponding node 2 ...], node 2:[corresponding node 1, corresponding node 2 ...], node 3 ...) Carry out the information of Chu Jiedian and related direction and length;
Step 3.3:(note: depth-first search belongs to one kind of nomography, and process is briefly may to each Individual path be deep into cannot again deeply until, and each node can only access once.Breadth first traversal is connected graph A kind of traversal strategies, its thought is the radial ground first traversal region wider around it since a vertex V0.)
More specifically, in order to explore number of passes that may be present and path length between any two papers, then The first step certainly extracts information from database.Due to the relationship for only needing to look between paper, paper is come as point one by one Idealization processing, if it further, needs to consider papers contents itself, then putting the position that can also save related content storage The characteristics of setting and storing.Based on database obtained in step 1, first it is considered that database save be every two node it Between positional relationship, pay attention to being every two, possible consequence is that there are the preservations of many duplicate nodes, although these are saved It is all necessary for database, but a point has several adjacent nodes, it will occur several times, this and the mankind are to opening up The intuition for flutterring structure is disagreed, and is returned reading and is brought big inconvenience benefit, so it is pre-processed, it in this way could logarithm According to the relationship effectively further obtained between reference paper and the paper that is cited.Specific algorithm is as follows:
Dictionary is initialized into ref_dic { } first, refed_dic { } .ref_dic is indicated by paper central node to multiple The mapping relations of person who quote.Refed_dic is indicated from the mapping relations by paper central node to multiple persons quoted.Then will Paper data are taken out from database.It needs first to judge that paper central node whether there is in the key assignments of dictionary, then divides at situation Reason.If a paper node is already present in the key assignments of dictionary, then need to only add behind its dictionary respective items Upper new person who quote or person quoted.If this paper node, from not appearing in dictionary key assignments, we elect additional member it to be Key assignments, and new person who quote or person quoted are added up.The two finally obtained dictionaries are contained by a pair of in database The central point that one structural relation is transformed corresponds to the relationship of consecutive points.
After the dic for obtaining indexing with person who quote and the dic to index with the paper that is cited, any two points will be explored Between number of passes and length.Since paper reference is very special with the topological relation that is cited, most paper all can only It is directed toward the paper node that it is quoted, small part paper has more reference quantity, becomes the central node in structure.It considers If being separated by between two paper nodes, node is too many, and often correlation can be extremely weak, so in order to keep search more effective and rapid, Maximum length scope provided with searching route.Then the algorithm process of breadth First and depth-first has been used respectively, substantially Thought is as follows:
By taking breadth First as an example, being in braces for the variation of depth-first has respective description.
A dictionary Dir={ } is initialized first, for saving the reference direction of paper, i.e., is directed toward and is cited from person who quote Person still turns around.Another sub- allusion quotation p_dic={ } is initialized, traversed node is saved.
Then initialize in a list p_list (is then queue if breadth First, addition element is at the end p_list Tail addition, deleting element is to start to delete in p_list.Then it is stack if depth-first, adds and subtracts element all at the end p_list), Paper node and relevant information are originated, such as length is directed toward, and is added in processing list p_list.
Following circulation is done when list p_list non-empty, and records cycle-index, when number is required greater than us Step number when, can also terminate circulation.Cyclic process are as follows:
The all references paper that paper v maps in ref_dic is stored in list a_list and worked as by one paper v of team out In, all papers that are cited that paper v is mapped in refed_dic are stored in list b_list.
A_list is traversed, is the paper node end that we want if there is paper v ', then it is exported, v is assigned to one A transient node refunds removal search according to the path that dictionary Dir is saved, and until looking for start node, then output is saved The node and directional information crossed;
If v ' is not, re-inspection was either with or without being traversed, assuming that it has not, then there is Dir [v '] ← (v, 0) { v represents v ' and draws Paper, 0 represents from person quoted to person who quote }, it means and the mapping relations of v to v ' is stored in dictionary Dir.It will be Paper v ' and its direction are added in traversed paper set p_dic.Then v ' is joined the team, that is, be added in p_list.
Then b_list is traversed, process is similar with a_list, only Dir [v '] ← (v, 1), this step is different.
Then it can facilitate and relevant research is carried out to certain paper data set.For different data sets, it pre- The processing time and can search for range the two indexs can be variant, obtain the number of passes under a variety of situations between two paper nodes And path length.
The step b include: with the extraction algorithm in distributed program procedure 3, enable be in distribution It is run on system, distributed algorithm is as follows:
More specifically, the algorithm of the distributed implementation in path between two papers of academic map is marked to be divided into two parts: just To path label and reversed path label.
During positive path label, circulation still calculates one layer of paper every time.Initialize the paper of first layer For starting point paper.In each round circulation, using paper reference information, opinion that the paper for marking preceding layer all is quoted to them The path of text.Then update next layer paper be this wheel circulation in quote for the first time paper (remove terminal paper and with The preceding paper accessed).When next layer of paper is empty, circulation terminates.
During reversed path label, circulation remains the paper of one layer of calculating every time.Initialize the opinion of first layer Text is terminal paper.In each round circulation, using calculated routing information in positive path label, mark preceding layer all Paper to they father's paper path.Then the paper for updating next layer is the father being accessed for the first time in this wheel circulation Paper (paper for removing starting point paper and accessing in the past).When next layer of paper is empty, circulation terminates.At this moment, it just marks Remember and from all paths of origin-to-destination.
Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow Ring substantive content of the invention.

Claims (5)

1. a kind of academic big data analysis method based on adduction relationship between paper, which comprises the steps of:
Step 1: constructing paper citation network in the database after local paper data set is analyzed and handled accordingly;
Step 2: according to the adduction relationship creation analysis algorithm in paper citation network, the paper being obtained by the parser The importance of citation network interior joint and mutual relationship, and obtain different degree of the paper relative to center paper;
The center paper refers to: a certain piece paper that user is inquired by input;
Step 3: convert the one-to-one adduction relationship of paper to the mapping ensemblen in reference direction and the mapping ensemblen in the direction that is cited, The development path between specified paper is obtained in the paper citation network, and is counted according to the paper different degree obtained in step 2 Calculate the different degree in path;
The step 3 includes:
Step 3.1: converting the one-to-one adduction relationship of paper in database to mapping ensemblen and the side of being cited in reference direction To mapping ensemblen;
Step 3.2: the adduction relationship of preliminary analysis paper takes python program design language to call the turn the data structure of dictionary;
Step 3.3: extracting the information in path between two papers;
The step 3.1 includes:
Initialize two dictionaries ref_dic, refed_dic, wherein ref_dic is indicated by paper central node to multiple references The mapping relations of person, refed_dic are indicated from the mapping relations by paper central node to multiple persons quoted;
Each row of data in ergodic data library finds the left data of any row, if the row left-hand digit in ref_dic dictionary key assignments According in ref_dic dictionary key assignments, then data on the right side of the row are added to the tail portion of key assignments respective items, if not in ref_ It in dic dictionary key assignments, is then saved left data as new key assignments, and using right side data as corresponding item, so that number The mapping ensemblen in reference direction is converted into according to the one-to-one adduction relationship of paper in library;
For refed_dic dictionary, then using right side data as key assignments, left data is as item, in refed_dic dictionary key assignments The middle right side data for finding the row, if data are in refed_dic dictionary key assignments on the right side of the row, by the row left data It is added to the tail portion of key assignments respective items, if not saved in refed_dic dictionary key assignments using right side data as new key assignments, And using left data as corresponding item, so that the one-to-one adduction relationship of paper is converted into the side of being cited in database To mapping ensemblen.
2. the academic big data analysis method according to claim 1 based on adduction relationship between paper, which is characterized in that also Including step a: carrying out distributed treatment to the parser in step 2, calculated by the multinode of the paper citation network Different degree of the every paper relative to center paper in the paper citation network.
3. the academic big data analysis method according to claim 1 based on adduction relationship between paper, which is characterized in that also Including step b: carrying out distributed treatment to the extraction algorithm in step 3, calculated by the multinode of the paper citation network The different degree of development path between paper citation network middle finger final conclusion text.
4. the academic big data analysis method according to claim 1 based on adduction relationship between paper, which is characterized in that institute Stating step 1 includes:
Step 1.1: utilizing text-processing and analytical technology, extract the reference information in local paper data, reference information is packet The information of which paper is referred to containing a paper any in collection of thesis;
Step 1.2: building paper citation network;
Step 1.3: removing duplicate content after being compared to the reference information in the paper citation network of acquisition, utilize data Library software is stored and is established index, and the adduction relationship between paper is stored in lane database in the form of key-value pair.
5. the academic big data analysis method according to claim 1 based on adduction relationship between paper, which is characterized in that institute Stating step 2 includes:
Step 2.1: according to the adduction relationship in paper citation network, calculating the score based on Crosslinking Structural;
Step 2.2: the subgraph paid close attention to using breadth first algorithm search spread user calculates each paper relative in The longest path of heart paper and the ratio of shortest path, as the score based on reference step analysis, the meter of breadth first algorithm It is as follows to calculate formula:
Score=longest path/shortest path based on reference step analysis;
Step 2.3: selecting corresponding weight, the score based on Crosslinking Structural and the score based on reference step analysis are closed The different degree as other final papers relative to center paper, calculation formula are as follows together:
Different degree=score of the weight * based on Crosslinking Structural based on Crosslinking Structural+based on reference step analysis Score of the weight * based on reference step analysis.
CN201610131343.0A 2016-03-08 2016-03-08 Academic big data analysis method based on adduction relationship between paper Active CN105808729B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610131343.0A CN105808729B (en) 2016-03-08 2016-03-08 Academic big data analysis method based on adduction relationship between paper

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610131343.0A CN105808729B (en) 2016-03-08 2016-03-08 Academic big data analysis method based on adduction relationship between paper

Publications (2)

Publication Number Publication Date
CN105808729A CN105808729A (en) 2016-07-27
CN105808729B true CN105808729B (en) 2019-08-23

Family

ID=56467913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610131343.0A Active CN105808729B (en) 2016-03-08 2016-03-08 Academic big data analysis method based on adduction relationship between paper

Country Status (1)

Country Link
CN (1) CN105808729B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846019B (en) * 2018-05-08 2019-05-21 北京市科学技术情报研究所 A kind of paper sort method based on gold reference algorithm
CN108874990A (en) * 2018-06-12 2018-11-23 亓富军 A kind of method and system extracted based on power technology journal article unstructured data
CN110119412B (en) * 2019-04-16 2023-01-03 南京昆虫软件有限公司 Method for distinguishing source database of quotation
CN112612785B (en) * 2020-11-20 2023-11-17 北京理工大学 Dynamic monitoring method for key development path of unconventional energy technology
CN113515589A (en) * 2021-01-12 2021-10-19 腾讯科技(深圳)有限公司 Data recommendation method, device, equipment and medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537063A (en) * 2014-12-29 2015-04-22 北京理工大学 Knowledge venation map construction system and method based on thesis citation network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090043797A1 (en) * 2007-07-27 2009-02-12 Sparkip, Inc. System And Methods For Clustering Large Database of Documents

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537063A (en) * 2014-12-29 2015-04-22 北京理工大学 Knowledge venation map construction system and method based on thesis citation network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于PDA模式的电站新机组调试专家***研究;汪尚兵;《中国优秀硕士学位论文全文数据库工程科技Ⅱ辑》;20110815;第37-38页
基于权力指数的引文网络分析方法探讨;纪雪梅等;《图书情报工作》;20091231;第111-114页

Also Published As

Publication number Publication date
CN105808729A (en) 2016-07-27

Similar Documents

Publication Publication Date Title
CN105808729B (en) Academic big data analysis method based on adduction relationship between paper
CN109492077B (en) Knowledge graph-based petrochemical field question-answering method and system
CN104102745B (en) Complex network community method for digging based on Local Minimum side
CN106503148B (en) A kind of table entity link method based on multiple knowledge base
CN104063507B (en) A kind of figure computational methods and system
CN103314371B (en) A kind of method and system of retrieval
CN107220277A (en) Image retrieval algorithm based on cartographical sketching
CN104008106B (en) A kind of method and device obtaining much-talked-about topic
CN112434169A (en) Knowledge graph construction method and system and computer equipment
CN104778256B (en) A kind of the quick of field question answering system consulting can increment clustering method
CN106021457A (en) Keyword-based RDF distributed semantic search method
CN105631018B (en) Article Feature Extraction Method based on topic model
CN102521364B (en) Method for inquiring shortest path between two points on map
CN106547864A (en) A kind of Personalized search based on query expansion
CN108804516A (en) Similar users search device, method and computer readable storage medium
CN102750375A (en) Service and tag recommendation method based on random walk
CN113254630B (en) Domain knowledge map recommendation method for global comprehensive observation results
CN106991614A (en) The parallel overlapping community discovery method propagated under Spark based on label
CN106528648A (en) Distributed keyword approximate search method for RDF in combination with Redis memory database
CN104391969B (en) Determine the method and device of user's query statement syntactic structure
CN103761286B (en) A kind of Service Source search method based on user interest
CN108228787A (en) According to the method and apparatus of multistage classification processing information
CN109299443B (en) News text duplication eliminating method based on minimum vertex coverage
CN117151659B (en) Ecological restoration engineering full life cycle tracing method based on large language model
CN112784049B (en) Text data-oriented online social platform multi-element knowledge acquisition method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant