CN111914092A

CN111914092A - Information processing apparatus, method, and medium for author disambiguation

Info

Publication number: CN111914092A
Application number: CN201910384663.0A
Authority: CN
Inventors: 夏迎炬; 郑仲光; 孟遥; 陈炎
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-05-09
Filing date: 2019-05-09
Publication date: 2020-11-10
Also published as: JP2020187738A

Abstract

An information processing apparatus, method, and medium for author disambiguation are disclosed. The device comprises: a graph construction unit configured to construct a knowledge graph based on entities extracted from a document repository and attributes thereof, the entities including author entities and associated entities thereof; a traversal unit configured to traverse the constructed knowledge-graph to obtain a sequence of nodes about the author; an alignment unit configured to perform node alignment based on the attribute for the node sequence; and a calculation unit configured to calculate a similarity between the aligned sequences of nodes, wherein the author disambiguation is performed according to the calculated similarity.

Description

Information processing apparatus, method, and medium for author disambiguation

Technical Field

The present disclosure relates to the field of information processing, and in particular, to an information processing apparatus and method for author disambiguation.

Background

This section provides background information related to the present disclosure, which is not necessarily prior art.

For most studies based on literature metrology data, as well as for research evaluation purposes, it must be possible to attribute specific bibliographic records to individual researchers. One practical problem is that there is a degree of ambiguity in this process, which is known as author disambiguation. This problem manifests itself in two ways: a given individual may be identified as two or more authors or two or more individuals may be identified as a single author. Given the large number of researchers active in most disciplines, author names are not clearly distinguishable as the root cause of author disambiguation problems.

Disclosure of Invention

This section provides a general summary of the disclosure, and is not a comprehensive disclosure of its full scope or all of its features.

The technical scheme is used for carrying out author disambiguation by adopting the similarity of the knowledge graph. Wherein the similarity is calculated using a sequence of nodes extracted from the knowledge graph. The present disclosure provides a more efficient method of author disambiguation by using relationships between different nodes.

According to an aspect of the present disclosure, there is provided an information processing apparatus for author disambiguation, including: a graph construction unit configured to construct a knowledge graph based on entities extracted from a document repository and attributes thereof, the entities including author entities and associated entities thereof; a traversal unit configured to traverse the constructed knowledge-graph to obtain a sequence of nodes about the author; an alignment unit configured to perform node alignment based on the attribute for the node sequence; and a calculation unit configured to calculate a similarity between the aligned sequences of nodes, wherein the author disambiguation is performed according to the calculated similarity.

According to another aspect of the present disclosure, there is provided an information processing method for author disambiguation, including: constructing a knowledge graph based on entities extracted from a document library and attributes thereof, the entities including author entities and associated entities; traversing the constructed knowledge graph to obtain a node sequence related to an author; performing node alignment based on the attributes for the sequence of nodes; and calculating a similarity between the aligned sequences of nodes, wherein the author disambiguation is performed according to the calculated similarity.

According to another aspect of the present disclosure, there is provided a program product comprising machine-readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to perform an information processing method for author disambiguation according to the present disclosure.

According to another aspect of the present disclosure, a machine-readable storage medium is provided, having embodied thereon a program product according to the present disclosure.

Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.

Drawings

The drawings described herein are for illustrative purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure. In the drawings:

FIG. 1 is a block diagram of an information processing apparatus 100 for author disambiguation according to one embodiment of the present disclosure;

FIG. 2 illustrates a portion of a knowledge-graph according to one embodiment of the present disclosure;

FIG. 3 is a flow diagram of an information processing method for author disambiguation according to one embodiment of the present disclosure; and

fig. 4 is a block diagram of an exemplary structure of a general-purpose personal computer in which the information processing apparatus and method for author disambiguation according to the embodiment of the present disclosure can be implemented.

While the disclosure is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and are herein described in detail. It should be understood, however, that the description herein of specific embodiments is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure. It is noted that throughout the several views, corresponding reference numerals indicate corresponding parts.

Detailed Description

Examples of the present disclosure will now be described more fully with reference to the accompanying drawings. The following description is merely exemplary in nature and is not intended to limit the present disclosure, application, or uses.

Example embodiments are provided so that this disclosure will be thorough, and will fully convey the scope to those skilled in the art. Numerous specific details are set forth such as examples of specific components, devices, and methods to provide a thorough understanding of embodiments of the present disclosure. It will be apparent to those skilled in the art that specific details need not be employed, that example embodiments may be embodied in many different forms, and that neither should be construed to limit the scope of the disclosure. In certain example embodiments, well-known processes, well-known structures, and well-known technologies are not described in detail.

According to an embodiment of the present disclosure, there is provided an information processing apparatus for author disambiguation, including: a graph construction unit configured to construct a knowledge graph based on entities extracted from a document repository and attributes thereof, the entities including author entities and associated entities thereof; a traversal unit configured to traverse the constructed knowledge-graph to obtain a sequence of nodes about the author; an alignment unit configured to perform node alignment based on the attribute for the node sequence; and a calculation unit configured to calculate a similarity between the aligned sequences of nodes, wherein the author disambiguation is performed according to the calculated similarity.

As shown in fig. 1, the information processing apparatus for author disambiguation according to the present disclosure may include a graph construction unit 101, a traversal unit 102, an alignment unit 103, and a calculation unit 104.

First, the graph building unit 101 may be configured to build a knowledge graph based on entities extracted from a document corpus and their attributes, including author entities and their associated entities. Wherein the document library may be any one or a combination of document libraries of the prior art. And wherein the entity may be, for example, an author, an article, an affiliated entity, a co-author, an email, an address, an article title, a summary or keywords, etc. Here, it should be apparent to those skilled in the art that the above entities are merely exemplary, and the present disclosure is not limited thereto.

According to one embodiment of the present disclosure, the graph building unit 101 may be configured to build a knowledge graph based on, for example, author entities, affiliated institution entities, and article entities. As shown in fig. 2, several entities of authors such as juveniles, affiliated institutions such as shanxi medical university and articles such as taiyuan community population atrial fibrillation influence factor analysis are first extracted from the literature base, and then connected together through relationships therebetween such as subordination or articles, thereby constructing a knowledge graph. Here, it should be clear to those skilled in the art that FIG. 2 illustrates only a portion of the knowledge-graph for sake of brevity. The present disclosure is not limited to that shown in fig. 2.

Next, traversal unit 102 may be configured to traverse the constructed knowledge-graph to obtain a sequence of nodes about the author. For example, as shown in fig. 2, traversing unit 102 traverses the knowledge-graph using a breadth-first traversal method according to an embodiment of the present disclosure, starting from an author dawn node (the first dawn node from the left), the following sequence of nodes about author dawn can be obtained:

"Chengxiang" ] [ "Shanxi medical university" ] [ "Taiyuan Community population atrial fibrillation influence factor analysis" ] [ "Wang hong Yu" ].

Starting from the middle one of the nodes of dawn, the following sequence of nodes can be obtained for the dawn of the author:

"Chenxiaoli" ] [ "the department of electrocardiographic information of the second hospital of Shanxi medical university" ] [ "the current situation of atrial fibrillation of the population of Taiyuan Community" ] [ "Wanhongyu", "Zhang hongyu", "Shaozheng Shi.

Starting from the rightmost aging node, the following node sequence can be obtained for the aging of the author:

"Chen Xiao Li" ] "traditional Chinese medicine institute in Wanan county, Jiangxi province" ] "pulse-activating and heart-nourishing prescription for treating acute coronary syndrome" ] "Zeng Xin Hua" ].

Here, it should be apparent to those skilled in the art that the above-described node sequences are merely exemplary, and the present disclosure is not limited thereto. Furthermore, it should be clear to those skilled in the art that the breadth first traversal method described above is also merely exemplary, and those skilled in the art may use any prior art traversal method. According to another embodiment of the present disclosure, traversal unit 102 may traverse the knowledge-graph using a depth-first traversal method.

Then, since different nodes may have different sets of attributes, the nodes need to be aligned. The alignment unit 103 may be configured to perform node alignment based on the attribute for the sequence of nodes. In some cases, there may be multiple child nodes in a class of nodes. For example, in the article category, different authors may have different numbers of different articles. For example, as shown in fig. 2, the first celebrity from the left has an article "atrial fibrillation influence factor analysis in the population of the taiyuan", and the middle celebrity has two articles "relationship between elderly hypertensive patients and arterial elasticity" and "atrial fibrillation situation in the population of the taiyuan". Thus, according to one embodiment of the present disclosure, the node alignment may include aligning child nodes of a node.

According to one embodiment of the present disclosure, aligning child nodes of a node may include sorting the child attributes of the child nodes based on similarity; and aligning the child nodes according to the sorted child attributes. For example, as shown in fig. 2, based on the similarity between the sub-attributes of the sub-nodes (i.e. the similarity between the article "taiyuan community atrial fibrillation influence factor analysis" from the first curiosity on the left and the two articles "relationship between elderly hypertensive patients and arterial elasticity" and "taiyuan community atrial fibrillation situation" from the middle to the first curiosity on the left), the node ranking of the two articles from the middle to the first curiosity (relative to the first curiosity on the left) can be obtained as follows:

the situation of atrial fibrillation of the taiyuan community population is ahead of the relation between the aged hypertension patients and the elasticity of the artery.

Here, it should be apparent to those skilled in the art that the above-described alignment of the nodes or the sub-nodes is merely exemplary, and the present disclosure is not limited thereto.

Next, the calculation unit 104 may be configured to calculate a similarity between the aligned sequences of nodes, wherein the author disambiguation is performed according to the calculated similarity. According to one embodiment of the present disclosure, the similarity between nodes may be represented using a value of [0,1 ]. For example, as shown in fig. 2, for the affiliation, the similarity between "shanxi medical university" and "second hospital electrocardiographic information department of shanxi medical university" should be higher than the similarity between "shanxi medical university" and "department of medicine in wan an county, washings, shanxi province". Here, it should be apparent to those skilled in the art that different alignment methods may be applied to different node attributes, and the use of a value of [0,1] to represent the similarity between nodes is merely exemplary, and the present disclosure is not limited thereto.

According to one embodiment of the present disclosure, for example, for the attributes of the common author names, a binarization determination method may be employed, where 0 may represent different author names and 1 may represent the same author name. It should be clear to those skilled in the art that such a binarization judging method is also merely exemplary, and the present disclosure is not limited thereto.

According to one embodiment of the present disclosure, the similarity between aligned node sequences may also be calculated based on semantic analysis. For example, for "keywords" and "abstracts," semantic analysis may be used to compute the similarity between nodes. For example, the keywords "machine learning" and "artificial intelligence" although have a lower degree of similarity in the face. But in high level semantics, the two words have a higher degree of similarity. Also, as described above, different node attributes may apply different alignment methods. The above-described manner of calculating the semantic-based similarity is merely exemplary, and the present disclosure is not limited thereto. Any suitable alignment method known in the art may be used by those skilled in the art depending on the actual needs.

According to an embodiment of the present disclosure, calculating the similarity between the aligned node sequences may include calculating the similarity between the aligned nodes in the two node sequences, respectively, to obtain the similarity for each node in the node sequences; and calculating a similarity between the node sequences using the similarity of each node based on the weight of each node in the node sequences.

According to an embodiment of the present disclosure, the obtained similarity for each node in the node sequences may be normalized, wherein the similarity between the node sequences is calculated using the normalized similarity for each node.

According to one embodiment of the present disclosure, the normalization process may be performed based on the following formula:

wherein, sim_pRepresenting the similarity of the attributes p, W_pRepresenting the weight.

Here, it should be clear to those skilled in the art that the normalization process by the weighted average method employed in the present disclosure is merely exemplary, and the present disclosure is not limited thereto. The person skilled in the art can of course carry out the normalization process according to the actual need by any other method known in the art.

Finally, according to one embodiment of the present disclosure, the author disambiguation is performed when a similarity between the sequences of nodes is greater than a predetermined threshold.

According to the information processing apparatus for author disambiguation of the present disclosure, author disambiguation is performed by employing the knowledge graph similarity. Wherein the similarity is calculated using a sequence of nodes extracted from the knowledge graph. The present disclosure provides a more efficient approach to author disambiguation by using relationships between different nodes.

An information processing method for author disambiguation according to an embodiment of the present disclosure will be described below with reference to fig. 3. As shown in fig. 3, an information processing method for author disambiguation according to an embodiment of the present disclosure starts at step S310.

In step S310, a knowledge graph is constructed based on entities extracted from the document corpus and their attributes, including author entities and their associated entities.

Next, in step S320, the constructed knowledge-graph is traversed to obtain a sequence of nodes about the author.

Next, in step S330, for the node sequence, node alignment is performed based on the attribute.

Finally, in step S340, a similarity between the aligned sequences of nodes is calculated, wherein the author disambiguation is performed according to the calculated similarity.

The information processing method for author disambiguation according to an embodiment of the present disclosure may further include the steps of calculating a similarity between aligned nodes in two node sequences, respectively, to obtain a similarity for each node in the node sequences, and calculating a similarity between the node sequences using the similarity for each node based on a weight of each node in the node sequences.

The information processing method for author disambiguation according to one embodiment of the present disclosure may further include a step of normalizing the obtained similarity for each node in the sequence of nodes, wherein the similarity between the sequence of nodes is calculated using the normalized similarity for each node.

The information processing method for author disambiguation according to one embodiment of the present disclosure, wherein the author disambiguation is performed when a similarity between the node sequences is greater than a predetermined threshold.

The information processing method for author disambiguation according to one embodiment of the present disclosure, wherein the node aligning includes aligning child nodes of a node.

According to an embodiment of the present disclosure, an information processing method for author disambiguation, wherein aligning child nodes comprises: sorting the sub-attributes of the sub-nodes based on the similarity; and aligning the child nodes according to the sorted child attributes.

An information processing method for author disambiguation according to one embodiment of the present disclosure, wherein the knowledge-graph is traversed using a depth-first traversal method.

An information processing method for author disambiguation according to one embodiment of the present disclosure, wherein the knowledge-graph is traversed using a breadth-first traversal method.

An information processing method for author disambiguation according to one embodiment of the present disclosure, wherein a similarity between aligned sequences of nodes is calculated based on semantic analysis.

Various specific implementations of the above-described steps of the information processing method for author disambiguation according to the embodiment of the present disclosure have been described in detail previously, and a description thereof will not be repeated.

It is apparent that the respective operational procedures of the information processing method for author disambiguation according to the present disclosure can be implemented in the form of computer-executable programs stored in various machine-readable storage media.

Moreover, the object of the present disclosure can also be achieved by: a storage medium storing the above executable program code is directly or indirectly supplied to a system or an apparatus, and a computer or a Central Processing Unit (CPU) in the system or the apparatus reads out and executes the program code. At this time, as long as the system or the apparatus has a function of executing a program, the embodiments of the present disclosure are not limited to the program, and the program may also be in any form, for example, an object program, a program executed by an interpreter, a script program provided to an operating system, or the like.

Such machine-readable storage media include, but are not limited to: various memories and storage units, semiconductor devices, magnetic disk units such as optical, magnetic, and magneto-optical disks, and other media suitable for storing information, etc.

In addition, the computer can also implement the technical solution of the present disclosure by connecting to a corresponding website on the internet, downloading and installing the computer program code according to the present disclosure into the computer and then executing the program.

Fig. 4 is a block diagram of an exemplary structure of a general-purpose personal computer 1300 in which an information processing method for author disambiguation according to an embodiment of the present disclosure may be implemented.

As shown in fig. 4, the CPU 1301 executes various processes in accordance with a program stored in a Read Only Memory (ROM)1302 or a program loaded from a storage section 1308 to a Random Access Memory (RAM) 1303. In the RAM 1303, data necessary when the CPU 1301 executes various processes and the like is also stored as necessary. The CPU 1301, the ROM 1302, and the RAM 1303 are connected to each other via a bus 1304. An input/output interface 1305 is also connected to bus 1304.

The following components are connected to the input/output interface 1305: an input portion 1306 (including a keyboard, a mouse, and the like), an output portion 1307 (including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like), a storage portion 1308 (including a hard disk, and the like), a communication portion 1309 (including a network interface card such as a LAN card, a modem, and the like). The communication section 1309 performs communication processing via a network such as the internet. A driver 1310 may also be connected to the input/output interface 1305, as desired. A removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1310 as needed, so that a computer program read out therefrom is installed in the storage portion 1308 as needed.

In the case where the above-described series of processes is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 1311.

It should be understood by those skilled in the art that such a storage medium is not limited to the removable medium 1311 shown in fig. 4, in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 1311 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disc read only memory (CD-ROM) and a Digital Versatile Disc (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM 1302, a hard disk contained in the storage section 1308, or the like, in which programs are stored and which are distributed to users together with the apparatus containing them.

In the systems and methods of the present disclosure, it is apparent that individual components or steps may be broken down and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.

Although the embodiments of the present disclosure have been described in detail with reference to the accompanying drawings, it should be understood that the above-described embodiments are merely illustrative of the present disclosure and do not constitute a limitation of the present disclosure. It will be apparent to those skilled in the art that various modifications and variations can be made in the above-described embodiments without departing from the spirit and scope of the disclosure. Accordingly, the scope of the disclosure is to be defined only by the claims appended hereto, and by their equivalents.

With respect to the embodiments including the above embodiments, the following remarks are also disclosed:

supplementary note 1. an information processing apparatus for author disambiguation, comprising:

a graph construction unit configured to construct a knowledge graph based on entities extracted from a document repository and attributes thereof, the entities including author entities and associated entities thereof;

a traversal unit configured to traverse the constructed knowledge-graph to obtain a sequence of nodes about the author;

an alignment unit configured to perform node alignment based on the attribute for the node sequence; and

a computing unit configured to compute similarities between the aligned sequences of nodes, wherein the author disambiguation is performed according to the computed similarities.

Supplementary note 2. the apparatus according to supplementary note 1, wherein the calculation unit is further configured to:

respectively calculating the similarity between aligned nodes in the two node sequences to obtain the similarity aiming at each node in the node sequences; and

calculating a similarity between the sequence of nodes using the similarity of each node based on the weight of each node in the sequence of nodes.

Supplementary note 3 the apparatus according to supplementary note 2, further comprising a normalization unit configured to normalize the obtained similarity for each node in the node sequences, wherein the similarity between the node sequences is calculated using the normalized similarity for each node.

Supplementary note 4. the apparatus of supplementary note 3, wherein the author disambiguation is performed when a degree of similarity between the sequence of nodes is greater than a predetermined threshold.

Supplementary note 5 the apparatus according to supplementary note 1, wherein the alignment unit is further configured to align child nodes of a node.

Supplementary note 6. the apparatus according to supplementary note 5, wherein the alignment unit is further configured to:

sorting the sub-attributes of the sub-nodes based on the similarity; and

and aligning the child nodes according to the sorted child attributes.

Supplementary 7. the apparatus according to supplementary 1, wherein the traversal unit is further configured to traverse the knowledge-graph using a depth-first traversal method.

Supplementary note 8. the apparatus of supplementary note 1, wherein the traversal unit is further configured to traverse the knowledge-graph using a breadth-first traversal method.

Note 9 the apparatus according to note 1, wherein the calculation unit is further configured to calculate a similarity between the aligned node sequences based on semantic analysis.

Supplementary note 10. an information processing method for author disambiguation, comprising:

constructing a knowledge graph based on entities extracted from a document library and attributes thereof, the entities including author entities and associated entities;

traversing the constructed knowledge graph to obtain a node sequence related to an author;

performing node alignment based on the attributes for the sequence of nodes; and

calculating a similarity between the aligned sequences of nodes, wherein the author disambiguation is performed according to the calculated similarity.

Note 11. according to the method described in note 10, calculating the similarity between the aligned node sequences includes:

Supplementary note 12 the method according to supplementary note 11, further comprising normalizing the obtained similarity for each node in the sequence of nodes, wherein the similarity between the sequence of nodes is calculated using the normalized similarity for each node.

Supplementary notes 13. the method according to supplementary notes 12, wherein the author disambiguation is performed when a degree of similarity between the sequences of nodes is greater than a predetermined threshold.

Supplementary note 14 the method according to supplementary note 10, wherein the node alignment comprises aligning child nodes of a node.

Supplementary note 15 the method of supplementary note 14, wherein aligning child nodes comprises:

sorting the sub-attributes of the sub-nodes based on the similarity; and

and aligning the child nodes according to the sorted child attributes.

Appendix 16. the method of appendix 10, wherein the knowledge-graph is traversed using a depth-first traversal method.

Supplementary notes 17. the method of supplementary notes 10 wherein the knowledge-graph is traversed using a breadth first traversal method.

Reference 18. the method according to reference 10, wherein the similarity between the aligned sequences of nodes is calculated based on semantic analysis.

Reference 19. a program product comprising machine readable instruction code stored therein, wherein said instruction code, when read and executed by a computer, is capable of causing said computer to perform a method according to any of the references 10-18.

Claims

1. An information processing apparatus for author disambiguation, comprising:

2. The apparatus of claim 1, wherein the computing unit is further configured to:

3. The apparatus according to claim 2, further comprising a normalization unit configured to normalize the obtained similarity for each node in the node sequences, wherein the similarity between the node sequences is calculated using the normalized similarity for each node.

4. The apparatus of claim 3, wherein the author disambiguation occurs when a degree of similarity between the sequence of nodes is greater than a predetermined threshold.

5. The apparatus of claim 1, wherein the alignment unit is further configured to align children of a node.

6. The apparatus of claim 5, wherein the alignment unit is further configured to:

sorting the sub-attributes of the sub-nodes based on the similarity; and

and aligning the child nodes according to the sorted child attributes.

7. The apparatus of claim 1, wherein the traversal unit is further configured to traverse the knowledge-graph using a depth-first traversal method or a breadth-first traversal method.

8. The apparatus of claim 1, wherein the computing unit is further configured to compute a similarity between the aligned sequences of nodes based on semantic analysis.

9. An information processing method for author disambiguation, comprising:

10. A machine-readable storage medium having a program product embodied thereon, the program product comprising machine-readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to perform the method of claim 9.