CN113488188A

CN113488188A - Traditional Chinese medicine meridian ancient book knowledge graph construction and syndrome mining system

Info

Publication number: CN113488188A
Application number: CN202110885758.8A
Authority: CN
Inventors: 林树元; 刘畅; 曹灵勇
Original assignee: Zhejiang Chinese Medicine University ZCMU
Current assignee: Zhejiang Chinese Medicine University ZCMU
Priority date: 2021-08-03
Filing date: 2021-08-03
Publication date: 2021-10-08

Abstract

The invention discloses a traditional Chinese medicine meridian ancient book knowledge graph construction and syndrome mining system, which comprises: the knowledge graph building module is used for acquiring concept terms, classifying entities, building a knowledge graph according to the relation between the corresponding entities, and importing the knowledge graph into a graph database to form an entity graph; the knowledge mining module comprises the following steps: the graph mapping unit extracts all symptom nodes in the entity graph as subgraph nodes, counts the number of times of symptom co-occurrence as the weight of edges in the subgraph, and makes the co-occurrence that two symptom nodes are connected in the entity graph through the same middle node, and the type of the middle node is not limited; and the community discovery calculation unit is used for acquiring the symptom subgraph data after the graph mapping, calculating the community distribution rule of the symptoms, and obtaining the community number, the modularity, the node number, and specific nodes contained in each community, namely specific symptoms, wherein each community represents a syndrome, and the node number represents symptoms which may appear under the syndrome.

Description

Traditional Chinese medicine meridian ancient book knowledge graph construction and syndrome mining system

Technical Field

The invention relates to the technical field of data mining, in particular to a traditional Chinese medicine meridian ancient book knowledge graph construction and syndrome mining system.

Background

Diabetes is clinically manifested as polydipsia, polyphagia, polyuria, thirst, asthenia, emaciation or sweet urine, and is seen in diabetes mellitus in modern medicine. Traditional Chinese medicine has definite advantages in improving insulin resistance, protecting islet function and the like, however, doctors of different generations know about diabetes differently, so that the syndrome of diabetes lacks normalcy. The meridian and prescription is a Chinese-Tang prescription-pulse medical theory system with Zhang Zhongjing six-meridian syndrome differentiation as the theoretical core. Differentiation of syndromes of the six meridians reclassifies deficiency-excess cold-heat in the exterior and interior, covering the differentiation information of viscera, meridians, etc. Preliminary experiments in the application show that the contents related to diabetes recorded in ancient books have the distribution rule of six channels syndrome. However, the knowledge is scattered in many ancient books, the manual integration is difficult, and the knowledge is difficult to reuse. Therefore, ancient books of diabetes need to be regulated.

The knowledge graph is a semantic network and can formally describe concepts and relationships in the field. The advantages are as follows: integrating knowledge of various data sources; providing convenient knowledge search; knowledge discovery is realized through knowledge reasoning; and the prior knowledge is provided for the artificial intelligence system. The establishment of the warrior diabetes knowledge map improves the utilization rate of knowledge and the efficiency of knowledge workers, and lays a foundation for the next research of fusion of the knowledge map and deep learning. At present, the research of knowledge maps in the field of traditional Chinese medicine relates to knowledge integration, inheritance of famous and old traditional Chinese medicine experiences, traditional Chinese medicine preparations, traditional Chinese medicine scientific research and the like. The field of prescriptions has been studied to adopt a graph database to construct a map of a formula of cassia twig decoction and realize retrieval, or to realize the inquiry of a medical scheme and a prescription from 'prescription-syndrome correspondence'. However, the research of integrating multiple ancient books for syndrome mining has not been reported.

The syndrome research is limited by personal experience and clinical level if the research is switched from theory, and the data mining is objective. The syndrome of traditional Chinese medicine has the characteristic of high-level and high-order, and the symptoms forming the syndrome have nonlinear complex relationship. Although the co-occurrence rule of symptoms can be found by methods such as factor analysis and the like, the premise is that the symptoms of different syndromes are relatively independent and do not accord with clinical practice; data dimensionality reduction in operations also results in information loss.

The complex network can present nonlinear complex relation, and is more in line with the characteristics of the traditional Chinese medicine syndrome differentiation knowledge structure. One of the important features is the community structure, and the community structure can discover the close connection and similar attributes of the nodes in the network and mine potential categories based on the close connection and similar attributes. The community discovery algorithm takes the contribution degree of each symptom to the syndrome type into consideration, and interaction among the symptoms is not missed. The Louvain algorithm is currently one of the most efficient and widely used community discovery algorithms. Therefore, research attempts to use a community discovery algorithm for syndrome analysis and medication law research.

The knowledge discovery method based on graph calculation better accords with the characteristics of complex and multidimensional traditional Chinese medicine syndromes, can realize syndrome research covering symptom associated information, and provides a new thought and method for the syndrome research of ancient books of the traditional Chinese medicine meridians.

Disclosure of Invention

In order to solve the defects of the prior art, on the basis of research of ancient books and literature related to diabetes, knowledge representation is realized by constructing a knowledge graph for identifying and treating diabetes by a meridian, knowledge mining of six channel syndromes and the like of diabetes is realized by utilizing a community discovery algorithm, and the purpose of realizing knowledge graph completion by exploring and applying a link prediction technology is adopted by the invention, the following technical scheme is adopted:

traditional chinese medical science menstruation side ancient book knowledge map construction and syndrome excavation system includes: knowledge map construction module, picture database and knowledge mining module, knowledge mining module includes: the system comprises a graph mapping unit and a community discovery computing unit;

the knowledge graph construction module is used for acquiring concept terms, classifying entities, constructing a knowledge graph corresponding to the relationships among the entities, and importing the knowledge graph into a graph database to form an entity graph;

the graph mapping unit extracts all symptom nodes in the entity graph as subgraph nodes, counts the number of symptom co-occurrence as the weight of edges in the subgraph, wherein the co-occurrence is that two symptom nodes are connected through the same middle node in the entity graph, the types of the middle nodes are not limited, and the entity graph comprises nodes of different types, and needs to be mapped into the subgraph first and then operated;

the community discovery calculation unit acquires the symptom subgraph data after the graph mapping, calculates the community distribution rule of the symptoms, and obtains the community number, the modularity, the node number, and specific nodes contained in each community, namely specific symptoms, each community represents a syndrome, the node number represents the symptoms which may appear under the syndrome, and the calculation process is as follows:

1.1, taking each node in the entity graph as an independent community, wherein the number of the communities is the same as that of the nodes;

1.2, for each node i, sequentially trying to allocate the node i to a community where each neighbor node is located, calculating a change value delta Q of the modularity before and after allocation, and recording the neighbor node with the largest change value delta Q of the modularity, if the maximum change value max delta Q of the modularity is larger than 0, allocating the node i to the community where the neighbor node with the largest change value delta Q of the modularity is located, otherwise, keeping the node i unchanged;

1.3, repeating 1.2 until the communities to which all the nodes belong do not change any more;

1.4, compressing the entity graph, compressing all nodes in the same community into a new node, converting the weight of edges between the nodes in the community into the weight of a ring of the new node, converting the edge weight between the community into the weight of edges between the new nodes, compressing the nodes according to the community to greatly reduce the number of the edges and the nodes, and calculating the modularity change when the node i is distributed to a neighbor j only related to the communities of the nodes i and j without traversing the whole graph, thereby greatly improving the calculation efficiency;

1.5, repeat 1.1 until the modularity of the whole graph is no longer changed.

The algorithm can generate a hierarchical community structure, and the nodes are compressed according to communities in the calculation process, so that the calculation efficiency is high.

Further, the modularity is used for measuring a community network division condition, and is a difference between a continuous edge number of a node in a community and an edge number under a random condition, and a formula of the modularity is as follows:

wherein m is 1/2 ∑_ijA_ijRepresents the weight of the edge in the whole entity graph, i.e. the total number of edges in the graph, c_iDenotes the community to which node i belongs, δ (c)_i,c_j) The function value of the expressed function is 0 when the node i and the node j are in the same community, otherwise, the function value is 1, A_ijRepresenting the weight, k, of the edge between node i and node j_jRepresenting the sum of the weights, k, of all edges connected to node j_iRepresenting the sum of the weights, k, of all edges connected to node i_jThe 2m represents the probability of the connection between the node j and any node on the whole entity graph, and in the random case, the expected connection weight between the node i and the node j is k_i×k_j/2m，A_ij-k_i×k_jThe/2 m is the difference between the actual connection weight and the expected connection weight of the nodes i and j;

the formula for the modularity is simplified as:

where Σ in represents the sum of the weights of the edges within community c, Σ tot represents the sum of the weights of the edges connected to the nodes within community c,

represents [ sigma in- (. sigma. tot) for all communities²/2m]Accumulating, wherein a formula shows that the bigger the sum of internal weights of the community is, the smaller the weights of the community and external links are, and the bigger the modularity is;

calculating the modularity gain, namely the variation value Delta Q of the modularity:

Δ Q represents the change in modularity when node i is assigned to community c where neighbor node j is located, where k_i,inIs the c inner section of communitySum of point and edge weight of node i, first half of formula

Representing modularity after adding node i to community c, the second half

Before joining the node i, the modularity sum of the community c and the node i when the community c and the node i are used as an independent community is shown.

Further, the value range of the modularity is [ -1/2, 1).

Further, the community discovery computing unit computes the influence of the symptom nodes in the syndrome community by taking the co-occurrence information of the symptoms in the extracted entity graph as the connecting edges of the symptoms in the subgraph, and is used for acquiring the primary and secondary relations among the symptoms in the community, and the more the connecting edges of the symptoms are, the more the symptoms are likely to appear together with other symptoms, so the larger the influence on the syndrome community is; because the Louvain algorithm can only obtain the community distribution condition of the symptoms, the syndrome needs to be obtained by professional analysis, the primary and secondary relations among the symptoms in the community need to be known, and the problem can be abstracted to calculate the influence of the symptom nodes in the syndrome community;

the calculation of the influence is to calculate the PR value of one symptom node:

wherein PR (a) is the PR value of symptom node a, representing the probability of a being visited, to obtain which the importance scores pointing to all edges of node a, PR (T), are accumulated_n) Is a node T_nPR value of, the node T_nIs a certain node among all nodes pointing to the symptom node a, C (T)_n) Is a node T_nOut of degree of, i.e. node T_nThe number of edges pointing to other nodes needs to be calculated for PR (T) since the side of the join between symptom nodes has weight information_n) And C (T)_n) Time-cumulated weights, i.e. co-occurrence of weightsThe number of times d is a damping coefficient and represents the probability of reaching a certain node and continuing to extend backwards at any time;

and calculating PR values of other nodes in the same manner, substituting the PR values into a formula, and calculating a second round of PR (a) until the difference value of the results of two adjacent iterations is smaller than a convergence threshold value, deriving the PR value of each symptom node in the community to which the symptom node belongs, wherein the larger the PR value is, the larger the probability of the symptom is, and sequencing the derived symptoms according to the size, so as to divide the importance of the symptoms.

Further, the knowledge graph building module comprises:

the original text screening and semantic analysis unit screens out the content related to the keywords from the original text, eliminates irrelevant articles, and performs duplication elimination and numbering on the screening result; semantically analyzing original texts related to the keywords, splitting the original texts into minimum morphemes, and establishing an original text concept according to the minimum morphemes;

the concept term standardization unit is used for establishing a standard word bank by referring to national standards in order to ensure the multiplexing capability of the knowledge graph, corresponding the original text concept with the standardized concept, naming the nodes in the knowledge graph by adopting standard words and generating standardized concept terms;

the knowledge graph mode layer definition unit is used for classifying the concept terms according to the entities to which the concept terms belong and constructing a knowledge graph mode layer corresponding to the relationship among the entities;

and importing the data of the knowledge graph into a graph database to form an entity graph, importing the entities in the knowledge graph mode layer and the relations between the entities into the concrete contents of the entities in the graph database and the concrete contents of the relations, and storing the concrete contents.

Further, the related contents comprise screening out texts containing keywords and synonym thereof from ancient books, describing texts of related diseases of the keywords, presenting texts of typical prescriptions for treating the diseases of the keywords, and excluding texts which contain the keywords but have no relation with the actual meanings of the keywords; the clauses containing the keywords and the synonyms thereof comprise the medicines containing the keywords in the whole chapter content and the efficacy.

Further, the original concept is to arrange phrases composed of minimum morphemes such as "main-predicate-object" or "predicate-object-complement" as candidate triples by arranging nouns in the minimum morphemes as candidate concepts.

Further, the concept term standardization unit also establishes a synonym table, and words of the original text concepts are labeled as the attributes of the nodes so as to maximally retain the original text semantics.

Furthermore, the knowledge graph building module further comprises an original text labeling unit for further analyzing semantics, labeling concept terms in the original text as node names, labeling entity categories to which concepts belong as node categories, labeling one node as a group of node categories, and labeling phrases as triples under the condition of compatibility with word ambiguity.

Further, in the knowledge graph pattern layer definition unit, the entity classification includes: the disease names, the causes, the treatment methods, the prescriptions, the syndromes, the symptoms, the physical signs, the medicines, the effects and the acupuncture points, the syndromes comprise physical signs, symptoms and pathogenesis, the prescriptions comprise the medicines and are respectively connected with the causes, the treatment methods, the prescriptions, the physical signs and the symptoms by taking the disease names as the center, the treatment methods comprise an internal treatment method and an external treatment method, wherein the external treatment method is connected with the symptoms through the acupuncture points, and the prescriptions are also connected with the effects and the syndromes.

The invention has the advantages and beneficial effects that:

the traditional Chinese medicine meridian ancient book knowledge graph construction and syndrome mining system converts ancient book originals related to diabetes in the field of traditional Chinese medicine meridians into graph structure data, expresses the ancient book originals in knowledge graph, and performs syndrome mining by using a community discovery algorithm on the basis of the knowledge graph.

Drawings

FIG. 1 is a technical roadmap for the present invention.

FIG. 2 is a schematic diagram of a knowledge graph mode of diabetes treatment by meridian differentiation.

FIG. 3 is an exemplary illustration of the labeling of text in the present invention.

FIG. 4 is an exemplary graph of profile data in the present invention.

Fig. 5 is a schematic diagram of the loop structure of the present invention.

FIG. 6 is a view of the community classification of the symptoms of diabetes of the menstruation side in the present invention.

Fig. 7 is a connection relationship diagram of the symptom A, B, C, D in the community in the present invention.

FIG. 8 is a (partial) knowledge map of diabetes mellitus of menstruation formulae in the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

The diabetes mellitus meridian ancient book knowledge graph construction and syndrome mining system comprises a meridian ancient book knowledge graph construction and syndrome mining system, a meridian ancient book knowledge graph for distinguishing and treating diabetes mellitus, knowledge mining based on the knowledge graph, screening meridian ancient book original texts for distinguishing and treating diabetes mellitus, performing semantic analysis on the original texts, extracting concepts and relations in the original texts, forming graph structure data, and constructing a meridian diabetes knowledge graph; the knowledge is mined from the syndrome differentiation knowledge structural characteristics of the channels, the distribution rule of the six channels of diabetes is researched by adopting a community analysis algorithm, and the distribution rule is compared with the syndrome rule obtained by factor analysis in earlier research. The syndrome research method (such as factor analysis) based on the structured data assumes that the symptoms are relatively independent and fails to cover the complex associated information between the symptoms. However, the nature of the six-channel syndrome is to classify the complex related symptoms. The community analysis algorithm based on the graph structure data can cover the association between symptoms and weight information (expressed as edges and attributes of the edges between symptom nodes in the graph), and well simulates the classification process from symptoms to syndromes. Therefore, the embodiment aims to adopt community analysis to mine the syndrome of the diabetes on the basis of the construction of the knowledge graph. As shown in fig. 1, the method specifically comprises the following steps:

step one, building a knowledge graph for treating diabetes by differentiation of meridian prescriptions

1.1 original text screening and semantic parsing

(1) Screening

And (3) screening range: the following version of ancient book text was retrieved: liu Shuzhou edited "Shang Han Lun school notes" (people's health publishing house, 7 months in 2013), any of them edited "jin Kui Yao You school notes" (people's health publishing house, 7 months in 2013), Ma Shen xing edited "Shen nong Ben Cao Jing notes" (people's health publishing house, 7 months in 2013), Shang Zhi Jun Shu school "famous medical records" (people's health publishing house, 6 months in 1986), Ding Guandi edited "all sources of illness theory school notes" (people's health publishing house, 7 months in 2013), and the column of height edited "Bei Qian jin Yao Fang school notes", "jin Qian Fang school notes", and "outer platform secret recipe school notes" (academic publishing house, 2016, 1 month in 2013).

Screening conditions are as follows: since the ancient book means "diabetes" can be either the disease name or the symptom, the present embodiment adopts the broad-entry and strict criteria when incorporated into the original text:

the article containing the two characters of 'diabetes' and the synonym thereof comprises the whole chapter of the chapter name containing 'diabetes' and the medicine containing 'diabetes' in the efficacy;

② the original texts of the diseases related to the diabetes, such as consumptive disease, blood impediment, carbuncle pus and the like are described;

and the original text of a typical prescription for treating diabetes (such as a near-effect diabetes prescription) appears.

Exclusion criteria: although containing the word "diabetes", the actual meaning is irrelevant to diabetes in the section of TCM medical guidelines for diagnosis and treatment of common diseases in TCM (2008), the year edition. After the treaty whose semantics can not be judged is recorded in batch, it is discussed and adjudged by domain expert.

The screening results were de-duplicated and all the articles and formulas were numbered separately.

(2) Semantic parsing: the original text is interpreted and split to the smallest morphemes. Arranging nouns in the original text as alternative concepts; and (3) sorting phrases of structures such as main-predicate-object or predicate-object-complement as alternative triples.

1.2 concept term normalization

(1) Establishing a standard word bank and a synonym table: in traditional Chinese medicine, a linguistic phenomenon of a polysemous word exists, in order to ensure the reusability of knowledge graphs, the present embodiment refers to national standards such as basic theoretical terms in traditional Chinese medicine (2006), clinical diagnosis and treatment terms in traditional Chinese medicine (1997), classification and code in traditional Chinese medicine (1995), and medical terms in traditional Chinese medicine (2013), etc., establishes a standard word stock, corresponds concepts in the original text to standardized concepts, and establishes a synonym table.

(2) Naming rules: nodes in the map are named by standard words, and the original text is marked with words as attributes of the nodes so as to maximally retain the original text semantics.

1.3 knowledge graph schema layer definition

Standardized concepts are still a vocabulary of unorganized and unstructured systems that require analysis of core concepts (entities) and relationships between concepts in accordance with a theoretical framework of warp parties.

(1) And (3) entity extraction: in this embodiment, referring to the meridian body constructed in the early stage, the semantic meaning and the expert opinions of the original text are integrated, and the concept entities for differentiating and treating diabetes by the meridian are divided into 10 categories: the name of the disease, symptoms, physical signs, syndrome, prescription, medicine, efficacy, etiology, treatment and acupuncture points.

(2) And (3) relationship extraction: and sorting the relation among the entities according to the entities.

(3) Map mode layers are defined as shown in figure 2.

1.4, original text labeling

(1) Manual labeling: in order to ensure the accuracy of the original text semantics, the embodiment adopts a manual labeling method to analyze the semantics one by one and label the nodes and the triples synchronously, as shown in fig. 3, and excel software is used to store the labeling information.

(2) Labeling categories: marking the concept in the original text as a node name, marking the entity category to which the concept belongs as a node category, and allowing one node to belong to a plurality of categories to be compatible with the condition of one word with multiple meanings; phrases are labeled as triplets.

For example, the chapter "Wei Fang, vomiting and thirst for water, Fu Ling ze Xie Tang is mainly used. "the nodes comprised by it are: stomach regurgitation (disease name), vomiting (symptom), thirst (symptom), Poria and Alismatis rhizoma decoction (formula), Poria (medicine), Alismatis rhizoma (medicine), Glycyrrhrizae radix (medicine), ramulus Cinnamomi (medicine), Atractylodis rhizoma (medicine), and rhizoma Zingiberis recens (medicine). The parenthesis above is the node class. Wherein the prescription of the tuckahoe-rhizoma alismatis soup is numbered as JG 2.7.20. The triples contained in this article are entered in the columns of the table I-J-L, where the attributes of the categories and relationships of the nodes within the triples are labeled in column H, K, M, respectively. For example, the second triplet of the article (line 162 in the table): the "tuckahoe and rhizoma alismatis decoction-treatment-stomach regurgitation" is a prescription (marked in the H162 unit cell), the "stomach regurgitation" is a disease name (marked in the M162 unit cell), and the relationship of "treatment" has no attribute and is not marked.

The doses of the drugs in the prescription are labeled by the attributes of the relationship and are reflected as the weights of the edges between the nodes when the drugs are imported into the map. For example, in line 165 of the table, the relationship between the triple "tuckahoe-alisma decoction" containing "and" tuckahoe "is" half liter ", the name of the connection between" tuckahoe-alisma decoction "and" tuckahoe "in the map is" containing "and the attribute is" half liter ".

1.5 software implementation

(1) Building requirements: the graph database software required by the embodiment not only needs to provide a representation method of the multivariate relationship, but also needs to be compatible with linguistic phenomena of one-word polysemous and one-word polysemous, support the operation of a community discovery algorithm and provide convenient and fast graph query.

(2) Neo4j predominates: neo4j is an open source non-relational graph database based on Java, and can store data in a flexible network structure instead of a table, thereby realizing professional database-level graph data model storage and being widely applied. Its advantage does: 1) the graph storage structure with free adjacency is adopted, so that the transaction processing and data relation processing capabilities are improved; 2) data adding and changing operations are convenient and fast, and the construction time is shortened; 3) the nodes and the relations can have a plurality of attributes and express rich semantics; 4) neo4j provides an easy-to-understand expression using the Cypher query language; 5) by loading the algorithm package, Neo4j can easily implement a variety of community discovery algorithms. Based on the advantages, the embodiment adopts Neo4j software to construct the knowledge graph.

(3) Data import: and calling the label information in the csv format by using the Java language, and importing Neo4j in batches to establish a knowledge graph.

(4) Data checking: and inquiring nodes, relations and map structures in the map through Cypher sentences, and checking the nodes, relations and map structures and the labeled information.

For example, the schema layer has a connection of "disease name (entity) → pathogenesis is (relationship) → pathogenesis (entity)", represented in the graph database as: the nodes of "diabetes (disease name) → pathogenesis (relationship) → deficiency fever (pathogenesis)", "diabetes (disease name) → pathogenesis (relationship) → liver and lung heat (pathogenesis)", and as shown in fig. 4, the nodes of "deficiency fever" and "liver and lung heat" are all connected with the node of "diabetes". The node type labels and the attribute labels of the relationship are both seen in the upper left corner, and the label color supports self-definition. If the node label of the disease name is dark orange, the node color in the corresponding map is also dark orange, the label of the disease pathogenesis is dark blue, and the node color in the map is also dark blue; the "pathogenesis is" relationship label is gray, the corresponding relationship arrow in the figure is gray and the relationship name of "pathogenesis is" is shown.

Step two, knowledge mining based on atlas

2.1, map mapping: because the entity graph contains nodes of different classes, the entity graph needs to be mapped into a subgraph and then operated. And (2) extracting all symptom nodes as nodes of the subgraph by using a Cypher statement, and counting the number of times of symptom co-occurrence as the weight of edges in the subgraph (the co-occurrence is defined as that two symptoms are connected through the same node in the entity graph, and the category of the middle node is not limited).

2.2, algorithm calling: adding an Apoc algorithm plug-in package, and calling a Louvain algorithm by using a Cypher command to obtain a community distribution rule of the symptoms. And importing the result into Gephi software for visualization. The PageRank algorithm is run to derive the PR value of each symptom in the community to which it belongs.

The Louvain algorithm belongs to an aggregation community discovery algorithm and is aggregated into a community based on the closeness of node links. The algorithm flow is as follows:

1) each node in the graph is regarded as an independent community, and the number of the communities is the same as that of the nodes;

2) for each node i, sequentially trying to allocate the node i to the community where each neighbor node is located, calculating the modularity change delta Q before and after allocation, and recording the neighbor node with the maximum delta Q, if max delta Q is larger than 0, allocating the node i to the community where the neighbor node with the maximum delta Q is located, otherwise, keeping unchanged;

3) repeating 2) until the community to which all the nodes belong does not change;

4) compressing the graph, compressing all nodes in the same community into a new node, converting the weight of edges between the nodes in the community into the weight of a ring of the new node, and converting the weight of edges between the community into the weight of edges between the new nodes; the loop (loop) is a concept in graph theory. Here, the node ring refers to a "new node" generated after each wheel compresses the node, for example, as shown in fig. 5, 5

nodes

0, 1, 2, 4, and 5 belong to the same green community in the first round of community division, and are folded into 14 node rings in the second round of calculation. This is equivalent to taking these 5 nodes as a new node (represented by node 14) and bringing it into the next round of community partitioning. The number of edges and nodes can be greatly reduced by compressing the nodes according to communities, and the modularity change when the computing node i is distributed to the neighbor j is only related to the communities of the nodes i and j, so that the whole graph does not need to be traversed, and the computing efficiency is greatly improved.

5) Repeat 1) until the modularity of the entire graph is no longer changed.

The algorithm can generate a hierarchical community structure, and the nodes are compressed according to communities in the calculation process, so that the calculation efficiency is high, and the algorithm is one of the most efficient and widely applied community discovery algorithms at present.

The modularity is a measurement method for evaluating the division quality of a community network, the physical meaning of the modularity is the difference between the number of connected edges of nodes in the community and the number of edges under random conditions, the value range of the modularity is [ -1/2,1), and the modularity is defined as follows:

where m is the weight of the edge in the whole graph (graph), and m is 1/2 Σ_ijA_ijEqual to the total number of edges in the graph, c_iThe community to which the node i belongs is represented, and the delta (u, v) represents a function, wherein when the node i and the node j are in the same community, the function value is 0, and otherwise, the function value is 1. A. the_ijRepresenting the weight of the edge between node i and node j; k is a radical of_jRepresenting the sum of the weights, k, of all edges connected to node j_iIn the same way, k_jWhere 2m represents the probability of a node j being connected to any node in the entire graph (graph), the expected connection weight of node i and node j is k in random cases_i×k_j/2m，A_ij-k_i×k_jAnd/2 m is the difference between the actual connection weight and the expected connection weight of nodes i and j.

The formula for the modularity can be simplified as:

represents [ sigma in- (. sigma. tot) for all communities²/2m]And accumulating. According to the formula, the bigger the sum of the internal weights of the community is, the smaller the weights of the community and the external link are, and the greater the modularity is.

The 2 nd step of the process of the Louvain algorithm needs to calculate the modularity gain, and the formula is as follows:

Δ Q represents the change in modularity when node i is assigned to community c where neighbor node j is located. Wherein k is_i,inIs the sum of the edge weights of the node and node i within community c.

The formula is divided into two parts.

Front half part

Representing the modularity after adding node i to community c (2 m is moved into middle brackets in the calculation and does not need to be accumulated here, so there is no

)

The second half part

After the formula is expanded, a plurality of terms are mutually offset, and simplification can be obtained:

in actual operation, the input variables of the Louvain algorithm are the symptom sub-graph data extracted in the last step, that is, the symptom is a node, the association between the nodes (whether the nodes co-occur or not, and the co-occurrence number). The algorithmic process configuration parameters use default values. The algorithm outputs are the number of communities, the modularity, the number of nodes, and the specific nodes (symptoms) included in each community, as shown in table 1, each community represents a syndrome, and the number of nodes represents the symptoms that may appear under the syndrome. The output is imported as input into Gephi software for visualization, as shown in fig. 6.

Because the Louvain algorithm can only obtain the community distribution condition of the symptoms, the primary and secondary relations among the symptoms in the community need to be known when the syndrome is obtained through professional analysis. The problem can be abstracted into calculating the influence of symptom nodes in the syndrome community, and the more edges of the symptom, the more likely the symptom appears together with other symptoms, so the larger the influence on the syndrome community is. Co-occurrence information of symptoms in the entity graph has been extracted as a continuous edge of the symptoms in the subgraph in 2.1, and the influence of the symptoms in the community is calculated by using the PageRank algorithm in the example.

The PageRank algorithm is derived from an algorithm of a webpage link structure, a plurality of webpages in the Internet are regarded as nodes of a graph, links among the webpages are regarded as edges, the physical meaning of the PageRank (PR) value is the probability of accessing the webpage A, and the importance scores of all the links pointing to the webpage A need to be accumulated to obtain the probability.

For a page A, its PR value is:

where PR (A) is the PR value of page A, PR (T)_n) Is a page T_nPR value of, here, page T_nIs a page of all pages pointing to a; c (T)_n) Is a page T_nIs the output of (i.e. T)_nThe number of edges pointing to other pages; d is a damping coefficient, which represents the probability that the user will reach a certain page and continue browsing backwards at any time, and is estimated according to the average frequency of the web surfers using the browser bookmarks, and is usually equal to 0.85.

The formula is suitable for the case that the connecting edges between nodes have no weight, and the connecting edges between symptom nodes in the example have weight information, so PR (T) needs to be calculated_n) And C (T)_n) The weights are accumulated. For example, the connection relationship of the symptom A, B, C, D in the community is shown in fig. 7, wherein the connecting edge between A, B indicates that two symptoms co-occur in the entity diagram, and the number 3 on the edge indicates that the co-occurrence number is 3.

Assuming that the initial value of each node PR is 1:

PR(A)＝(1-d)+d(PR(B)/C(B)+PR(C)/C(C)+PR(D)/C(D))

＝PR(A)＝(1-0.85)+0.85[3/(3+1)+2/(2+1)+1/(1+1+1)]

the PR value of B, C, D was calculated in the same way. Then, new PR (B), PR (C) and PR (D) are substituted into the formula, and the calculation of the second round of PR (A) and the PR value of each symptom is started. And iterating for multiple times until the difference between two adjacent iteration results is less than 0.00001, and converging.

And deriving the size and ranking of the PR value of each symptom node in the community.

The larger the PR value, the greater the probability of the symptom appearing. The first 10% of the syndromes, ranked according to the PR value, are considered as the primary symptoms, and the rest are secondary symptoms. The calculation results are shown in Table 1.

And (3) knowledge graph construction and syndrome mining results:

1. map construction

As shown in fig. 8, the embodiment incorporates 454 ancient book articles, and 1432 map nodes are constructed, wherein 168 symptom nodes are provided; the relationship 24 class is constructed.

2. Community discovery

As shown in fig. 6, this embodiment obtains 4 core communities, which account for 94.05% of the symptom nodes of the entity map. Community 1 is the largest community, accounting for 47.02% of all symptom nodes.

3. Analysis of syndromes

The symptoms in each community were ranked by PR value, with the top 10% being assigned to the principal certificate of the community. The six-channel syndrome analysis is performed on symptoms by applying professional knowledge, as shown in table 1:

TABLE 1 Community Classification of six-channel syndromes of diabetes

The traditional syndrome study method is factor analysis, and the community discovery result and the factor analysis result (shown in table 2) are compared in the previous study. The common factors in Table 2 represent the underlying syndromes, the components are grouped into symptoms of the syndromes, and the contribution degree represents the support degree of the symptoms for distinguishing different common factors. After comparison, the following defects are found in the syndrome mining by using factor analysis: 1. the contribution degree of the symptom cannot be used as a basis for judging whether the symptom is the main symptom, because the symptoms (such as thirst, polydipsia and the like which are typical symptoms of diabetes) existing in each common factor show low contribution degree; 2. the lack of important symptoms leads to unclear classification of six meridians of common factors, only preliminary analysis of pathogenesis elements is possible, the syndrome is difficult to judge and the results are professionally and reasonably explained, so that the assistance for syndrome mining is limited.

TABLE 2 analysis of essential factors of diabetes pathogenesis

The advantages of syndrome research by community discovery are as follows:

the mining of syndrome information is one of the important purposes of ancient book knowledge discovery. The theoretical shift into syndrome research is limited by personal experience and clinical level, while the research using data mining method is objective. The syndrome of traditional Chinese medicine has the characteristic of high-level and high-order, and the symptoms forming the syndrome have nonlinear complex relationship. Although the co-occurrence rule of symptoms can be found by methods such as factor analysis and the like, the premise is that the symptoms of different syndromes are relatively independent and do not accord with clinical practice; the dimensionality reduction processing is carried out on the data in the operation, and the information loss is also caused. The complex network can present nonlinear complex relation, and is more in line with the characteristics of the traditional Chinese medicine syndrome differentiation knowledge structure. One of the important features of a complex network is the community structure, and the community discovery algorithm can reveal the close connection and similar attributes of nodes in the network and mine potential categories based on the close connection and similar attributes. The community discovery algorithm takes the contribution degree of each symptom to the syndrome type into consideration, and interaction among the symptoms is not missed. Therefore, research attempts to use a community discovery algorithm for syndrome analysis and medication law research.

The community discovery algorithm can be divided into two categories of separation and aggregation in principle. The separation method achieves the effect of dividing communities based on removing edges of the communities, and the aggregation method aggregates communities based on the closeness of node links. The Louvain algorithm belongs to a polymerization method and is a method for improving community division efficiency by using a method for optimizing modularity. The algorithm firstly tries to distribute the nodes in the graph to the communities where each neighbor node is located, the distributed modularity changes maximally, and the step is repeated until all communities do not change any more; and then, the obtained small communities are merged into a super node to reconstruct a network, the edge weight among the nodes in the community is converted into the ring weight of a new node, and the edge weight among the community is converted into the edge weight among the new nodes. And iterating the steps until the algorithm is stable. The algorithm can generate a hierarchical community structure, and the nodes are compressed according to communities in the calculation process, so that the calculation efficiency is high, and the algorithm is one of the most efficient and widely applied community discovery algorithms at present.

The Louvain algorithm can only obtain the community distribution condition of the symptoms, but the primary and secondary relations among the symptoms in the community need to be known when the syndrome is obtained through professional analysis. In the subgraph of community discovery, the connection between symptoms is derived from the symptom co-occurrence information in the knowledge graph, so the problem can be abstracted into calculating the influence of the symptom nodes in the community. The PageRank algorithm is derived from an algorithm of a webpage link structure, a plurality of webpages in the Internet are regarded as nodes of a graph, links among the webpages are regarded as edges, the physical meaning of the PageRank (PR) value is the probability of accessing the webpage A, and the importance scores of all the links pointing to the webpage A need to be accumulated to obtain the probability. The embodiment compares the influence of the symptom in the community by calculating the PR value of the symptom node. The larger the PR value, the higher the probability of the syndrome showing the symptom, the symptom is the primary symptom, and vice versa.

The advantages of community discovery based on knowledge graph:

the knowledge graph is a semantic network and can formally describe concepts and relationships in the field. The advantages are as follows: integrating knowledge of various data sources; providing convenient knowledge search; knowledge discovery is realized through knowledge reasoning; and the prior knowledge is provided for the artificial intelligence system. The construction of the knowledge map can improve the utilization rate of ancient book knowledge and the efficiency of knowledge workers, and the normalized research of syndromes is carried out through the knowledge map, so that a foundation is laid for the next research of fusing the knowledge map and deep learning. At present, the research of knowledge maps in the field of traditional Chinese medicine relates to knowledge integration, inheritance of famous and old traditional Chinese medicine experiences, traditional Chinese medicine preparations, traditional Chinese medicine scientific research and the like. In the field of prescriptions, the existing researches adopt a graph database to construct a graph of a cassia twig decoction prescription and realize retrieval, or realize the inquiry of a medical scheme and a prescription from 'prescription-evidence correspondence', and related researches around knowledge discovery are still few.

From a data structure perspective, a knowledge graph also belongs to a complex network. The knowledge graph is adopted for community discovery, the method is more intuitive compared with a traditional complex network analysis method, particularly, the ancient book knowledge of traditional Chinese medicine is converted into graph structure data, the association among symptoms can be clearly established, and the information is covered in the community discovery process through graph extraction.

Analysis of the distribution rule of the diabetes six channels syndromes:

regarding the study of the six-channel syndrome of diabetes, it is considered that the onset of diabetes usually starts from the middle-jiao excess heat, and the whole course of disease should be treated with fire; the complicated phlegm, blood stasis and toxicity gradually appear in the process of development, and obvious deficiency syndrome appears in the later period due to stagnated heat and deficiency. This example shows that 4 communities meet the syndrome rule of diabetes at each stage: community 1 is the typical syndrome with early thirst and fire-heat as the main symptoms, community 4 is the syndrome with deficiency as the main symptom in the late thirst, and community 2 and community 3 are the syndromes with different pathogenesis of water drinking, water heating and blood injury during the development of thirst.

The community 1 takes yangming heat as the core pathogenesis. The symptoms of thirst, polydipsia, fever, headache and the like caused by fire-heat damaging the body fluid, such as 'thirst belonging to yangming' (97 th item in the treatise on exogenous febrile disease) and 'yangming disease, fever and sweating which are heat-excess' (97 th item in the treatise on exogenous febrile disease and dialectical treatment of syndromes of lower pathogenesis); fire-heat disturbing the heart-mind, manifested as restlessness and palpitation. Although the main syndrome of the community is fire-heat, it is accompanied by symptoms of fire-heat damaging blood and vessels failing to nourish, such as spasm and weakness of limbs, reversed tension, withered and lusterless nails, and pale complexion.

The 2 diseases in the community are combined with two meridians of Taiyin and Yangming, and the syndrome can be divided into two categories of water-fire inclusion and water-heat blood injury. In addition to the symptoms of thirst, polydipsia, polyphagia and the like caused by fire and heat, the community also has the symptoms of frequent urination, dysuria, turbid urination or sweet taste and the like caused by the retention of water and drink, just as in jin Kui Yao L ü e & Xiao Ke Li dysuria (Jinkui Yao & Xiao Ke Li dysuria): "dysuria with damp" or "dysuria"; the symptoms of chest oppression, thirst without desire for water and vertigo can be seen in the lingering and sweet soup syndrome: "downward fullness in the heart, upward chest thrust, and dizziness due to qi" (treatise on Cold-induced diseases, 67 bars). The combination of water and heat can cause hematuria and pale complexion due to burning the body fluids, such as: "Heat in the lower energizer causes hematuria" (accumulation of wind-cold in the five zang organs, gold starvation).

Community 3, which is marked by deficiency-cold water retention, belongs to the category of Taiyin diseases (water syndrome). Shortness of breath, urinary obstruction, edema, heavy pain of limbs, which are the manifestations of deficiency of yin and spleen yang and internal retention of cold-dampness, are recorded in jin Kui Yao L ü e phlegm-fluid retention cough: "short qi failing to lie down, its form being swollen", "short qi with slight drinking, when it is removed from the small stool"; the failure of spleen and stomach to warm and transport due to cold-dampness accumulation is manifested as abdominal pain, vomiting and indigestion, which are the symptoms of "taiyin disease with abdominal fullness and vomiting, indigestion, self-benefiting and prolonged pain of the abdomen"; the excessive cold-dampness may cause syncope of hand and foot, aversion to cold, such as cold in the waist and cold in the clothing, which are the symptoms of kidney retention (jin Kui Yao L., accumulation of wind-cold in the five zang organs).

Community 4 is marked by deficiency of both body fluid and blood, and belongs to the category of Taiyin (blood syndrome). Such as emaciation, hypodynamia, limb atrophy, amnesia, depression, dry throat, cracked lips, etc., which are the symptoms of the deficiency of body fluid and blood failing to nourish muscles, head, face and brain orifices; poor appetite, abdominal pain in the umbilicus, oliguria, etc., are the syndromes of insufficient supply of body fluids and blood to moisten the intestines and insufficient urine. The symptoms are similar to those of the general formula Xiaojianzhong decoction of consumptive disease: deficiency internal urgency, palpitation, epistaxis, abdominal pain, dream-disturbed sperm, soreness and pain of limbs, feverish sensation in the hands and feet, dry throat and mouth. At this time, fire-heat is not an excess syndrome, but is a deficient fire due to deficiency of body fluid and blood and yin failing to astringe yang, so the body fluid and blood are tonified by the middle-jiao building method.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. Traditional chinese medical science menstruation side ancient book knowledge map construction and syndrome excavation system includes: knowledge map construction module, picture database and knowledge mining module, its characterized in that: the knowledge mining module comprises: the system comprises a graph mapping unit and a community discovery computing unit;

the graph mapping unit extracts all symptom nodes in the entity graph as subgraph nodes, counts the number of times of symptom co-occurrence as the weight of edges in the subgraph, wherein the co-occurrence is that two symptom nodes are connected through the same middle node in the entity graph, and the types of the middle nodes are not limited;

1.2, sequentially distributing the node i to the community where each neighbor node is located, calculating the change value delta Q of the modularity before and after distribution, and recording the neighbor node with the maximum change value delta Q of the modularity, wherein if the maximum change value max delta Q of the modularity is larger than 0, the node i is distributed to the community where the neighbor node with the maximum change value delta Q of the modularity is located, otherwise, the node i is kept unchanged;

1.4, compressing the entity graph, compressing all nodes in the same community into a new node, converting the weight of edges between nodes in the community into the weight of a ring of the new node, and converting the edge weight of a community interval into the weight of edges between the new nodes;

1.5, repeat 1.1 until the modularity of the whole graph is no longer changed.

2. The system of claim 1, wherein the modularity is used to measure a network partition condition of a community, and is a difference between the number of edges connecting nodes in the community and the number of edges under a random condition, and the formula is as follows:

the formula for the modularity is simplified as:

represents [ sigma in- (. sigma. tot) for all communities²/2m]Are accumulated by the formulaIt can be known that the larger the sum of the internal weights of the community is, the smaller the weights of the community and the external link are, the larger the modularity is;

Δ Q represents the change in modularity when node i is assigned to community c where neighbor node j is located, where k_i,inIs the sum of the edge weights of the node and the node i in the community c, the first half of the formula

Representing modularity after adding node i to community c, the second half

3. The system for building ancient book knowledge graph and mining syndromes of traditional Chinese medicine meridian according to claim 2, wherein the value range of the modularity is [ -1/2, 1).

4. The system for traditional Chinese medicine meridian ancient book knowledge graph construction and syndrome mining according to claim 1, characterized in that the community discovery calculation unit calculates influence of symptom nodes in the syndrome community by using co-occurrence information of symptoms in the extracted entity graph as a connecting edge of symptoms in a subgraph, for obtaining primary and secondary relations among symptoms in the community, the more connecting edges of the symptoms, the more likely the symptoms appear together with other symptoms, and therefore the larger the influence on the syndrome community is;

wherein PR (a) is the PR value of symptom node a, representing the probability of a being visited, PR (T)_n) Is a node T_nPR value of, the node T_nIs a certain node among all nodes pointing to the symptom node a, C (T)_n) Is a node T_nOut of degree of, i.e. node T_nThe number of edges pointing to other nodes, PR (T) is calculated_n) And C (T)_n) Accumulating weights, namely co-occurrence times, and d is a damping coefficient and represents the probability of reaching a certain node and continuing to extend backwards at any time;

and calculating PR values of other nodes in the same manner, substituting the PR values into a formula, and calculating a second round of PR (a) until the difference value of the results of two adjacent iterations is smaller than a convergence threshold value, so as to derive the PR value of each symptom node in the community to which the symptom node belongs, wherein the larger the PR value is, the larger the probability of the symptom is, and thus the division of the importance of the symptom is performed.

5. The system of claim 1, wherein the knowledge graph building module comprises:

the original text screening and semantic analysis unit screens out the content related to the keywords from the original text, eliminates irrelevant articles and removes the duplication of the screening result; semantically analyzing original texts related to the keywords, splitting the original texts into minimum morphemes, and establishing an original text concept according to the minimum morphemes;

the concept term standardization unit is used for establishing a standard word bank, corresponding the original text concept with the standardized concept, naming the nodes in the knowledge graph by adopting standard words and generating standardized concept terms;

and importing the data of the knowledge graph into a graph database to form an entity graph.

6. The system of claim 5, wherein the relevant content comprises selecting a sentence containing keywords and their synonyms, a sentence describing related diseases of the keywords, a sentence with a typical prescription for treating a keyword disorder, and a sentence containing the keywords but having no relation between the actual meaning and the keywords; the clauses containing the keywords and the synonyms thereof comprise the medicines containing the keywords in the whole chapter content and the efficacy.

7. The system of claim 5, wherein the original concepts are selected by sorting nouns in the smallest morphemes as candidate concepts and sorting phrases formed by the smallest morphemes as candidate triples.

8. The system of claim 5, wherein the concept term standardization unit further establishes a synonym table, and words of the original text concept are labeled as node attributes.

9. The system of claim 5, wherein the knowledge graph building module further comprises a raw text labeling unit for further parsing semantic meaning, and labeling concept terms in raw text as node names, entity types to which concepts belong as node types, and one node can belong to a group of node types, and phrases are labeled as triples.

10. The system of claim 1, wherein the knowledge map model layer definition unit comprises the following entity classes: the disease names, the causes, the treatment methods, the prescriptions, the syndromes, the symptoms, the physical signs, the medicines, the effects and the acupuncture points, the syndromes comprise physical signs, symptoms and pathogenesis, the prescriptions comprise the medicines and are respectively connected with the causes, the treatment methods, the prescriptions, the physical signs and the symptoms by taking the disease names as the center, the treatment methods comprise an internal treatment method and an external treatment method, wherein the external treatment method is connected with the symptoms through the acupuncture points, and the prescriptions are also connected with the effects and the syndromes.