CN112100405A

CN112100405A - Veterinary drug residue knowledge graph construction method based on weighted LDA

Info

Publication number: CN112100405A
Application number: CN202011010727.XA
Authority: CN
Inventors: 郑丽敏; 杨璐; 张恬
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2020-12-18
Anticipated expiration: 2040-09-23
Also published as: CN112100405B

Abstract

The invention discloses a veterinary drug residue knowledge graph construction method based on weighted LDA (Latent Dirichlet Allocation). Firstly, constructing a knowledge framework of veterinary drugs, and utilizing a web crawler to combine the knowledge framework to perform deep search and download documents. And aiming at the problems of theme noise and characteristic word bias existing in the LDA theme model, a weighted LDA method is used for theme mining, and veterinary drug related documents are downloaded again. Named entity recognition and relationship extraction is accomplished using a dictionary-based model. And finally, constructing a veterinary drug knowledge map by using a Neo4j map database. The method can be used for constructing the veterinary drug residue knowledge graph, finding out the characteristic rule of the veterinary drug residue and the reason of harm of the veterinary drug residue to the human body, and ensuring the quality safety of the meat, the eggs and the milk, thereby protecting the health and the life safety of people.

Description

Veterinary drug residue knowledge graph construction method based on weighted LDA

Technical Field

The invention relates to the field of natural language processing, in particular to a veterinary drug residual knowledge graph construction method based on weighted LDA.

Background

Food safety issues are receiving increasing attention, and among them, meat, egg and milk food safety issues are more important. The veterinary drug has important functions of preventing and treating animal diseases and promoting animal growth, and the animal product can not be cultured in the process of animal breeding. However, the phenomena of non-standardization, illicit use and abuse of veterinary drugs lead to the overproof veterinary drug residues, thereby causing toxic events. The characteristic rule of veterinary drug residue and the reason of harm of the veterinary drug residue to human bodies are found out by constructing a veterinary drug residue knowledge graph, and the quality safety of meat, eggs and milk products is ensured, so that the health and life safety of people are protected.

The veterinary drug residue data relates to the residue standard of veterinary drugs, the sampling inspection standard exceeding condition of the veterinary drugs, the animal toxicology experimental data of the veterinary drug residue, the harm symptoms to people and the like. These data include structured data and unstructured text data. And extracting and classifying knowledge by using the data to construct a basic veterinary drug residual knowledge framework, and downloading relevant veterinary drug knowledge documents by using the constructed veterinary drug knowledge framework.

And (4) combining the basic framework of veterinary drug residual knowledge to download the literature to obtain the literature related to veterinary drug knowledge. And (4) performing theme mining by adopting LDA to obtain potential information in veterinary drug literature. LDA (late Dirichlet Allocation) is an unsupervised machine learning technique that can be used to identify underlying topic information in large-scale document collections or corpora. The LDA topic mining method considers that all words have the same weight, and actually a large number of high-frequency irrelevant words do not contribute to topic mining, and a weighted LDA topic mining method combining the semantic similarity of the veterinary drug knowledge level and TF-IDF is adopted to download documents.

Through data fusion, data integration, entity identification and relation extraction, the knowledge graph is further constructed, the characteristic rule of veterinary drug residue and the reason of harm of the veterinary drug residue to a human body are found, and the quality safety of meat, eggs and milk is ensured, so that the health and life safety of people are protected.

Disclosure of Invention

The invention aims to provide a veterinary drug residual knowledge graph construction method based on weighted LDA, and in order to solve the technical problems, the invention mainly comprises the following technical contents:

a veterinary drug residual knowledge graph construction method based on weighted LDA comprises the following steps:

(1) constructing a veterinary medicine knowledge framework: knowledge is extracted from veterinary pharmacology, veterinary toxicology books using a hierarchical analysis and rule based approach. Veterinary toxicology-related knowledge was obtained from the Pubchem website using a wrapper-based approach. Utilizing a jieba word segmentation tool to perform word-stop, word segmentation and part-of-speech tagging on the linguistic data to finally form a dictionary and form a hierarchical veterinary drug knowledge frame;

(2) downloading document data: and performing multilayer search on the Web of science by using the dictionary obtained in the last step and combining the name of the veterinary drug, namely traversing each path from the root node to the leaf node, and performing multilayer result search on all vocabularies on each path. Classifying the obtained documents by using a Support Vector Machine (SVM) method, wherein the classification comprises two categories of veterinary drug knowledge correlation and veterinary drug knowledge irrelevance. For veterinary drug knowledge-related documents, performing theme extraction by using a weighted LDA method;

(3) information extraction: dictionary-based named entity recognition and relationship extraction;

(4) constructing a knowledge graph: and (3) importing the entities of the veterinary drug field knowledge and the relationships among the entities into a Neo4j database in a csv format.

The veterinary drug knowledge framework constructed in the step (1) comprises the following contents:

(a) the veterinary drug residue knowledge structure is formulated, and the veterinary drug residue knowledge structure comprises five parts: veterinary drug residues, toxicology, effects on organs and systems, attributes and toxicity;

(b) veterinary drug residue: including cause, impact and harm. The harm can be divided into three parts of harm to human bodies, food and environment;

(c) the veterinary drug attributes are as follows: category, physicochemical properties, pharmacokinetics, action, application, maximum residual limit and adverse reactions;

(d) toxicity of veterinary drugs: classification of toxic effects, common parameters, special risk groups, exposure routes, preventive measures, inhalation mode, animal experiments. The categories of toxic effects include nature, time of occurrence, location and recovery. Commonly used parameters include acute toxicity, mutagenicity, carcinogenicity, teratogenicity, acute toxicity, and the like. The animal experiment object comprises mice, rats, rabbits, dogs and the like;

(e) theory of toxicity of veterinary drugs: including objects, content and methods. The method can be divided into two parts of biological experiment and population investigation; effects on organs and systems: including the eye, skin, liver, kidney, nervous system, blood system, immune system, gastrointestinal tract, endocrine system, and respiratory system;

(f) each portion, if containing table contents, is placed under the corresponding category.

The multi-layer search in the step (2) comprises the following steps:

(a) the selected veterinary drugs are: the standard of the maximum residue limit of veterinary drugs in national food standard for food safety stipulates 2191 residue limits and use requirements of 267 veterinary drugs in livestock and poultry products, aquatic products and bee products;

(b) and (4) data capture of a dynamic webpage (Ajax) after the Selenium and chrome driver are used. The searching range is all documents of a database established by web office to date, and the searching is performed without limitation on periodicals in consideration of small data volume of veterinary toxicology research;

(c) according to the veterinary medicine knowledge framework, from the root node to the leaf node. And combining the keywords to search in a multi-layer result for all nodes on each path.

The SVM text classification in the step (2) comprises the following steps:

(a) the purpose is to divide the obtained literature sets into two categories, wherein the veterinary drug knowledge is related and the veterinary drug knowledge is unrelated;

(b) the TF-IDF method calculates and expresses the importance degree of a certain keyword in the text through a statistical method. TF refers to word frequency and indicates the frequency of occurrence of a given entry in the text, and IDF refers to the inverse text frequency and is a measure of the general importance of a word. Entry t_iIn the text d_jThe TF calculation method in (1):

wherein n is_i，jAs an entry t_iIn the text d_jThe denominator represents the number of occurrences of the text d_jThe sum of the number of occurrences of all words in (a).

The calculation method of the IDF comprises the following steps:

where | D | is the total number of texts, | j: t is t_i∈d_jI is a word containing t_iTo prevent the entry not being in the text, which results in a denominator of 0, the denominator is increased by 1. Finally, the entry t is obtained_iTF-IDF value of (1):

tfidf_i，j＝tf_i，j×idf_i

(c) firstly, feature words of a paper part abstract are extracted by using a TF-IDF algorithm, and a document vector is generated. The vector dimensionality is too high when the full text in the abstract is selected, so that the complexity of calculation is increased, and subsequent classification is not facilitated. Selecting a short text of a conclusion part in a thesis abstract according to the characteristics of a veterinary drug knowledge-related document, and extracting feature words for the short text by using a TF-IDF algorithm to generate a document vector;

(d) and randomly selecting partial data and manually marking. Setting the proportion of the training set to the test set to be 8: 2;

(e) adjusting an SVM penalty parameter C, and evaluating the model by combining the accuracy (a), the precision (P), the recall rate (R) and the F1 value;

(f) verifying the model by the test set;

(g) and for literature data, using a trained model to obtain literature data related to veterinary drug residue topics.

The establishment of the weighted LDA topic model in the step (2) comprises the following steps:

(a) LDA (latentdirichletalogenation) is a 3-layer Bayesian model, which describes the relationship among documents, topics, and vocabularies. The graphical model is shown in fig. 2. Meanings of respective symbols in the drawings: alpha is a hyperparameter of the Dirichlet distribution theta and beta is the Dirichlet distribution

Is a polynomial distribution of "document-subject" [ theta ],

Is the polynomial distribution of "topic-vocabulary", z is the topic assignment of words, w is a word, K is the number of topics, M is the number of documents, N is the number of words of a document;

(b) the process of LDA:

1. randomly assigning a theme number Z to each vocabulary in each document in the corpus;

2. rescanning the corpus, Sampling each word by using a Gibbs Sampling formula, solving the theme of each word, and updating the theme in the corpus;

3. repeating the step 2 until Gibbs Sampling converges;

4. and (4) counting a topic word co-occurrence frequency matrix of the corpus, wherein the matrix is a model of the LDA.

(c) Gibbs Sampling fitting θ,

The process of (2):

1. scanning an article for each word w_nRandomly assigning a theme Z_j；

2. Initialization Z_jAn integer of 1 to K;

3. rescanning each article, performing topic modeling on the corpus by adopting an LDA (Linear discriminant analysis) model, continuously iterating parameter reasoning by using Gibbs Sampling, and simultaneously recording Z_jThe value of (c). The parameter theta,

The calculation formula of (a) is as follows:

wherein the content of the first and second substances,

is the number of words for topic j in article d,

is the number of words for all topics in article d,

is the number of times the word w appears under topic j,

is the total number of words for topic j in article d.

(d) The LDA algorithm does not well combine related semantic information in the topic modeling process, which seriously affects the semantic consistency, interpretability and accuracy of text semantic representation of the topic. Aiming at the vocabulary distribution characteristics of the veterinary drug knowledge, according to the semantic similarity between each word and the seed word of the veterinary drug knowledge, the similarity is calculated by using a hierarchical semantic similarity calculation formula, different weights are given to the vocabulary, and weight information is integrated into a Gibbs sampling process;

p1 and p2 represent two vocabularies, d represents the path distance of p1 and p2 in the veterinary drug knowledge hierarchy, and the larger d is, the smaller the similarity is. The value range of the similarity is [0,1 ]. k is an adjustable parameter, typically set to 20 by default.

(e) LDA is biased to the extraction of high frequency words in the parameter estimation process, and some low frequency characteristic words of the implied subject are submerged. Considering both high-frequency words and low-frequency characteristic words of implicit topics, and considering the TF-IDF method for optimization, the TF-IDF calculates and expresses the importance degree of a certain keyword in a text through a statistical method. And weighting a theme-word matrix generated by the theme model iteration by calculating the TF-IDF value of the word, thereby effectively weakening the influence of the high-frequency noise word.

(f) Weighted LDA step:

1. performing word segmentation and word removal processing on the thesis abstract data set;

2. gibbs sampling is carried out on the corpus, and document-theme distribution and theme-word distribution are generated;

3. calculating similarity, sequencing according to the similarity, reserving the first K/2 topics as candidate topics, and constructing new document-topic distribution and topic-word distribution by combining the candidate topics;

4. and weighting the distribution of the theme-words by using TF-IDF to obtain a weighted probability, and selecting 20 characteristic words with the highest weight according to the distribution condition of the theme-words.

(g) Determination of number of topics in LDA: when the model is trained, the number of themes needs to be set in advance, and parameters are manually adjusted according to the trained result. The number of subjects is 40, the hyper-parameter alpha is 0.25, and the beta is 0.1;

(h) the documents in the corpus are related documents of veterinary drug knowledge obtained through SVM classification in the previous step, and topic mining is carried out on the documents to obtain related topic vocabularies. Then, the subject words are used to search again.

The step (3) of establishing weighting information extraction comprises the following steps:

(a) named entity recognition: performing word segmentation and word stop by using an open-source word segmentation tool, and performing named entity recognition by using a veterinary drug knowledge dictionary;

(b) and (3) extracting the relation: and (3) predefining a relation extraction model among the veterinary drug knowledge entities to extract the relation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a veterinary drug knowledge dictionary framework of the present invention;

FIG. 2 is a schematic diagram of an LDA algorithm;

FIG. 3 is a flow chart of a weighted LDA document search algorithm.

Detailed Description

To further explain the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the embodiments, structures, features and effects thereof according to the present invention will be made with reference to the accompanying drawings and preferred embodiments.

(2) downloading document data: and performing multilayer search on the Web of science by using the dictionary obtained in the last step and combining the name of the veterinary drug, namely traversing each path from the root node to the leaf node, and performing multilayer result search on all vocabularies on each path. Classifying the obtained documents by using a Support Vector Machine (SVM) method, wherein the classification comprises two categories of veterinary drug knowledge correlation and veterinary drug knowledge irrelevance. For veterinary drug knowledge-related documents, a modified LDA method is used for theme extraction;

The veterinary drug knowledge framework construction in the step (1) comprises the following contents:

The multi-layer search in the step (2) comprises the following steps:

The SVM text classification in the step (2) comprises the following steps:

(b) the TF-IDF method calculates and expresses the importance degree of a certain keyword in the text through a statistical method. TF refers to word frequency and indicates the frequency of occurrence of a given entry in the text, and IDF refers to the inverse text frequency and is a measure of the general importance of a word. TF calculation method of entry t _ i in text d _ j:

The calculation method of the IDF comprises the following steps:

tfidf_i，j＝tf_i，j×idf_i

(a) firstly, feature words of a paper part abstract are extracted by using a TF-IDF algorithm, and a document vector is generated. The vector dimensionality is too high when the full text in the abstract is selected, so that the complexity of calculation is increased, and subsequent classification is not facilitated. Selecting a short text of a conclusion part in a thesis abstract according to the characteristics of a veterinary drug knowledge-related document, and extracting feature words for the short text by using a TF-IDF algorithm to generate a document vector;

(b) and randomly selecting partial data and manually marking. Setting the proportion of the training set to the test set to be 8: 2;

(c) adjusting an SVM penalty parameter C, and evaluating the model by combining the accuracy (a), the precision (P), the recall rate (R) and the F1 value;

(d) verifying the model by the test set;

(e) and for literature data, using a trained model to obtain literature data related to veterinary drug residue topics.

Is a polynomial distribution of "document-subject" [ theta ],

(b) the process of LDA:

3. repeating the step 2 until Gibbs Sampling converges;

(c) Gibbs Sampling fitting θ,

The process of (2):

1. scanning an article for each word w_nRandomly assigning a theme Z_j；

2. Initialization Z_jAn integer of 1 to K;

The calculation formula of (a) is as follows:

wherein the content of the first and second substances,

is the number of words for topic j in article d,

is the number of words for all topics in article d,

is the number of times the word w appears under topic j,

is the total number of words for topic j in article d.

(f) Weighted LDA step:

The parts not involved in the present invention are the same as or can be implemented using the prior art.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A veterinary drug residual knowledge graph construction method based on weighted LDA is characterized by comprising the following steps:

(1) constructing a veterinary medicine knowledge framework: extracting knowledge from veterinary pharmacology and veterinary toxicology books by using a method based on hierarchical analysis and rules, obtaining veterinary toxicology related knowledge from a Pubchem website by using a method based on a wrapper, and performing word-stop, word-segmentation and part-of-speech tagging on the corpora by using a jieba word-segmentation tool to finally form a dictionary and form a hierarchical veterinary drug knowledge frame;

(2) downloading document data: performing multilayer search on Web of science by using the dictionary obtained in the last step and combining with the name of the veterinary drug, namely traversing each path from a root node to a leaf node, performing search in multilayer results for all vocabularies on each path, classifying the obtained documents by using a Support Vector Machine (SVM) method, wherein the classification comprises two categories of correlation of veterinary drug knowledge and irrelevance of veterinary drug knowledge, and performing theme extraction on the documents related to the veterinary drug knowledge by using a weighted LDA method;

2. The method of claim 1, wherein the constructing of the veterinary drug knowledge framework in the step (1) comprises the following steps:

(2a) a veterinary drug residue knowledge system structure is established, and the veterinary drug residue knowledge system structure comprises five parts: veterinary drug residues, toxicology, effects on organs and systems, attributes and toxicity;

(2b) veterinary drug residue: the harm includes reasons, influence and harm, and can be divided into three parts of harm to human bodies, food and environment;

(2c) the veterinary drug attributes are as follows: category, physicochemical properties, pharmacokinetics, action, application, maximum residual limit and adverse reactions;

(2d) toxicity of veterinary drugs: the method comprises the following steps of (1) toxic effect classification, common parameters, special risk groups, exposure ways, preventive measures, inhalation modes and animal experiments, wherein the toxic effect classification comprises properties, occurrence time, positions and recovery conditions, the common parameters comprise acute toxicity, mutagenicity, carcinogenicity, teratogenicity, acute toxicity and the like, and the objects of the animal experiments comprise mice, rats, rabbits, dogs and the like;

(2e) theory of toxicity of veterinary drugs: comprises purposes, contents and methods, and the method is divided into two parts of biological experiment and population investigation; effects on organs and systems include eye, skin, liver, kidney, nervous system, blood system, immune system, gastrointestinal tract, endocrine system, and respiratory system;

(2f) each portion, if containing table contents, is placed under the corresponding category.

3. The method for constructing veterinary drug residual knowledge graph based on weighted LDA as claimed in claim 1, wherein the multi-layer search in step (2) comprises the following steps:

(3a) the selected veterinary drugs are: the standard of the maximum residue limit of veterinary drugs in national food standard for food safety stipulates 2191 residue limits and use requirements of 267 veterinary drugs in livestock and poultry products, aquatic products and bee products;

(3b) data capture is carried out by utilizing a dynamic webpage (Ajax) after a Selenium and chrome driver are finished, the searching range is all documents of a database built by the web of science so far, and the searching is carried out without limiting periodicals considering that the data volume of veterinary toxicology research is less;

(3c) and combining the keywords to search in a multi-layer result from the root node to the leaf node according to a veterinary medicine knowledge framework for all nodes on each path.

4. The method of claim 1, wherein the step (2) of establishing a weighted LDA topic model comprises the steps of:

(4a) LDA (late dichhere allocation) is a 3-layer Bayes model, which describes the relationship among documents, subjects and vocabularies, and the graph model is shown in FIG. 2, wherein the meaning of each symbol in the graph is as follows: alpha is a hyperparameter of the Dirichlet distribution theta and beta is the Dirichlet distribution

Is a polynomial distribution of "document-subject" [ theta ],

(4b) the process of LDA:

(1) randomly assigning a theme number Z to each vocabulary in each document in the corpus;

(2) rescanning the corpus, Sampling each word by using a Gibbs Sampling formula, solving the theme of each word, and updating the theme in the corpus;

(3) repeating the step (2) until Gibbs Sampling converges;

(4) counting a topic vocabulary co-occurrence frequency matrix of a corpus, wherein the matrix is a model of LDA;

(4c) gibbs Sampling fitting θ,

The process of (2):

(1) scanning an article for each word w_nRandomly assigning a theme Z_j；

(2) Initialization Z_jAn integer of 1 to K;

(3) rescanning each article, performing topic modeling on the corpus by adopting an LDA (Linear discriminant analysis) model, continuously iterating parameter reasoning by using Gibbs Sampling, and simultaneously recording Z_jThe parameter theta,

The calculation formula of (a) is as follows:

wherein the content of the first and second substances,

is the number of words for topic j in article d,

is the number of words for all topics in article d,

is the number of times the word w appears under topic j,

is the total number of words for topic j in article d;

(4d) the LDA algorithm does not well combine related semantic information in the topic modeling process, which seriously affects the semantic coherence, interpretability and accuracy of text semantic representation of the topic, and aiming at the vocabulary distribution characteristics of veterinary drug knowledge, the similarity is calculated by using a hierarchical semantic similarity calculation formula according to the semantic similarity of each word and a seed word of the veterinary drug knowledge, different weights are given to the vocabulary, and weight information is integrated into the Gibbs sampling process;

p1 and p2 represent two vocabularies, d represents the path distance of p1 and p2 in the veterinary drug knowledge hierarchy, the larger d is, the smaller the similarity is, the value range of the similarity is [0,1], k is an adjustable parameter and is usually set to be 20 by default;

(4e) LDA is biased to the extraction of high-Frequency words in the parameter estimation process, and some low-Frequency characteristic words which imply the theme are submerged, the high-Frequency words are considered, the low-Frequency characteristic words which imply the theme are also considered, the optimization is carried out by using a TF-IDF (Term Frequency-Inverse Document Frequency) method, the TF-IDF calculates and expresses the importance degree of a certain keyword in a text by a statistical method, and a theme-word matrix generated by the theme model iteration is weighted by calculating the TF-IDF value of the word, so that the influence of the high-Frequency noise words is effectively weakened;

(4f) weighted LDA step:

(1) performing word segmentation and word removal processing on the thesis abstract data set;

(2) gibbs sampling is carried out on the corpus, and document-theme distribution and theme-word distribution are generated;

(3) calculating similarity, sequencing according to the similarity, reserving the first K/2 topics as candidate topics, and constructing new document-topic distribution and topic-word distribution by combining the candidate topics;

(4) weighting the topic-word distribution by using TF-IDF to obtain a weighted probability, and selecting 20 characteristic words with the highest weight according to the topic-word distribution condition;

(4g) determination of number of topics in LDA: when the model is trained, the number of subjects needs to be set in advance, parameters are manually adjusted according to the trained result, the hyper-parameter alpha is 0.25, and the hyper-parameter beta is 0.1;

(4h) the documents in the corpus are related documents of veterinary drug knowledge obtained through SVM classification in the previous step, topic mining is carried out on the documents to obtain related topic vocabularies, and then the topic vocabularies are used for searching again.