CN112100405A - Veterinary drug residue knowledge graph construction method based on weighted LDA - Google Patents

Veterinary drug residue knowledge graph construction method based on weighted LDA Download PDF

Info

Publication number
CN112100405A
CN112100405A CN202011010727.XA CN202011010727A CN112100405A CN 112100405 A CN112100405 A CN 112100405A CN 202011010727 A CN202011010727 A CN 202011010727A CN 112100405 A CN112100405 A CN 112100405A
Authority
CN
China
Prior art keywords
veterinary drug
knowledge
veterinary
word
lda
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011010727.XA
Other languages
Chinese (zh)
Other versions
CN112100405B (en
Inventor
郑丽敏
杨璐
张恬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Agricultural University
Original Assignee
China Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Agricultural University filed Critical China Agricultural University
Priority to CN202011010727.XA priority Critical patent/CN112100405B/en
Publication of CN112100405A publication Critical patent/CN112100405A/en
Application granted granted Critical
Publication of CN112100405B publication Critical patent/CN112100405B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Animal Behavior & Ethology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a veterinary drug residue knowledge graph construction method based on weighted LDA (Latent Dirichlet Allocation). Firstly, constructing a knowledge framework of veterinary drugs, and utilizing a web crawler to combine the knowledge framework to perform deep search and download documents. And aiming at the problems of theme noise and characteristic word bias existing in the LDA theme model, a weighted LDA method is used for theme mining, and veterinary drug related documents are downloaded again. Named entity recognition and relationship extraction is accomplished using a dictionary-based model. And finally, constructing a veterinary drug knowledge map by using a Neo4j map database. The method can be used for constructing the veterinary drug residue knowledge graph, finding out the characteristic rule of the veterinary drug residue and the reason of harm of the veterinary drug residue to the human body, and ensuring the quality safety of the meat, the eggs and the milk, thereby protecting the health and the life safety of people.

Description

Veterinary drug residue knowledge graph construction method based on weighted LDA
Technical Field
The invention relates to the field of natural language processing, in particular to a veterinary drug residual knowledge graph construction method based on weighted LDA.
Background
Food safety issues are receiving increasing attention, and among them, meat, egg and milk food safety issues are more important. The veterinary drug has important functions of preventing and treating animal diseases and promoting animal growth, and the animal product can not be cultured in the process of animal breeding. However, the phenomena of non-standardization, illicit use and abuse of veterinary drugs lead to the overproof veterinary drug residues, thereby causing toxic events. The characteristic rule of veterinary drug residue and the reason of harm of the veterinary drug residue to human bodies are found out by constructing a veterinary drug residue knowledge graph, and the quality safety of meat, eggs and milk products is ensured, so that the health and life safety of people are protected.
The veterinary drug residue data relates to the residue standard of veterinary drugs, the sampling inspection standard exceeding condition of the veterinary drugs, the animal toxicology experimental data of the veterinary drug residue, the harm symptoms to people and the like. These data include structured data and unstructured text data. And extracting and classifying knowledge by using the data to construct a basic veterinary drug residual knowledge framework, and downloading relevant veterinary drug knowledge documents by using the constructed veterinary drug knowledge framework.
And (4) combining the basic framework of veterinary drug residual knowledge to download the literature to obtain the literature related to veterinary drug knowledge. And (4) performing theme mining by adopting LDA to obtain potential information in veterinary drug literature. LDA (late Dirichlet Allocation) is an unsupervised machine learning technique that can be used to identify underlying topic information in large-scale document collections or corpora. The LDA topic mining method considers that all words have the same weight, and actually a large number of high-frequency irrelevant words do not contribute to topic mining, and a weighted LDA topic mining method combining the semantic similarity of the veterinary drug knowledge level and TF-IDF is adopted to download documents.
Through data fusion, data integration, entity identification and relation extraction, the knowledge graph is further constructed, the characteristic rule of veterinary drug residue and the reason of harm of the veterinary drug residue to a human body are found, and the quality safety of meat, eggs and milk is ensured, so that the health and life safety of people are protected.
Disclosure of Invention
The invention aims to provide a veterinary drug residual knowledge graph construction method based on weighted LDA, and in order to solve the technical problems, the invention mainly comprises the following technical contents:
a veterinary drug residual knowledge graph construction method based on weighted LDA comprises the following steps:
(1) constructing a veterinary medicine knowledge framework: knowledge is extracted from veterinary pharmacology, veterinary toxicology books using a hierarchical analysis and rule based approach. Veterinary toxicology-related knowledge was obtained from the Pubchem website using a wrapper-based approach. Utilizing a jieba word segmentation tool to perform word-stop, word segmentation and part-of-speech tagging on the linguistic data to finally form a dictionary and form a hierarchical veterinary drug knowledge frame;
(2) downloading document data: and performing multilayer search on the Web of science by using the dictionary obtained in the last step and combining the name of the veterinary drug, namely traversing each path from the root node to the leaf node, and performing multilayer result search on all vocabularies on each path. Classifying the obtained documents by using a Support Vector Machine (SVM) method, wherein the classification comprises two categories of veterinary drug knowledge correlation and veterinary drug knowledge irrelevance. For veterinary drug knowledge-related documents, performing theme extraction by using a weighted LDA method;
(3) information extraction: dictionary-based named entity recognition and relationship extraction;
(4) constructing a knowledge graph: and (3) importing the entities of the veterinary drug field knowledge and the relationships among the entities into a Neo4j database in a csv format.
The veterinary drug knowledge framework constructed in the step (1) comprises the following contents:
(a) the veterinary drug residue knowledge structure is formulated, and the veterinary drug residue knowledge structure comprises five parts: veterinary drug residues, toxicology, effects on organs and systems, attributes and toxicity;
(b) veterinary drug residue: including cause, impact and harm. The harm can be divided into three parts of harm to human bodies, food and environment;
(c) the veterinary drug attributes are as follows: category, physicochemical properties, pharmacokinetics, action, application, maximum residual limit and adverse reactions;
(d) toxicity of veterinary drugs: classification of toxic effects, common parameters, special risk groups, exposure routes, preventive measures, inhalation mode, animal experiments. The categories of toxic effects include nature, time of occurrence, location and recovery. Commonly used parameters include acute toxicity, mutagenicity, carcinogenicity, teratogenicity, acute toxicity, and the like. The animal experiment object comprises mice, rats, rabbits, dogs and the like;
(e) theory of toxicity of veterinary drugs: including objects, content and methods. The method can be divided into two parts of biological experiment and population investigation; effects on organs and systems: including the eye, skin, liver, kidney, nervous system, blood system, immune system, gastrointestinal tract, endocrine system, and respiratory system;
(f) each portion, if containing table contents, is placed under the corresponding category.
The multi-layer search in the step (2) comprises the following steps:
(a) the selected veterinary drugs are: the standard of the maximum residue limit of veterinary drugs in national food standard for food safety stipulates 2191 residue limits and use requirements of 267 veterinary drugs in livestock and poultry products, aquatic products and bee products;
(b) and (4) data capture of a dynamic webpage (Ajax) after the Selenium and chrome driver are used. The searching range is all documents of a database established by web office to date, and the searching is performed without limitation on periodicals in consideration of small data volume of veterinary toxicology research;
(c) according to the veterinary medicine knowledge framework, from the root node to the leaf node. And combining the keywords to search in a multi-layer result for all nodes on each path.
The SVM text classification in the step (2) comprises the following steps:
(a) the purpose is to divide the obtained literature sets into two categories, wherein the veterinary drug knowledge is related and the veterinary drug knowledge is unrelated;
(b) the TF-IDF method calculates and expresses the importance degree of a certain keyword in the text through a statistical method. TF refers to word frequency and indicates the frequency of occurrence of a given entry in the text, and IDF refers to the inverse text frequency and is a measure of the general importance of a word. Entry tiIn the text djThe TF calculation method in (1):
Figure BDA0002697476900000031
wherein n isi,jAs an entry tiIn the text djThe denominator represents the number of occurrences of the text djThe sum of the number of occurrences of all words in (a).
The calculation method of the IDF comprises the following steps:
Figure BDA0002697476900000032
where | D | is the total number of texts, | j: t is ti∈djI is a word containing tiTo prevent the entry not being in the text, which results in a denominator of 0, the denominator is increased by 1. Finally, the entry t is obtainediTF-IDF value of (1):
tfidfi,j=tfi,j×idfi
(c) firstly, feature words of a paper part abstract are extracted by using a TF-IDF algorithm, and a document vector is generated. The vector dimensionality is too high when the full text in the abstract is selected, so that the complexity of calculation is increased, and subsequent classification is not facilitated. Selecting a short text of a conclusion part in a thesis abstract according to the characteristics of a veterinary drug knowledge-related document, and extracting feature words for the short text by using a TF-IDF algorithm to generate a document vector;
(d) and randomly selecting partial data and manually marking. Setting the proportion of the training set to the test set to be 8: 2;
(e) adjusting an SVM penalty parameter C, and evaluating the model by combining the accuracy (a), the precision (P), the recall rate (R) and the F1 value;
(f) verifying the model by the test set;
(g) and for literature data, using a trained model to obtain literature data related to veterinary drug residue topics.
The establishment of the weighted LDA topic model in the step (2) comprises the following steps:
(a) LDA (latentdirichletalogenation) is a 3-layer Bayesian model, which describes the relationship among documents, topics, and vocabularies. The graphical model is shown in fig. 2. Meanings of respective symbols in the drawings: alpha is a hyperparameter of the Dirichlet distribution theta and beta is the Dirichlet distribution
Figure BDA0002697476900000042
Is a polynomial distribution of "document-subject" [ theta ],
Figure BDA0002697476900000043
Is the polynomial distribution of "topic-vocabulary", z is the topic assignment of words, w is a word, K is the number of topics, M is the number of documents, N is the number of words of a document;
(b) the process of LDA:
1. randomly assigning a theme number Z to each vocabulary in each document in the corpus;
2. rescanning the corpus, Sampling each word by using a Gibbs Sampling formula, solving the theme of each word, and updating the theme in the corpus;
3. repeating the step 2 until Gibbs Sampling converges;
4. and (4) counting a topic word co-occurrence frequency matrix of the corpus, wherein the matrix is a model of the LDA.
(c) Gibbs Sampling fitting θ,
Figure BDA0002697476900000041
The process of (2):
1. scanning an article for each word wnRandomly assigning a theme Zj
2. Initialization ZjAn integer of 1 to K;
3. rescanning each article, performing topic modeling on the corpus by adopting an LDA (Linear discriminant analysis) model, continuously iterating parameter reasoning by using Gibbs Sampling, and simultaneously recording ZjThe value of (c). The parameter theta,
Figure BDA0002697476900000051
The calculation formula of (a) is as follows:
Figure BDA0002697476900000052
Figure BDA0002697476900000053
wherein the content of the first and second substances,
Figure BDA0002697476900000054
is the number of words for topic j in article d,
Figure BDA0002697476900000055
is the number of words for all topics in article d,
Figure BDA0002697476900000056
is the number of times the word w appears under topic j,
Figure BDA0002697476900000057
is the total number of words for topic j in article d.
(d) The LDA algorithm does not well combine related semantic information in the topic modeling process, which seriously affects the semantic consistency, interpretability and accuracy of text semantic representation of the topic. Aiming at the vocabulary distribution characteristics of the veterinary drug knowledge, according to the semantic similarity between each word and the seed word of the veterinary drug knowledge, the similarity is calculated by using a hierarchical semantic similarity calculation formula, different weights are given to the vocabulary, and weight information is integrated into a Gibbs sampling process;
Figure BDA0002697476900000058
p1 and p2 represent two vocabularies, d represents the path distance of p1 and p2 in the veterinary drug knowledge hierarchy, and the larger d is, the smaller the similarity is. The value range of the similarity is [0,1 ]. k is an adjustable parameter, typically set to 20 by default.
(e) LDA is biased to the extraction of high frequency words in the parameter estimation process, and some low frequency characteristic words of the implied subject are submerged. Considering both high-frequency words and low-frequency characteristic words of implicit topics, and considering the TF-IDF method for optimization, the TF-IDF calculates and expresses the importance degree of a certain keyword in a text through a statistical method. And weighting a theme-word matrix generated by the theme model iteration by calculating the TF-IDF value of the word, thereby effectively weakening the influence of the high-frequency noise word.
(f) Weighted LDA step:
1. performing word segmentation and word removal processing on the thesis abstract data set;
2. gibbs sampling is carried out on the corpus, and document-theme distribution and theme-word distribution are generated;
3. calculating similarity, sequencing according to the similarity, reserving the first K/2 topics as candidate topics, and constructing new document-topic distribution and topic-word distribution by combining the candidate topics;
4. and weighting the distribution of the theme-words by using TF-IDF to obtain a weighted probability, and selecting 20 characteristic words with the highest weight according to the distribution condition of the theme-words.
(g) Determination of number of topics in LDA: when the model is trained, the number of themes needs to be set in advance, and parameters are manually adjusted according to the trained result. The number of subjects is 40, the hyper-parameter alpha is 0.25, and the beta is 0.1;
(h) the documents in the corpus are related documents of veterinary drug knowledge obtained through SVM classification in the previous step, and topic mining is carried out on the documents to obtain related topic vocabularies. Then, the subject words are used to search again.
The step (3) of establishing weighting information extraction comprises the following steps:
(a) named entity recognition: performing word segmentation and word stop by using an open-source word segmentation tool, and performing named entity recognition by using a veterinary drug knowledge dictionary;
(b) and (3) extracting the relation: and (3) predefining a relation extraction model among the veterinary drug knowledge entities to extract the relation.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a veterinary drug knowledge dictionary framework of the present invention;
FIG. 2 is a schematic diagram of an LDA algorithm;
FIG. 3 is a flow chart of a weighted LDA document search algorithm.
Detailed Description
To further explain the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the embodiments, structures, features and effects thereof according to the present invention will be made with reference to the accompanying drawings and preferred embodiments.
A veterinary drug residual knowledge graph construction method based on weighted LDA comprises the following steps:
(1) constructing a veterinary medicine knowledge framework: knowledge is extracted from veterinary pharmacology, veterinary toxicology books using a hierarchical analysis and rule based approach. Veterinary toxicology-related knowledge was obtained from the Pubchem website using a wrapper-based approach. Utilizing a jieba word segmentation tool to perform word-stop, word segmentation and part-of-speech tagging on the linguistic data to finally form a dictionary and form a hierarchical veterinary drug knowledge frame;
(2) downloading document data: and performing multilayer search on the Web of science by using the dictionary obtained in the last step and combining the name of the veterinary drug, namely traversing each path from the root node to the leaf node, and performing multilayer result search on all vocabularies on each path. Classifying the obtained documents by using a Support Vector Machine (SVM) method, wherein the classification comprises two categories of veterinary drug knowledge correlation and veterinary drug knowledge irrelevance. For veterinary drug knowledge-related documents, a modified LDA method is used for theme extraction;
(3) information extraction: dictionary-based named entity recognition and relationship extraction;
(4) constructing a knowledge graph: and (3) importing the entities of the veterinary drug field knowledge and the relationships among the entities into a Neo4j database in a csv format.
The veterinary drug knowledge framework construction in the step (1) comprises the following contents:
(a) the veterinary drug residue knowledge structure is formulated, and the veterinary drug residue knowledge structure comprises five parts: veterinary drug residues, toxicology, effects on organs and systems, attributes and toxicity;
(b) veterinary drug residue: including cause, impact and harm. The harm can be divided into three parts of harm to human bodies, food and environment;
(c) the veterinary drug attributes are as follows: category, physicochemical properties, pharmacokinetics, action, application, maximum residual limit and adverse reactions;
(d) toxicity of veterinary drugs: classification of toxic effects, common parameters, special risk groups, exposure routes, preventive measures, inhalation mode, animal experiments. The categories of toxic effects include nature, time of occurrence, location and recovery. Commonly used parameters include acute toxicity, mutagenicity, carcinogenicity, teratogenicity, acute toxicity, and the like. The animal experiment object comprises mice, rats, rabbits, dogs and the like;
(e) theory of toxicity of veterinary drugs: including objects, content and methods. The method can be divided into two parts of biological experiment and population investigation; effects on organs and systems: including the eye, skin, liver, kidney, nervous system, blood system, immune system, gastrointestinal tract, endocrine system, and respiratory system;
(f) each portion, if containing table contents, is placed under the corresponding category.
The multi-layer search in the step (2) comprises the following steps:
(a) the selected veterinary drugs are: the standard of the maximum residue limit of veterinary drugs in national food standard for food safety stipulates 2191 residue limits and use requirements of 267 veterinary drugs in livestock and poultry products, aquatic products and bee products;
(b) and (4) data capture of a dynamic webpage (Ajax) after the Selenium and chrome driver are used. The searching range is all documents of a database established by web office to date, and the searching is performed without limitation on periodicals in consideration of small data volume of veterinary toxicology research;
(c) according to the veterinary medicine knowledge framework, from the root node to the leaf node. And combining the keywords to search in a multi-layer result for all nodes on each path.
The SVM text classification in the step (2) comprises the following steps:
(a) the purpose is to divide the obtained literature sets into two categories, wherein the veterinary drug knowledge is related and the veterinary drug knowledge is unrelated;
(b) the TF-IDF method calculates and expresses the importance degree of a certain keyword in the text through a statistical method. TF refers to word frequency and indicates the frequency of occurrence of a given entry in the text, and IDF refers to the inverse text frequency and is a measure of the general importance of a word. TF calculation method of entry t _ i in text d _ j:
Figure BDA0002697476900000081
wherein n isi,jAs an entry tiIn the text djThe denominator represents the number of occurrences of the text djThe sum of the number of occurrences of all words in (a).
The calculation method of the IDF comprises the following steps:
Figure BDA0002697476900000082
where | D | is the total number of texts, | j: t is ti∈djI is a word containing tiTo prevent the entry not being in the text, which results in a denominator of 0, the denominator is increased by 1. Finally, the entry t is obtainediTF-IDF value of (1):
tfidfi,j=tfi,j×idfi
(a) firstly, feature words of a paper part abstract are extracted by using a TF-IDF algorithm, and a document vector is generated. The vector dimensionality is too high when the full text in the abstract is selected, so that the complexity of calculation is increased, and subsequent classification is not facilitated. Selecting a short text of a conclusion part in a thesis abstract according to the characteristics of a veterinary drug knowledge-related document, and extracting feature words for the short text by using a TF-IDF algorithm to generate a document vector;
(b) and randomly selecting partial data and manually marking. Setting the proportion of the training set to the test set to be 8: 2;
(c) adjusting an SVM penalty parameter C, and evaluating the model by combining the accuracy (a), the precision (P), the recall rate (R) and the F1 value;
(d) verifying the model by the test set;
(e) and for literature data, using a trained model to obtain literature data related to veterinary drug residue topics.
The establishment of the weighted LDA topic model in the step (2) comprises the following steps:
(a) LDA (latentdirichletalogenation) is a 3-layer Bayesian model, which describes the relationship among documents, topics, and vocabularies. The graphical model is shown in fig. 2. Meanings of respective symbols in the drawings: alpha is a hyperparameter of the Dirichlet distribution theta and beta is the Dirichlet distribution
Figure BDA0002697476900000091
Is a polynomial distribution of "document-subject" [ theta ],
Figure BDA0002697476900000092
Is the polynomial distribution of "topic-vocabulary", z is the topic assignment of words, w is a word, K is the number of topics, M is the number of documents, N is the number of words of a document;
(b) the process of LDA:
1. randomly assigning a theme number Z to each vocabulary in each document in the corpus;
2. rescanning the corpus, Sampling each word by using a Gibbs Sampling formula, solving the theme of each word, and updating the theme in the corpus;
3. repeating the step 2 until Gibbs Sampling converges;
4. and (4) counting a topic word co-occurrence frequency matrix of the corpus, wherein the matrix is a model of the LDA.
(c) Gibbs Sampling fitting θ,
Figure BDA0002697476900000109
The process of (2):
1. scanning an article for each word wnRandomly assigning a theme Zj
2. Initialization ZjAn integer of 1 to K;
3. rescanning each article, performing topic modeling on the corpus by adopting an LDA (Linear discriminant analysis) model, continuously iterating parameter reasoning by using Gibbs Sampling, and simultaneously recording ZjThe value of (c). The parameter theta,
Figure BDA0002697476900000101
The calculation formula of (a) is as follows:
Figure BDA0002697476900000102
Figure BDA0002697476900000103
wherein the content of the first and second substances,
Figure BDA0002697476900000104
is the number of words for topic j in article d,
Figure BDA0002697476900000105
is the number of words for all topics in article d,
Figure BDA0002697476900000106
is the number of times the word w appears under topic j,
Figure BDA0002697476900000107
is the total number of words for topic j in article d.
(d) The LDA algorithm does not well combine related semantic information in the topic modeling process, which seriously affects the semantic consistency, interpretability and accuracy of text semantic representation of the topic. Aiming at the vocabulary distribution characteristics of the veterinary drug knowledge, according to the semantic similarity between each word and the seed word of the veterinary drug knowledge, the similarity is calculated by using a hierarchical semantic similarity calculation formula, different weights are given to the vocabulary, and weight information is integrated into a Gibbs sampling process;
Figure BDA0002697476900000108
p1 and p2 represent two vocabularies, d represents the path distance of p1 and p2 in the veterinary drug knowledge hierarchy, and the larger d is, the smaller the similarity is. The value range of the similarity is [0,1 ]. k is an adjustable parameter, typically set to 20 by default.
(e) LDA is biased to the extraction of high frequency words in the parameter estimation process, and some low frequency characteristic words of the implied subject are submerged. Considering both high-frequency words and low-frequency characteristic words of implicit topics, and considering the TF-IDF method for optimization, the TF-IDF calculates and expresses the importance degree of a certain keyword in a text through a statistical method. And weighting a theme-word matrix generated by the theme model iteration by calculating the TF-IDF value of the word, thereby effectively weakening the influence of the high-frequency noise word.
(f) Weighted LDA step:
1. performing word segmentation and word removal processing on the thesis abstract data set;
2. gibbs sampling is carried out on the corpus, and document-theme distribution and theme-word distribution are generated;
3. calculating similarity, sequencing according to the similarity, reserving the first K/2 topics as candidate topics, and constructing new document-topic distribution and topic-word distribution by combining the candidate topics;
4. and weighting the distribution of the theme-words by using TF-IDF to obtain a weighted probability, and selecting 20 characteristic words with the highest weight according to the distribution condition of the theme-words.
(g) Determination of number of topics in LDA: when the model is trained, the number of themes needs to be set in advance, and parameters are manually adjusted according to the trained result. The number of subjects is 40, the hyper-parameter alpha is 0.25, and the beta is 0.1;
(h) the documents in the corpus are related documents of veterinary drug knowledge obtained through SVM classification in the previous step, and topic mining is carried out on the documents to obtain related topic vocabularies. Then, the subject words are used to search again.
The step (3) of establishing weighting information extraction comprises the following steps:
(a) named entity recognition: performing word segmentation and word stop by using an open-source word segmentation tool, and performing named entity recognition by using a veterinary drug knowledge dictionary;
(b) and (3) extracting the relation: and (3) predefining a relation extraction model among the veterinary drug knowledge entities to extract the relation.
The parts not involved in the present invention are the same as or can be implemented using the prior art.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (4)

1. A veterinary drug residual knowledge graph construction method based on weighted LDA is characterized by comprising the following steps:
(1) constructing a veterinary medicine knowledge framework: extracting knowledge from veterinary pharmacology and veterinary toxicology books by using a method based on hierarchical analysis and rules, obtaining veterinary toxicology related knowledge from a Pubchem website by using a method based on a wrapper, and performing word-stop, word-segmentation and part-of-speech tagging on the corpora by using a jieba word-segmentation tool to finally form a dictionary and form a hierarchical veterinary drug knowledge frame;
(2) downloading document data: performing multilayer search on Web of science by using the dictionary obtained in the last step and combining with the name of the veterinary drug, namely traversing each path from a root node to a leaf node, performing search in multilayer results for all vocabularies on each path, classifying the obtained documents by using a Support Vector Machine (SVM) method, wherein the classification comprises two categories of correlation of veterinary drug knowledge and irrelevance of veterinary drug knowledge, and performing theme extraction on the documents related to the veterinary drug knowledge by using a weighted LDA method;
(3) information extraction: dictionary-based named entity recognition and relationship extraction;
(4) constructing a knowledge graph: and (3) importing the entities of the veterinary drug field knowledge and the relationships among the entities into a Neo4j database in a csv format.
2. The method of claim 1, wherein the constructing of the veterinary drug knowledge framework in the step (1) comprises the following steps:
(2a) a veterinary drug residue knowledge system structure is established, and the veterinary drug residue knowledge system structure comprises five parts: veterinary drug residues, toxicology, effects on organs and systems, attributes and toxicity;
(2b) veterinary drug residue: the harm includes reasons, influence and harm, and can be divided into three parts of harm to human bodies, food and environment;
(2c) the veterinary drug attributes are as follows: category, physicochemical properties, pharmacokinetics, action, application, maximum residual limit and adverse reactions;
(2d) toxicity of veterinary drugs: the method comprises the following steps of (1) toxic effect classification, common parameters, special risk groups, exposure ways, preventive measures, inhalation modes and animal experiments, wherein the toxic effect classification comprises properties, occurrence time, positions and recovery conditions, the common parameters comprise acute toxicity, mutagenicity, carcinogenicity, teratogenicity, acute toxicity and the like, and the objects of the animal experiments comprise mice, rats, rabbits, dogs and the like;
(2e) theory of toxicity of veterinary drugs: comprises purposes, contents and methods, and the method is divided into two parts of biological experiment and population investigation; effects on organs and systems include eye, skin, liver, kidney, nervous system, blood system, immune system, gastrointestinal tract, endocrine system, and respiratory system;
(2f) each portion, if containing table contents, is placed under the corresponding category.
3. The method for constructing veterinary drug residual knowledge graph based on weighted LDA as claimed in claim 1, wherein the multi-layer search in step (2) comprises the following steps:
(3a) the selected veterinary drugs are: the standard of the maximum residue limit of veterinary drugs in national food standard for food safety stipulates 2191 residue limits and use requirements of 267 veterinary drugs in livestock and poultry products, aquatic products and bee products;
(3b) data capture is carried out by utilizing a dynamic webpage (Ajax) after a Selenium and chrome driver are finished, the searching range is all documents of a database built by the web of science so far, and the searching is carried out without limiting periodicals considering that the data volume of veterinary toxicology research is less;
(3c) and combining the keywords to search in a multi-layer result from the root node to the leaf node according to a veterinary medicine knowledge framework for all nodes on each path.
4. The method of claim 1, wherein the step (2) of establishing a weighted LDA topic model comprises the steps of:
(4a) LDA (late dichhere allocation) is a 3-layer Bayes model, which describes the relationship among documents, subjects and vocabularies, and the graph model is shown in FIG. 2, wherein the meaning of each symbol in the graph is as follows: alpha is a hyperparameter of the Dirichlet distribution theta and beta is the Dirichlet distribution
Figure FDA0002697476890000021
Is a polynomial distribution of "document-subject" [ theta ],
Figure FDA0002697476890000022
Is the polynomial distribution of "topic-vocabulary", z is the topic assignment of words, w is a word, K is the number of topics, M is the number of documents, N is the number of words of a document;
(4b) the process of LDA:
(1) randomly assigning a theme number Z to each vocabulary in each document in the corpus;
(2) rescanning the corpus, Sampling each word by using a Gibbs Sampling formula, solving the theme of each word, and updating the theme in the corpus;
(3) repeating the step (2) until Gibbs Sampling converges;
(4) counting a topic vocabulary co-occurrence frequency matrix of a corpus, wherein the matrix is a model of LDA;
(4c) gibbs Sampling fitting θ,
Figure FDA0002697476890000023
The process of (2):
(1) scanning an article for each word wnRandomly assigning a theme Zj
(2) Initialization ZjAn integer of 1 to K;
(3) rescanning each article, performing topic modeling on the corpus by adopting an LDA (Linear discriminant analysis) model, continuously iterating parameter reasoning by using Gibbs Sampling, and simultaneously recording ZjThe parameter theta,
Figure FDA0002697476890000038
The calculation formula of (a) is as follows:
Figure FDA0002697476890000031
Figure FDA0002697476890000032
wherein the content of the first and second substances,
Figure FDA0002697476890000033
is the number of words for topic j in article d,
Figure FDA0002697476890000034
is the number of words for all topics in article d,
Figure FDA0002697476890000035
is the number of times the word w appears under topic j,
Figure FDA0002697476890000036
is the total number of words for topic j in article d;
(4d) the LDA algorithm does not well combine related semantic information in the topic modeling process, which seriously affects the semantic coherence, interpretability and accuracy of text semantic representation of the topic, and aiming at the vocabulary distribution characteristics of veterinary drug knowledge, the similarity is calculated by using a hierarchical semantic similarity calculation formula according to the semantic similarity of each word and a seed word of the veterinary drug knowledge, different weights are given to the vocabulary, and weight information is integrated into the Gibbs sampling process;
Figure FDA0002697476890000037
p1 and p2 represent two vocabularies, d represents the path distance of p1 and p2 in the veterinary drug knowledge hierarchy, the larger d is, the smaller the similarity is, the value range of the similarity is [0,1], k is an adjustable parameter and is usually set to be 20 by default;
(4e) LDA is biased to the extraction of high-Frequency words in the parameter estimation process, and some low-Frequency characteristic words which imply the theme are submerged, the high-Frequency words are considered, the low-Frequency characteristic words which imply the theme are also considered, the optimization is carried out by using a TF-IDF (Term Frequency-Inverse Document Frequency) method, the TF-IDF calculates and expresses the importance degree of a certain keyword in a text by a statistical method, and a theme-word matrix generated by the theme model iteration is weighted by calculating the TF-IDF value of the word, so that the influence of the high-Frequency noise words is effectively weakened;
(4f) weighted LDA step:
(1) performing word segmentation and word removal processing on the thesis abstract data set;
(2) gibbs sampling is carried out on the corpus, and document-theme distribution and theme-word distribution are generated;
(3) calculating similarity, sequencing according to the similarity, reserving the first K/2 topics as candidate topics, and constructing new document-topic distribution and topic-word distribution by combining the candidate topics;
(4) weighting the topic-word distribution by using TF-IDF to obtain a weighted probability, and selecting 20 characteristic words with the highest weight according to the topic-word distribution condition;
(4g) determination of number of topics in LDA: when the model is trained, the number of subjects needs to be set in advance, parameters are manually adjusted according to the trained result, the hyper-parameter alpha is 0.25, and the hyper-parameter beta is 0.1;
(4h) the documents in the corpus are related documents of veterinary drug knowledge obtained through SVM classification in the previous step, topic mining is carried out on the documents to obtain related topic vocabularies, and then the topic vocabularies are used for searching again.
CN202011010727.XA 2020-09-23 2020-09-23 Veterinary drug residue knowledge graph construction method based on weighted LDA Active CN112100405B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011010727.XA CN112100405B (en) 2020-09-23 2020-09-23 Veterinary drug residue knowledge graph construction method based on weighted LDA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011010727.XA CN112100405B (en) 2020-09-23 2020-09-23 Veterinary drug residue knowledge graph construction method based on weighted LDA

Publications (2)

Publication Number Publication Date
CN112100405A true CN112100405A (en) 2020-12-18
CN112100405B CN112100405B (en) 2024-01-30

Family

ID=73755147

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011010727.XA Active CN112100405B (en) 2020-09-23 2020-09-23 Veterinary drug residue knowledge graph construction method based on weighted LDA

Country Status (1)

Country Link
CN (1) CN112100405B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127627A (en) * 2021-04-23 2021-07-16 中国石油大学(华东) Poetry recommendation method based on LDA topic model and poetry knowledge map
CN114117082A (en) * 2022-01-28 2022-03-01 北京欧应信息技术有限公司 Method, apparatus, and medium for correcting data to be corrected
WO2022246691A1 (en) * 2021-05-26 2022-12-01 深圳晶泰科技有限公司 Construction method and system for small molecule drug crystal form knowledge graph

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823848A (en) * 2014-02-11 2014-05-28 浙江大学 LDA (latent dirichlet allocation) and VSM (vector space model) based similar Chinese herb literature recommendation method
US20160103932A1 (en) * 2014-02-13 2016-04-14 Samsung Electronics Co., Ltd. Dynamically modifying elements of user interface based on knowledge graph
CN105653706A (en) * 2015-12-31 2016-06-08 北京理工大学 Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN105677856A (en) * 2016-01-07 2016-06-15 中国农业大学 Text classification method based on semi-supervised topic model
CN107122444A (en) * 2017-04-24 2017-09-01 北京科技大学 A kind of legal knowledge collection of illustrative plates method for auto constructing
CN109684483A (en) * 2018-12-11 2019-04-26 平安科技(深圳)有限公司 Construction method, device, computer equipment and the storage medium of knowledge mapping
CN110633364A (en) * 2019-09-23 2019-12-31 中国农业大学 Graph database-based food safety knowledge graph construction method and display mode
CN110674274A (en) * 2019-09-23 2020-01-10 中国农业大学 Knowledge graph construction method for food safety regulation question-answering system
CN110941721A (en) * 2019-09-28 2020-03-31 国家计算机网络与信息安全管理中心 Short text topic mining method and system based on variational self-coding topic model
WO2020082560A1 (en) * 2018-10-25 2020-04-30 平安科技(深圳)有限公司 Method, apparatus and device for extracting text keyword, as well as computer readable storage medium
CN111159430A (en) * 2019-12-31 2020-05-15 秒针信息技术有限公司 Live pig breeding prediction method and system based on knowledge graph
CN111209412A (en) * 2020-02-10 2020-05-29 同方知网(北京)技术有限公司 Method for building knowledge graph of periodical literature by cyclic updating iteration
CN111291156A (en) * 2020-01-21 2020-06-16 同方知网(北京)技术有限公司 Question-answer intention identification method based on knowledge graph

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823848A (en) * 2014-02-11 2014-05-28 浙江大学 LDA (latent dirichlet allocation) and VSM (vector space model) based similar Chinese herb literature recommendation method
US20160103932A1 (en) * 2014-02-13 2016-04-14 Samsung Electronics Co., Ltd. Dynamically modifying elements of user interface based on knowledge graph
CN105653706A (en) * 2015-12-31 2016-06-08 北京理工大学 Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN105677856A (en) * 2016-01-07 2016-06-15 中国农业大学 Text classification method based on semi-supervised topic model
CN107122444A (en) * 2017-04-24 2017-09-01 北京科技大学 A kind of legal knowledge collection of illustrative plates method for auto constructing
WO2020082560A1 (en) * 2018-10-25 2020-04-30 平安科技(深圳)有限公司 Method, apparatus and device for extracting text keyword, as well as computer readable storage medium
CN109684483A (en) * 2018-12-11 2019-04-26 平安科技(深圳)有限公司 Construction method, device, computer equipment and the storage medium of knowledge mapping
CN110633364A (en) * 2019-09-23 2019-12-31 中国农业大学 Graph database-based food safety knowledge graph construction method and display mode
CN110674274A (en) * 2019-09-23 2020-01-10 中国农业大学 Knowledge graph construction method for food safety regulation question-answering system
CN110941721A (en) * 2019-09-28 2020-03-31 国家计算机网络与信息安全管理中心 Short text topic mining method and system based on variational self-coding topic model
CN111159430A (en) * 2019-12-31 2020-05-15 秒针信息技术有限公司 Live pig breeding prediction method and system based on knowledge graph
CN111291156A (en) * 2020-01-21 2020-06-16 同方知网(北京)技术有限公司 Question-answer intention identification method based on knowledge graph
CN111209412A (en) * 2020-02-10 2020-05-29 同方知网(北京)技术有限公司 Method for building knowledge graph of periodical literature by cyclic updating iteration

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张瀚驰;杨璐;方雄武;郑丽敏;: "基于本体的食品安全新闻爬虫的设计与实现", 农业网络信息, no. 05 *
彭博远;彭冬亮;谷雨;彭俊利;: "基于改进LDA特征抽取的重大事件趋势预测", 杭州电子科技大学学报(自然科学版), no. 02 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127627A (en) * 2021-04-23 2021-07-16 中国石油大学(华东) Poetry recommendation method based on LDA topic model and poetry knowledge map
CN113127627B (en) * 2021-04-23 2023-01-17 中国石油大学(华东) Poetry recommendation method based on LDA theme model and poetry knowledge map
WO2022246691A1 (en) * 2021-05-26 2022-12-01 深圳晶泰科技有限公司 Construction method and system for small molecule drug crystal form knowledge graph
CN114117082A (en) * 2022-01-28 2022-03-01 北京欧应信息技术有限公司 Method, apparatus, and medium for correcting data to be corrected

Also Published As

Publication number Publication date
CN112100405B (en) 2024-01-30

Similar Documents

Publication Publication Date Title
Liu et al. Probabilistic reasoning via deep learning: Neural association models
CN112100405B (en) Veterinary drug residue knowledge graph construction method based on weighted LDA
Mihaylov et al. SemanticZ at SemEval-2016 Task 3: Ranking relevant answers in community question answering using semantic similarity based on fine-tuned word embeddings
Faris et al. Medical speciality classification system based on binary particle swarms and ensemble of one vs. rest support vector machines
Alanazi A named entity recognition system applied to Arabic text in the medical domain
Sheshikala et al. Natural language processing and machine learning classifier used for detecting the author of the sentence
Barve et al. A novel evolving sentimental bag-of-words approach for feature extraction to detect misinformation
Khodadi et al. Genetic programming-based feature learning for question answering
Ullah et al. A deep neural network-based approach for sentiment analysis of movie reviews
Millington et al. Analysis and classification of word co-occurrence networks from Alzheimer’s patients and controls
Šandor et al. Sarcasm detection in online comments using machine learning
CN110199354B (en) Biological system information retrieval system and method
Qahl An automatic similarity detection engine between sacred texts using text mining and similarity measures
Pant et al. Smokeng: Towards fine-grained classification of tobacco-related social media text
Nagaraj et al. Classification of Tweets Using Natural Language Processing from Twitter API Data
Schwartz et al. Minimally supervised classification to semantic categories using automatically acquired symmetric patterns
Baldha et al. Covid-19 vaccine tweets sentiment analysis and topic modelling for public opinion mining
Kongburan et al. Enhancing predictive power of cluster-boosted regression with text-based indexing
Rachmad et al. Sentiment Analysis of Government Policy Management on the Handling of Covid-19 Using Naive Bayes with Feature Selection
Bhaskoro et al. Extracting important sentences for public health surveillance information from Indonesian medical articles
Jiang et al. Describing and classifying post-mortem content on social media
Zhu et al. Topic judgment helps question similarity prediction in medical faq dialogue systems
Bonnefoy et al. The web as a source of evidence for filtering candidate answers to natural language questions
Sun et al. Comparisons of word representations for convolutional neural network: An exploratory study on tourism Weibo classification
Yelisetti et al. Aspect-based Text Classification for Sentimental Analysis using Attention mechanism with RU-BiLSTM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Chen Juan

Inventor after: Yang Lu

Inventor after: Zheng Limin

Inventor after: Wang Ran

Inventor after: Zhang Tian

Inventor after: Li Yixuan

Inventor after: Wang Pengjie

Inventor after: Liu Rong

Inventor after: Fang Bing

Inventor after: Liu Siyuan

Inventor after: Qiu Ju

Inventor before: Zheng Limin

Inventor before: Yang Lu

Inventor before: Zhang Tian

GR01 Patent grant
GR01 Patent grant