CN113420118A - Data correlation analysis method and device, electronic terminal and storage medium - Google Patents

Data correlation analysis method and device, electronic terminal and storage medium Download PDF

Info

Publication number
CN113420118A
CN113420118A CN202110702825.8A CN202110702825A CN113420118A CN 113420118 A CN113420118 A CN 113420118A CN 202110702825 A CN202110702825 A CN 202110702825A CN 113420118 A CN113420118 A CN 113420118A
Authority
CN
China
Prior art keywords
data
target
vector
correlation
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110702825.8A
Other languages
Chinese (zh)
Inventor
罗永贵
刘霄晨
肖劲
尹芳
张晓璐
马晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lianren Healthcare Big Data Technology Co Ltd
Original Assignee
Lianren Healthcare Big Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lianren Healthcare Big Data Technology Co Ltd filed Critical Lianren Healthcare Big Data Technology Co Ltd
Priority to CN202110702825.8A priority Critical patent/CN113420118A/en
Publication of CN113420118A publication Critical patent/CN113420118A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Finance (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Accounting & Taxation (AREA)
  • Computational Linguistics (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a method and a device for analyzing data correlation, an electronic terminal and a storage medium, wherein the method comprises the following steps: acquiring first data and second data related to the first data; coding is carried out according to the first data and the second data based on a preset data steering quantity model, and a vector library is generated; the vector library comprises a first vector corresponding to the first data and a second vector corresponding to the second data, and the distance between the first vector and the second vector is smaller than a preset threshold value; acquiring first target data and second target data, and determining a first target vector corresponding to the first target data and a second target vector corresponding to the second target data according to a vector library; and judging whether the first target data and the second target data have correlation or not according to the distance between the first target vector and the second target vector. The accuracy of judging the correlation between the disease data and the medicine data can be improved.

Description

Data correlation analysis method and device, electronic terminal and storage medium
Technical Field
The present invention relates to computer technologies, and in particular, to a method and an apparatus for analyzing data correlation, an electronic terminal, and a storage medium.
Background
In processing medical insurance claim applications, one of the broad categories is the processing of claim applications for pharmaceuticals, i.e. the processing of claims against the drug costs of insureds. In the process of claim application for processing medicines, claim workers need to confirm whether medicines for applying claims are suitable for diseases guaranteed by insurance products or not and reject unreasonable application items for medication. Therefore, it is of great importance to determine whether there is an adaptation between the disease and the drug (i.e., to determine whether the drug can be used to treat the disease).
Common approaches in the prior art include: counting historical clinic information in each hospital database, wherein each piece of historical clinic information can comprise disease data and medicine data for treating the disease; and judging whether the disease and the medicine have adaptability or not according to the statistical result.
The drawbacks of the prior art include at least: due to the fact that various medicines are available and many new medicines are available on the market every year, all medicines for treating a certain disease are difficult to be completely covered in historical diagnosis information, and the judgment result is inaccurate.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for analyzing data correlation, an electronic terminal, and a storage medium, which can improve the accuracy of determining the correlation between disease data and drug data.
In a first aspect, an embodiment of the present invention provides a method for analyzing data correlation, including:
acquiring first data and second data related to the first data;
coding according to the first data and the second data based on a preset data steering quantity model to generate a vector library; the vector library comprises a first vector corresponding to the first data and a second vector corresponding to the second data, and the distance between the first vector and the second vector is smaller than a preset threshold value;
acquiring first target data and second target data, and determining a first target vector corresponding to the first target data and a second target vector corresponding to the second target data according to the vector library;
and judging whether the first target data and the second target data have correlation or not according to the distance between the first target vector and the second target vector.
In a second aspect, an embodiment of the present invention further provides an apparatus for analyzing data correlation, including:
the data acquisition module is used for acquiring first data and second data related to the first data;
the vector library generating module is used for coding according to the first data and the second data based on a preset data steering quantity model to generate a vector library; the vector library comprises a first vector corresponding to the first data and a second vector corresponding to the second data, and the distance between the first vector and the second vector is smaller than a preset threshold value;
the target vector determining module is used for acquiring first target data and second target data, and determining a first target vector corresponding to the first target data and a second target vector corresponding to the second target data according to the vector library;
and the correlation judging module is used for judging whether the first target data and the second target data have correlation or not according to the distance between the first target vector and the second target vector.
In a third aspect, an embodiment of the present invention further provides an electronic terminal, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for analyzing data correlation according to any embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for analyzing data correlation according to any embodiment of the present invention.
The embodiment of the invention provides a method, a device, an electronic terminal and a storage medium for analyzing data correlation, wherein the method for analyzing the data correlation is used for acquiring first data and second data related to the first data; coding is carried out according to the first data and the second data based on a preset data steering quantity model, and a vector library is generated; the vector library comprises a first vector corresponding to the first data and a second vector corresponding to the second data, and the distance between the first vector and the second vector is smaller than a preset threshold value; acquiring first target data and second target data, and determining a first target vector corresponding to the first target data and a second target vector corresponding to the second target data according to a vector library; and judging whether the first target data and the second target data have correlation or not according to the distance between the first target vector and the second target vector.
By learning the correlation between the first data and the second data using the data steering amount model, the first vector and the second vector can be encoded at a small distance based on the model to generate a vector library. Furthermore, a first target vector and a second target vector corresponding to the first target data and the second target data, respectively, may be generated based on a vector library, and the correlation between the first target data and the second target data may be reversely deduced by the distance between the first target vector and the second target vector.
The method can be applied to judging the correlation between the disease data and the medicine data, so that when the medicine data is data which is not covered in the historical treatment information, the vector corresponding to the medicine data can still be reversely deduced according to the vector library, and the correlation between the disease data and the medicine data can be determined according to the distance between the vector corresponding to the disease data and the vector corresponding to the medicine data. Compared with the traditional method, the method improves the judgment accuracy of the correlation between the disease data and the medicine data.
Drawings
Fig. 1 is a schematic flow chart of a method for analyzing data correlation according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a knowledge graph in a data correlation analysis method according to a second embodiment of the present invention;
FIG. 3 is a schematic flow chart illustrating the determination of the correlation between disease data and drug data in a method for analyzing data correlation according to a second embodiment of the present invention;
fig. 4 is a schematic structural diagram of an apparatus for analyzing data correlation according to a third embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic terminal according to a fourth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described through embodiments with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. In the following embodiments, optional features and examples are provided in each embodiment, and various features described in the embodiments may be combined to form a plurality of alternatives, and each numbered embodiment should not be regarded as only one technical solution.
Example one
Fig. 1 is a schematic flow chart of an analysis method of data correlation according to an embodiment of the present invention, which is applicable to a case of determining data correlation, for example, a case of determining correlation between disease data and drug data. The method may be performed by the data correlation analysis device provided in the embodiment of the present invention, the device may be implemented in a software and/or hardware manner, and the management device may be configured in the electronic terminal provided in the embodiment of the present invention, for example, may be configured in a server.
Referring to fig. 1, the method for analyzing data correlation provided in this embodiment may include the following steps:
s110, acquiring first data and second data related to the first data.
In this embodiment of the present invention, the first data may include data capable of characterizing the first object, for example, a name text of the first object; the second data may include data that is capable of characterizing the second object, such as a name text of the second object. The first data and the second data can have correlation, and the correlation of the first data and the second data can represent that the first object is correlated with the second object. Wherein, one first data can be related to a plurality of second data, and there can be coincidence between the second data related to different first data.
For example, assuming that the first object is a disease and the second object is a drug, the first data may be a disease name text and the second object may be a drug name text. The disease name text is associated with the drug name text, which may characterize the disease as being associated with the drug (also referred to as disease and drug compliance). Wherein the disease name text can be associated with a plurality of drug name texts to characterize a plurality of drugs available for treating the disease. Also, there may be overlap between drug name texts associated with different disease name texts. In addition, other first data and second data that meet the above-mentioned relevant characteristics may also be applied to the analysis method provided in this embodiment, and are not exhaustive here.
The first data and the second data related to the first data can be obtained in the form of data pairs. For example, the first data and the second data may be acquired in the form of a one-to-many data pair of "first data-second data, second data. By acquiring the first data and the second data in a data pair manner, the second data related to the first data can be more conveniently determined.
S120, coding is carried out according to the first data and the second data based on a preset data steering quantity model, and a vector library is generated; the vector library comprises a first vector corresponding to the first data and a second vector corresponding to the second data, and the distance between the first vector and the second vector is smaller than a preset threshold value.
In this embodiment, the preset data steering amount model may be a model for converting data into a vector. The data steering quantity module may include, but is not limited to, Pre-trained Models (PTMs) such as Models based on shallow word embedding (e.g., word2vec and GloVe), Models based on Pre-trained encoders (e.g., BERT and ELMo), and Models based on supervised learning (e.g., CoVe), among others. In addition, other models that can convert data into vectors can also be applied, and are not exhaustive here.
Since the correlation between the first data and the second data can be determined simultaneously when the first data and the second data are acquired. Through a preset data steering quantity model, correlation logic between first data and second data can be learned and mined, and the correlated first data and second data can be encoded into a first vector and a second vector which are close to each other.
Wherein, the distance between the first vector and the second vector may include at least one of the following: euclidean distance, manhattan distance, chebyshev distance, mahalanobis distance, and cosine distance. The preset threshold may be a preset threshold based on an empirical value and an experimental value. The distance between the first vector and the second vector is smaller than a preset threshold value, and the proximity of the first vector and the second vector can be represented.
When the data steering quantity model completes the coding of all the first data and the second data, all the coded first vectors and second vectors can be collected to generate a vector library. In an alternative, when the first data and the second data are lexical texts, the vector library may include a vector corresponding to each lexical text; vectors corresponding to each word text in each vocabulary text may also be included. Wherein the vector of the vocabulary text may be determined based on the vectors of the word texts comprising the vocabulary.
S130, obtaining the first target data and the second target data, and determining a first target vector corresponding to the first target data and a second target vector corresponding to the second target data according to the vector library.
In this embodiment, the first target data and the second target data may be understood as a data pair that needs to be determined whether or not there is a correlation. The first target data and the second target data may be considered to be the same as the characterization objects, formats, and the like of the first data and the second data of the generated vector library, respectively.
Determining a first target vector and a second target vector corresponding to the first target data and the second target data respectively according to the vector library may include: judging whether the first data and the second data of the generated vector library contain first target data and second target data; if so, directly searching a first vector and a second vector corresponding to the first target data and the second target data from the vector library to serve as the first target vector and the second target vector; if not, respectively encoding the first target data and the second target data according to a data steering quantity model which learns the correlation logic between the first data and the second data, and determining a first target vector and a second target vector.
In an optional mode, when the first target data and the second target data are vocabulary texts, whether the first data and the second data of the generated vector library contain word texts in the first target data and the second target data can be judged; if yes, the first target vector and the second target vector can be determined according to the searched vector of the word text.
S140, judging whether the first target data and the second target data have correlation or not according to the distance between the first target vector and the second target vector.
After the distance between the first target vector and the second target vector is determined, when the distance is smaller than the preset threshold, it can be considered that there is a correlation between the first target data and the second target data; when the distance is greater than or equal to the preset threshold, it may be considered that there is no correlation between the first target data and the second target data.
According to the data correlation analysis method provided by the embodiment of the invention, the correlation between the first data and the second data is learned by using the data steering quantity model, and the first vector and the second vector which are very small in distance can be coded based on the model to generate the vector library. Furthermore, a first target vector and a second target vector corresponding to the first target data and the second target data, respectively, may be generated based on a vector library, and the correlation between the first target data and the second target data may be reversely deduced by the distance between the first target vector and the second target vector.
The method can be applied to judging the correlation between the disease data and the medicine data, so that when the medicine data is data which is not covered in the historical treatment information, the vector corresponding to the medicine data can still be reversely deduced according to the vector library, and the correlation between the disease data and the medicine data can be determined according to the distance between the vector corresponding to the disease data and the vector corresponding to the medicine data. Compared with the traditional method, the method improves the judgment accuracy of the correlation between the disease data and the medicine data.
Example two
The present embodiment describes in detail the steps of generating the vector library on the basis of the above-described embodiments. The vector library can be generated by at least one of the following encoding methods: encoding according to a data sequence, encoding according to a knowledge graph, and encoding according to a data dictionary. Furthermore, multiple pairs of vector pairs corresponding to the first target data and the second target data can be determined based on the generated multiple vector libraries, and the distances of the multiple pairs of vector pairs can be integrated to judge the correlation between the first target data and the second target data, so that the judgment precision is improved.
In an optional implementation manner provided in this embodiment, the encoding according to the first data and the second data may include: generating a knowledge graph according to the first data and the second data; circularly selecting a starting node from the knowledge graph, randomly walking a preset step length from the starting node to obtain a data sequence, and stopping circulation until the current condition meets a preset condition; encoding is performed according to each data sequence.
The knowledge graph (graph) can be drawn by utilizing an open-source or custom-written tool script according to the acquired first data and second data and the determined correlation between the first data and the second data. The graph can be composed of nodes and connecting lines, each node can represent first data or second data, and each connection can connect two nodes so as to represent that the two nodes have correlation.
The current condition satisfies a preset condition, which may include but is not limited to: the current cycle times reach preset times; and/or each node has currently walked, etc. The preset step length may be set according to an empirical value or an experimental value. Wherein the data sequence may include a sequence in which the first data and the second data are arranged at intervals.
Fig. 2 is a schematic diagram of a knowledge graph in an analysis method of data correlation according to a second embodiment of the present invention. Referring to fig. 2, a large circle node in the graph may represent first data, and a small circle node may represent second data; the first data may be a disease name text such as "disease 1" and "disease 2", and the second data may be a medicine name text such as "medicine 1" and "medicine 2", and the like. There may be a line between the large circle node and the small circle node in the graph for representing that the disease name text and the drug name text have a correlation.
For example, a "disease 1" node may be selected from the graph as a start node, and a preset step may be randomly walked from the start node to obtain a data sequence. Illustratively, the data sequence may be [ disease 1, drug 3, disease 2, drug 5 ]. Wherein, a plurality of starting nodes can be selected to carry out random walk so as to obtain a plurality of data sequences.
The data to be coded are in different forms, and the adopted preset data steering quantity models can be different. When the data form to be encoded is a data sequence, a plurality of data sequences can be encoded by using a shallow word embedding model (such as word2vec and GloVe) to obtain a vector library.
In another optional implementation manner provided in this embodiment, the encoding according to the first data and the second data may also include: generating a knowledge graph according to the first data and the second data; and coding according to the knowledge graph.
Wherein a knowledge graph (graph) can be generated using the first data and the second data, using the technical features disclosed above. Furthermore, the encoding can be performed directly according to a graph. When the data form to be encoded is a graph, the graph can be encoded by using the neural network node2vec or LINE to obtain a vector library.
In another optional implementation manner provided by this embodiment, the encoding according to the first data and the second data may further include: removing duplication of the first data and the second data to obtain a data dictionary; and encoding according to the data dictionary.
After the data pairs including the first data and the second data are obtained, the data pairs are deduplicated when the contents of the data pairs are completely consistent, and a data dictionary including the first data, the second data and the correlation between the first data and the second data is generated. Through the deduplication operation, data dictionary redundancy can be avoided, and the subsequent coding efficiency can be improved. When the data to be coded is in the form of a data dictionary, suitable PTMs can be used for coding to obtain a vector library.
The encoding of the first data and the second data can be realized based on the at least one encoding mode, and at least one vector library is generated.
In an optional implementation manner, if the vector library includes at least two vectors, determining whether there is a correlation between the first target data and the second target data according to a distance between the first target vector and the second target vector, includes: determining the distance between a first target vector and a second target vector aiming at the same vector library; and judging whether the first target data and the second target data have correlation or not according to the distances.
And determining a first target vector and a second target vector corresponding to a group of first target data and second target data through each vector library. Also, a distance may be determined from the first and second target vectors within each group. And determining whether the first target data and the second target data have correlation according to the determined distances.
In these alternative implementations, the first target vector and the second target vector corresponding to the first target data and the second target data may be determined by using a vector library obtained based on different encoding modes. The distance is determined by utilizing the plurality of groups of first target vectors and second target vectors, so that the judgment accuracy of the correlation of the first target data and the second target data is improved.
In some further implementations, determining whether there is a correlation between the first target data and the second target data based on the distances includes: determining a first score according to each distance; weighting each first score based on the weight coefficient corresponding to each first score to obtain a comprehensive score; judging whether the first target data and the second target data have correlation or not according to the comprehensive score; and each weight coefficient is obtained by carrying out grid parameter adjustment on the basis of the sample pair set of the first sample data and the second sample data with correlation.
The first score is determined according to each distance, and the distance may be normalized to a preset range, for example, normalized to (0-1), so that the magnitude of the data correlation degree can be compared more uniformly and intuitively. And weighting each first score to obtain a comprehensive score so as to judge the relevance by using each vector library.
The method includes the steps of generating a vector library, determining a comprehensive score of each pair of first sample data and second sample data in a labeled sample set pair with relevance according to the generated vector library, and determining an optimal set of weight coefficients by performing grid parameter adjustment on the weight coefficients corresponding to the first scores. The determined weight coefficient can be applied to the calculation of the total score corresponding to the first target data and the second target data.
In these further implementation manners, the distances of the pairs of vectors can be integrated to judge the correlation between the first target data and the second target data, so that the judgment precision is improved.
Fig. 3 is a schematic flow chart illustrating the determination of the correlation between disease data and drug data in a data correlation analysis method according to a second embodiment of the present invention. Referring to fig. 3, the first data is a disease name text; the second data is a medicine name text; the first target data is a target disease name text; the second target data is a target drug name text.
The process for determining the correlation between disease data and drug data provided in fig. 3 may include:
first, a disease name text in the historical encounter information and a medicine name text corresponding to the disease name text may be extracted from the hospital database. Illustratively, the disease name text and the medicine name text corresponding to the disease name text may be obtained in the format of "[ disease 1, [ medicine 1, medicine 2, …, medicine n ] ]".
Second, the vector library V1 may be generated based on the following steps:
firstly, generating a knowledge graph according to a disease name text and a medicine name text corresponding to the disease name text; then, a starting node can be selected from the knowledge graph spectrum in a circulating mode, the preset step length is randomly walked from the starting node to obtain a data sequence, and the circulation is stopped until the current condition meets the preset condition; finally, based on word2vec or GloVe, encoding can be carried out according to each data sequence to generate a vector library V1.
Meanwhile, the vector library V2 may also be generated based on the following steps:
the vector library V2 can be generated by encoding according to a graph based on node2vec or LINE.
Moreover, the vector library V3 may also be generated based on the following steps:
firstly, the first data and the second data can be subjected to duplicate removal to obtain a data dictionary; and then coding is carried out according to the data dictionary based on the PTMs to generate a vector library V3.
After V1, V2, and V3 are generated, the data pair (disease, drug) of the target disease name text and the target drug name text may be passed through V1, V2, and V3 to obtain vector pairs (disease _ vec1, drug _ vec1), (disease _ vec2, drug _ vec2), and (disease _ vec3, drug _ vec3) of the first target vector and the second target vector, respectively.
Again, the distance of the two vectors of the vector pair (distance _ vec1, drug _ vec1), (distance _ vec2, drug _ vec2) and (distance _ vec3, drug _ vec3) can be determined and the corresponding first scores score (distance _ vec1, drug _ vec1), score (distance _ vec2, drug _ vec2) and score (distance _ vec 63 3, drug _ vec3) are calculated from the distances.
And finally, endowing each first Score with weight parameters alpha, beta and gamma, and fusing 3 first scores to obtain a comprehensive Score _ match. The method specifically comprises the following steps:
Score_match=α*score(disease_vec1,drug_vec1)+β*score(disease_vec2,drug_vec2)+γ*score(disease_vec3,drug_vec3)。
in the embodiment of the present invention, the steps of generating the vector library are described in detail. The vector library can be generated by at least one of the following encoding methods: encoding according to a data sequence, encoding according to a knowledge graph, and encoding according to a data dictionary. Furthermore, multiple pairs of vector pairs corresponding to the first target data and the second target data can be determined based on the generated multiple vector libraries, and the distances of the multiple pairs of vector pairs can be integrated to judge the correlation between the first target data and the second target data, so that the judgment precision is improved. In addition, the embodiment of the present invention and the method for analyzing the data correlation provided by the embodiment belong to the same inventive concept, and technical details which are not described in detail can be referred to the embodiment, and have the same technical effects.
EXAMPLE III
Fig. 4 is a schematic structural diagram of an analysis apparatus for data correlation according to a third embodiment of the present invention. The present embodiment is applicable to a case where the correlation of data is determined, for example, a case where the correlation between disease data and medicine data is determined.
Referring to fig. 4, the apparatus for analyzing data correlation provided in this embodiment may include:
a data obtaining module 410, configured to obtain first data and second data related to the first data;
the vector library generating module 420 is configured to encode according to the first data and the second data based on a preset data steering quantity model to generate a vector library; the vector library comprises a first vector corresponding to the first data and a second vector corresponding to the second data, and the distance between the first vector and the second vector is smaller than a preset threshold value;
a target vector determining module 430, configured to obtain first target data and second target data, and determine a first target vector corresponding to the first target data and a second target vector corresponding to the second target data according to a vector library;
the correlation determination module 440 is configured to determine whether there is correlation between the first target data and the second target data according to a distance between the first target vector and the second target vector.
In some optional implementations, the vector library generation module may be to:
generating a knowledge graph according to the first data and the second data;
circularly selecting a starting node from the knowledge graph, randomly walking a preset step length from the starting node to obtain a data sequence, and stopping circulation until the current condition meets a preset condition;
encoding is performed according to each data sequence.
In some optional implementations, the vector library generation module may also be configured to:
generating a knowledge graph according to the first data and the second data;
and coding according to the knowledge graph.
In some optional implementations, the vector library generation module may be further configured to:
removing duplication of the first data and the second data to obtain a data dictionary;
and encoding according to the data dictionary.
In some optional implementations, if the vector library includes at least two, the relevance determination module may be configured to:
determining the distance between a first target vector and a second target vector aiming at the same vector library;
and judging whether the first target data and the second target data have correlation or not according to the distances.
In some optional implementations, the relevance determination module may be specifically configured to:
determining a first score according to each distance;
weighting each first score based on the weight coefficient corresponding to each first score to obtain a comprehensive score;
judging whether the first target data and the second target data have correlation or not according to the comprehensive score;
and each weight coefficient is obtained by carrying out grid parameter adjustment on the basis of the sample pair set of the first sample data and the second sample data with correlation.
In some alternative implementations, the distance may include at least one of: euclidean distance, manhattan distance, chebyshev distance, mahalanobis distance, and cosine distance.
The data correlation analysis device provided by the embodiment of the invention can execute the data correlation analysis method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. For technical details which are not described in detail, reference may be made to the method for analyzing data correlation provided in any embodiment of the present invention.
Example four
Fig. 5 is a schematic structural diagram of an electronic terminal according to a fourth embodiment of the present invention. Fig. 5 illustrates a block diagram of an exemplary terminal 12 suitable for use in implementing any of the embodiments of the present invention. The terminal 12 shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention. The device 12 is typically a terminal that undertakes the analysis functions of the data correlations.
As shown in fig. 5, the terminal 12 is embodied in the form of a general purpose computing device. The components of the terminal 12 may include, but are not limited to: one or more processors or processing units 16, a memory 28, and a bus 18 that couples the various components (including the memory 28 and the processing unit 16).
Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.
The terminal 12 typically includes a variety of computer readable media. Such media may be any available media that is accessible by terminal 12 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 28 may include computer device readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. The terminal 12 may further include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, and commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk-Read Only Memory (CD-ROM), a Digital Video disk (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product 40, with program product 40 having a set of program modules 42 configured to carry out the functions of embodiments of the invention. Program product 40 may be stored, for example, in memory 28, and such program modules 42 include, but are not limited to, one or more application programs, other program modules, and program data, each of which examples or some combination may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.
The terminal 12 may also communicate with one or more external devices 14 (e.g., keyboard, mouse, camera, etc., and display), one or more devices that enable a user to interact with the terminal 12, and/or any devices (e.g., network card, modem, etc.) that enable the terminal 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the terminal 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), Wide Area Network (WAN), and/or a public Network such as the internet) via the Network adapter 20. As shown, the network adapter 20 communicates with the other modules of the terminal 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the terminal 12, including but not limited to: microcode, device drivers, Redundant processing units, external disk drive Arrays, disk array (RAID) devices, tape drives, and data backup storage devices, to name a few.
The processor 16 executes various functional applications and data processing by executing programs stored in the memory 28, for example, to implement the data correlation analysis method provided by the above-described embodiment of the present invention, and the method includes:
acquiring first data and second data related to the first data; coding is carried out according to the first data and the second data based on a preset data steering quantity model, and a vector library is generated; the vector library comprises a first vector corresponding to the first data and a second vector corresponding to the second data, and the distance between the first vector and the second vector is smaller than a preset threshold value; acquiring first target data and second target data, and determining a first target vector corresponding to the first target data and a second target vector corresponding to the second target data according to a vector library; and judging whether the first target data and the second target data have correlation or not according to the distance between the first target vector and the second target vector.
Of course, those skilled in the art can understand that the processor can also implement the technical solution of the analysis method for data correlation provided in any embodiment of the present invention.
EXAMPLE five
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, for example, implements the method for analyzing data correlation provided in the foregoing embodiment of the present invention, where the method includes:
acquiring first data and second data related to the first data; coding is carried out according to the first data and the second data based on a preset data steering quantity model, and a vector library is generated; the vector library comprises a first vector corresponding to the first data and a second vector corresponding to the second data, and the distance between the first vector and the second vector is smaller than a preset threshold value; acquiring first target data and second target data, and determining a first target vector corresponding to the first target data and a second target vector corresponding to the second target data according to a vector library; and judging whether the first target data and the second target data have correlation or not according to the distance between the first target vector and the second target vector.
Of course, the computer program stored on the computer-readable storage medium provided by the embodiments of the present invention is not limited to the above method requests, and may also perform the analysis method for data correlation provided by any embodiments of the present invention.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor device, apparatus, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a requesting execution apparatus, device, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a request execution apparatus, device, or apparatus.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out requests for the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user-side computer, partly on the user-side computer, as a stand-alone software package, partly on the user-side computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments illustrated herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A method for analyzing data correlations, comprising:
acquiring first data and second data related to the first data;
coding according to the first data and the second data based on a preset data steering quantity model to generate a vector library; the vector library comprises a first vector corresponding to the first data and a second vector corresponding to the second data, and the distance between the first vector and the second vector is smaller than a preset threshold value;
acquiring first target data and second target data, and determining a first target vector corresponding to the first target data and a second target vector corresponding to the second target data according to the vector library;
and judging whether the first target data and the second target data have correlation or not according to the distance between the first target vector and the second target vector.
2. The method of claim 1, wherein the encoding according to the first data and the second data comprises:
generating a knowledge graph according to the first data and the second data;
circularly selecting a starting node from the knowledge graph, randomly walking a preset step length from the starting node to obtain a data sequence, and stopping circulation until the current condition meets a preset condition;
and coding is carried out according to each data sequence.
3. The method of claim 1, wherein the encoding according to the first data and the second data comprises:
generating a knowledge graph according to the first data and the second data;
and coding according to the knowledge graph.
4. The method of claim 1, wherein the encoding according to the first data and the second data comprises:
removing duplication of the first data and the second data to obtain a data dictionary;
and coding according to the data dictionary.
5. The method of claim 1, wherein if the vector library comprises at least two vectors, the determining whether there is a correlation between the first target data and the second target data according to the distance between the first target vector and the second target vector comprises:
determining, for the same vector library, a distance between the first target vector and the second target vector;
and judging whether the first target data and the second target data have correlation or not according to the distances.
6. The method of claim 5, wherein determining whether there is a correlation between the first target data and the second target data according to each of the distances comprises:
determining a first score according to each distance;
weighting each first score based on the weight coefficient corresponding to each first score to obtain a comprehensive score;
judging whether the first target data and the second target data have correlation or not according to the comprehensive score;
and each weight coefficient is obtained by carrying out grid parameter adjustment on a sample pair set of the first sample data and the second sample data with correlation.
7. The method according to any of claims 1-6, wherein the distance comprises at least one of: euclidean distance, manhattan distance, chebyshev distance, mahalanobis distance, and cosine distance.
8. An apparatus for analyzing data correlation, comprising:
the data acquisition module is used for acquiring first data and second data related to the first data;
the vector library generating module is used for coding according to the first data and the second data based on a preset data steering quantity model to generate a vector library; the vector library comprises a first vector corresponding to the first data and a second vector corresponding to the second data, and the distance between the first vector and the second vector is smaller than a preset threshold value;
the target vector determining module is used for acquiring first target data and second target data, and determining a first target vector corresponding to the first target data and a second target vector corresponding to the second target data according to the vector library;
and the correlation judging module is used for judging whether the first target data and the second target data have correlation or not according to the distance between the first target vector and the second target vector.
9. An electronic terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method for analyzing data correlations according to any one of claims 1 to 7 when executing the program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method for analyzing a data correlation according to any one of claims 1 to 7.
CN202110702825.8A 2021-06-24 2021-06-24 Data correlation analysis method and device, electronic terminal and storage medium Pending CN113420118A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110702825.8A CN113420118A (en) 2021-06-24 2021-06-24 Data correlation analysis method and device, electronic terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110702825.8A CN113420118A (en) 2021-06-24 2021-06-24 Data correlation analysis method and device, electronic terminal and storage medium

Publications (1)

Publication Number Publication Date
CN113420118A true CN113420118A (en) 2021-09-21

Family

ID=77716593

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110702825.8A Pending CN113420118A (en) 2021-06-24 2021-06-24 Data correlation analysis method and device, electronic terminal and storage medium

Country Status (1)

Country Link
CN (1) CN113420118A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241241A (en) * 2020-01-08 2020-06-05 平安科技(深圳)有限公司 Case retrieval method, device and equipment based on knowledge graph and storage medium
CN111429204A (en) * 2020-03-10 2020-07-17 携程计算机技术(上海)有限公司 Hotel recommendation method, system, electronic equipment and storage medium
CN112151141A (en) * 2020-09-23 2020-12-29 康键信息技术(深圳)有限公司 Medical data processing method, device, computer equipment and storage medium
CN112347267A (en) * 2020-11-06 2021-02-09 北京乐学帮网络技术有限公司 Text processing method and device, computer equipment and storage medium
CN112599213A (en) * 2021-03-04 2021-04-02 联仁健康医疗大数据科技股份有限公司 Classification code determining method, device, equipment and storage medium
WO2021114830A1 (en) * 2020-05-13 2021-06-17 平安科技(深圳)有限公司 Drug discovery method, device, server, and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241241A (en) * 2020-01-08 2020-06-05 平安科技(深圳)有限公司 Case retrieval method, device and equipment based on knowledge graph and storage medium
CN111429204A (en) * 2020-03-10 2020-07-17 携程计算机技术(上海)有限公司 Hotel recommendation method, system, electronic equipment and storage medium
WO2021114830A1 (en) * 2020-05-13 2021-06-17 平安科技(深圳)有限公司 Drug discovery method, device, server, and readable storage medium
CN112151141A (en) * 2020-09-23 2020-12-29 康键信息技术(深圳)有限公司 Medical data processing method, device, computer equipment and storage medium
CN112347267A (en) * 2020-11-06 2021-02-09 北京乐学帮网络技术有限公司 Text processing method and device, computer equipment and storage medium
CN112599213A (en) * 2021-03-04 2021-04-02 联仁健康医疗大数据科技股份有限公司 Classification code determining method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Fries et al. Ontology-driven weak supervision for clinical entity classification in electronic health records
CN109599185B (en) Disease data processing method and device, electronic equipment and computer readable medium
CN110459324B (en) Disease prediction method and device based on long-term and short-term memory model and computer equipment
CN111680159B (en) Data processing method and device and electronic equipment
CN111738001B (en) Training method of synonym recognition model, synonym determination method and equipment
JP2020135853A (en) Method, apparatus, electronic device, computer readable medium, and computer program for determining descriptive information
JP7345046B2 (en) Word overlap-based clustering cross-modal search
WO2018201772A1 (en) Method and system for inferring potential disease from medical text, and readable storage medium
CN112599213B (en) Classification code determining method, device, equipment and storage medium
CN113767380A (en) Automatic verification and enrichment of semantic relationships between medical entities for drug discovery
CN111986793A (en) Diagnosis guide processing method and device based on artificial intelligence, computer equipment and medium
CN112364664A (en) Method and device for training intention recognition model and intention recognition and storage medium
CN116010586A (en) Method, device, equipment and storage medium for generating health advice
CN115424691A (en) Case matching method, system, device and medium
CN115798661A (en) Knowledge mining method and device in clinical medicine field
CN113470775B (en) Information acquisition method, device, equipment and storage medium
CN115222443A (en) Client group division method, device, equipment and storage medium
CN113722507B (en) Hospitalization cost prediction method and device based on knowledge graph and computer equipment
CN114220505A (en) Information extraction method of medical record data, terminal equipment and readable storage medium
CN113724830A (en) Medicine taking risk detection method based on artificial intelligence and related equipment
CN116741333B (en) Medicine marketing management system
CN113420118A (en) Data correlation analysis method and device, electronic terminal and storage medium
CN113326365B (en) Reply sentence generation method, device, equipment and storage medium
CN115762704A (en) Prescription auditing method, device, equipment and storage medium
CN112328879B (en) News recommendation method, device, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination