CN115952304A

CN115952304A - Method, device and equipment for searching variant documents and storage medium

Info

Publication number: CN115952304A
Application number: CN202310232304.XA
Authority: CN
Inventors: 蔡娇; 许青青; 盛磊; 陈梅; 任子云; 余蕾; 方云倩; 张学杰; 徐昕; 苗翠翠; 王建峰
Original assignee: Suzhou Chaoyun Life Intelligence Industry Research Institute Co ltd
Current assignee: Suzhou Chaoyun Life Intelligence Industry Research Institute Co ltd
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-04-11
Anticipated expiration: 2043-03-13
Also published as: CN115952304B

Abstract

The invention discloses a method, a device, equipment and a storage medium for searching a variant document. The method comprises the following steps: constructing at least one retrieval combination set based on at least one retrieval entity in the received retrieval data; for each retrieval combination in the retrieval combination set, determining at least one reference variation document corresponding to the retrieval combination and a retrieval weight value corresponding to each reference variation document based on a pre-constructed document knowledge graph; sorting the reference variant documents based on the retrieval weight values to obtain a reference sorting result corresponding to the retrieval combination; determining a document retrieval result corresponding to the retrieval data based on each reference sorting result; the literature knowledge graph comprises incidence relations and preset weight values, wherein the incidence relations correspond to at least one preset variant literature and at least one variant type, the preset weight values correspond to the incidence relations, and the variant type represents a combination form of at least one preset entity. The embodiment of the invention improves the efficiency of document retrieval.

Description

Method, device and equipment for searching variant documents and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for searching a variant document.

Background

With the completion of the human genome project in 2003, human research on self-genetic information has been in qualitative flight. Gene testing is a hot spot in clinical diagnosis and scientific research, more and more new documents about human genetic variation are included in PubMed databases, and thousands of documents contain the pathogenic possibility of various variation sites.

Researchers rely on consulting a large number of variation databases when reading variation sites, while traditional variation search engines rely on whether a plurality of search words co-occur in documents as search standards, and show a large number of related documents to the researchers together, and the researchers need to locate the article segments where the search words appear by themselves to judge whether the article segments can be used as reference. Therefore, the traditional mutation search engine has poor referential performance of the search results and low efficiency of document search.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for searching a variant document, which are used for solving the problem of poor referential performance of a search result given by a traditional variant search engine and improving the efficiency of document search.

According to an embodiment of the present invention, there is provided a method for searching a variant document, including:

in response to receiving retrieval data, constructing a retrieval combination set based on at least one retrieval entity in the retrieval data;

for each retrieval combination in the retrieval combination set, determining at least one reference variation literature corresponding to the retrieval combination and retrieval weight values respectively corresponding to the reference variation literatures based on a pre-constructed literature knowledge graph;

sorting the reference variant documents based on the retrieval weight values to obtain a reference sorting result corresponding to the retrieval combination;

determining a document retrieval result corresponding to the retrieval data based on the reference sorting result corresponding to each retrieval combination;

the literature knowledge graph comprises association relations respectively corresponding to at least one preset variant literature and at least one variant type and preset weight values respectively corresponding to the association relations, and the variant type represents a combination form of at least one preset entity.

According to another embodiment of the present invention, there is provided a variant document retrieval apparatus, including:

the retrieval combination set building module is used for responding to the received retrieval data and building a retrieval combination set based on at least one retrieval entity in the retrieval data;

a reference variant document determining module, configured to determine, for each search combination in the search combination set, at least one reference variant document corresponding to the search combination and a search weight value corresponding to each reference variant document based on a pre-constructed document knowledge graph;

a reference ranking result determining module, configured to rank, based on each search weight value, each reference variant document to obtain a reference ranking result corresponding to the search combination;

the document retrieval result determining module is used for determining document retrieval results corresponding to the retrieval data based on the reference sorting results respectively corresponding to the retrieval combinations;

According to another embodiment of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the method for retrieving a mutated document according to any of the embodiments of the invention.

According to another embodiment of the present invention, a computer-readable storage medium is provided, and the computer-readable storage medium stores computer instructions for causing a processor to implement a method for retrieving a variant document according to any embodiment of the present invention when the computer instructions are executed.

According to the technical scheme, the problem that the referential performance of a retrieval result given by a traditional variation retrieval engine is poor is solved, so that the variation documents which are more in line with the retrieval requirements of users are ranked in the front, and the efficiency of document retrieval is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a variant document retrieval method according to an embodiment of the present invention;

FIG. 2 is a schematic illustration of a document knowledge-graph provided in accordance with an embodiment of the present invention;

FIG. 3 is a flowchart of another variant document retrieval method according to an embodiment of the present invention;

FIG. 4 is a flowchart of an embodiment of a default entity identification and alignment method;

FIG. 5 is a flowchart of another variant document retrieval method according to an embodiment of the present invention;

FIG. 6 is a flowchart of a method for determining a search portfolio according to one embodiment of the present invention;

FIG. 7 is a flowchart of a method for determining a search variance set according to an embodiment of the present invention;

FIG. 8 is a flowchart of another variant document retrieval method according to an embodiment of the present invention;

FIG. 9 is a flow chart of another method for determining a search combination set according to an embodiment of the present invention;

FIG. 10 is a flowchart illustrating an embodiment of a variant document retrieval method according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a variant document searching apparatus according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

It should be noted that the terms "first", "second", "initial", "target", "reference", "preset", "search", and the like in the description and claims of the present invention and the above drawings are used for distinguishing similar objects, and are not necessarily used for describing a specific order or sequence. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Fig. 1 is a flowchart of a variant document retrieval method according to an embodiment of the present invention, which is applicable to retrieving variant documents in a database, and the method can be executed by a variant document retrieval apparatus, which can be implemented in hardware and/or software, and the variant document retrieval apparatus can be configured in a terminal device. As shown in fig. 1, the method includes:

s110, in response to receiving the retrieval data, constructing a retrieval combination set based on at least one retrieval entity in the retrieval data.

In this embodiment, each search entity in the search data includes at least a search gene, and on this basis, each search entity in the search data may further include a search amino acid, a search nucleotide, a search transcript, a search disease, and the like.

Specifically, the search combination set includes at least one search combination, and the search combination includes at least one search entity.

In an alternative embodiment, the search data includes at least a search gene, and the constructing of the search combination set based on at least one search entity in the search data includes: when the search data further includes a search disease, adding the search gene and the search disease as a search combination to a search combination set; when the search data further includes search amino acids and search nucleotides, the search genes, the search amino acids, and the search nucleotides are added as search combinations to the search combination set.

For example, assuming that the search data includes gene 1, disease 1, amino acid 1, and nucleotide 1, the search portfolio set includes [ gene 1 disease 1] and [ gene 1 amino acid 1, nucleotide 1].

And S120, aiming at each retrieval combination in the retrieval combination set, determining at least one reference variant document corresponding to the retrieval combination and a retrieval weight value corresponding to each reference variant document based on a pre-constructed document knowledge graph.

In this embodiment, the literature knowledge graph includes association relationships respectively corresponding to at least one preset variant literature and at least one variant type and preset weight values respectively corresponding to the association relationships, and the variant type represents a combination form of at least one preset entity.

In an alternative embodiment, each type of variation in the literature knowledgegraph includes GD variation and/or GPN variation. Wherein the GD variation characterizes a combination of a predetermined gene and a predetermined disease, and the GPN variation characterizes a combination of a predetermined gene, a predetermined amino acid and a predetermined nucleotide. For example, the variant name of GD variant and GPN variant may be composed of entity names of corresponding preset entities, e.g., the variant name of GPN variant may be "GJB2: p.lys224gln: c.670a > C".

FIG. 2 is a schematic diagram of a document knowledge graph according to an embodiment of the present invention. Specifically, each of the predetermined variant documents in the document knowledge graph shown in fig. 2 includes variant document 1, variant document 2, and variant document 3. Wherein "G" represents a Gene (Gene), "D" represents a Disease (Disease), "P" represents an amino acid (Protein), "N" represents a Nucleotide (Nucleotide), the number following the letter represents an example of each predetermined entity, "Gx + Dy" represents GD variation, "Gx + Py + Nz" represents GPN variation, the dotted arrow represents a relationship, and the number on the dotted arrow represents a predetermined weight value corresponding to the relationship. It should be noted that, in fig. 2, only "Gx + Dy" and "Gx + Py + Nz" are used to distinguish and illustrate the mutation types, and in the real literature knowledge map, "Gx + Dy" and "Gx + Py + Nz" may be replaced by real mutation names.

In an optional embodiment, determining, based on a pre-constructed document knowledge graph, at least one reference variant document corresponding to the search combination and a search weight value corresponding to each reference variant document respectively includes: determining a retrieval variation type based on the retrieval combination; determining at least one reference variation document and a retrieval weight value corresponding to each reference variation document based on the document knowledge graph and the retrieval variation type; wherein, the retrieval variation type is GD variation retrieval or GPN variation retrieval.

In one embodiment, determining a search variation type based on the search combination comprises: determining a search GD mutation based on the search combination when the search combination set contains the search combination consisting of the search gene and the search disease; when a search combination set contains a search combination consisting of a search gene, a search amino acid, and a search nucleotide, a search GPN mutation is determined based on the search combination.

For example, when the search combination is [ gene 1 disease 1], the GD variation is searched for G1+ D1, and when the search combination is [ gene 1 amino acid 1 nucleotide 1], the GPN variation is searched for G1+ P1+ N1.

In an optional embodiment, determining at least one reference variant document and a retrieval weight value corresponding to each reference variant document based on the document knowledge graph and the retrieval variant type includes: and taking a preset variation document which is associated with the retrieval variation type in the document knowledge graph as a reference variation document, and taking a preset weight value corresponding to the association relationship as a retrieval weight value corresponding to the reference variation document.

Taking fig. 2 as an example, assuming that the search variation type is G1+ D1, each reference variation document includes document 1 and document 2, and the search weight values of document 1 and document 2 are 0.51 and 0.9, respectively. Assuming that the search variation type is G1+ P1+ N1, each reference variation document includes documents 2 and 3, and the search weight values of documents 2 and 3 are 0.46 and 0.5, respectively.

And S130, sorting the reference variant documents based on the retrieval weight values to obtain a reference sorting result corresponding to the retrieval combination.

Specifically, for each search combination, the reference variant documents are sorted based on the search weight values corresponding to the reference variant documents corresponding to the search combination, so as to obtain a reference sorting result corresponding to the search combination.

Taking the above example as an example, the reference ranking result corresponding to [ gene 1 disease 1] is [ document 2 document 1], and the reference ranking result corresponding to [ gene 1 amino acid 1 nucleotide 1] is [ document 3 document 2].

S140, determining a document retrieval result corresponding to the retrieval data based on the reference sorting result corresponding to each retrieval combination.

In an alternative embodiment, determining the document search result corresponding to the search data based on the reference ranking result respectively corresponding to each search combination comprises: when the number of the retrieval combinations in the retrieval combination set is one, taking the reference sorting result corresponding to the retrieval combination as a document retrieval result corresponding to the retrieval data; when the number of the retrieval combinations in the retrieval combination set is at least two, respectively acquiring target sorting results from all reference sorting results based on the preset sorting number; and sequencing the sequencing results of the targets based on the priorities corresponding to the retrieval combinations respectively, and taking the sequencing results as document retrieval results corresponding to the retrieval data.

The preset sorting quantity is, for example, 100, and is not limited herein, and the user can customize the preset sorting quantity according to actual requirements. For example, when the sorting order of the reference sorting results is descending sorting, the top 100 reference variant documents in the reference sorting results are taken as the target sorting results.

In an alternative embodiment, the search combination of search genes and search diseases has a lower priority than the search combination of search genes, search amino acids and search nucleotides.

For example, assuming that the reference ranking result 1 is determined based on a search combination composed of a search gene and a search disease, the reference ranking result 2 is determined based on a search combination composed of a search gene, a search amino acid, and a search nucleotide, the reference ranking result 1 is [ document 12 ], the reference ranking result 2 is [ document 3 document 4 document 5], and then the document search result [ document 3 document 4 document 5 document 1 document 2].

In an optional embodiment, the number of combinations of the retrieval combinations in the retrieval combination set is at least two, and accordingly, before determining the document retrieval result corresponding to the retrieval data based on the reference ranking result corresponding to each retrieval combination, the method further comprises: and under the condition that at least one repeated variant document exists in each reference sorting result, deleting the repeated variant document from the reference sorting results corresponding to the retrieval combination with lower priority aiming at each repeated variant document to obtain the screened reference sorting result.

In particular, the repeated variants are used to characterize the reference variants present in at least two reference ranking results, or to characterize the reference variants corresponding to at least two search combinations, respectively. And deleting the repeated variant documents from the reference sorting results corresponding to the retrieval combination with lower priority, wherein the repeated variant documents in the reference sorting result with the highest priority are retained, and the repeated variant documents in other reference sorting results are all deleted.

For example, assuming that reference ranking result 1 is determined based on a search combination composed of a search gene and a search disease, reference ranking result 2 is determined based on a search combination composed of a search gene, a search amino acid, and a search nucleotide, reference ranking result 1 is [ document 1 document 2 document 3], reference ranking result 2 is [ document 3 document 4 document 5], document 3 is deleted from reference ranking result 1, and accordingly, the document search result is [ document 3 document 4 document 5 document 1 document 2].

The advantage of this arrangement is that the document retrieval result can be prevented from having a plurality of identical variant documents, and the accuracy of the document retrieval result can be improved.

In another alternative embodiment, determining the document search result corresponding to the search data based on the reference ranking result respectively corresponding to each search combination comprises: respectively obtaining a preset sorting quantity of target variant documents from each reference sorting result, and obtaining matching segments corresponding to the retrieval combination in each target variant document; aiming at each target variant document, inputting the matching segments corresponding to the target variant document and the retrieval combination into an evidence classification model trained in advance to obtain the matching probability corresponding to the target variant document; and sequencing each target variant document in a descending order based on each matching probability to obtain a document retrieval result corresponding to the retrieval combination.

The preset sorting quantity is, for example, 100, and is not limited here, and the user can customize the setting according to actual needs. Specifically, the matching segment is used for characterizing the document segment in which each retrieval entity in the retrieval combination is recorded.

In an alternative embodiment, the model architecture of the evidence classification model is a BioLinkBERT pre-training model. The BioLinkBERT model takes the first token position (CLS position) output by the last layer as the vector representation of the matching segment, and inputs the vector representation into the full-link layer for classification to obtain the matching probability. The activation function used by the BioLinkBERT model is, for example, a sigmoid function.

The method has the advantages that according to the business requirements of genetic interpretation, the importance of the supporting evidence on the variant literature retrieval is considered, the retrieved variant literatures are accurately sorted, and the efficiency of literature retrieval is further improved.

On the basis of the foregoing embodiment, optionally, the method further includes: and correspondingly outputting the matching segments respectively corresponding to the target variant documents in the document retrieval result and the document retrieval result. This has the advantage of assisting researchers in performing genetic interpretation work more efficiently.

According to the technical scheme of the embodiment, a variation knowledge graph comprising an association relationship corresponding to at least one preset variation document and at least one variation type and a preset weight value corresponding to each association relationship is constructed in advance, a retrieval combination set is constructed based on at least one retrieval entity in received retrieval data, for each retrieval combination in the retrieval combination set, retrieval weight values corresponding to at least one reference variation document corresponding to the retrieval combination and each reference variation document are determined based on the pre-constructed document knowledge graph, the reference variation documents are sorted based on each retrieval weight value to obtain a reference sorting result corresponding to the retrieval combination, and a document retrieval result corresponding to the retrieval data is determined based on each reference sorting result.

Fig. 3 is a flowchart of another variant document retrieval method according to an embodiment of the present invention, and the embodiment further details the method for constructing a document knowledge graph according to the embodiment. As shown in fig. 3, the method includes:

s210, aiming at each preset variant document in a preset variant document set, at least one variant type which is associated with the preset variant document is obtained.

Specifically, the preset literature includes a plurality of preset variant literatures in a concentrated manner.

In an optional embodiment, the obtaining of the at least one mutation type associated with the preset mutation document comprises: acquiring at least two preset entities corresponding to preset variant documents, and constructing at least one preset entity pair based on each preset entity; for each preset entity pair, inputting the preset entity pair and an entity fragment corresponding to the preset entity pair in a preset variation document into a relation extraction model which is trained in advance to obtain an output entity relation; and determining at least one variation type associated with the preset variation document based on the entity relationship respectively corresponding to each preset entity pair. Wherein each preset entity pair comprises at least one entity pair consisting of a preset gene and a preset disease and at least one entity pair consisting of two preset genes, preset amino acids and preset nucleotides.

In an optional embodiment, the obtaining at least two preset entities corresponding to the preset variant documents includes: and acquiring at least two preset entities corresponding to the preset variant documents by adopting a variant entity recognition tool. The mutated entity recognition tool may be, for example, a tmvar entity recognition tool, and the mutated entity recognition tool is not limited herein.

In another alternative embodiment, the obtaining at least two preset entities corresponding to the preset variant documents includes: and inputting the preset variation literature into the entity recognition model trained in advance to obtain at least two output preset entities.

Illustratively, firstly, a tmvar entity recognition tool is used for carrying out entity data labeling on 10000 variant documents, then a reference entity set preliminarily labeled by the tmvar entity recognition tool is imported into a doccano text labeling tool, and professionals conduct manual entity examination to obtain standard entity sets respectively corresponding to the 10000 variant documents. The standard entity set is used for training the entity recognition model.

In one embodiment, the model structure of the target entity recognition model is the BERT + Efficient GlobalPointer model.

The method has the advantages that the BERT + Efficient GlobalPointer model judges the head and the tail of the entity as a whole, is global, reduces a large number of model parameters and reduces the risk of overfitting compared with other model architectures.

Specifically, the preset entity pair is used for representing an entity pair formed by two at least two preset entities in the preset variant document. In this embodiment, each pair of predetermined entities includes at least one pair of entities consisting of a predetermined gene and a predetermined disease and a pair of entities consisting of two of the predetermined gene, the predetermined amino acid and the predetermined nucleotide.

Wherein, the entity types of the preset entity comprise 4 types of genes, diseases, amino acids and nucleotides. For example, assuming that each of the predetermined entities includes gene 1, disease 2, amino acid 1, amino acid 2, and nucleotide 1, when the variation type includes GD variation, each of the predetermined entity pairs includes gene 1& disease 1 and gene 1& disease 2, and when the variation type includes GPN variation, each of the predetermined entity pairs includes gene 1& amino acid 1, gene 1& nucleotide 1, gene 1& amino acid 2, amino acid 1& nucleotide 1, and amino acid 2& nucleotide 1.

In this embodiment, the variation type includes GD variation and/or GPN variation, wherein GD variation represents a combination of a predetermined gene and a predetermined disease, and GPN variation represents a combination of a predetermined gene, a predetermined amino acid and a predetermined nucleotide.

Specifically, the entity fragment is used for characterizing a document fragment in which two preset entities in a preset entity pair are recorded.

In an optional embodiment, the model architecture of the relationship extraction model is a PURE model in the Pipeline method. The encoder in the PURE model may adopt a BERT model, the BERT model is used to encode the preset entity pair and the entity segment, and input the encoding vector into a linear transformation layer, and the linear transformation layer is used to output the entity relationship of the preset entity pair based on the encoding vector. In this case, the linear transformation layer is exemplarily a layerNorm + dropout + classifier.

Specifically, the entity relationship may represent whether a relationship exists between two preset entities in the preset entity pair, and certainly, the entity relationship may also represent a relationship type between two preset entities in the preset entity pair. In this embodiment, for example, when the preset entity pair is an entity pair consisting of a preset gene and a preset disease, the entity relationship may be "related disease", when the preset entity pair is a preset gene and a preset amino acid, the entity relationship may be "amino acid variation", when the preset entity pair is a preset gene and a preset nucleotide, the entity relationship may be "nucleotide variation", and when the preset entity pair is a preset amino acid and a preset nucleotide, the entity relationship may be "nucleotide variation".

In an optional embodiment, determining at least one variant type associated with the preset variant document based on the entity relationship respectively corresponding to each preset entity pair includes: when the preset entity pair is an entity pair consisting of a preset gene and a preset disease, constructing GD mutation associated with a preset mutation document based on the preset entity pair under the condition that the entity relationship of the preset entity pair is an existing relationship; and when each preset entity pair comprises an entity pair formed by two preset genes, preset amino acids and preset nucleotides, constructing GPN variation associated with preset variation documents based on the preset genes, the preset amino acids and the preset nucleotides under the condition that entity relations respectively corresponding to the entity pairs formed by two preset genes, the preset amino acids and the preset nucleotides are all existing relations.

For example, when the preset entity pair of document 1 is gene 1& disease 1, assuming that the entity relationship output by the target relationship extraction model is 1 or "related disease", each variation type associated with the preset variation document includes GD variation "G1+ D1". When each preset entity pair of document 1 includes gene 1& amino acid 1, gene 1& nucleotide 1, amino acid 1& nucleotide 1, gene 1& amino acid 2, and amino acid 2& nucleotide 1, assuming that the entity relationships output by the target relationship extraction model are 1, 0, and 1, respectively, each variation type associated with the preset variation document includes GPN variation "G1+ P1+ D1".

And S220, determining a preset weight value of the association relation between the preset variant literature and the variant type according to the occurrence position of the variant type in the preset variant literature.

Wherein, for example, the preset weight value

Satisfies the formula: />

Wherein,

indicates the fifth->

The predetermined entities corresponding to the respective mutation type are present at ^ h>

And the weight values corresponding to the appearance positions in the preset variation documents. In an exemplary case, the weight value of the title position and the keyword position is 3, the weight value of the abstract position is 2, and the weight value of the chart position and the text position is 1. Specifically, when each preset entity appears at a plurality of positions of the preset variant document, the position with the largest weight value is selected as the appearance position of the variant type.

Wherein,

represents a fifth or fifth party>

Respective predetermined entity associated with a respective type of variation is determined at [ < i > H >>

Frequency of occurrence in a predetermined variant document>

Indicates the fifth->

The number of occurrences in each of the pre-determined variant documents,

indicates the fifth->

A total number of types of mutations corresponding to the predetermined variant document>

Indicates the fifth->

An inverse document frequency, corresponding to each variant type>

Number of documents in a predefined variant document set that represents a predefined variant document in a predefined variant document set>

Indicates inclusion of a fifth +>

The number of predetermined variant documents of each variant type.

Wherein, specifically, when

The higher the frequency of occurrence in one predetermined variant document and the lower the frequency of occurrence in the other predetermined variant documents, this is an indication of the ^ th or greater degree of occurrence>

Individual type of variation versus ^ th->

The more important the respective predetermined variation is, the corresponding, th->

The respective type of variation and the ^ th ^ or ^ th ^>

The larger the preset weight value of the association relation corresponding to each preset variation document is.

And S230, constructing a document knowledge graph based on the association relationship between each preset variant document and at least one variant type and the preset weight value corresponding to each association relationship.

The literature knowledge graph is built by adopting a Neo4j tool, the Neo4j tool has high stability and usability, data storage, retrieval and processing are supported, hundreds of millions of nodes, relations and attributes can be borne, and the practicability is high.

Specifically, for each preset variant document, adding the preset variant document into a document knowledge graph, judging whether the variant type exists in the document knowledge graph or not for each variant type corresponding to the preset variant document, if so, building an association relationship between the preset variant document and the variant type in the document knowledge graph, and adding a preset weight value corresponding to the association relationship; if not, adding the variation type into the document knowledge graph, building an association relation between a preset variation document and the variation type in the document knowledge graph, and adding a preset weight value corresponding to the association relation.

On the basis of the foregoing embodiment, optionally, the method further includes: and in response to detecting an updating instruction of the preset variant literature set, updating the literature knowledge graph based on the updating data corresponding to the preset variant literature set. For example, the update operation includes, but is not limited to, building a new document node, building a new mutation type node, building a new association relationship, deleting an old document node, deleting an old mutation type node, deleting an old association relationship, and the like.

The method has the advantages that timeliness of the literature knowledge graph can be guaranteed, and timeliness and accuracy of retrieval results are guaranteed.

S240, in response to receiving the retrieval data, constructing a retrieval combination set based on at least one retrieval entity in the retrieval data.

And S250, aiming at each retrieval combination in the retrieval combination set, determining at least one reference variant document corresponding to the retrieval combination and a retrieval weight value corresponding to each reference variant document based on a pre-constructed document knowledge graph.

And S260, sorting the reference variant documents based on the retrieval weight values to obtain a reference sorting result corresponding to the retrieval combination.

And S270, determining the document retrieval result corresponding to the retrieval data based on the reference sorting result corresponding to each retrieval combination.

According to the technical scheme, at least one variation type associated with the preset variation documents is obtained for each preset variation document in the preset variation document set, a preset weight value of an association relation between the preset variation document and the variation type is determined for each variation type based on the occurrence position of the variation type in the preset variation document, and a document knowledge graph is constructed based on the association relation between each preset variation document and at least one variation type and the preset weight value corresponding to each association relation.

On the basis of the foregoing embodiment, optionally, before constructing at least one preset entity pair based on each preset entity, the method further includes: adopting a preset alignment strategy, and respectively executing alignment operation on each preset entity based on each standard entity in the standard database to obtain at least one aligned preset entity; the preset alignment strategy comprises at least one of a preset alignment algorithm, a first mapping list and a vector similarity algorithm, and the first mapping list represents at least one preset entity corresponding to each standard entity in the standard database.

Exemplary standard databases include, but are not limited to, HGNC, OMIM, clinvar, etc. patent databases.

In an optional embodiment, the performing, by using a preset alignment policy, an alignment operation on each preset entity based on each standard entity in the standard database to obtain at least one aligned preset entity includes: aiming at each preset entity, adopting a preset alignment algorithm, and respectively performing alignment operation on each preset entity based on each standard entity in a standard database to obtain at least one aligned preset entity; and/or querying the first mapping list based on the preset entity to obtain the aligned preset entity; the first mapping list represents at least one preset entity corresponding to each standard entity in the standard database; and/or respectively inputting the preset entity and each standard entity in the standard database into a pre-trained target semantic extraction model to obtain a preset vector corresponding to the output preset entity and a standard vector corresponding to each standard entity; and determining the aligned preset entity based on the preset vector and each standard vector.

In one embodiment, the preset alignment algorithm may be a regular matching algorithm, which is not limited herein.

In another embodiment, specifically, the first mapping list includes a plurality of standard entities and at least one preset entity corresponding to each standard entity. Exemplary standard entities are the amino acids p.glu615gly, and each predetermined entity corresponding to the standard entity includes, but is not limited to, E615G, glu615Gly, p.e615g, p.glu615g, and the like. The standard entity is a gene GJB2, and each preset entity corresponding to the standard entity includes, but is not limited to, gap junction protein beta 2, DFNB1, CX26, and the like.

In another embodiment, illustratively, a cosine distance algorithm is adopted to obtain vector similarities corresponding to the preset vector and each standard vector, and a standard entity corresponding to the standard vector with the highest vector similarity is taken as a target standard entity corresponding to the preset entity, or a standard entity corresponding to the standard vector with the vector similarity exceeding a similarity threshold is taken as a target standard entity corresponding to the preset entity.

In an optional embodiment, the model architecture of the target semantic extraction model is a BioLinkBERT pre-training model. The BioLinkBERT pre-training model is trained by PubMed documents with index links, and compared with a general field pre-training language model, the pre-training language model oriented to the medical vertical field can capture semantic features of medical entities better.

On the basis of the foregoing embodiment, optionally, the method further includes: and carrying out fine tuning training on the target semantic extraction model by using the comparative learning model Simcse to obtain the fine-tuned target semantic extraction model. The method has the advantages that the preset entities with similar semantics and the standard entities are closer to each other in the vector space, and the preset entities with different semantics and the standard entities are farther from each other in the vector space, so that the alignment accuracy of the preset entities and the standard database is further improved.

In another embodiment, specifically, it is determined whether a preset alignment algorithm is adopted and a standard entity in a standard database is matched, if not, it is determined whether a standard entity aligned with the preset entity is queried based on the first mapping list, if not, the preset entity is input and each standard entity in the standard database is respectively input into a pre-trained target semantic extraction model, a preset vector corresponding to the output preset entity and a standard vector corresponding to each standard entity are obtained, and a target standard entity aligned with the preset entity is determined based on the preset vector and each standard vector.

Fig. 4 is a flowchart illustrating an embodiment of a method for identifying and aligning default entities according to an embodiment of the present invention. Specifically, each preset variant document in the preset variant document set is input into a target entity recognition model which is trained in advance, and an output preset entity set is obtained. Aiming at each preset entity in the preset entity set, a regular expression is adopted to determine whether a standard entity aligned with the preset entity is matched or not, if yes, the matched standard entity is used as the aligned preset entity, if not, the standard entity aligned with the preset entity is determined whether to be inquired or not based on a first mapping list, if yes, the inquired standard entity is used as the aligned preset entity, if not, the preset entity and each standard entity in the standard database are respectively input into a pre-trained target semantic extraction model, vector similarity corresponding to the output preset vector and each standard vector is calculated, and the standard entity corresponding to the standard vector with the vector similarity exceeding a similarity threshold value is used as a target standard entity corresponding to the preset entity.

Although the same entity expresses the same semantics in different variant documents, the multiple expression forms perform the alignment operation on the preset entity, so that the normalization of the subsequently constructed document knowledge graph can be improved, and the comprehensiveness and the accuracy of the subsequent retrieval result can be further improved.

Fig. 5 is a flowchart of another variant document searching method according to an embodiment of the present invention, and the embodiment further details the determination method of the reference variant document and the search weight value in the above embodiment. As shown in fig. 5, the method includes:

s310, in response to receiving the retrieval data, constructing a retrieval combination set based on at least one retrieval entity in the retrieval data.

On the basis of the foregoing embodiment, optionally, before constructing the search combination set based on at least one search entity in the search data, the method further includes: and aiming at each retrieval entity in the retrieval data, performing alignment operation on the retrieval entities to obtain the aligned retrieval entities.

In this embodiment, the search data at least includes a search gene, and accordingly, the constructing of the search combination set based on at least one search entity in the search data includes: adding a retrieval gene in the retrieval data as a first retrieval combination into a retrieval combination set; in the case where the search data further includes a search amino acid, adding the search gene and the search amino acid as a second search combination to the search combination set; in the case where the search data further includes a search nucleotide, the search gene and the search nucleotide are added as a third search combination to the search combination set.

Fig. 6 is a flowchart of a determination method for retrieving a combined set according to an embodiment of the present invention. Specifically, the acquired retrieval data at least comprises a retrieval gene, the retrieval gene is used as a first retrieval combination, whether the retrieval data further comprises retrieval amino acid is judged, if yes, the retrieval gene and the retrieval amino acid are used as a second retrieval combination, and the step of judging whether the retrieval data further comprises retrieval nucleotide is continuously executed; if the search data is not, continuing to perform the step of determining whether the search data further includes search nucleotides. And if the retrieval data also comprises retrieval nucleotides, the retrieval gene and the retrieval nucleotides are used as a third retrieval combination, and if the retrieval data does not comprise retrieval amino acids, the method is ended.

For example, assuming that the search data includes gene 1, disease 1, amino acid 1, and nucleotide 2, the search set includes [ gene 1], [ gene 1 amino acid 1], and [ gene 1 nucleotide 2].

And S320, aiming at each retrieval combination in the retrieval combination set, determining a retrieval variation set based on the retrieval combination.

In this embodiment, the search variation set includes at least one search variation type, and the search variation set is a search GD variation set or a search GPN variation set. Specifically, the search mutation type is a search GD mutation when the search mutation set is a search GD mutation set, and the search mutation type is a search GPN mutation when the search mutation set is a search GPN mutation set. Wherein the GD variation characterizes a combination of a predetermined gene and a predetermined disease, and the GPN variation characterizes a combination of a predetermined gene, a predetermined amino acid and a predetermined nucleotide.

In an alternative embodiment, determining the search variation set based on the search combination comprises: when the retrieval combination set contains a first retrieval combination, judging whether at least one inquiry disease which has a relationship with retrieval genes in the first retrieval combination exists in the literature knowledge map; if yes, determining a search GD variation set based on the search genes and the search diseases; if not, determining to search the GPN variant set based on the search gene; and searching each preset entity corresponding to each first GPN variation in the GPN variation set, wherein each preset entity comprises a search gene.

Specifically, the document knowledge map is queried to obtain at least one query disease based on the search genes. Taking fig. 2 as an example, assuming that the search gene is gene 1, each query disease includes disease 1, disease 2, and disease 3, and accordingly, the search GD variation set includes G1+ D1, G1+ D2, and G1+ D3. If the retrieval gene is gene 4, the inquiry disease which has a relationship with the gene 4 does not exist in the literature knowledge map, and accordingly, the retrieval GPN variation set comprises G4+ P2+ N3.

In another alternative embodiment, determining a search variation set based on the search combination comprises: when the search combination set comprises a second search combination, determining a search GPN variation set based on the second search combination; wherein, searching each preset entity corresponding to each second GPN variation in the GPN variation set comprises searching genes and searching amino acids; when the retrieval combination set contains a third retrieval combination, determining a retrieval GPN variation set based on the third retrieval combination; and searching each preset entity corresponding to each third GPN variation in the GPN variation set, wherein each preset entity comprises a search gene and a search nucleotide.

Taking fig. 2 as an example, if the second search combination set includes gene 1 and amino acid 1, the search GPN mutation set includes G1+ P1+ N1 and G1+ P1+ N2, and if the third search combination set includes gene 1 and nucleotide 3, the search GPN mutation set includes G1+ P2+ N3.

Fig. 7 is a flowchart of a method for determining a search variance set according to an embodiment of the present invention. Specifically, each retrieval combination is judged, when the retrieval combination set contains a first retrieval combination ([ Gm ]), whether a query disease Dx corresponding to the Gm exists in the document knowledge map is queried, and if so, a retrieval GD mutation Gm + Dx determined based on the retrieval gene Gm and the query disease Dx is added into the retrieval GD mutation set; if not, adding the search GPN variation Gm + Px + Ny determined based on the search gene Gm into the search GPN variation set.

When the search combination set includes the second search combination ([ Gm Pn ]), the search GPN variation Gm + Pn + Nx determined based on the search gene Gm and the search amino acid Pn is added to the search GPN variation set, and when the search combination set includes the third search combination ([ Gm Nn ]), the search GPN variation Gm + Px + Nn determined based on the search gene Gm and the search nucleotide Nn is added to the search GPN variation set.

Wherein "m" and "n" represent specific entities in the retrieval data, and "x" and "y" represent preset entities queried based on the literature knowledge graph, and the entities are not limited.

S330, determining at least one reference variant literature and retrieval weight values corresponding to the reference variant literatures on the basis of the literature knowledge map and the retrieval variant set.

In an optional embodiment, determining at least one reference variant document and a retrieval weight value corresponding to each reference variant document based on the document knowledge graph and the retrieval variant set includes: aiming at each retrieval variation type in the retrieval variation set, taking the incidence relation corresponding to the retrieval variation type in the document knowledge graph as a matching incidence relation, and taking a preset variation document corresponding to each matching incidence relation as a reference variation document; and acquiring at least one matching incidence relation corresponding to each reference variation document according to each reference variation document, and determining a retrieval weight value corresponding to each reference variation document based on a preset weight value corresponding to each matching incidence relation in the document knowledge graph.

In one embodiment, when the number of matching associations corresponding to the reference variant document is one, a preset weight value corresponding to the matching associations is used as a retrieval weight value corresponding to the reference variant document.

In another embodiment, when the number of the matching association relations corresponding to the reference variant documents is at least two, the statistical value of the preset weight value corresponding to each matching association relation is used as the retrieval weight value corresponding to the reference variant documents. The statistical value is, for example, a maximum value, a minimum value, a median value, or a mean value, and the like, and the statistical value is not limited herein.

Taking fig. 2 as an example, assuming that the first search combination is gene 1 and disease 1, the search GD variation set includes G1+ D1, each reference variation document includes document 1 and document 2, and the search weight values of document 1 and document 2 are 0.51 and 0.9, respectively. Assuming that the second search combination is gene 1 and amino acid 1, the search GPN variant set includes G1+ P1+ N1 and G1+ P1+ N2, each reference variant includes document 1, document 2, and document 3, and the search weight values corresponding to document 1, document 2, and document 3 are 0.7, 0.46, and 0.5, respectively.

Assuming that the search variation set includes G1+ P2+ N3 and G4+ P2+ N3, and the statistics are mean values, each reference variation document includes a document 3, and since the preset statistics corresponding to the two matching correlations of the document 3 are 0.8 and 0.91, respectively, the search weight value corresponding to the document 3 is 0.885.

And S340, sorting the reference variant documents based on the retrieval weight values to obtain a reference sorting result corresponding to the retrieval combination.

And S350, determining a document retrieval result corresponding to the retrieval data based on the reference sorting result corresponding to each retrieval combination.

In an optional embodiment, the number of combinations of the retrieval combinations in the retrieval combination set is at least two, and accordingly, before determining the document retrieval result corresponding to the retrieval data based on the reference ranking result corresponding to each retrieval combination, the method further comprises: under the condition that at least one repeated variant document exists in each reference sorting result, deleting the repeated variant document from the reference sorting results corresponding to the retrieval combination with lower priority aiming at each repeated variant document to obtain the screened reference sorting result; or deleting the repeated variant documents with smaller retrieval weight values from the corresponding reference sorting results to obtain the screened reference sorting results.

In an alternative embodiment, the priorities of the first retrieval combination, the third retrieval combination and the second retrieval combination are respectively increased in sequence.

For example, if the results of the reference ranking after screening corresponding to the first search combination, the second search combination, and the third search combination are [ document 1 document 2], [ document 3 document 4 document 5], and [ document 6], respectively, the result of the document search is [ document 3 document 4 document 5 document 6 document 1 document 2].

In another alternative embodiment, determining the document search result corresponding to the search data based on the reference ranking result respectively corresponding to each search combination comprises: respectively acquiring target variant documents with preset sorting quantity from each reference sorting result, and acquiring matched segments corresponding to the retrieval combination in each target variant document; aiming at each target variant document, inputting the matching segments corresponding to the target variant document and the retrieval combination into an evidence classification model trained in advance to obtain the matching probability corresponding to the target variant document; and performing descending sorting on each target variant document based on each matching probability to obtain a document retrieval result corresponding to the retrieval combination.

The technical solution of this embodiment is to determine a search variance set based on a search combination by using a search gene in search data as a first search combination, using the search gene and the search amino acid as a second search combination when the search data further includes the search amino acid, and using the search gene and the search nucleotide as a third search combination when the search data further includes the search nucleotide, and determining a search weight value corresponding to each of at least one reference variance document and each of the reference variance documents based on a document knowledge map and the search variance set.

Fig. 8 is a flowchart of another variant document retrieval method according to an embodiment of the present invention, which further details the reference sorting result in the above embodiment. As shown in fig. 8, the method includes:

s410, responding to the received retrieval data, constructing a retrieval combination set based on at least one retrieval entity in the retrieval data, and sequentially obtaining each retrieval combination in the retrieval combination set.

In this embodiment, the retrieving data at least includes retrieving genes and retrieving transcripts, and accordingly, constructing a retrieval combination set based on at least one retrieval entity in the retrieving data, further includes: in the case where the search data further includes search amino acids, adding the search gene, the search amino acids, and the search transcript as a fourth search combination to the search combination set; in the case where the search data does not include search amino acids but also includes search nucleotides, the search gene, the search nucleotides, and the search transcript are added as a fourth search combination to the search combination set.

Fig. 9 is a flowchart of another method for determining a search combination set according to an embodiment of the present invention, specifically, the obtained search data at least includes a search gene, the search gene is used as a first search combination, whether the search data further includes a search amino acid is determined, and if yes, the search gene and the search amino acid are used as a second search combination. On the one hand, whether the retrieval data further includes the retrieval transcript is continuously judged, and if the retrieval transcript is further included, the retrieval gene, the retrieval amino acid and the retrieval transcript are used as a fourth retrieval combination. On the other hand, it is continuously judged whether the search data further includes a search nucleotide, and if the search nucleotide is further included, the search gene and the search nucleotide are regarded as a third search combination, and if the search nucleotide is not included, the process is ended.

And if the retrieval data does not comprise retrieval amino acids, continuing to judge whether the retrieval data further comprises retrieval nucleotides, if not, ending, if yes, using the retrieval gene and the retrieval nucleotides as a third retrieval combination, and continuing to judge whether the retrieval data further comprises retrieval transcripts, and if the retrieval transcripts are further included, using the retrieval gene, the retrieval nucleotides and the retrieval transcripts as a fifth retrieval combination. If the search transcript is not included, then it ends.

For example, assuming that the search data includes gene 1, amino acid 1, nucleotide 2, and transcript 1, the search set includes [ gene 1], [ gene 1 amino acid 1 transcript 1], and [ gene 1 nucleotide 2]. If the search data includes gene 1, nucleotide 2 and transcript 1, the search set contains [ gene 1], [ gene 1 nucleotide 2 transcript 1] and [ gene 1 nucleotide 2].

And S420, judging whether the retrieval combination is a fourth retrieval combination or a fifth retrieval combination, if so, executing S430, and if not, executing S450.

And S430, determining a search standard variation set based on the search combination.

In this example, the detection standard variation in the search standard variation set represents a combination of a predetermined gene, a predetermined amino acid, a predetermined transcript (Sequence) and a predetermined nucleotide. In the standard database map, the variant name of the standard variant is composed of a preset gene, a preset amino acid, a preset transcript and a preset nucleotide, and has a specific number. The standard variant has the variant name of "NM-004004.6 (GJB 2): c.670A > C (p.Lys224Gln)" and the number of 44765.

Specifically, when the search combination is the fourth search combination, each preset entity corresponding to each search standard variation in the search standard variation set comprises a search gene, a search amino acid and a search transcript, and when the search combination is the fifth search combination, each preset entity corresponding to each search standard variation in the search standard variation set comprises a search gene, a search nucleotide and a search transcript.

In an optional embodiment, after determining the search criterion variation set based on the search combination, the method further comprises: when the retrieval combination is a fifth retrieval combination, acquiring a target HGVS variation corresponding to the retrieval nucleotide and the retrieval transcript in the fifth retrieval combination under the condition that the retrieval standard variation set is an empty set; based on the second mapping list, adding a standard variation corresponding to the target HGVS variation as a search standard variation into a search standard variation set; the second mapping list represents a mapping relation between at least one preset HGVS variation and a standard variation, and the preset HGVS variation represents a combination form of a preset nucleotide and a preset transcript.

Specifically, the standard database provides a second mapping list, and when the search standard variation is not queried through the fifth search combination, the target HGVS variation can be uniquely located through the search nucleotide and the search transcript in the fifth search combination, and the search standard variation corresponding to the fifth search combination is queried through the second mapping list.

The method has the advantages that the problem that the variation of the retrieval standard cannot be found through the fifth retrieval combination query is solved, and the detection rate of a subsequent standard database search engine is further ensured.

S440, determining a reference sorting result corresponding to the retrieval combination based on the retrieval standard variation set by adopting a standard database search engine, and executing S470.

Specifically, the standard database search engine is configured to perform a retrieval operation on a standard database map provided by a standard database, and output a reference ranking result. Illustratively, the standard database search engine may be a search engine provided by the Clinvar database.

And S450, determining at least one reference variant document corresponding to the retrieval combination and a retrieval weight value corresponding to each reference variant document based on a document knowledge graph constructed in advance.

And S460, sorting the reference variant documents based on the retrieval weight values to obtain a reference sorting result corresponding to the retrieval combination.

And S470, determining the document retrieval result corresponding to the retrieval data based on the reference sorting result corresponding to each retrieval combination.

In an alternative embodiment, the fourth retrieval combination and the fifth retrieval combination have the highest priority, and the priorities of the second retrieval combination, the third retrieval combination and the first retrieval combination are respectively reduced in sequence.

Fig. 10 is a flowchart illustrating an embodiment of a variant document retrieval method according to an embodiment of the present invention. Specifically, entity identification is carried out on each preset variation document in a preset variation document set to obtain a preset entity, relation extraction is carried out on preset entity pairs constructed on the basis of the preset entities to obtain preset entity pairs with relations, at least one variation type associated with the preset variation documents is determined on the basis of entity relations respectively corresponding to the preset entity pairs, a preset weight value of the association relation between the preset variation documents and the variation types is determined on the basis of the occurrence positions of the variation types in the preset variation documents for each variation type, and a document knowledge graph is constructed on the basis of the association relation between each preset variation document and at least one variation type and the preset weight value corresponding to each association relation.

And determining a retrieval combination set based on the retrieval data (NM _020779.4 (WDR 35): c.1844A > G (p.Glu645Gly)), and retrieving the standard database map and/or the pre-established literature knowledge map based on each retrieval combination in the retrieval combination set to obtain the reference ranking result corresponding to each retrieval combination. In fig. 10, each reference ranking result includes a ranking result 1 and a ranking result 2, the first two variant documents and their matching segments in the ranking results 1 and 2 are input into an evidence classification model, matching probabilities corresponding to target variant documents are obtained, 4 documents are sorted in a descending order based on each matching probability, and document retrieval results corresponding to retrieval combinations are obtained.

According to the technical scheme of the embodiment, when the retrieval combination is the fourth retrieval combination or the fifth retrieval combination, the retrieval standard variation set is determined based on the retrieval combination, the standard database search engine is adopted, the reference sorting result corresponding to the retrieval combination is determined based on the retrieval standard variation set, the document retrieval result corresponding to the retrieval combination is determined based on each reference sorting result, the problem that the number of variation documents in the document retrieval result is too small is solved, the matching degree of the document retrieval result and the user retrieval requirement is ensured, and the variation documents retrieved by the standard database map are added to the document retrieval result, so that the comprehensiveness and the retrieval efficiency of document retrieval are further improved.

Fig. 11 is a schematic structural diagram of a variant document searching apparatus according to an embodiment of the present invention. As shown in fig. 11, the apparatus includes: a search combination set construction module 510, a reference variant document determination module 520, a reference ranking result determination module 530, and a document search result determination module 540.

Wherein, the retrieval combination set constructing module 510 is configured to, in response to receiving the retrieval data, construct a retrieval combination set based on at least one retrieval entity in the retrieval data;

a reference variant document determining module 520, configured to determine, for each search combination in the search combination set, at least one reference variant document corresponding to the search combination and a search weight value corresponding to each reference variant document based on a pre-constructed document knowledge graph;

a reference ranking result determining module 530, configured to rank, based on each search weight value, each reference variant document to obtain a reference ranking result corresponding to the search combination;

a document search result determining module 540, configured to determine a document search result corresponding to the search data based on the reference ranking result corresponding to each search combination;

the literature knowledge graph comprises incidence relations and preset weight values, wherein the incidence relations correspond to at least one preset variant literature and at least one variant type, the preset weight values correspond to the incidence relations, and the variant type represents a combination form of at least one preset entity.

According to the technical scheme of the embodiment, a mutation knowledge graph comprising an association relationship corresponding to at least one preset mutation document and at least one mutation type and a preset weight value corresponding to each association relationship is constructed in advance, a retrieval combination set is constructed based on at least one retrieval entity in received retrieval data, for each retrieval combination in the retrieval combination set, retrieval weight values corresponding to at least one reference mutation document corresponding to the retrieval combination and each reference mutation document are determined based on the pre-constructed document knowledge graph, the reference mutation documents are ranked based on each retrieval weight value to obtain a reference ranking result corresponding to the retrieval combination, and a document retrieval result corresponding to the retrieval data is determined based on each reference ranking result.

On the basis of the foregoing embodiment, optionally, the apparatus further includes:

the literature knowledge graph building module is used for acquiring at least one variation type associated with a preset variation literature for each preset variation literature in a preset variation literature set;

for each variation type, determining a preset weight value of an incidence relation between the preset variation literature and the variation type based on the occurrence position of the variation type in the preset variation literature;

and constructing a document knowledge graph based on the association relationship between each preset variation document and at least one variation type and the preset weight value corresponding to each association relationship.

On the basis of the above embodiment, optionally, the literature knowledge graph building module includes:

the system comprises a preset entity pair construction unit, a variable document analysis unit and a variable document analysis unit, wherein the preset entity pair construction unit is used for acquiring at least two preset entities corresponding to preset variable documents and constructing at least one preset entity pair based on each preset entity;

the entity relationship determining unit is used for inputting the preset entity pairs and entity fragments corresponding to the preset entity pairs in the preset variation literature into a relationship extraction model which is trained in advance aiming at each preset entity pair to obtain an output entity relationship;

and the variation type determining unit is used for determining at least one variation type associated with the preset variation document based on the entity relationship respectively corresponding to each preset entity pair.

Wherein each preset entity pair comprises at least one entity pair consisting of a preset gene and a preset disease and at least one entity pair consisting of two preset genes, preset amino acids and preset nucleotides.

On the basis of the foregoing embodiment, optionally, the mutation type includes GD mutation and/or GPN mutation, and accordingly, the mutation type determining unit is specifically configured to:

when the preset entity pair is an entity pair consisting of a preset gene and a preset disease, constructing GD variation associated with a preset variation document on the basis of the preset entity pair under the condition that the entity relationship of the preset entity pair is a presence relationship;

and when each preset entity pair comprises an entity pair formed by two preset genes, preset amino acids and preset nucleotides, constructing GPN variation associated with preset variation documents based on the preset genes, the preset amino acids and the preset nucleotides under the condition that entity relations respectively corresponding to the entity pairs formed by two preset genes, the preset amino acids and the preset nucleotides are all existing relations.

On the basis of the above embodiment, optionally, the apparatus further includes:

the system comprises a preset entity alignment module, a standard database and a data processing module, wherein the preset entity alignment module is used for adopting a preset alignment strategy to respectively perform alignment operation on each preset entity based on each standard entity in the standard database to obtain at least one aligned preset entity before at least one preset entity pair is constructed based on each preset entity; the preset alignment strategy comprises at least one of a preset alignment algorithm, a first mapping list and a vector similarity algorithm, and the first mapping list represents at least one preset entity corresponding to each standard entity in the standard database.

On the basis of the above embodiment, optionally, the search data at least includes a search gene, and accordingly, the search combination set constructing module 510 is specifically configured to:

adding a retrieval gene in the retrieval data as a first retrieval combination into a retrieval combination set;

in the case where the search data further includes a search amino acid, adding the search gene and the search amino acid as a second search combination to the search combination set;

in the case where the search data further includes a search nucleotide, the search gene and the search nucleotide are added as a third search combination to the search combination set.

Based on the above embodiments, optionally, each variation type in the literature knowledgegraph includes GD variation and/or GPN variation, and accordingly, the reference variation document determination module 520 includes:

a search variation set determination unit configured to determine a search variation set based on the search combination; wherein, the retrieval variation set comprises at least one retrieval variation type, and the retrieval variation set is a retrieval GD variation set or a retrieval GPN variation set;

and the retrieval weight value determining unit is used for determining at least one reference variant document and a retrieval weight value corresponding to each reference variant document based on the literature knowledge map and the retrieval variant set.

On the basis of the foregoing embodiment, optionally, the search variance determining unit includes:

a search GD variation set determination subunit configured to, when a first search combination is included in the search combination set, determine whether at least one query disease having a relationship with a search gene in the first search combination exists in the literature knowledge map;

if yes, determining a search GD variation set based on the search genes and the search diseases;

if not, determining to search the GPN variant set based on the search gene; and searching each preset entity corresponding to each first GPN variation in the GPN variation set, wherein each preset entity comprises a search gene.

On the basis of the foregoing embodiment, optionally, the search variation set determining unit includes:

a search GPN variant set determining subunit, configured to determine, when the search combination set includes the second search combination, a search GPN variant set based on the second search combination; wherein, searching each preset entity corresponding to each second GPN variation in the GPN variation set comprises searching genes and searching amino acids;

when the retrieval combination set contains a third retrieval combination, determining a retrieval GPN variation set based on the third retrieval combination; and searching each preset entity corresponding to each third GPN variation in the GPN variation set, wherein each preset entity comprises a search gene and a search nucleotide.

On the basis of the above embodiment, optionally, the search combination set includes a fourth search combination consisting of a search gene, a search amino acid and a search transcript, or the search combination set includes a fifth search combination consisting of a search gene, a search nucleotide and a search transcript, and accordingly, the apparatus further includes:

the search criterion variant set determining module is used for determining a search criterion variant set based on the search combination when the search combination is a fourth search combination or a fifth search combination;

determining a reference sorting result corresponding to the retrieval combination based on the retrieval standard variation set by adopting a standard database search engine;

wherein the detection standard variation in the search standard variation set represents a combination of a predetermined gene, a predetermined amino acid, a predetermined transcript and a predetermined nucleotide.

a search criterion mutation adding module for acquiring a target HGVS mutation corresponding to a search nucleotide and a search transcript in a fifth search combination in a case where the search criterion mutation set is an empty set when the search combination is a fifth search combination after determining the search criterion mutation set based on the search combination;

based on the second mapping list, adding a standard variation corresponding to the target HGVS variation as a search standard variation into a search standard variation set;

the second mapping list represents a mapping relation between at least one preset HGVS variation and a standard variation, and the preset HGVS variation represents a combination form of a preset nucleotide and a preset transcript.

On the basis of the foregoing embodiment, optionally, the document retrieval result determining module 540 is specifically configured to:

respectively acquiring target variant documents with preset sorting quantity from each reference sorting result, and acquiring matched segments corresponding to the retrieval combination in each target variant document;

aiming at each target variant document, inputting the matching segments corresponding to the target variant document and the retrieval combination into an evidence classification model trained in advance to obtain the matching probability corresponding to the target variant document;

and sequencing each target variant document in a descending order based on each matching probability to obtain a document retrieval result corresponding to the retrieval combination.

The retrieval device for the variant documents provided by the embodiment of the invention can execute the retrieval method for the variant documents provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. The electronic device 10 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 12, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 may also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the search method for variant documents provided by the above-described embodiments.

In some embodiments, the method for retrieving a variant document provided in the above embodiments may be implemented as a computer program, which is tangibly embodied in a computer-readable storage medium, such as the storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the method for retrieving a mutated document described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the retrieval method of the variant document by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Computer programs for implementing the methods of the present invention can be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for searching a variant document, comprising:

for each retrieval combination in the retrieval combination set, determining at least one reference variant document corresponding to the retrieval combination and a retrieval weight value corresponding to each reference variant document based on a pre-constructed document knowledge graph;

2. The method of claim 1, further comprising:

for each preset variant document in a preset variant document set, acquiring at least one variant type associated with the preset variant document;

and constructing a literature knowledge graph based on the association relationship between each preset variant literature and at least one variant type and the preset weight value corresponding to each association relationship.

3. The method of claim 2, wherein the obtaining at least one mutation type associated with the predetermined mutation document comprises:

acquiring at least two preset entities corresponding to the preset variant documents, and constructing at least one preset entity pair based on each preset entity;

for each preset entity pair, inputting the preset entity pair and an entity fragment corresponding to the preset entity pair in the preset variation literature into a relation extraction model trained in advance to obtain an output entity relation;

determining at least one mutation type associated with the preset mutation document based on the entity relationship respectively corresponding to each preset entity pair;

wherein each preset entity pair comprises at least one entity pair consisting of a preset gene and a preset disease and at least one entity pair consisting of two of the preset gene, the preset amino acid and the preset nucleotide.

4. The method as claimed in claim 3, wherein the mutation types include GD mutation and/or GPN mutation, and the determining at least one mutation type associated with the predetermined mutation document based on the entity relationship corresponding to each of the predetermined entity pairs comprises:

when the preset entity pair is an entity pair consisting of a preset gene and a preset disease, constructing GD variation associated with the preset variation document based on the preset entity pair under the condition that the entity relationship of the preset entity pair is a relationship;

and when each preset entity pair comprises an entity pair formed by two preset genes, preset amino acids and preset nucleotides, constructing GPN (gigabit passive network) variation associated with the preset variation document based on the preset genes, the preset amino acids and the preset nucleotides under the condition that entity relations respectively corresponding to the entity pairs formed by two preset genes, the preset amino acids and the preset nucleotides are in existence relations.

5. The method of claim 3, wherein prior to constructing at least one preset entity pair based on each of the preset entities, the method further comprises:

adopting a preset alignment strategy, and respectively executing alignment operation on each preset entity based on each standard entity in a standard database to obtain at least one aligned preset entity; the preset alignment strategy comprises at least one of a preset alignment algorithm, a first mapping list and a vector similarity algorithm, wherein the first mapping list represents at least one preset entity corresponding to each standard entity in the standard database.

6. The method of claim 1, wherein the search data comprises at least search genes, and wherein constructing the search combination set based on at least one search entity in the search data comprises:

in a case where the search data further includes a search amino acid, adding the search gene and the search amino acid as a second search combination to a search combination set;

in the case where the search data further includes a search nucleotide, the search gene and the search nucleotide are added as a third search combination to a search combination set.

7. The method as claimed in claim 6, wherein each mutation type in the document knowledge graph comprises GD mutation and/or GPN mutation, and the determining at least one reference mutation document corresponding to the search combination and the search weight value corresponding to each of the reference mutation documents based on the pre-constructed document knowledge graph comprises:

determining a search variation set based on the search combination; wherein the search variation set comprises at least one search variation type, and the search variation set is a search GD variation set or a search GPN variation set;

and determining at least one reference variant document and a retrieval weight value corresponding to each reference variant document based on the document knowledge graph and the retrieval variant set.

8. The method of claim 7, wherein determining a search variation set based on the search combination comprises:

when the retrieval combination set contains a first retrieval combination, judging whether at least one inquiry disease which has a relationship with retrieval genes in the first retrieval combination exists in the literature knowledge graph or not;

if yes, determining to search GD mutation sets based on the search genes and the query diseases;

if not, determining to search the GPN variant set based on the search gene; and each preset entity corresponding to each first GPN variation in the search GPN variation set comprises a search gene.

9. The method of claim 7, wherein determining a search variation set based on the search combination comprises:

when a second retrieval combination is contained in the retrieval combination set, determining to retrieve a GPN variation set based on the second retrieval combination; wherein, each preset entity corresponding to each second GPN variation in the search GPN variation set comprises a search gene and a search amino acid;

when a third retrieval combination is contained in the retrieval combination set, determining to retrieve a GPN variation set based on the third retrieval combination; and each preset entity corresponding to each third GPN variation in the search GPN variation set comprises a search gene and a search nucleotide.

10. The method of claim 1, wherein the set of search combinations comprises a fourth search combination consisting of a search gene, a search amino acid, and a search transcript, or the set of search combinations comprises a fifth search combination consisting of a search gene, a search nucleotide, and a search transcript, and the method further comprises:

when the retrieval combination is a fourth retrieval combination or a fifth retrieval combination, determining a retrieval standard variation set based on the retrieval combination;

11. The method of claim 10, wherein after determining a search criteria variation set based on the search combination, the method further comprises:

when the retrieval combination is a fifth retrieval combination, acquiring a target HGVS variation corresponding to the retrieval nucleotide and the retrieval transcript in the fifth retrieval combination under the condition that the retrieval standard variation set is an empty set;

based on a second mapping list, adding a standard variation corresponding to the target HGVS variation as a search standard variation into the search standard variation set;

wherein the second mapping list represents a mapping relationship between at least one predefined HGVS variation, which represents a combination of a predefined nucleotide and a predefined transcript, and a standard variation, respectively.

12. The method according to any one of claims 1-11, wherein said determining a document search result corresponding to said search combination based on each of said reference ranking results comprises:

for each target variant document, inputting the target variant document and a matching segment corresponding to the retrieval combination into an evidence classification model trained in advance to obtain a matching probability corresponding to the target variant document;

and performing descending sorting on each target variant document based on each matching probability to obtain a document retrieval result corresponding to the retrieval combination.

13. A device for searching a mutated document, comprising:

the reference variant document determining module is used for determining at least one reference variant document corresponding to the retrieval combination and a retrieval weight value corresponding to each reference variant document based on a pre-constructed document knowledge graph aiming at each retrieval combination in the retrieval combination set;

a reference sorting result determining module, configured to sort the reference variant documents based on the search weight values to obtain a reference sorting result corresponding to the search combination;

the document retrieval result determining module is used for determining a document retrieval result corresponding to the retrieval data based on the reference sorting result respectively corresponding to each retrieval combination;

14. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of retrieving a variant document of any of claims 1-12.

15. A computer-readable storage medium storing computer instructions for causing a processor to perform the method of retrieving a variant document according to any one of claims 1 to 12 when executed.