CN113076432A

CN113076432A - Document knowledge context generation method, device and storage medium

Info

Publication number: CN113076432A
Application number: CN202110480081.XA
Authority: CN
Inventors: 林桂
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2021-07-06
Anticipated expiration: 2041-04-30
Also published as: CN113076432B

Abstract

The invention relates to artificial intelligence, and discloses a document knowledge venation generation method, which comprises the following steps: classifying labels of the documents to be detected, and acquiring a class label set corresponding to the documents to be detected; acquiring query information, and acquiring a target document range corresponding to the query information in the document to be detected based on the query information; simultaneously, performing entity extraction on the target documents in the target document range to obtain all standard entity designations in the target documents; acquiring a category label and a standard entity index set corresponding to the target document based on the standard entity index and the category label set; forming a document knowledge context corresponding to the query information based on the category labels and the set of standard entity designations. The invention can complete the knowledge context combing of the related documents, and further can recommend corresponding contents for the user according to the knowledge context combing and the expectation of the user for navigation.

Description

Document knowledge context generation method, device and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for generating a document context, an electronic device, and a computer-readable storage medium.

Background

At present, a self-service scientific research information service platform developed by college personnel provides information mining and analyzing service based on documents for researchers. By using the service, researchers can deeply and comprehensively understand the current research situation of the concerned subject, complete the research data extraction of experts and research institutions in a specific field and master the latest dynamic and fund research hotspots of the subject. For example, Aminer is independently developed by Qinghua university, provides functions of extraction of semantic information, topic discovery, trend analysis and the like for researchers by using data mining and social network analysis and mining technologies, and provides comprehensive domain knowledge, targeted research topics and partner information for the researchers.

However, most of the existing scientific research information service platforms only support Chinese literature analysis and interpretation, are insufficient in Pubmed literature inclusion, generally focus on the computer field, and are not deep in the mining of literature research hotspots. In general, the more obvious and common problems of the existing domestic products except that the academic excavation and the student search have function loss in different degrees are that the verticality is not enough only aiming at the documents in the medical field, and the inexorable shortage of the expertise in the excavation and research in the medical field can be avoided.

Disclosure of Invention

The invention provides a document knowledge venation generation method, a document knowledge venation generation device, electronic equipment and a computer readable storage medium, and mainly aims to provide a reliable scheme for generation of professional document knowledge venation in medicine and the like.

In order to achieve the above object, the present invention provides a document knowledge context generation method, including:

classifying labels of documents to be detected, and acquiring a class label set corresponding to the documents to be detected;

acquiring query information, and acquiring a target document range corresponding to the query information in the document to be detected based on the query information and the category label set;

performing entity extraction on the target documents in the target document range to obtain all standard entity designations in the target documents;

acquiring a class label and a standard entity name set corresponding to the target document based on the standard entity name and the class label set;

forming a document knowledge context corresponding to the query information based on the category labels and the set of standard entity designations.

Optionally, the step of obtaining all standard entity designations in the target document includes:

acquiring all entity designations corresponding to the target document based on a pre-trained entity recognition model;

and linking the entity names to a standard map based on an entity linking technology, and acquiring standard entity names corresponding to the entity names.

Optionally, the step of obtaining a standard entity name corresponding to the entity name includes:

acquiring a synonymous information item corresponding to the entity designation based on the entity designation, and determining a designation item set based on the entity designation and the synonymous information item;

searching a candidate entity item set corresponding to the nominal item set in a preset knowledge base on the basis of the nominal item set;

respectively extracting dimension reduction characteristics of the nominal item set and the candidate entity item set;

similarity calculation is carried out on the dimensionality reduction features of the nominal item set and the candidate entity item set, and all entities in the candidate entity item set are ranked according to scores obtained by the similarity calculation;

determining a set of entities corresponding to the entity designations based on the results of the ranking, the entities in the set of entities being the standard entity designations.

Optionally, the separately extracting the dimension reduction features of the nominal item set and the candidate entity item set includes:

acquiring Word2Vec values of all entities in the nominal item set and the candidate entity item set;

based on the Word2Vec value, obtaining a TF-IDF value of the entity corresponding to the Word2Vec value;

multiplying the TF-IDF value as a weight by the word vector of the entity to obtain the dimension reduction characteristics of the reference item set and the candidate entity item set.

Optionally, the step of classifying the tags of the document to be detected and acquiring the class tag set corresponding to the document to be detected includes:

acquiring document data with classification labels as a training data set;

training an MLG-Bert model based on the training data until the MLG-Bert model converges to a preset range to form a document classification model;

and acquiring a category label set corresponding to the document to be detected based on the document classification model.

Optionally, the formula for multiplying the TF-IDF value as a weight by the word vector of the entity is expressed as:

doc_emb＝∑TF-IDF('word_i)·Word2vec(word_i)

wherein doc _ emb represents the dimension reduction characteristic of the nominal item set/the candidate entity item set, word_iAnd expressing the ith entity in the named item set/the candidate entity item set, wherein TF-IDF expresses the TF-IDF value of the ith entity, and Word2Vec expresses the Word2Vec Word vector of the ith entity.

In order to solve the above problem, the present invention also provides a document knowledge context generating apparatus, comprising:

the category label set acquisition unit is used for classifying the labels of the documents to be detected and acquiring a category label set corresponding to the documents to be detected;

the target document range acquiring unit is used for acquiring query information and acquiring a target document range corresponding to the query information in the document to be detected based on the query information and the category label set;

a standard entity designation acquiring unit, configured to perform entity extraction on the target documents within the target document range to acquire all standard entity designations in the target documents;

a category label and standard entity designation set acquisition unit configured to acquire a category label and a standard entity designation set corresponding to the target document based on the standard entity designation and the category label set;

a document knowledge context forming unit for forming a document knowledge context corresponding to the query information based on the category label and the standard entity designation set.

In order to solve the above problem, the present invention also provides an electronic device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the document context generation method described above.

In order to solve the above problem, the present invention further provides a computer-readable storage medium having at least one instruction stored therein, where the at least one instruction is executed by a processor in an electronic device to implement the document knowledge context generation method described above.

The method comprises the steps of classifying labels of the documents to be detected, and acquiring a class label set corresponding to the documents to be detected; acquiring a target document range corresponding to query information in the document to be detected based on the query information; performing entity extraction on the target document to obtain all standard entity designations in the target document; acquiring a class label and a standard entity name set corresponding to the target document based on the standard entity name and the class label set; and forming a document knowledge context corresponding to the query information based on the category label and the standard entity index set, mining and understanding massive medical documents and other types through artificial intelligence and natural language processing technologies, providing scientific research knowledge context service for researchers, acquiring the category label and the entity index set of the documents based on bottom layer algorithm technologies such as named entity identification and extraction, document multi-label classification, entity recommendation and the like, providing expected knowledge navigation context for users according to the category label and the entity index set, simultaneously covering documents and entities, and showing a trend from a surface to a point, thereby facilitating systematic and overall understanding of the research field for the users.

Drawings

FIG. 1 is a flow chart of a document context generation method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a document context generation apparatus according to an embodiment of the present invention;

fig. 3 is a schematic internal structural diagram of an electronic device implementing a document context generation method according to an embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a document knowledge context generation method. Referring to fig. 1, a flowchart of a document knowledge context generation method according to an embodiment of the present invention is shown. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In the embodiment, the document knowledge context generation method includes: classifying labels of documents to be detected, and acquiring a class label set corresponding to the documents to be detected; acquiring query information, and acquiring a target document range corresponding to the query information in the document to be detected based on the query information and the category label set; performing entity extraction on the target documents in the target document range to obtain all standard entity designations in the target documents; acquiring a class label and a standard entity name set corresponding to the target document based on the standard entity name and the class label set; forming a document knowledge context corresponding to the query information based on the category labels and the set of standard entity designations.

Specifically, the steps of the above-described document context generation method are described in detail below.

S110: and classifying labels of the documents to be detected, and acquiring a class label set corresponding to the documents to be detected.

The method comprises the following steps of performing label classification on a document to be detected, and acquiring a class label set corresponding to the document to be detected, wherein the steps comprise:

s111: acquiring document data with classification labels as a training data set;

s112: training an MLG-Bert model based on the training data until the MLG-Bert model converges to a preset range to form a document classification model;

s113: and acquiring a category label set corresponding to the document to be detected based on the document classification model.

Specifically, the document may be subjected to category pre-marking based on the Medical Subject Headings Mesh (Mesh, which is a tool widely used in Medical information retrieval) supported by Pubmed, and may be classified into 3 types of first-level tags: foundation, diagnosis, treatment, and constructing more than 20 kinds of secondary labels on the basis of the primary classification. Therefore, more than 1000 thousands of data with classification labels can be obtained as a training set, and the model architecture of Bert + GCN is adopted to predict the unlabeled data, so that the document class labels of the full-scale pubmed literature are constructed. The document to be detected does not have a classification label, a classification label corresponding to the document to be detected is obtained through a document classification model, the classification label can comprise a primary label, a secondary label, a tertiary label and the like, each label further comprises some types of labels, such as foundation, diagnosis, treatment and the like, all classification labels of the document to be detected can be obtained through detection, and a classification label set of the document to be detected is formed according to the classification labels. The above steps S111 to S113 are mainly a training process of the document classification model, in the training process, the input of the model is document data with classification labels (and labels thereof or manual labels), then the output data is the predicted classification labels of the corresponding documents, and based on the predicted classification labels and the originally labeled classification labels, the training result of the model can be judged until the accuracy meets the requirement. It can be known that the training by using the MLG-Bert model is not the only method for obtaining the class label set of the document to be detected, and other models can be used for training to obtain the desired class label.

As an example, the category label set includes results of label classification of all documents to be detected, that is, the category labels corresponding to the documents to be detected are obtained, and further the category labels include at least a first-level label, a second-level label and a third-level label; the primary label at least comprises basic, diagnosis and treatment, the secondary label at least comprises drug treatment, operation treatment, intervention treatment, general treatment, other treatment and prognosis, and the like, and the grade and number of the class labels are not limited in the application. S120: acquiring query information, and acquiring a target document range corresponding to the query information in the document to be detected based on the query information and the category label set;

specifically, the query information may be related to documents that the user needs to search for, and may be various information such as a summary, a title, or other keywords. The method comprises the steps that labels of documents to be detected are classified, and corresponding class labels are obtained, so that in the process of determining the range of the target document, the class labels can be judged and screened according to specific query information input by a user in the documents to be detected, so as to obtain the range of the corresponding target document, the range of the target document can comprise a preset number of target documents, and the target documents can be specifically set according to requirements.

In addition, in the process of screening the category labels (corresponding to the documents to be detected) by querying the information, the category labels can be screened in various ways such as presetting a certain judgment rule or performing similarity calculation, and the invention is not particularly limited.

S130: medical named entity extraction is carried out on the target documents within the range of the target documents to obtain all standard entity designations in the target documents.

In this step, the process of obtaining all the standard entity designations in the target document further comprises the steps of:

s131: acquiring all entity designations corresponding to the documents to be detected based on a pre-trained entity recognition model;

s132: and linking the entity names to a standard map based on an entity linking technology, and acquiring standard entity names corresponding to the entity names.

As a specific example, the safe medical knowledge graph is a product which is related to a multi-dimensional database in the medical field through a knowledge graph technology and provides a large amount of professional medical knowledge for users. The system integrates 100 ten thousand core medical terms, 1000 ten thousand medical terms and 1600 ten thousand medical relations, realizes the aggregation of all-round knowledge data in a medical ecosphere, covers core medical concepts such as diseases, medicines, examinations, operations, genes, departments and the like, and provides personalized solutions based on accurate medical knowledge for all roles in clinical paths.

In addition, the entity recognition model can be set as a medical named entity recognition model, and a deep learning model based on BioBERT is adopted. BioBERT is a language pre-training model for the public biomedical field, which contains tens of millions of pieces of information of the biomedical field and the general field literature. In the invention, the shell encodes basic semantic information of a text through BioBERT, then learns the characteristics of the task through a bidirectional LSTM layer, and finally optimizes an entity sequence through a CRF layer. By the model, entity names in the texts of the documents to be detected can be obtained, and then the entity names can be linked to the standard map concept through an entity linking technology based on text similarity.

For example, the text in the target document (also referred to as the document to be examined, the same below) is acquired as "heart transplant", and since "heart transplant" has no standard term, the standard on the standard atlas is "heart transplant", and for this purpose "heart transplant" can be linked to "heart transplant" by physically referring to a link.

Specifically, in the document knowledge context generation method of the present invention, the query information may be a Title and an Abstract of an article, and the MLG-Bert model is used to predict the document to be detected, for example, the Title and the Abstract of the article may be input through the MLG-Bert model, Title represents the Title, Abstract represents the Abstract, and the whole vector representation is generated through bioBERT of the model, where the bioBERT is pre-trained using a corpus of a biomedical text, and compared with a Bert pre-trained model using a general corpus, the negative influence caused by word distribution shift may be reduced. Adding a CNN (convolutional neural network) layer after BioBert of the model can better extract and combine features, and finally outputting the category Labels of the documents by Dot-by-Dot. The GCN layer is then added as an embedded network layer of tags, which improves the non-linear capability of the model through embedded value input of node features.

In particular, the model may be trained using binary cross entropy as a loss of bioBert. Finally, the model is used for predicting the articles without the mesh term, and the articles are also classified into a classification system for subsequent document analysis.

Specifically, the step of obtaining a standard entity name corresponding to the entity name includes:

s1321: acquiring a synonymous information item corresponding to the entity designation based on the entity designation, and determining a designation item set based on the entity designation and the synonymous information item;

s1322: based on the nominal item set, searching a candidate entity item set (entity recall) corresponding to the nominal item set in a preset knowledge base;

s1323: respectively extracting dimension reduction characteristics of the nominal item set and the candidate entity item set;

the process of extracting the dimensionality reduction features of the nominal item set and the candidate entity item set is a process of respectively carrying out dimensionality reduction on the nominal item set and the candidate entity item set, so that the corresponding dimensionality reduction features can be obtained, the subsequent similarity calculation can be facilitated, and the calculation process is simplified.

In addition, the dimension reduction processing is performed on the nominal item set and the candidate entity item set, and other dimension reduction modes, for example, a plurality of dimension reduction modes such as filtering, random forest, principal component analysis, reverse feature elimination, etc., may also be used, which is not limited in the present invention.

S1324: similarity calculation is carried out on the dimensionality reduction features of the nominal item set and the candidate entity item set, and all entities in the candidate entity item set are ranked according to scores obtained by the similarity calculation;

in this step, the calculation formula of the similarity is as follows:

and the x and the y respectively represent entity vector representations of different word vectors, the entity vector representations comprise surface features and deep features, all entities are ranked according to the similarity scores, and the higher the score is, the earlier the ranking of the corresponding entities is.

S1325: determining a set of entities corresponding to the entity designations based on the results of the ranking, the entities in the set of entities being the standard entity designations.

In this step, the entity set may be formed by taking the first preset number of entities with similarity scores, the entities in the entity set are referred to as the standard entities, the preset number may be set as required, and the preset number may be set to 5 in the present invention.

Specifically, the Entity Linking (EL) mainly refers to a process of correctly pointing the identified Entity objects (e.g., name of person, name of place, name of organization, etc.) in the free text to the target Entity in the knowledge base without ambiguity. In popular terms, entity linking mainly refers to predicting a knowledge base id corresponding to a certain entity of an input query under the condition that the knowledge base exists. The method mainly comprises two parts of entity recall and entity sequencing.

The two most important steps in the above entity linking technique are entity recall and entity sorting. The generation of the entity recall, namely the candidate entity set, recalls entities related to the entity as much as possible from the knowledge base according to the existing designated items in the text of the document to be detected, and the process requires higher recall rate. Specifically, the cosine similarity between the word vector of the term and the word vector in the text may be calculated according to the text training word vector, for example, a threshold value may be set to about 0.56, and a term greater than the threshold value may be calculated as a synonym of the term.

Entity ordering is mainly to rank a candidate entity item set by using different kinds of evidences to obtain the most probable entities, for example, in the process of extracting dimension reduction features of the nominal item set and the candidate entity item set: respectively acquiring Word2Vec values of all entities in the nominal item set and the candidate entity item set; acquiring a TF-IDF value of the entity corresponding to the Word2Vec value based on the Word2Vec value; multiplying the TF-IDF value as a weight by the word vector of the entity to obtain dimension reduction characteristics of the entity in the reference item set and the candidate entity item set.

Wherein Word2Vec represents a Word vector corresponding to an entity, TF-IDF value represents a Word frequency-inverse document frequency, and the formula of multiplying the TF-IDF value as a weight by the Word vector of the entity is represented as:

doc_emb＝∑TF-IDF(word_i)·Word2vec(word_i)

wherein doc _ emb represents the dimension reduction characteristic of the nominal item set/the candidate entity item set, word_iRepresenting the ith entity in the named item set/the candidate entity item set, and TF-IDF representing the named item set/the candidate entity item setDescribing the TF-IDF value of the ith entity, Word2Vec represents the Word2Vec Word vector of the ith entity.

S140: acquiring a class label and a standard entity name set corresponding to the target document based on the standard entity name and the class label set;

s150: forming a document knowledge context corresponding to the query information based on the category labels and the set of standard entity designations.

In the above steps S140 and S150, after the entity names of the target documents in the target document range are determined, the corresponding standard entity names can be determined based on all the entity names of the target documents, and further, according to the category label of the target documents corresponding to the user query information, all the category labels and all the standard entity names corresponding to the reference documents can be obtained in the category label set formed by the documents to be detected, so as to form the category label (set) and the standard entity name set corresponding to the target documents.

Furthermore, the category labels can be classified according to the category labels and the standard entity index set, for example, the category labels include a first-level label, a second-level label, a third-level label, and the like, and the standard entity index is further classified under each label level, for example, under the first-level label, the standard entity index is further classified into four entity indexes, under the second-level label, the standard entity index is further classified into a plurality of entity indexes, and the like, until all the category labels and the standard entity indexes are completely classified, and a document knowledge context corresponding to the query information is formed.

It should be noted that, the above label classification for the category labels may be divided by a preset rule or a conventional label classification manner, and the division manner is not particularly limited.

Therefore, the document knowledge context generation method provided by the invention can classify documents and extract entities of the classified documents, and finally, a plurality of entities which are most consistent with the expectation of a user are recommended in each document category for the user to navigate to form the knowledge context. The method provided by the invention relies on underlying algorithm technologies such as named entity recognition extraction, document multi-label classification, entity recommendation and the like, provides expected knowledge context navigation for users, covers document-entities, shows a trend from surface to point, is more convenient for users to systematically and generally know the field to be researched, can mine and understand a large amount of medical documents through artificial intelligence and natural language processing technologies, and provides scientific research knowledge context service for researchers.

The invention also provides a document knowledge venation generation device corresponding to the document knowledge venation generation method.

Fig. 2 shows a functional block diagram of the context of knowledge generation device of the present invention.

As shown in fig. 2, the document knowledge context generating apparatus 200 according to the present invention may be installed in an electronic device. According to the implemented functions, the document knowledge context generating device may include: a category label set acquisition unit 210, a target document range acquisition unit 220, a standard entity designation acquisition unit 230, a category label and standard entity designation set acquisition unit 240, and a document knowledge context formation unit 250. The unit of the present invention, which may also be referred to as a module, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

a category label set obtaining unit 210, configured to classify labels of documents to be detected, and obtain a category label set corresponding to the documents to be detected;

a target document range obtaining unit 220, configured to obtain query information, and obtain a target document range corresponding to the query information in the document to be detected based on the query information and the category label set;

the standard entity designation acquiring unit 230 is configured to perform entity extraction on the target documents within the range of the target documents to acquire all standard entity designations in the target documents.

In this unit, the process of obtaining all the standard entity designations in the document to be detected further comprises the following steps:

Further, the medical named entity recognition model may employ a BioBERT-based deep learning model. BioBERT is a language pre-training model for the public biomedical field, which contains tens of millions of pieces of information of the biomedical field and the general field literature. In the invention, the shell encodes basic semantic information of a text (a text of a document to be detected, the same below) through BioBERT, then learns the characteristics of the task through a bidirectional LSTM layer, and finally optimizes an entity sequence through a CRF layer. By the model, entity names in the texts of the documents to be detected can be obtained, and then the entity names can be linked to the standard map concept through an entity linking technology based on text similarity.

For example, the text in the document to be examined is taken as "heart transplant", and since "heart transplant" does not have a standard term, the standard on the standard atlas is referred to as "heart transplant", for which purpose "heart transplant" can be linked to "heart transplant" by a physical reference link.

and respectively carrying out dimensionality reduction on the nominal item set and the candidate entity item set to obtain corresponding dimensionality reduction characteristics, so that the subsequent similarity calculation can be facilitated, and the calculation process is simplified.

in this step, the calculation formula of the similarity is as follows:

Entity ordering is mainly to rank a candidate entity item set by using different kinds of evidences to obtain the most probable entities, for example, in the process of extracting dimension reduction features of the nominal item set and the candidate entity item set: acquiring Word2Vec values of all entities in the nominal item set and the candidate entity item set; acquiring a TF-IDF value of the entity corresponding to the Word2Vec value based on the Word2Vec value; multiplying the TF-IDF value as a weight by the word vector of the entity to obtain the dimension reduction characteristics of the entities in the nominal item set and the candidate entity item set.

doc_emb＝∑TF-IDF(ord_i)·Word2vec(word_i)

wherein doc _ emb represents the dimension reduction characteristic of the nominal item set/the candidate entity item set, word_iRepresenting the ith entity in the named item set/the candidate entity item set, TF-IDF representing the TF-IDF value of the ith entity, and Word2Vec representing the Word2Vec Word vector of the ith entity.

A category label and standard entity designation set obtaining unit 240, configured to obtain a category label and a standard entity designation set corresponding to the target document based on the standard entity designation and the category label set;

a document knowledge context forming unit 250 for forming a document knowledge context corresponding to the query information based on the category label and the set of standard entity designations.

In the above-mentioned units 240 and 250, after the entity names of the target documents in the target document range are determined, the corresponding standard entity names can be determined based on all the entity names of the target documents, and further, according to the category labels of the target documents corresponding to the user query information, all the category labels and all the standard entity names corresponding to the reference documents can be obtained in the category label set formed by the documents to be detected, and further, the category labels (set) and the standard entity name set corresponding to the target documents are formed.

It should be noted that, the embodiment of the document context generation apparatus may refer to the description in the embodiment of the document context generation method, and is not described in detail here.

Fig. 3 is a schematic structural diagram of an electronic device implementing the document context of knowledge generation method according to the present invention.

The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a document knowledge context generating program 12, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as a code of a document knowledge context generating program, but also to temporarily store data that has been output or is to be output.

The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., document knowledge context generation programs, etc.) stored in the memory 11 and calling data stored in the memory 11.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.

Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The document knowledge context generation program 12 stored by the memory 11 in the electronic device 1 is a combination of instructions that, when executed in the processor 10, may implement:

acquiring a target document range corresponding to the query information in the document to be detected based on the query information pre-acquired by the user and the category label set; at the same time, the user can select the desired position,

acquiring document data with classification labels as a training data set;

Optionally, the formula for multiplying the TF-IDF value as a weight by the word vector of the entity is represented as:

doc_emb＝∑TF-IDF(word_i)·Word2ec(Rord_i)

wherein, word_iRepresenting an entity, TF-IDF representing the entity's TF-IDF value, Word2Vec representing the entity's Word2Vec value.

Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

The computer-readable storage medium has stored therein at least one instruction that is executed by a processor in an electronic device to implement the document knowledge context generation method described above.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A document context of knowledge generation method, the method comprising:

2. The document context of knowledge generation method of claim 1, wherein the step of obtaining all standard entity designations in the target document comprises:

3. The document context of knowledge generation method of claim 2, wherein the step of obtaining a standard entity designation corresponding to the entity designation comprises:

4. The document context of knowledge generation method of claim 3, wherein the extracting dimension-reducing features of the set of named items and the set of candidate entity items, respectively, comprises:

5. The document context of knowledge generation method according to claim 1, wherein the step of performing label classification on the document to be detected and obtaining a class label set corresponding to the document to be detected comprises:

acquiring document data with classification labels as a training data set;

6. The document knowledge context generating method according to claim 4, wherein the formula for multiplying the TF-IDF values as weights by the word vectors of the entities is expressed as:

doc_emb＝∑TF-IDF(word_i)·Word2vec(word_i)

wherein doc _ emb represents the finger scaleDimension reduction feature of item set/candidate entity item set, word_iRepresenting the ith entity in the named item set/the candidate entity item set, TF-IDF representing the TF-IDF value of the ith entity, and Word2Vec representing the Word2Vec Word vector of the ith entity.

7. An apparatus for document context of knowledge generation, the apparatus comprising:

the system comprises a category label set acquisition unit, a classification unit and a classification unit, wherein the category label set acquisition unit is used for classifying labels of documents to be detected and acquiring a category label set corresponding to the documents to be detected;

8. The document context of knowledge generation apparatus of claim 7, wherein the step of obtaining all standard entity designations in the target document comprises:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps in the document context generation method as claimed in any one of claims 1 to 6.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the document knowledge context generation method according to any one of claims 1 to 6.