CN110674306B

CN110674306B - Knowledge graph construction method and device and electronic equipment

Info

Publication number: CN110674306B
Application number: CN201810620223.6A
Authority: CN
Inventors: 郑萌; 耿璐; 李岚
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2018-06-15
Filing date: 2018-06-15
Publication date: 2023-06-20
Anticipated expiration: 2038-06-15
Also published as: CN110674306A

Abstract

The invention provides a knowledge graph construction method, a knowledge graph construction device and electronic equipment, and belongs to the technical field of artificial intelligence. The knowledge graph construction method comprises the following steps: analyzing word segmentation and syntactic dependency relation of each sentence in the text to be processed to obtain word segmentation results and a word sequence library; screening frequent sequences with the length larger than a preset first threshold value from the word sequence library; combining words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into newly added words, and updating the word segmentation result; establishing a near meaning word combination according to the updated word segmentation result, updating a word sequence library according to the near meaning word combination, calculating variant confidence degrees among words in the word sequence, and judging upper and lower concepts among the words according to the calculation result, wherein the variant confidence degrees represent the relativity among the words or word sequences in the word sequence. The method can accurately and effectively extract concepts and upper and lower relationships from the undefined field text.

Description

Knowledge graph construction method and device and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a knowledge graph construction method, a knowledge graph construction device and electronic equipment.

Background

Knowledge graph construction is an important component in natural language processing and machine language. Most of the current knowledge graph construction methods are to extract texts from the Internet, discover concepts from the texts and judge the upper and lower relationship. The existing knowledge graph construction method often needs some preset sentence patterns when extracting the upper and lower relation, for example, "deep learning is one of the machine learning methods", "word is the software special for word processing in Microsoft's office software", and the like. Such sentence patterns are often found in large numbers in corpora such as specifications, encyclopedias, and the like. However, in real life, there are many scenes as well, and there is no text such as a description that defines entity concepts specifically. Such as a relatively complex device, the specification will not typically provide the user with a very detailed definition or hint of the part, indicating that part a is part of part B, etc. In addition, a large amount of domain texts, such as customer service records, maintenance records, etc., are usually recorded in a relatively concise manner, and meanwhile, the reader is assumed to have a relatively strong accumulation of domain knowledge, so that definition description of entity concepts involved in the texts is not performed. At this time, the existing knowledge graph construction method cannot accurately and effectively extract concepts and upper and lower relationships from the undefined domain text.

Disclosure of Invention

The invention aims to solve the technical problem of providing a knowledge graph construction method, a knowledge graph construction device and electronic equipment, which can accurately and effectively extract concepts and upper and lower relations from a non-defined field text.

In order to solve the technical problems, the embodiment of the invention provides the following technical scheme:

in one aspect, a method for constructing a knowledge graph is provided, including:

analyzing word segmentation and syntactic dependency relation of each sentence in the text to be processed to obtain word segmentation results and a word sequence library;

screening frequent sequences with the length larger than a preset first threshold value from the word sequence library, and calculating the frequency and the lifting degree of each frequent sequence, wherein the frequency represents the probability of the frequent sequence in the word sequence library, and the lifting degree represents the correlation among words in the frequent sequence;

combining words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into newly added words, and updating the word segmentation result according to the newly added words;

establishing a near meaning word combination according to the updated word segmentation result, and replacing the words in the word sequence library with the words with highest frequency in the same near meaning word combination according to the near meaning word combination;

And acquiring a word sequence with the frequency higher than a preset third threshold value and the length of a preset fourth threshold value in the updated word sequence library, calculating variant confidence coefficients among words in the word sequence, and judging upper and lower concepts among the words according to the calculation result, wherein the variant confidence coefficients represent the relativity among the words or the word sequences in the word sequence.

Further, the analyzing the word segmentation and the syntactic dependency relation of each sentence in the text to be processed to obtain a word segmentation result and a word sequence library includes:

performing word segmentation on each sentence in the text to be processed to obtain a word segmentation result;

and analyzing the syntactic dependency relation of each sentence in the text to be processed based on the word segmentation result, correcting the word segmentation result according to the syntactic dependency relation analysis result to obtain at least one group of word sequences corresponding to each sentence, and obtaining a word sequence library comprising word sequences of all sentences.

Further, correcting the word segmentation result according to the analysis result of the syntactic dependency relationship to obtain at least one group of word sequences corresponding to each sentence includes:

when the sentence center word is a noun, determining the center word, recursively finding all centering relation modifier words of the center word, and generating a word sequence comprising the center word and all centering relation modifier words of the center word;

When the central word of a sentence is a verb or an adjective, judging whether the sentence has a main-predicate structure, when the sentence has the main-predicate structure, determining main-predicate nouns in the main-predicate structure, recursively finding all centering relation modifier words of the main-predicate nouns, and generating word sequences of all centering relation modifier words comprising the main-predicate nouns and the main-predicate nouns; determining object nouns in a moving object structure when the main-predicate structure does not exist in the sentence and the moving object structure exists, recursively finding all centering relation modifier words of the object nouns, and generating word sequences of all centering relation modifier words comprising the object nouns and the object nouns;

when the center word of the sentence is not a noun, a verb or an adjective, all centering relations in the sentence are determined, the noun with the most modifier is selected, all centering relation modifier words of the noun are found recursively, and a word sequence comprising the noun and all centering relation modifier words of the noun is generated.

Further, when the frequent sequence includes a word a and a word B, the degree of lift (a, B) =p (b|a)/P (B), where P (B) is the proportion of all the tuples including B in all the tuples, and P (b|a) is the proportion of occurrence of B in all the tuples including a, where the tuples are word sequences with a length of 2 in the word sequence library.

Further, the establishing the paraphrasing combination according to the updated word segmentation result comprises the following steps:

generating word vectors according to the updated word segmentation result;

and calculating cosine similarity between every two words based on the generated word vectors, and establishing all hyponym combinations based on a preset similarity threshold s.

Further, the establishing all the paraphrasal combinations includes:

sorting all words in the updated word segmentation result based on word frequency;

sequentially establishing a paraphrasing combination of each word according to the word frequency from high to low;

establishing a paraphrasing combination of each word includes:

calculating the similarity of the word and other words;

establishing a set based on at least one word having a similarity to the word greater than a threshold s, and ordering the words in the set based on the similarity to the word;

and judging whether the similarity between each word in the set and other words in the set is larger than a threshold value s in sequence from high to low according to the similarity, and if so, adding the judged word into a hyponym combination of the word.

Further, the step of judging the upper and lower concepts among the words according to the calculation result comprises the following steps:

And calculating the variant confidence coefficient of the rightmost word in the word sequence and other word sequences or words on the left side, and judging the word sequence or word on the left side as the upper concept of the rightmost word if the variant confidence coefficient is lower than a preset fifth threshold value.

Further, when the word sequence includes a word or word sequence M and a word or word sequence N, the variant confidence P (n|m) is the proportion of all the multiple word sequences containing M on the left and N on the right.

Further, the preset first threshold is not smaller than 2, and the preset fourth threshold is 2 or 3.

The embodiment of the invention also provides a knowledge graph construction device, which comprises:

the analysis module is used for analyzing the word segmentation and the syntactic dependency relationship of each sentence in the text to be processed to obtain a word segmentation result and a word sequence library;

the first processing module is used for screening out frequent sequences with the length larger than a preset first threshold value from the word sequence library, and calculating the frequency and the lifting degree of each frequent sequence, wherein the frequency represents the probability of occurrence of the frequent sequence in the word sequence library, and the lifting degree represents the correlation among words in the frequent sequence;

the first updating module is used for merging words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into new added words, and updating the word segmentation result according to the new added words;

The second updating module is used for establishing a near-meaning word combination according to the updated word segmentation result, and replacing the words in the word sequence library with the words with the highest frequency in the same near-meaning word combination according to the near-meaning word combination;

the second processing module is used for obtaining word sequences with frequencies higher than a preset third threshold value and lengths of preset fourth threshold values in the updated word sequence library, calculating variant confidence coefficients among words in the word sequences, and judging upper and lower concepts among the words according to calculation results, wherein the variant confidence coefficients represent the relevance among the words or the word sequences in the word sequences.

The embodiment of the invention also provides electronic equipment for constructing the knowledge graph, which comprises the following steps:

a processor; and

a memory in which computer program instructions are stored,

wherein the computer program instructions, when executed by the processor, cause the processor to perform the steps in the knowledge graph construction method as described above.

The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program when being executed by a processor causes the processor to execute the steps in the knowledge graph construction method.

The embodiment of the invention has the following beneficial effects:

according to the technical scheme, the word sequence library is obtained by correcting the word segmentation result by utilizing the analysis result of the syntactic dependency relationship, the word segmentation result is updated according to the correlation among the words in the frequent sequences meeting the conditions in the word sequence library, the hyponym combination is established according to the updated word segmentation result, the words in the word sequence library are replaced by the words with the highest frequency in the same hyponym combination according to the hyponym combination, the confidence level of the variants among the words in the word sequence library meeting the requirements is calculated, and the upper and lower concepts among the words are judged according to the calculation result.

Drawings

FIG. 1 is a schematic flow chart of a knowledge graph construction method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a knowledge graph construction device according to an embodiment of the present invention;

FIG. 3 is a block diagram of an electronic device for constructing a knowledge graph in accordance with an embodiment of the present invention;

FIG. 4 is a flow chart of a knowledge graph construction method according to an embodiment of the present invention;

FIG. 5 is a flow chart illustrating a syntactic dependency analysis according to an embodiment of the present invention;

FIG. 6 is a flow chart of the establishment of a paraphrasing combination according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages to be solved by the embodiments of the present invention more apparent, the following detailed description will be given with reference to the accompanying drawings and the specific embodiments.

The embodiment of the invention provides a knowledge graph construction method, a knowledge graph construction device and electronic equipment, which can accurately and effectively extract concepts and upper and lower relationships from undefined field texts.

Example 1

The embodiment of the invention provides a knowledge graph construction method, as shown in fig. 1, comprising the following steps:

step 101: analyzing word segmentation and syntactic dependency relation of each sentence in the text to be processed to obtain word segmentation results and a word sequence library;

in this step, the sentence may be segmented, then, based on the segmentation result, the syntactic dependency relationship analysis may be performed on each sentence, and then, the segmentation result may be corrected according to the syntactic dependency relationship analysis result, so as to obtain a word sequence library including word sequences of all sentences.

Step 102: screening frequent sequences with the length larger than a preset first threshold value from the word sequence library, and calculating the frequency and the lifting degree of each frequent sequence, wherein the frequency represents the probability of the frequent sequence in the word sequence library, and the lifting degree represents the correlation among words in the frequent sequence;

if the word sequence has a smaller length, the word sequence does not have an analysis meaning, so that frequent sequences with the length larger than a preset first threshold value need to be screened out, wherein the preset first threshold value is not smaller than 2, specifically, the preset first threshold value can be 2, namely, all frequent sequences with the length larger than 2 need to be screened out.

Wherein, when the frequent sequence includes a word a and a word B, the degree of lift (a, B) =p (b|a)/P (B) of the frequent sequence (a, B), where P (B) is the proportion of all the tuples including B in all the tuples, and P (b|a) is the proportion of occurrence of B in all the tuples including a, where the tuples are word sequences with a length of 2 in the word sequence library. The greater the degree of boosting, the greater the correlation of word a and word B.

Wherein the frequency includes the number of times the frequent sequence occurs or the percentage of the frequent sequence in the total number of all the frequent sequences.

Step 103: combining words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into newly added words, and updating the word segmentation result according to the newly added words;

step 104: establishing a near meaning word combination according to the updated word segmentation result, and replacing the words in the word sequence library with the words with highest frequency in the same near meaning word combination according to the near meaning word combination;

specifically, word vectors can be generated based on updated word segmentation results, cosine similarity between words can be calculated based on the word vectors, and all hyponym combinations can be found based on a similarity threshold s.

When all the hyponym combinations are established, all the words in the updated word segmentation result can be ordered based on word frequency, and the hyponym combinations of each word are established in sequence from high to low according to the word frequency; when a paraphrasing combination of each word is established, calculating the similarity of the word and other words; establishing a set based on at least one word having a similarity to the word greater than a threshold s, and ordering the words in the set based on the similarity to the word; and judging whether the similarity between each word in the set and other words in the set is larger than a threshold value s in sequence from high to low according to the similarity, and if so, adding the judged word into a hyponym combination of the word.

Step 105: and acquiring a word sequence with the frequency higher than a preset third threshold value and the length of a preset fourth threshold value in the updated word sequence library, calculating variant confidence coefficients among words in the word sequence, and judging upper and lower concepts among the words according to the calculation result, wherein the variant confidence coefficients represent the relativity among the words or the word sequences in the word sequence.

In this embodiment, only the binary group or the ternary group with higher frequency in the word sequence library may be analyzed, if the length of the word sequence is other value, the meaning of word sequence analysis is not great, i.e. the preset fourth threshold is 2 or 3.

Specifically, when the upper and lower concepts among the words are judged according to the calculation result, calculating the variant confidence degrees of the rightmost word in the word sequence and other word sequences or words on the left side, and judging that the word sequence or word on the left side is the upper concept of the rightmost word if the variant confidence degrees are lower than a preset fifth threshold value.

Where the word sequence includes a word or word sequence M and a word or word sequence N, the variant confidence P (n|m) is the proportion of all the multiple word sequences containing M on the left and N on the right.

In this embodiment, not only word segmentation is performed on each sentence in a text to be processed, but also syntactic dependency analysis is performed on each sentence, so that the word segmentation result can be corrected by using the syntactic dependency analysis result to obtain a word sequence library, then the word segmentation result is updated according to the correlation between words in frequent sequences meeting the conditions in the word sequence library, a hyponym combination is established according to the updated word segmentation result, words in the word sequence library are replaced by words with highest frequency in the same hyponym combination according to the hyponym combination, the confidence of variants among the words in the word sequence meeting the requirements in the word sequence library is calculated, and the upper and lower concepts among the words are judged according to the calculation result.

In a specific example, the analyzing the word segmentation and the syntactic dependency relationship of each sentence in the text to be processed to obtain the word segmentation result and the word sequence library includes:

In a specific example, correcting the word segmentation result according to the syntactic dependency analysis result to obtain at least one word sequence corresponding to each sentence includes:

Example two

The embodiment of the invention also provides a knowledge graph construction device, as shown in fig. 2, comprising:

the analysis module 21 is used for analyzing the word segmentation and the syntactic dependency relationship of each sentence in the text to be processed to obtain a word segmentation result and a word sequence library;

the first processing module 22 is configured to screen out frequent sequences with a length greater than a preset first threshold value from the word sequence library, and calculate a frequency and a degree of lifting of each frequent sequence, where the frequency represents a probability that the frequent sequence appears in the word sequence library, and the degree of lifting represents a correlation between words in the frequent sequence;

a first updating module 23, configured to combine words included in a frequent sequence with a degree of improvement greater than a preset second threshold and a frequency greater than a preset sixth threshold into a new added word, and update the word segmentation result according to the new added word;

The second updating module 24 is configured to establish a paraphrasing combination according to the updated word segmentation result, and replace the word in the word sequence library with the word with the highest frequency in the same paraphrasing combination according to the paraphrasing combination;

the second processing module 25 is configured to obtain a word sequence with a frequency higher than a preset third threshold and a length of a preset fourth threshold in the updated word sequence library, calculate a variant confidence coefficient between words in the word sequence, and determine a context concept between words according to the calculation result, where the variant confidence coefficient indicates a correlation between words or word sequences in the word sequence.

In this embodiment, not only each sentence in the text to be processed is segmented, but also each sentence is subjected to syntactic dependency analysis, so that the segmentation result can be corrected by using the syntactic dependency analysis result to obtain a word sequence library, then the segmentation result is updated according to the correlation between words in frequent sequences meeting the conditions in the word sequence library, a hyponym combination is established according to the updated segmentation result, the words in the word sequence library are replaced by the words with the highest frequency in the same hyponym combination according to the hyponym combination, the confidence of variants among the words in the word sequence meeting the requirements in the word sequence library is calculated, and the upper and lower concepts among the words are judged according to the calculation result.

Further, the analysis module 21 includes:

the word segmentation unit is used for segmenting each sentence in the text to be processed to obtain a word segmentation result;

the syntactic dependency relation analysis unit is used for carrying out syntactic dependency relation analysis on each sentence in the text to be processed based on the word segmentation result, correcting the word segmentation result according to the syntactic dependency relation analysis result to obtain at least one group of word sequences corresponding to each sentence, and obtaining a word sequence library comprising word sequences of all sentences.

Further, the syntactic dependency analysis unit is specifically configured to determine a center word when the sentence center word is a noun, recursively find all centering relationship modifier words of the center word, and generate a word sequence including the center word and all centering relationship modifier words of the center word; when the central word of a sentence is a verb or an adjective, judging whether the sentence has a main-predicate structure, when the sentence has the main-predicate structure, determining main-predicate nouns in the main-predicate structure, recursively finding all centering relation modifier words of the main-predicate nouns, and generating word sequences of all centering relation modifier words comprising the main-predicate nouns and the main-predicate nouns; determining object nouns in a moving object structure when the main-predicate structure does not exist in the sentence and the moving object structure exists, recursively finding all centering relation modifier words of the object nouns, and generating word sequences of all centering relation modifier words comprising the object nouns and the object nouns; when the center word of the sentence is not a noun, a verb or an adjective, all centering relations in the sentence are determined, the noun with the most modifier is selected, all centering relation modifier words of the noun are found recursively, and a word sequence comprising the noun and all centering relation modifier words of the noun is generated.

Further, the second updating module 24 includes:

the word vector generation unit is used for generating a word vector according to the updated word segmentation result;

and the near-meaning word combination generating unit is used for calculating cosine similarity between every two words based on the generated word vectors and establishing all near-meaning word combinations based on a preset similarity threshold s.

Further, the paraphrasing combination generating unit is specifically configured to sort all words in the updated word segmentation result based on word frequency, and sequentially establish the paraphrasing combination of each word according to the word frequency from high to low.

The paraphrasing combination generating unit is specifically used for calculating the similarity of the word and other words; establishing a set based on at least one word having a similarity to the word greater than a threshold s, and ordering the words in the set based on the similarity to the word; and judging whether the similarity between each word in the set and other words in the set is larger than a threshold value s in sequence from high to low according to the similarity, and if so, adding the judged word into a hyponym combination of the word.

Further, the second processing module 25 is specifically configured to calculate a variant confidence coefficient of a rightmost word in the word sequence and other word sequences or words on the left side, and if the variant confidence coefficient is lower than a preset fifth threshold, determine that the word sequence or word on the left side is an upper concept of the rightmost word.

Example III

The embodiment of the invention also provides an electronic device 30 for constructing a knowledge graph, as shown in fig. 3, including:

a processor 32; and

a memory 34, in which memory 34 computer program instructions are stored,

wherein the computer program instructions, when executed by the processor, cause the processor 32 to perform the steps of:

Further, as shown in fig. 3, the electronic device 30 for constructing a knowledge graph further includes a network interface 31, an input device 33, a hard disk 35, and a display device 36.

The interfaces and devices described above may be interconnected by a bus architecture. The bus architecture may be a bus and bridge that may include any number of interconnects. One or more Central Processing Units (CPUs), represented in particular by processor 32, and various circuits of one or more memories, represented by memory 34, are connected together. The bus architecture may also connect various other circuits together, such as peripheral devices, voltage regulators, and power management circuits. It is understood that a bus architecture is used to enable connected communications between these components. The bus architecture includes, in addition to a data bus, a power bus, a control bus, and a status signal bus, all of which are well known in the art and therefore will not be described in detail herein.

The network interface 31 may be connected to a network (e.g. the internet, a local area network, etc.), and may obtain relevant data from the network, for example, input text to be processed, such as non-defined field text, and may be stored in the hard disk 35.

The input device 33 may receive various instructions entered by an operator and may be sent to the processor 32 for execution. The input device 33 may comprise a keyboard or a pointing device (e.g. a mouse, a trackball, a touch pad or a touch screen, etc.).

The display device 36 may display results from the execution of instructions by the processor 32.

The memory 34 is used for storing programs and data necessary for the operation of the operating system, and data such as intermediate results in the calculation process of the processor 32.

It will be appreciated that the memory 34 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be Read Only Memory (ROM), programmable Read Only Memory (PROM), erasable Programmable Read Only Memory (EPROM), electrically Erasable Programmable Read Only Memory (EEPROM), or flash memory, among others. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. The memory 34 of the apparatus and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some implementations, the memory 34 stores the following elements, executable modules or data structures, or a subset thereof, or an extended set thereof: an operating system 341 and application programs 342.

The operating system 341 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application programs 342 include various application programs, such as a Browser (Browser), etc., for implementing various application services. A program for implementing the method of the embodiment of the present invention may be included in the application program 342.

The processor 32, when calling and executing the application program and the data stored in the memory 34, specifically, may perform word segmentation and syntactic dependency analysis on each sentence in the text to be processed, so as to obtain a word segmentation result and a word sequence library; screening frequent sequences with the length larger than a preset first threshold value from the word sequence library, and calculating the frequency and the lifting degree of each frequent sequence, wherein the frequency represents the probability of the frequent sequence in the word sequence library, and the lifting degree represents the correlation among words in the frequent sequence; combining words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into newly added words, and updating the word segmentation result according to the newly added words; establishing a near meaning word combination according to the updated word segmentation result, and replacing the words in the word sequence library with the words with highest frequency in the same near meaning word combination according to the near meaning word combination; and acquiring a word sequence with the frequency higher than a preset third threshold value and the length of a preset fourth threshold value in the updated word sequence library, calculating variant confidence coefficients among words in the word sequence, and judging upper and lower concepts among the words according to the calculation result, wherein the variant confidence coefficients represent the relativity among the words or the word sequences in the word sequence.

The method disclosed in the above embodiment of the present invention may be applied to the processor 32 or implemented by the processor 32. The processor 32 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware in processor 32 or by instructions in the form of software. The processor 32 may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 34 and the processor 32 reads the information in the memory 34 and in combination with its hardware performs the steps of the method described above.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Specifically, the processor 32 performs word segmentation on each sentence in the text to be processed to obtain a word segmentation result; and analyzing the syntactic dependency relation of each sentence in the text to be processed based on the word segmentation result, correcting the word segmentation result according to the syntactic dependency relation analysis result to obtain at least one group of word sequences corresponding to each sentence, and obtaining a word sequence library comprising word sequences of all sentences.

Specifically, the processor 32 determines a center word when the center word of the sentence is a noun, and recursively finds all centering relationship modifier words of the center word, generating a word sequence including the center word and all centering relationship modifier words of the center word; when the central word of a sentence is a verb or an adjective, judging whether the sentence has a main-predicate structure, when the sentence has the main-predicate structure, determining main-predicate nouns in the main-predicate structure, recursively finding all centering relation modifier words of the main-predicate nouns, and generating word sequences of all centering relation modifier words comprising the main-predicate nouns and the main-predicate nouns; determining object nouns in a moving object structure when the main-predicate structure does not exist in the sentence and the moving object structure exists, recursively finding all centering relation modifier words of the object nouns, and generating word sequences of all centering relation modifier words comprising the object nouns and the object nouns; when the center word of the sentence is not a noun, a verb or an adjective, all centering relations in the sentence are determined, the noun with the most modifier is selected, all centering relation modifier words of the noun are found recursively, and a word sequence comprising the noun and all centering relation modifier words of the noun is generated.

Specifically, the processor 32 generates a word vector from the updated word segmentation result; and calculating cosine similarity between every two words based on the generated word vectors, and establishing all hyponym combinations based on a preset similarity threshold s.

Specifically, the processor 32 sorts all words in the updated word segmentation result based on word frequency; and establishing a paraphrasing combination of each word in sequence from high to low according to word frequency.

Specifically, the processor 32 calculates the similarity of the word to other words; establishing a set based on at least one word having a similarity to the word greater than a threshold s, and ordering the words in the set based on the similarity to the word; and judging whether the similarity between each word in the set and other words in the set is larger than a threshold value s in sequence from high to low according to the similarity, and if so, adding the judged word into a hyponym combination of the word.

Specifically, the processor 32 calculates the confidence of the variant of the rightmost word in the word sequence with other word sequences or words on the left, and if the confidence of the variant is lower than a preset fifth threshold, determines that the word sequence or word on the left is the upper concept of the rightmost word.

Example IV

The embodiment of the invention also provides a computer readable storage medium storing a computer program, which when being executed by a processor, causes the processor to execute the steps of:

Further, the computer program, when executed by a processor, further causes the processor to perform the steps of:

generating word vectors according to the updated word segmentation result;

establishing a paraphrasing combination of each word includes:

calculating the similarity of the word and other words;

Example five

The method for constructing a knowledge graph of the present invention is further described below with reference to specific embodiments, as shown in fig. 4, where the method for constructing a knowledge graph of the present embodiment includes the following steps:

step 401: receiving text to be processed;

the text to be processed may be a non-definitive domain text, and of course, the text to be processed is not limited to the non-definitive domain text, but may be a non-definitive domain text.

Step 402: performing word segmentation on each sentence in the text to be processed to obtain a word segmentation result;

specifically, each sentence in the text to be processed can be segmented by using the existing segmentation method, and the segmentation method belongs to the prior art and is not described herein.

Step 403: analyzing syntactic dependency relation of each sentence based on the word segmentation result to obtain a word sequence;

specifically, as shown in fig. 5, the syntactic dependency analysis for each sentence specifically includes the steps of:

step 501: judging whether the center word is a noun, if so, turning to step 502, and if not, turning to step 503;

Step 502: outputting a recursive center word and a centering modifier;

when the sentence center word is a noun, the center word is selected, and all the centering relation modifier words of the center word are found recursively. In a specific example, the word segmentation result is "display interface failure", the center word is "failure", the centering modifier is "display" and "interface", and the finally output word sequence includes "display interface failure".

Step 503: judging whether the center word is a verb or adjective, if so, turning to step 504, and if not, turning to step 506;

step 504: judging whether the sentence has a main-predicate structure, if so, turning to step 505, and if not, turning to step 508;

step 505: outputting recursive main language nouns and centering modifier words;

when the sentence center word is a verb or adjective, selecting a main noun in the main predicate structure, and recursively finding out centering relation modifier words of all the main nouns. In a specific example, the word segmentation result is "DNS server host power is damaged", the central word therein is "damaged", the main structure is "power", the centering modifier is "DNS", "server" and "host", and the finally output word sequence includes "DNS server host power".

Step 506: searching all centering structures, finding the noun with the most modification, and turning to step 507;

step 507: outputting recursive nouns and centering modifier words;

selecting all centering relations in sentences, selecting the most modified target words, recursively finding centering relation modifier words of all the target words, wherein in a specific example, the word segmentation result is ' DNS server host lower part water inflow ', the centering modifier words are ' DNS ', ' server ' and ' host ', the middle structure is ' lower part ', the center word is water inflow ', the finally output word sequence comprises ' DNS server host ', and if a plurality of target words are modified for equal times, the last word in the sentences is selected.

Step 508: judging whether a moving object structure exists, if so, turning to step 509;

step 509: recursive object nouns and centering modifiers are output.

When the sentence center word is a verb or adjective, object nouns in the dynamic guest structure are selected, and centering relation modifier words of all the object nouns are found recursively. In a specific example, the word segmentation result is "change DNS server host power", the center word is "change", the centering modifier is "DNS", "server" and "host", the dynamic guest structure is "power", and the finally output word sequence includes "DNS server host power".

Step 404: calculating the frequency and the lifting degree of frequent sequences with the length more than 2 in a word sequence library;

after step 403, one or more word sequences are obtained according to each sentence, a word sequence library is formed, all frequent sequences with the length greater than 2 are found in the word sequence library, and the frequency and the degree of improvement are calculated. The frequency is the probability that the frequent sequence appears in the word sequence library, and the degree of improvement indicates the correlation between words in the frequent sequence, for example, the degree of improvement of word sequence (a, B) is lift (a, B) =p (b|a)/P (B), where P (B) is the proportion of all the tuples containing B in all the tuples, P (b|a) is the proportion of B appearing in all the tuples containing a, where the tuples are word sequences with length 2 in the word sequence library. For example, a is "DNS", and B is "server", then the degree of promotion can be understood as the ratio of the probability that "server" appears after "DNS" appears to the probability that "server" appears randomly in a certain binary group. The higher the degree of lifting means the stronger the correlation between the two. Note that, unlike the conventional lifting degree calculation method, only the frequency of occurrence of all the tuples is counted in the calculation of this embodiment, i.e., P (b|a) is the proportion of occurrence of B in all the tuples including a, and P (B) is the proportion of all the tuples including B in all the tuples. B in a tuple is not added to the calculation of the frequency because abbreviations often appear in the actual text, for example, writing "DNS server" as DNS, or writing only "server" in the expression of the latter after "DNS server" appears in the former sentence, which will significantly affect the calculation result of the degree of promotion when this phenomenon is more common.

Step 405: screening word sequences with the lifting degree larger than a preset threshold value and the frequency larger than the threshold value, forming new words from words in the word sequences, and updating word segmentation results;

if the degree of improvement of the word sequence 'DNS server' is larger than a preset threshold value, the DNS server is used as a new word in the word segmentation result to update the word segmentation result.

Step 406: generating word vectors according to the updated word segmentation result, calculating cosine similarity between words based on the generated word vectors, and establishing all hyponym combinations based on a preset similarity threshold s;

specifically, based on the existing word2vec method, word vectors are generated by using the word segmentation result after replacement, cosine similarity between every two words is calculated based on the word vectors, and all the hyponym combinations are found out based on a similarity threshold s. The specific flow is shown in fig. 6, and comprises the following steps:

step 601: initializing a paraphrasing combination into an empty set;

step 602: all words (such as N) in the word sequence library are ordered from high to low based on the frequency of the words;

Step 603: judging word W _i If already included in the hyponym combination, if so, go to step 612, if not, go to step 604, the initial value of i may be 1;

step 604: calculate word W _i Similarity to other words;

step 605: words with similarity greater than the threshold s are extracted, and the words are ranked from high to low in similarity (W _i1 ，W _i2 ，…，W _iK )；

Step 606: initializing W _i The candidate set is an empty set;

step 607: judging word W _ik Whether the similarity with each word in the candidate set is greater than a threshold s, the initial value of k may be 1; if yes, go to step 608, if no, go to step 609;

step 608: will word W _ik Adding the candidate set;

step 609: adding 1 to the value of k;

step 610: judging whether K is greater than K, if yes, turning to step 611, if no, turning to step 607;

step 611: will word W _i Adding a paraphrasing combination with the candidate set;

step 612: adding 1 to the value of i;

step 613: whether the value of i is greater than N is determined, if so, the process ends, and if not, the process proceeds to step 604.

Step 407: according to the near-meaning word combination, replacing the words in the word sequence library with the words with highest frequency in the same near-meaning word combination;

step 408: and calculating the confidence of variants among the words of the two-tuple and the three-tuple in the updated word sequence library, and judging the upper and lower concepts among the words according to the calculation result.

The practical verification proves that the meaning of analyzing the word sequences with other lengths is not great, so that the variant confidence is calculated only for the word sequences with lengths of 2 and 3 in the embodiment, and the variant confidence represents the relevance between words or word sequences in the word sequence.

Specifically, a variant confidence P (n|m) for the right word and left Bian Cixu columns (or words) in the word sequence may be calculated, where the variant confidence P (n|m) is the proportion of all the multiple word sequences containing M on the left where N appears on the right. The calculation method of this embodiment is different from the calculation method of the existing confidence, and therefore, the calculated result is called a variant confidence. If the variant confidence is below the threshold b, the left word sequence (or word) is judged to be the upper concept of the right word. For example, a certain frequent sequence is (DNS server, host, power supply), first, the occurrence ratio r1 of the "host" in all the multiple sequences in which the "DNS server" occurs is calculated, and if r1 is smaller than b, the "host" is considered as the lower concept of the "DNS server"; and then calculating the occurrence ratio r2 of the power supply in all the multi-element sequences of the DNS server and the host, and if r2 is smaller than b, considering the DNS server host as the upper concept of the power supply.

The foregoing is a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention and are intended to be comprehended within the scope of the present invention.

Claims

1. The method for constructing the knowledge graph is characterized by comprising the following steps of:

obtaining a word sequence with the frequency higher than a preset third threshold value and the length of a preset fourth threshold value in the updated word sequence library, calculating variant confidence coefficients among words in the word sequence, judging upper and lower concepts among the words according to the calculation result, wherein the variant confidence coefficients represent the relativity among the words or word sequences in the word sequence,

Wherein, the judging the upper and lower concepts among the words according to the calculation result comprises the following steps:

2. The method for constructing a knowledge graph according to claim 1, wherein the step of analyzing the word segmentation and the syntactic dependency relationship of each sentence in the text to be processed to obtain the word segmentation result and the word sequence library comprises the steps of:

3. The method for constructing a knowledge graph according to claim 2, wherein correcting the word segmentation result according to the syntactic dependency analysis result to obtain at least one word sequence corresponding to each sentence comprises:

4. The method for constructing a knowledge graph according to claim 1, wherein when the frequent sequence includes a word a and a word B, the degree of lift (a, B) =p (b|a)/P (B), where P (B) is a proportion of all the tuples including B in all the tuples, P (b|a) is a proportion of occurrence of B in all the tuples including a, and wherein the tuples are word sequences with a length of 2 in a word sequence library.

5. The knowledge graph construction method according to claim 1, wherein the establishing a paraphrasing combination according to the updated word segmentation result comprises:

generating word vectors according to the updated word segmentation result;

6. The knowledge-graph construction method according to claim 5, wherein the establishing all the paraphrasing combinations includes:

establishing a paraphrasing combination of each word includes:

calculating the similarity of the word and other words;

7. The knowledge graph construction method according to claim 1, wherein,

when the word sequence includes a word or word sequence M and a word or word sequence N, the variant confidence P (n|m) is the proportion of all the multiple word sequences containing M on the left and N on the right.

8. The knowledge graph construction method according to claim 1, wherein the preset first threshold is not less than 2, and the preset fourth threshold is 2 or 3.

9. The knowledge graph construction device is characterized by comprising:

a second processing module, configured to obtain a word sequence with a frequency higher than a preset third threshold and a length of a preset fourth threshold in the updated word sequence library, calculate a variant confidence coefficient between words in the word sequence, and determine a context concept between words according to the calculation result, where the variant confidence coefficient indicates a correlation between words or word sequences in the word sequence,

the second processing module is configured to calculate a variant confidence coefficient of a rightmost word in the word sequence and other word sequences or words on the left side, and if the variant confidence coefficient is lower than a preset fifth threshold value, determine that the word sequence or word on the left side is an upper concept of the rightmost word.

10. An electronic device for constructing a knowledge graph, comprising:

A processor; and

a memory in which computer program instructions are stored,

wherein the computer program instructions, when executed by the processor, cause the processor to perform the steps in the knowledge-graph construction method of any one of claims 1-8.