CN110674306B - Knowledge graph construction method and device and electronic equipment - Google Patents

Knowledge graph construction method and device and electronic equipment Download PDF

Info

Publication number
CN110674306B
CN110674306B CN201810620223.6A CN201810620223A CN110674306B CN 110674306 B CN110674306 B CN 110674306B CN 201810620223 A CN201810620223 A CN 201810620223A CN 110674306 B CN110674306 B CN 110674306B
Authority
CN
China
Prior art keywords
word
words
sequence
sentence
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810620223.6A
Other languages
Chinese (zh)
Other versions
CN110674306A (en
Inventor
郑萌
耿璐
李岚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hitachi Ltd
Original Assignee
Hitachi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Ltd filed Critical Hitachi Ltd
Priority to CN201810620223.6A priority Critical patent/CN110674306B/en
Publication of CN110674306A publication Critical patent/CN110674306A/en
Application granted granted Critical
Publication of CN110674306B publication Critical patent/CN110674306B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a knowledge graph construction method, a knowledge graph construction device and electronic equipment, and belongs to the technical field of artificial intelligence. The knowledge graph construction method comprises the following steps: analyzing word segmentation and syntactic dependency relation of each sentence in the text to be processed to obtain word segmentation results and a word sequence library; screening frequent sequences with the length larger than a preset first threshold value from the word sequence library; combining words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into newly added words, and updating the word segmentation result; establishing a near meaning word combination according to the updated word segmentation result, updating a word sequence library according to the near meaning word combination, calculating variant confidence degrees among words in the word sequence, and judging upper and lower concepts among the words according to the calculation result, wherein the variant confidence degrees represent the relativity among the words or word sequences in the word sequence. The method can accurately and effectively extract concepts and upper and lower relationships from the undefined field text.

Description

Knowledge graph construction method and device and electronic equipment
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a knowledge graph construction method, a knowledge graph construction device and electronic equipment.
Background
Knowledge graph construction is an important component in natural language processing and machine language. Most of the current knowledge graph construction methods are to extract texts from the Internet, discover concepts from the texts and judge the upper and lower relationship. The existing knowledge graph construction method often needs some preset sentence patterns when extracting the upper and lower relation, for example, "deep learning is one of the machine learning methods", "word is the software special for word processing in Microsoft's office software", and the like. Such sentence patterns are often found in large numbers in corpora such as specifications, encyclopedias, and the like. However, in real life, there are many scenes as well, and there is no text such as a description that defines entity concepts specifically. Such as a relatively complex device, the specification will not typically provide the user with a very detailed definition or hint of the part, indicating that part a is part of part B, etc. In addition, a large amount of domain texts, such as customer service records, maintenance records, etc., are usually recorded in a relatively concise manner, and meanwhile, the reader is assumed to have a relatively strong accumulation of domain knowledge, so that definition description of entity concepts involved in the texts is not performed. At this time, the existing knowledge graph construction method cannot accurately and effectively extract concepts and upper and lower relationships from the undefined domain text.
Disclosure of Invention
The invention aims to solve the technical problem of providing a knowledge graph construction method, a knowledge graph construction device and electronic equipment, which can accurately and effectively extract concepts and upper and lower relations from a non-defined field text.
In order to solve the technical problems, the embodiment of the invention provides the following technical scheme:
in one aspect, a method for constructing a knowledge graph is provided, including:
analyzing word segmentation and syntactic dependency relation of each sentence in the text to be processed to obtain word segmentation results and a word sequence library;
screening frequent sequences with the length larger than a preset first threshold value from the word sequence library, and calculating the frequency and the lifting degree of each frequent sequence, wherein the frequency represents the probability of the frequent sequence in the word sequence library, and the lifting degree represents the correlation among words in the frequent sequence;
combining words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into newly added words, and updating the word segmentation result according to the newly added words;
establishing a near meaning word combination according to the updated word segmentation result, and replacing the words in the word sequence library with the words with highest frequency in the same near meaning word combination according to the near meaning word combination;
And acquiring a word sequence with the frequency higher than a preset third threshold value and the length of a preset fourth threshold value in the updated word sequence library, calculating variant confidence coefficients among words in the word sequence, and judging upper and lower concepts among the words according to the calculation result, wherein the variant confidence coefficients represent the relativity among the words or the word sequences in the word sequence.
Further, the analyzing the word segmentation and the syntactic dependency relation of each sentence in the text to be processed to obtain a word segmentation result and a word sequence library includes:
performing word segmentation on each sentence in the text to be processed to obtain a word segmentation result;
and analyzing the syntactic dependency relation of each sentence in the text to be processed based on the word segmentation result, correcting the word segmentation result according to the syntactic dependency relation analysis result to obtain at least one group of word sequences corresponding to each sentence, and obtaining a word sequence library comprising word sequences of all sentences.
Further, correcting the word segmentation result according to the analysis result of the syntactic dependency relationship to obtain at least one group of word sequences corresponding to each sentence includes:
when the sentence center word is a noun, determining the center word, recursively finding all centering relation modifier words of the center word, and generating a word sequence comprising the center word and all centering relation modifier words of the center word;
When the central word of a sentence is a verb or an adjective, judging whether the sentence has a main-predicate structure, when the sentence has the main-predicate structure, determining main-predicate nouns in the main-predicate structure, recursively finding all centering relation modifier words of the main-predicate nouns, and generating word sequences of all centering relation modifier words comprising the main-predicate nouns and the main-predicate nouns; determining object nouns in a moving object structure when the main-predicate structure does not exist in the sentence and the moving object structure exists, recursively finding all centering relation modifier words of the object nouns, and generating word sequences of all centering relation modifier words comprising the object nouns and the object nouns;
when the center word of the sentence is not a noun, a verb or an adjective, all centering relations in the sentence are determined, the noun with the most modifier is selected, all centering relation modifier words of the noun are found recursively, and a word sequence comprising the noun and all centering relation modifier words of the noun is generated.
Further, when the frequent sequence includes a word a and a word B, the degree of lift (a, B) =p (b|a)/P (B), where P (B) is the proportion of all the tuples including B in all the tuples, and P (b|a) is the proportion of occurrence of B in all the tuples including a, where the tuples are word sequences with a length of 2 in the word sequence library.
Further, the establishing the paraphrasing combination according to the updated word segmentation result comprises the following steps:
generating word vectors according to the updated word segmentation result;
and calculating cosine similarity between every two words based on the generated word vectors, and establishing all hyponym combinations based on a preset similarity threshold s.
Further, the establishing all the paraphrasal combinations includes:
sorting all words in the updated word segmentation result based on word frequency;
sequentially establishing a paraphrasing combination of each word according to the word frequency from high to low;
establishing a paraphrasing combination of each word includes:
calculating the similarity of the word and other words;
establishing a set based on at least one word having a similarity to the word greater than a threshold s, and ordering the words in the set based on the similarity to the word;
and judging whether the similarity between each word in the set and other words in the set is larger than a threshold value s in sequence from high to low according to the similarity, and if so, adding the judged word into a hyponym combination of the word.
Further, the step of judging the upper and lower concepts among the words according to the calculation result comprises the following steps:
And calculating the variant confidence coefficient of the rightmost word in the word sequence and other word sequences or words on the left side, and judging the word sequence or word on the left side as the upper concept of the rightmost word if the variant confidence coefficient is lower than a preset fifth threshold value.
Further, when the word sequence includes a word or word sequence M and a word or word sequence N, the variant confidence P (n|m) is the proportion of all the multiple word sequences containing M on the left and N on the right.
Further, the preset first threshold is not smaller than 2, and the preset fourth threshold is 2 or 3.
The embodiment of the invention also provides a knowledge graph construction device, which comprises:
the analysis module is used for analyzing the word segmentation and the syntactic dependency relationship of each sentence in the text to be processed to obtain a word segmentation result and a word sequence library;
the first processing module is used for screening out frequent sequences with the length larger than a preset first threshold value from the word sequence library, and calculating the frequency and the lifting degree of each frequent sequence, wherein the frequency represents the probability of occurrence of the frequent sequence in the word sequence library, and the lifting degree represents the correlation among words in the frequent sequence;
the first updating module is used for merging words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into new added words, and updating the word segmentation result according to the new added words;
The second updating module is used for establishing a near-meaning word combination according to the updated word segmentation result, and replacing the words in the word sequence library with the words with the highest frequency in the same near-meaning word combination according to the near-meaning word combination;
the second processing module is used for obtaining word sequences with frequencies higher than a preset third threshold value and lengths of preset fourth threshold values in the updated word sequence library, calculating variant confidence coefficients among words in the word sequences, and judging upper and lower concepts among the words according to calculation results, wherein the variant confidence coefficients represent the relevance among the words or the word sequences in the word sequences.
The embodiment of the invention also provides electronic equipment for constructing the knowledge graph, which comprises the following steps:
a processor; and
a memory in which computer program instructions are stored,
wherein the computer program instructions, when executed by the processor, cause the processor to perform the steps in the knowledge graph construction method as described above.
The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program when being executed by a processor causes the processor to execute the steps in the knowledge graph construction method.
The embodiment of the invention has the following beneficial effects:
according to the technical scheme, the word sequence library is obtained by correcting the word segmentation result by utilizing the analysis result of the syntactic dependency relationship, the word segmentation result is updated according to the correlation among the words in the frequent sequences meeting the conditions in the word sequence library, the hyponym combination is established according to the updated word segmentation result, the words in the word sequence library are replaced by the words with the highest frequency in the same hyponym combination according to the hyponym combination, the confidence level of the variants among the words in the word sequence library meeting the requirements is calculated, and the upper and lower concepts among the words are judged according to the calculation result.
Drawings
FIG. 1 is a schematic flow chart of a knowledge graph construction method according to an embodiment of the present invention;
FIG. 2 is a block diagram of a knowledge graph construction device according to an embodiment of the present invention;
FIG. 3 is a block diagram of an electronic device for constructing a knowledge graph in accordance with an embodiment of the present invention;
FIG. 4 is a flow chart of a knowledge graph construction method according to an embodiment of the present invention;
FIG. 5 is a flow chart illustrating a syntactic dependency analysis according to an embodiment of the present invention;
FIG. 6 is a flow chart of the establishment of a paraphrasing combination according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved by the embodiments of the present invention more apparent, the following detailed description will be given with reference to the accompanying drawings and the specific embodiments.
The embodiment of the invention provides a knowledge graph construction method, a knowledge graph construction device and electronic equipment, which can accurately and effectively extract concepts and upper and lower relationships from undefined field texts.
Example 1
The embodiment of the invention provides a knowledge graph construction method, as shown in fig. 1, comprising the following steps:
step 101: analyzing word segmentation and syntactic dependency relation of each sentence in the text to be processed to obtain word segmentation results and a word sequence library;
in this step, the sentence may be segmented, then, based on the segmentation result, the syntactic dependency relationship analysis may be performed on each sentence, and then, the segmentation result may be corrected according to the syntactic dependency relationship analysis result, so as to obtain a word sequence library including word sequences of all sentences.
Step 102: screening frequent sequences with the length larger than a preset first threshold value from the word sequence library, and calculating the frequency and the lifting degree of each frequent sequence, wherein the frequency represents the probability of the frequent sequence in the word sequence library, and the lifting degree represents the correlation among words in the frequent sequence;
if the word sequence has a smaller length, the word sequence does not have an analysis meaning, so that frequent sequences with the length larger than a preset first threshold value need to be screened out, wherein the preset first threshold value is not smaller than 2, specifically, the preset first threshold value can be 2, namely, all frequent sequences with the length larger than 2 need to be screened out.
Wherein, when the frequent sequence includes a word a and a word B, the degree of lift (a, B) =p (b|a)/P (B) of the frequent sequence (a, B), where P (B) is the proportion of all the tuples including B in all the tuples, and P (b|a) is the proportion of occurrence of B in all the tuples including a, where the tuples are word sequences with a length of 2 in the word sequence library. The greater the degree of boosting, the greater the correlation of word a and word B.
Wherein the frequency includes the number of times the frequent sequence occurs or the percentage of the frequent sequence in the total number of all the frequent sequences.
Step 103: combining words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into newly added words, and updating the word segmentation result according to the newly added words;
step 104: establishing a near meaning word combination according to the updated word segmentation result, and replacing the words in the word sequence library with the words with highest frequency in the same near meaning word combination according to the near meaning word combination;
specifically, word vectors can be generated based on updated word segmentation results, cosine similarity between words can be calculated based on the word vectors, and all hyponym combinations can be found based on a similarity threshold s.
When all the hyponym combinations are established, all the words in the updated word segmentation result can be ordered based on word frequency, and the hyponym combinations of each word are established in sequence from high to low according to the word frequency; when a paraphrasing combination of each word is established, calculating the similarity of the word and other words; establishing a set based on at least one word having a similarity to the word greater than a threshold s, and ordering the words in the set based on the similarity to the word; and judging whether the similarity between each word in the set and other words in the set is larger than a threshold value s in sequence from high to low according to the similarity, and if so, adding the judged word into a hyponym combination of the word.
Step 105: and acquiring a word sequence with the frequency higher than a preset third threshold value and the length of a preset fourth threshold value in the updated word sequence library, calculating variant confidence coefficients among words in the word sequence, and judging upper and lower concepts among the words according to the calculation result, wherein the variant confidence coefficients represent the relativity among the words or the word sequences in the word sequence.
In this embodiment, only the binary group or the ternary group with higher frequency in the word sequence library may be analyzed, if the length of the word sequence is other value, the meaning of word sequence analysis is not great, i.e. the preset fourth threshold is 2 or 3.
Specifically, when the upper and lower concepts among the words are judged according to the calculation result, calculating the variant confidence degrees of the rightmost word in the word sequence and other word sequences or words on the left side, and judging that the word sequence or word on the left side is the upper concept of the rightmost word if the variant confidence degrees are lower than a preset fifth threshold value.
Where the word sequence includes a word or word sequence M and a word or word sequence N, the variant confidence P (n|m) is the proportion of all the multiple word sequences containing M on the left and N on the right.
In this embodiment, not only word segmentation is performed on each sentence in a text to be processed, but also syntactic dependency analysis is performed on each sentence, so that the word segmentation result can be corrected by using the syntactic dependency analysis result to obtain a word sequence library, then the word segmentation result is updated according to the correlation between words in frequent sequences meeting the conditions in the word sequence library, a hyponym combination is established according to the updated word segmentation result, words in the word sequence library are replaced by words with highest frequency in the same hyponym combination according to the hyponym combination, the confidence of variants among the words in the word sequence meeting the requirements in the word sequence library is calculated, and the upper and lower concepts among the words are judged according to the calculation result.
In a specific example, the analyzing the word segmentation and the syntactic dependency relationship of each sentence in the text to be processed to obtain the word segmentation result and the word sequence library includes:
performing word segmentation on each sentence in the text to be processed to obtain a word segmentation result;
and analyzing the syntactic dependency relation of each sentence in the text to be processed based on the word segmentation result, correcting the word segmentation result according to the syntactic dependency relation analysis result to obtain at least one group of word sequences corresponding to each sentence, and obtaining a word sequence library comprising word sequences of all sentences.
In a specific example, correcting the word segmentation result according to the syntactic dependency analysis result to obtain at least one word sequence corresponding to each sentence includes:
when the sentence center word is a noun, determining the center word, recursively finding all centering relation modifier words of the center word, and generating a word sequence comprising the center word and all centering relation modifier words of the center word;
when the central word of a sentence is a verb or an adjective, judging whether the sentence has a main-predicate structure, when the sentence has the main-predicate structure, determining main-predicate nouns in the main-predicate structure, recursively finding all centering relation modifier words of the main-predicate nouns, and generating word sequences of all centering relation modifier words comprising the main-predicate nouns and the main-predicate nouns; determining object nouns in a moving object structure when the main-predicate structure does not exist in the sentence and the moving object structure exists, recursively finding all centering relation modifier words of the object nouns, and generating word sequences of all centering relation modifier words comprising the object nouns and the object nouns;
When the center word of the sentence is not a noun, a verb or an adjective, all centering relations in the sentence are determined, the noun with the most modifier is selected, all centering relation modifier words of the noun are found recursively, and a word sequence comprising the noun and all centering relation modifier words of the noun is generated.
Example two
The embodiment of the invention also provides a knowledge graph construction device, as shown in fig. 2, comprising:
the analysis module 21 is used for analyzing the word segmentation and the syntactic dependency relationship of each sentence in the text to be processed to obtain a word segmentation result and a word sequence library;
the first processing module 22 is configured to screen out frequent sequences with a length greater than a preset first threshold value from the word sequence library, and calculate a frequency and a degree of lifting of each frequent sequence, where the frequency represents a probability that the frequent sequence appears in the word sequence library, and the degree of lifting represents a correlation between words in the frequent sequence;
a first updating module 23, configured to combine words included in a frequent sequence with a degree of improvement greater than a preset second threshold and a frequency greater than a preset sixth threshold into a new added word, and update the word segmentation result according to the new added word;
The second updating module 24 is configured to establish a paraphrasing combination according to the updated word segmentation result, and replace the word in the word sequence library with the word with the highest frequency in the same paraphrasing combination according to the paraphrasing combination;
the second processing module 25 is configured to obtain a word sequence with a frequency higher than a preset third threshold and a length of a preset fourth threshold in the updated word sequence library, calculate a variant confidence coefficient between words in the word sequence, and determine a context concept between words according to the calculation result, where the variant confidence coefficient indicates a correlation between words or word sequences in the word sequence.
In this embodiment, not only each sentence in the text to be processed is segmented, but also each sentence is subjected to syntactic dependency analysis, so that the segmentation result can be corrected by using the syntactic dependency analysis result to obtain a word sequence library, then the segmentation result is updated according to the correlation between words in frequent sequences meeting the conditions in the word sequence library, a hyponym combination is established according to the updated segmentation result, the words in the word sequence library are replaced by the words with the highest frequency in the same hyponym combination according to the hyponym combination, the confidence of variants among the words in the word sequence meeting the requirements in the word sequence library is calculated, and the upper and lower concepts among the words are judged according to the calculation result.
Further, the analysis module 21 includes:
the word segmentation unit is used for segmenting each sentence in the text to be processed to obtain a word segmentation result;
the syntactic dependency relation analysis unit is used for carrying out syntactic dependency relation analysis on each sentence in the text to be processed based on the word segmentation result, correcting the word segmentation result according to the syntactic dependency relation analysis result to obtain at least one group of word sequences corresponding to each sentence, and obtaining a word sequence library comprising word sequences of all sentences.
Further, the syntactic dependency analysis unit is specifically configured to determine a center word when the sentence center word is a noun, recursively find all centering relationship modifier words of the center word, and generate a word sequence including the center word and all centering relationship modifier words of the center word; when the central word of a sentence is a verb or an adjective, judging whether the sentence has a main-predicate structure, when the sentence has the main-predicate structure, determining main-predicate nouns in the main-predicate structure, recursively finding all centering relation modifier words of the main-predicate nouns, and generating word sequences of all centering relation modifier words comprising the main-predicate nouns and the main-predicate nouns; determining object nouns in a moving object structure when the main-predicate structure does not exist in the sentence and the moving object structure exists, recursively finding all centering relation modifier words of the object nouns, and generating word sequences of all centering relation modifier words comprising the object nouns and the object nouns; when the center word of the sentence is not a noun, a verb or an adjective, all centering relations in the sentence are determined, the noun with the most modifier is selected, all centering relation modifier words of the noun are found recursively, and a word sequence comprising the noun and all centering relation modifier words of the noun is generated.
Further, when the frequent sequence includes a word a and a word B, the degree of lift (a, B) =p (b|a)/P (B), where P (B) is the proportion of all the tuples including B in all the tuples, and P (b|a) is the proportion of occurrence of B in all the tuples including a, where the tuples are word sequences with a length of 2 in the word sequence library.
Further, the second updating module 24 includes:
the word vector generation unit is used for generating a word vector according to the updated word segmentation result;
and the near-meaning word combination generating unit is used for calculating cosine similarity between every two words based on the generated word vectors and establishing all near-meaning word combinations based on a preset similarity threshold s.
Further, the paraphrasing combination generating unit is specifically configured to sort all words in the updated word segmentation result based on word frequency, and sequentially establish the paraphrasing combination of each word according to the word frequency from high to low.
The paraphrasing combination generating unit is specifically used for calculating the similarity of the word and other words; establishing a set based on at least one word having a similarity to the word greater than a threshold s, and ordering the words in the set based on the similarity to the word; and judging whether the similarity between each word in the set and other words in the set is larger than a threshold value s in sequence from high to low according to the similarity, and if so, adding the judged word into a hyponym combination of the word.
Further, the second processing module 25 is specifically configured to calculate a variant confidence coefficient of a rightmost word in the word sequence and other word sequences or words on the left side, and if the variant confidence coefficient is lower than a preset fifth threshold, determine that the word sequence or word on the left side is an upper concept of the rightmost word.
Further, when the word sequence includes a word or word sequence M and a word or word sequence N, the variant confidence P (n|m) is the proportion of all the multiple word sequences containing M on the left and N on the right.
Further, the preset first threshold is not smaller than 2, and the preset fourth threshold is 2 or 3.
Example III
The embodiment of the invention also provides an electronic device 30 for constructing a knowledge graph, as shown in fig. 3, including:
a processor 32; and
a memory 34, in which memory 34 computer program instructions are stored,
wherein the computer program instructions, when executed by the processor, cause the processor 32 to perform the steps of:
analyzing word segmentation and syntactic dependency relation of each sentence in the text to be processed to obtain word segmentation results and a word sequence library;
screening frequent sequences with the length larger than a preset first threshold value from the word sequence library, and calculating the frequency and the lifting degree of each frequent sequence, wherein the frequency represents the probability of the frequent sequence in the word sequence library, and the lifting degree represents the correlation among words in the frequent sequence;
Combining words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into newly added words, and updating the word segmentation result according to the newly added words;
establishing a near meaning word combination according to the updated word segmentation result, and replacing the words in the word sequence library with the words with highest frequency in the same near meaning word combination according to the near meaning word combination;
and acquiring a word sequence with the frequency higher than a preset third threshold value and the length of a preset fourth threshold value in the updated word sequence library, calculating variant confidence coefficients among words in the word sequence, and judging upper and lower concepts among the words according to the calculation result, wherein the variant confidence coefficients represent the relativity among the words or the word sequences in the word sequence.
In this embodiment, not only each sentence in the text to be processed is segmented, but also each sentence is subjected to syntactic dependency analysis, so that the segmentation result can be corrected by using the syntactic dependency analysis result to obtain a word sequence library, then the segmentation result is updated according to the correlation between words in frequent sequences meeting the conditions in the word sequence library, a hyponym combination is established according to the updated segmentation result, the words in the word sequence library are replaced by the words with the highest frequency in the same hyponym combination according to the hyponym combination, the confidence of variants among the words in the word sequence meeting the requirements in the word sequence library is calculated, and the upper and lower concepts among the words are judged according to the calculation result.
Further, as shown in fig. 3, the electronic device 30 for constructing a knowledge graph further includes a network interface 31, an input device 33, a hard disk 35, and a display device 36.
The interfaces and devices described above may be interconnected by a bus architecture. The bus architecture may be a bus and bridge that may include any number of interconnects. One or more Central Processing Units (CPUs), represented in particular by processor 32, and various circuits of one or more memories, represented by memory 34, are connected together. The bus architecture may also connect various other circuits together, such as peripheral devices, voltage regulators, and power management circuits. It is understood that a bus architecture is used to enable connected communications between these components. The bus architecture includes, in addition to a data bus, a power bus, a control bus, and a status signal bus, all of which are well known in the art and therefore will not be described in detail herein.
The network interface 31 may be connected to a network (e.g. the internet, a local area network, etc.), and may obtain relevant data from the network, for example, input text to be processed, such as non-defined field text, and may be stored in the hard disk 35.
The input device 33 may receive various instructions entered by an operator and may be sent to the processor 32 for execution. The input device 33 may comprise a keyboard or a pointing device (e.g. a mouse, a trackball, a touch pad or a touch screen, etc.).
The display device 36 may display results from the execution of instructions by the processor 32.
The memory 34 is used for storing programs and data necessary for the operation of the operating system, and data such as intermediate results in the calculation process of the processor 32.
It will be appreciated that the memory 34 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be Read Only Memory (ROM), programmable Read Only Memory (PROM), erasable Programmable Read Only Memory (EPROM), electrically Erasable Programmable Read Only Memory (EEPROM), or flash memory, among others. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. The memory 34 of the apparatus and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some implementations, the memory 34 stores the following elements, executable modules or data structures, or a subset thereof, or an extended set thereof: an operating system 341 and application programs 342.
The operating system 341 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application programs 342 include various application programs, such as a Browser (Browser), etc., for implementing various application services. A program for implementing the method of the embodiment of the present invention may be included in the application program 342.
The processor 32, when calling and executing the application program and the data stored in the memory 34, specifically, may perform word segmentation and syntactic dependency analysis on each sentence in the text to be processed, so as to obtain a word segmentation result and a word sequence library; screening frequent sequences with the length larger than a preset first threshold value from the word sequence library, and calculating the frequency and the lifting degree of each frequent sequence, wherein the frequency represents the probability of the frequent sequence in the word sequence library, and the lifting degree represents the correlation among words in the frequent sequence; combining words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into newly added words, and updating the word segmentation result according to the newly added words; establishing a near meaning word combination according to the updated word segmentation result, and replacing the words in the word sequence library with the words with highest frequency in the same near meaning word combination according to the near meaning word combination; and acquiring a word sequence with the frequency higher than a preset third threshold value and the length of a preset fourth threshold value in the updated word sequence library, calculating variant confidence coefficients among words in the word sequence, and judging upper and lower concepts among the words according to the calculation result, wherein the variant confidence coefficients represent the relativity among the words or the word sequences in the word sequence.
The method disclosed in the above embodiment of the present invention may be applied to the processor 32 or implemented by the processor 32. The processor 32 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware in processor 32 or by instructions in the form of software. The processor 32 may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 34 and the processor 32 reads the information in the memory 34 and in combination with its hardware performs the steps of the method described above.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Specifically, the processor 32 performs word segmentation on each sentence in the text to be processed to obtain a word segmentation result; and analyzing the syntactic dependency relation of each sentence in the text to be processed based on the word segmentation result, correcting the word segmentation result according to the syntactic dependency relation analysis result to obtain at least one group of word sequences corresponding to each sentence, and obtaining a word sequence library comprising word sequences of all sentences.
Specifically, the processor 32 determines a center word when the center word of the sentence is a noun, and recursively finds all centering relationship modifier words of the center word, generating a word sequence including the center word and all centering relationship modifier words of the center word; when the central word of a sentence is a verb or an adjective, judging whether the sentence has a main-predicate structure, when the sentence has the main-predicate structure, determining main-predicate nouns in the main-predicate structure, recursively finding all centering relation modifier words of the main-predicate nouns, and generating word sequences of all centering relation modifier words comprising the main-predicate nouns and the main-predicate nouns; determining object nouns in a moving object structure when the main-predicate structure does not exist in the sentence and the moving object structure exists, recursively finding all centering relation modifier words of the object nouns, and generating word sequences of all centering relation modifier words comprising the object nouns and the object nouns; when the center word of the sentence is not a noun, a verb or an adjective, all centering relations in the sentence are determined, the noun with the most modifier is selected, all centering relation modifier words of the noun are found recursively, and a word sequence comprising the noun and all centering relation modifier words of the noun is generated.
Further, when the frequent sequence includes a word a and a word B, the degree of lift (a, B) =p (b|a)/P (B), where P (B) is the proportion of all the tuples including B in all the tuples, and P (b|a) is the proportion of occurrence of B in all the tuples including a, where the tuples are word sequences with a length of 2 in the word sequence library.
Specifically, the processor 32 generates a word vector from the updated word segmentation result; and calculating cosine similarity between every two words based on the generated word vectors, and establishing all hyponym combinations based on a preset similarity threshold s.
Specifically, the processor 32 sorts all words in the updated word segmentation result based on word frequency; and establishing a paraphrasing combination of each word in sequence from high to low according to word frequency.
Specifically, the processor 32 calculates the similarity of the word to other words; establishing a set based on at least one word having a similarity to the word greater than a threshold s, and ordering the words in the set based on the similarity to the word; and judging whether the similarity between each word in the set and other words in the set is larger than a threshold value s in sequence from high to low according to the similarity, and if so, adding the judged word into a hyponym combination of the word.
Specifically, the processor 32 calculates the confidence of the variant of the rightmost word in the word sequence with other word sequences or words on the left, and if the confidence of the variant is lower than a preset fifth threshold, determines that the word sequence or word on the left is the upper concept of the rightmost word.
Further, when the word sequence includes a word or word sequence M and a word or word sequence N, the variant confidence P (n|m) is the proportion of all the multiple word sequences containing M on the left and N on the right.
Further, the preset first threshold is not smaller than 2, and the preset fourth threshold is 2 or 3.
Example IV
The embodiment of the invention also provides a computer readable storage medium storing a computer program, which when being executed by a processor, causes the processor to execute the steps of:
analyzing word segmentation and syntactic dependency relation of each sentence in the text to be processed to obtain word segmentation results and a word sequence library;
screening frequent sequences with the length larger than a preset first threshold value from the word sequence library, and calculating the frequency and the lifting degree of each frequent sequence, wherein the frequency represents the probability of the frequent sequence in the word sequence library, and the lifting degree represents the correlation among words in the frequent sequence;
Combining words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into newly added words, and updating the word segmentation result according to the newly added words;
establishing a near meaning word combination according to the updated word segmentation result, and replacing the words in the word sequence library with the words with highest frequency in the same near meaning word combination according to the near meaning word combination;
and acquiring a word sequence with the frequency higher than a preset third threshold value and the length of a preset fourth threshold value in the updated word sequence library, calculating variant confidence coefficients among words in the word sequence, and judging upper and lower concepts among the words according to the calculation result, wherein the variant confidence coefficients represent the relativity among the words or the word sequences in the word sequence.
In this embodiment, not only each sentence in the text to be processed is segmented, but also each sentence is subjected to syntactic dependency analysis, so that the segmentation result can be corrected by using the syntactic dependency analysis result to obtain a word sequence library, then the segmentation result is updated according to the correlation between words in frequent sequences meeting the conditions in the word sequence library, a hyponym combination is established according to the updated segmentation result, the words in the word sequence library are replaced by the words with the highest frequency in the same hyponym combination according to the hyponym combination, the confidence of variants among the words in the word sequence meeting the requirements in the word sequence library is calculated, and the upper and lower concepts among the words are judged according to the calculation result.
Further, the computer program, when executed by a processor, further causes the processor to perform the steps of:
performing word segmentation on each sentence in the text to be processed to obtain a word segmentation result;
and analyzing the syntactic dependency relation of each sentence in the text to be processed based on the word segmentation result, correcting the word segmentation result according to the syntactic dependency relation analysis result to obtain at least one group of word sequences corresponding to each sentence, and obtaining a word sequence library comprising word sequences of all sentences.
Further, the computer program, when executed by a processor, further causes the processor to perform the steps of:
when the sentence center word is a noun, determining the center word, recursively finding all centering relation modifier words of the center word, and generating a word sequence comprising the center word and all centering relation modifier words of the center word;
when the central word of a sentence is a verb or an adjective, judging whether the sentence has a main-predicate structure, when the sentence has the main-predicate structure, determining main-predicate nouns in the main-predicate structure, recursively finding all centering relation modifier words of the main-predicate nouns, and generating word sequences of all centering relation modifier words comprising the main-predicate nouns and the main-predicate nouns; determining object nouns in a moving object structure when the main-predicate structure does not exist in the sentence and the moving object structure exists, recursively finding all centering relation modifier words of the object nouns, and generating word sequences of all centering relation modifier words comprising the object nouns and the object nouns;
When the center word of the sentence is not a noun, a verb or an adjective, all centering relations in the sentence are determined, the noun with the most modifier is selected, all centering relation modifier words of the noun are found recursively, and a word sequence comprising the noun and all centering relation modifier words of the noun is generated.
Further, when the frequent sequence includes a word a and a word B, the degree of lift (a, B) =p (b|a)/P (B), where P (B) is the proportion of all the tuples including B in all the tuples, and P (b|a) is the proportion of occurrence of B in all the tuples including a, where the tuples are word sequences with a length of 2 in the word sequence library.
Further, the computer program, when executed by a processor, further causes the processor to perform the steps of:
generating word vectors according to the updated word segmentation result;
and calculating cosine similarity between every two words based on the generated word vectors, and establishing all hyponym combinations based on a preset similarity threshold s.
Further, the computer program, when executed by a processor, further causes the processor to perform the steps of:
sorting all words in the updated word segmentation result based on word frequency;
Sequentially establishing a paraphrasing combination of each word according to the word frequency from high to low;
establishing a paraphrasing combination of each word includes:
calculating the similarity of the word and other words;
establishing a set based on at least one word having a similarity to the word greater than a threshold s, and ordering the words in the set based on the similarity to the word;
and judging whether the similarity between each word in the set and other words in the set is larger than a threshold value s in sequence from high to low according to the similarity, and if so, adding the judged word into a hyponym combination of the word.
Further, the computer program, when executed by a processor, further causes the processor to perform the steps of:
and calculating the variant confidence coefficient of the rightmost word in the word sequence and other word sequences or words on the left side, and judging the word sequence or word on the left side as the upper concept of the rightmost word if the variant confidence coefficient is lower than a preset fifth threshold value.
Further, when the word sequence includes a word or word sequence M and a word or word sequence N, the variant confidence P (n|m) is the proportion of all the multiple word sequences containing M on the left and N on the right.
Further, the preset first threshold is not smaller than 2, and the preset fourth threshold is 2 or 3.
Example five
The method for constructing a knowledge graph of the present invention is further described below with reference to specific embodiments, as shown in fig. 4, where the method for constructing a knowledge graph of the present embodiment includes the following steps:
step 401: receiving text to be processed;
the text to be processed may be a non-definitive domain text, and of course, the text to be processed is not limited to the non-definitive domain text, but may be a non-definitive domain text.
Step 402: performing word segmentation on each sentence in the text to be processed to obtain a word segmentation result;
specifically, each sentence in the text to be processed can be segmented by using the existing segmentation method, and the segmentation method belongs to the prior art and is not described herein.
Step 403: analyzing syntactic dependency relation of each sentence based on the word segmentation result to obtain a word sequence;
specifically, as shown in fig. 5, the syntactic dependency analysis for each sentence specifically includes the steps of:
step 501: judging whether the center word is a noun, if so, turning to step 502, and if not, turning to step 503;
Step 502: outputting a recursive center word and a centering modifier;
when the sentence center word is a noun, the center word is selected, and all the centering relation modifier words of the center word are found recursively. In a specific example, the word segmentation result is "display interface failure", the center word is "failure", the centering modifier is "display" and "interface", and the finally output word sequence includes "display interface failure".
Step 503: judging whether the center word is a verb or adjective, if so, turning to step 504, and if not, turning to step 506;
step 504: judging whether the sentence has a main-predicate structure, if so, turning to step 505, and if not, turning to step 508;
step 505: outputting recursive main language nouns and centering modifier words;
when the sentence center word is a verb or adjective, selecting a main noun in the main predicate structure, and recursively finding out centering relation modifier words of all the main nouns. In a specific example, the word segmentation result is "DNS server host power is damaged", the central word therein is "damaged", the main structure is "power", the centering modifier is "DNS", "server" and "host", and the finally output word sequence includes "DNS server host power".
Step 506: searching all centering structures, finding the noun with the most modification, and turning to step 507;
step 507: outputting recursive nouns and centering modifier words;
selecting all centering relations in sentences, selecting the most modified target words, recursively finding centering relation modifier words of all the target words, wherein in a specific example, the word segmentation result is ' DNS server host lower part water inflow ', the centering modifier words are ' DNS ', ' server ' and ' host ', the middle structure is ' lower part ', the center word is water inflow ', the finally output word sequence comprises ' DNS server host ', and if a plurality of target words are modified for equal times, the last word in the sentences is selected.
Step 508: judging whether a moving object structure exists, if so, turning to step 509;
step 509: recursive object nouns and centering modifiers are output.
When the sentence center word is a verb or adjective, object nouns in the dynamic guest structure are selected, and centering relation modifier words of all the object nouns are found recursively. In a specific example, the word segmentation result is "change DNS server host power", the center word is "change", the centering modifier is "DNS", "server" and "host", the dynamic guest structure is "power", and the finally output word sequence includes "DNS server host power".
Step 404: calculating the frequency and the lifting degree of frequent sequences with the length more than 2 in a word sequence library;
after step 403, one or more word sequences are obtained according to each sentence, a word sequence library is formed, all frequent sequences with the length greater than 2 are found in the word sequence library, and the frequency and the degree of improvement are calculated. The frequency is the probability that the frequent sequence appears in the word sequence library, and the degree of improvement indicates the correlation between words in the frequent sequence, for example, the degree of improvement of word sequence (a, B) is lift (a, B) =p (b|a)/P (B), where P (B) is the proportion of all the tuples containing B in all the tuples, P (b|a) is the proportion of B appearing in all the tuples containing a, where the tuples are word sequences with length 2 in the word sequence library. For example, a is "DNS", and B is "server", then the degree of promotion can be understood as the ratio of the probability that "server" appears after "DNS" appears to the probability that "server" appears randomly in a certain binary group. The higher the degree of lifting means the stronger the correlation between the two. Note that, unlike the conventional lifting degree calculation method, only the frequency of occurrence of all the tuples is counted in the calculation of this embodiment, i.e., P (b|a) is the proportion of occurrence of B in all the tuples including a, and P (B) is the proportion of all the tuples including B in all the tuples. B in a tuple is not added to the calculation of the frequency because abbreviations often appear in the actual text, for example, writing "DNS server" as DNS, or writing only "server" in the expression of the latter after "DNS server" appears in the former sentence, which will significantly affect the calculation result of the degree of promotion when this phenomenon is more common.
Wherein the frequency includes the number of times the frequent sequence occurs or the percentage of the frequent sequence in the total number of all the frequent sequences.
Step 405: screening word sequences with the lifting degree larger than a preset threshold value and the frequency larger than the threshold value, forming new words from words in the word sequences, and updating word segmentation results;
if the degree of improvement of the word sequence 'DNS server' is larger than a preset threshold value, the DNS server is used as a new word in the word segmentation result to update the word segmentation result.
Step 406: generating word vectors according to the updated word segmentation result, calculating cosine similarity between words based on the generated word vectors, and establishing all hyponym combinations based on a preset similarity threshold s;
specifically, based on the existing word2vec method, word vectors are generated by using the word segmentation result after replacement, cosine similarity between every two words is calculated based on the word vectors, and all the hyponym combinations are found out based on a similarity threshold s. The specific flow is shown in fig. 6, and comprises the following steps:
step 601: initializing a paraphrasing combination into an empty set;
step 602: all words (such as N) in the word sequence library are ordered from high to low based on the frequency of the words;
Step 603: judging word W i If already included in the hyponym combination, if so, go to step 612, if not, go to step 604, the initial value of i may be 1;
step 604: calculate word W i Similarity to other words;
step 605: words with similarity greater than the threshold s are extracted, and the words are ranked from high to low in similarity (W i1 ,W i2 ,…,W iK );
Step 606: initializing W i The candidate set is an empty set;
step 607: judging word W ik Whether the similarity with each word in the candidate set is greater than a threshold s, the initial value of k may be 1; if yes, go to step 608, if no, go to step 609;
step 608: will word W ik Adding the candidate set;
step 609: adding 1 to the value of k;
step 610: judging whether K is greater than K, if yes, turning to step 611, if no, turning to step 607;
step 611: will word W i Adding a paraphrasing combination with the candidate set;
step 612: adding 1 to the value of i;
step 613: whether the value of i is greater than N is determined, if so, the process ends, and if not, the process proceeds to step 604.
Step 407: according to the near-meaning word combination, replacing the words in the word sequence library with the words with highest frequency in the same near-meaning word combination;
step 408: and calculating the confidence of variants among the words of the two-tuple and the three-tuple in the updated word sequence library, and judging the upper and lower concepts among the words according to the calculation result.
The practical verification proves that the meaning of analyzing the word sequences with other lengths is not great, so that the variant confidence is calculated only for the word sequences with lengths of 2 and 3 in the embodiment, and the variant confidence represents the relevance between words or word sequences in the word sequence.
Specifically, a variant confidence P (n|m) for the right word and left Bian Cixu columns (or words) in the word sequence may be calculated, where the variant confidence P (n|m) is the proportion of all the multiple word sequences containing M on the left where N appears on the right. The calculation method of this embodiment is different from the calculation method of the existing confidence, and therefore, the calculated result is called a variant confidence. If the variant confidence is below the threshold b, the left word sequence (or word) is judged to be the upper concept of the right word. For example, a certain frequent sequence is (DNS server, host, power supply), first, the occurrence ratio r1 of the "host" in all the multiple sequences in which the "DNS server" occurs is calculated, and if r1 is smaller than b, the "host" is considered as the lower concept of the "DNS server"; and then calculating the occurrence ratio r2 of the power supply in all the multi-element sequences of the DNS server and the host, and if r2 is smaller than b, considering the DNS server host as the upper concept of the power supply.
In this embodiment, not only each sentence in the text to be processed is segmented, but also each sentence is subjected to syntactic dependency analysis, so that the segmentation result can be corrected by using the syntactic dependency analysis result to obtain a word sequence library, then the segmentation result is updated according to the correlation between words in frequent sequences meeting the conditions in the word sequence library, a hyponym combination is established according to the updated segmentation result, the words in the word sequence library are replaced by the words with the highest frequency in the same hyponym combination according to the hyponym combination, the confidence of variants among the words in the word sequence meeting the requirements in the word sequence library is calculated, and the upper and lower concepts among the words are judged according to the calculation result.
The foregoing is a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention and are intended to be comprehended within the scope of the present invention.

Claims (10)

1. The method for constructing the knowledge graph is characterized by comprising the following steps of:
Analyzing word segmentation and syntactic dependency relation of each sentence in the text to be processed to obtain word segmentation results and a word sequence library;
screening frequent sequences with the length larger than a preset first threshold value from the word sequence library, and calculating the frequency and the lifting degree of each frequent sequence, wherein the frequency represents the probability of the frequent sequence in the word sequence library, and the lifting degree represents the correlation among words in the frequent sequence;
combining words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into newly added words, and updating the word segmentation result according to the newly added words;
establishing a near meaning word combination according to the updated word segmentation result, and replacing the words in the word sequence library with the words with highest frequency in the same near meaning word combination according to the near meaning word combination;
obtaining a word sequence with the frequency higher than a preset third threshold value and the length of a preset fourth threshold value in the updated word sequence library, calculating variant confidence coefficients among words in the word sequence, judging upper and lower concepts among the words according to the calculation result, wherein the variant confidence coefficients represent the relativity among the words or word sequences in the word sequence,
Wherein, the judging the upper and lower concepts among the words according to the calculation result comprises the following steps:
and calculating the variant confidence coefficient of the rightmost word in the word sequence and other word sequences or words on the left side, and judging the word sequence or word on the left side as the upper concept of the rightmost word if the variant confidence coefficient is lower than a preset fifth threshold value.
2. The method for constructing a knowledge graph according to claim 1, wherein the step of analyzing the word segmentation and the syntactic dependency relationship of each sentence in the text to be processed to obtain the word segmentation result and the word sequence library comprises the steps of:
performing word segmentation on each sentence in the text to be processed to obtain a word segmentation result;
and analyzing the syntactic dependency relation of each sentence in the text to be processed based on the word segmentation result, correcting the word segmentation result according to the syntactic dependency relation analysis result to obtain at least one group of word sequences corresponding to each sentence, and obtaining a word sequence library comprising word sequences of all sentences.
3. The method for constructing a knowledge graph according to claim 2, wherein correcting the word segmentation result according to the syntactic dependency analysis result to obtain at least one word sequence corresponding to each sentence comprises:
When the sentence center word is a noun, determining the center word, recursively finding all centering relation modifier words of the center word, and generating a word sequence comprising the center word and all centering relation modifier words of the center word;
when the central word of a sentence is a verb or an adjective, judging whether the sentence has a main-predicate structure, when the sentence has the main-predicate structure, determining main-predicate nouns in the main-predicate structure, recursively finding all centering relation modifier words of the main-predicate nouns, and generating word sequences of all centering relation modifier words comprising the main-predicate nouns and the main-predicate nouns; determining object nouns in a moving object structure when the main-predicate structure does not exist in the sentence and the moving object structure exists, recursively finding all centering relation modifier words of the object nouns, and generating word sequences of all centering relation modifier words comprising the object nouns and the object nouns;
when the center word of the sentence is not a noun, a verb or an adjective, all centering relations in the sentence are determined, the noun with the most modifier is selected, all centering relation modifier words of the noun are found recursively, and a word sequence comprising the noun and all centering relation modifier words of the noun is generated.
4. The method for constructing a knowledge graph according to claim 1, wherein when the frequent sequence includes a word a and a word B, the degree of lift (a, B) =p (b|a)/P (B), where P (B) is a proportion of all the tuples including B in all the tuples, P (b|a) is a proportion of occurrence of B in all the tuples including a, and wherein the tuples are word sequences with a length of 2 in a word sequence library.
5. The knowledge graph construction method according to claim 1, wherein the establishing a paraphrasing combination according to the updated word segmentation result comprises:
generating word vectors according to the updated word segmentation result;
and calculating cosine similarity between every two words based on the generated word vectors, and establishing all hyponym combinations based on a preset similarity threshold s.
6. The knowledge-graph construction method according to claim 5, wherein the establishing all the paraphrasing combinations includes:
sorting all words in the updated word segmentation result based on word frequency;
sequentially establishing a paraphrasing combination of each word according to the word frequency from high to low;
establishing a paraphrasing combination of each word includes:
calculating the similarity of the word and other words;
Establishing a set based on at least one word having a similarity to the word greater than a threshold s, and ordering the words in the set based on the similarity to the word;
and judging whether the similarity between each word in the set and other words in the set is larger than a threshold value s in sequence from high to low according to the similarity, and if so, adding the judged word into a hyponym combination of the word.
7. The knowledge graph construction method according to claim 1, wherein,
when the word sequence includes a word or word sequence M and a word or word sequence N, the variant confidence P (n|m) is the proportion of all the multiple word sequences containing M on the left and N on the right.
8. The knowledge graph construction method according to claim 1, wherein the preset first threshold is not less than 2, and the preset fourth threshold is 2 or 3.
9. The knowledge graph construction device is characterized by comprising:
the analysis module is used for analyzing the word segmentation and the syntactic dependency relationship of each sentence in the text to be processed to obtain a word segmentation result and a word sequence library;
the first processing module is used for screening out frequent sequences with the length larger than a preset first threshold value from the word sequence library, and calculating the frequency and the lifting degree of each frequent sequence, wherein the frequency represents the probability of occurrence of the frequent sequence in the word sequence library, and the lifting degree represents the correlation among words in the frequent sequence;
The first updating module is used for merging words included in frequent sequences with the lifting degree larger than a preset second threshold value and the frequency larger than a preset sixth threshold value into new added words, and updating the word segmentation result according to the new added words;
the second updating module is used for establishing a near-meaning word combination according to the updated word segmentation result, and replacing the words in the word sequence library with the words with the highest frequency in the same near-meaning word combination according to the near-meaning word combination;
a second processing module, configured to obtain a word sequence with a frequency higher than a preset third threshold and a length of a preset fourth threshold in the updated word sequence library, calculate a variant confidence coefficient between words in the word sequence, and determine a context concept between words according to the calculation result, where the variant confidence coefficient indicates a correlation between words or word sequences in the word sequence,
the second processing module is configured to calculate a variant confidence coefficient of a rightmost word in the word sequence and other word sequences or words on the left side, and if the variant confidence coefficient is lower than a preset fifth threshold value, determine that the word sequence or word on the left side is an upper concept of the rightmost word.
10. An electronic device for constructing a knowledge graph, comprising:
A processor; and
a memory in which computer program instructions are stored,
wherein the computer program instructions, when executed by the processor, cause the processor to perform the steps in the knowledge-graph construction method of any one of claims 1-8.
CN201810620223.6A 2018-06-15 2018-06-15 Knowledge graph construction method and device and electronic equipment Active CN110674306B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810620223.6A CN110674306B (en) 2018-06-15 2018-06-15 Knowledge graph construction method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810620223.6A CN110674306B (en) 2018-06-15 2018-06-15 Knowledge graph construction method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN110674306A CN110674306A (en) 2020-01-10
CN110674306B true CN110674306B (en) 2023-06-20

Family

ID=69065270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810620223.6A Active CN110674306B (en) 2018-06-15 2018-06-15 Knowledge graph construction method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN110674306B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325033B (en) * 2020-03-20 2023-07-11 中国建设银行股份有限公司 Entity identification method, entity identification device, electronic equipment and computer readable storage medium
CN111353303B (en) * 2020-05-25 2020-08-25 腾讯科技(深圳)有限公司 Word vector construction method and device, electronic equipment and storage medium
CN112241734A (en) * 2020-10-15 2021-01-19 首域科技(杭州)有限公司 Method and system for diagnosing equipment fault through knowledge graph and Bayesian network
CN112802569B (en) * 2021-02-05 2023-08-08 北京嘉和海森健康科技有限公司 Semantic information acquisition method, device, equipment and readable storage medium
CN115221872B (en) * 2021-07-30 2023-06-02 苏州七星天专利运营管理有限责任公司 Vocabulary expansion method and system based on near-sense expansion
CN113901800A (en) * 2021-08-31 2022-01-07 北京影谱科技股份有限公司 Method and system for extracting scene map from Chinese text
CN116467405A (en) * 2022-01-12 2023-07-21 腾讯科技(深圳)有限公司 Text processing method, device, equipment and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07160723A (en) * 1993-11-16 1995-06-23 Roehm Properties Bv Output device of retrieval word
US8402030B1 (en) * 2011-11-21 2013-03-19 Raytheon Company Textual document analysis using word cloud comparison
CN106569993A (en) * 2015-10-10 2017-04-19 ***通信集团公司 Method and device for mining hypernym-hyponym relation between domain-specific terms
CN108038096A (en) * 2017-11-10 2018-05-15 平安科技(深圳)有限公司 Knowledge database documents method for quickly retrieving, application server computer readable storage medium storing program for executing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07160723A (en) * 1993-11-16 1995-06-23 Roehm Properties Bv Output device of retrieval word
US8402030B1 (en) * 2011-11-21 2013-03-19 Raytheon Company Textual document analysis using word cloud comparison
CN106569993A (en) * 2015-10-10 2017-04-19 ***通信集团公司 Method and device for mining hypernym-hyponym relation between domain-specific terms
CN108038096A (en) * 2017-11-10 2018-05-15 平安科技(深圳)有限公司 Knowledge database documents method for quickly retrieving, application server computer readable storage medium storing program for executing

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Mixing semantic networks and conceptual vectors application to hyperonymy;V. Prince et al;《IEEE Transactions on Systems, Man, and Cybernetics, Part C 》;第36卷(第2期);第152-160页 *
Modeling and extracting hyponymy relationships on Chinese electric power field content;Dong-ru Ruan et al;《2016 8th International Conference on Modelling, Identification and Control (ICMIC)》;第439-443页 *
结合词向量和Bootstrapping的领域实体上下位关系获取与组织;马晓军 等;《计算机科学》;第45卷(第01期);第67-72页 *
领域实体上下位关系自动获取研究;程韵如;《中国优秀硕士学位论文全文数据库 信息科技辑》;第2017年卷(第02期);第I138-4456页 *

Also Published As

Publication number Publication date
CN110674306A (en) 2020-01-10

Similar Documents

Publication Publication Date Title
CN110674306B (en) Knowledge graph construction method and device and electronic equipment
US10558754B2 (en) Method and system for automating training of named entity recognition in natural language processing
US10607142B2 (en) Responding to user input based on confidence scores assigned to relationship entries in a knowledge graph
CN102483743B (en) Detecting writing systems and languages
US9613185B2 (en) Influence filtering in graphical models
JP2005158010A (en) Apparatus, method and program for classification evaluation
CN112906392A (en) Text enhancement method, text classification method and related device
CN110134780B (en) Method, device, equipment and computer readable storage medium for generating document abstract
US11941361B2 (en) Automatically identifying multi-word expressions
Huynh et al. When to use OCR post-correction for named entity recognition?
Gadde et al. Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results
CN115098556A (en) User demand matching method and device, electronic equipment and storage medium
CN115099233A (en) Semantic analysis model construction method and device, electronic equipment and storage medium
CN110263345B (en) Keyword extraction method, keyword extraction device and storage medium
US10043511B2 (en) Domain terminology expansion by relevancy
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN113010785B (en) User recommendation method and device
CN114329112A (en) Content auditing method and device, electronic equipment and storage medium
CN116776884A (en) Data enhancement method and system for medical named entity recognition
CN113486169B (en) Synonymous statement generation method, device, equipment and storage medium based on BERT model
CN116204692A (en) Webpage data extraction method and device, electronic equipment and storage medium
CN113836297B (en) Training method and device for text emotion analysis model
US10546247B2 (en) Switching leader-endorser for classifier decision combination
CN114387602A (en) Medical OCR data optimization model training method, optimization method and equipment
US20240104310A1 (en) Adaptable Transformer Models via Key Term Replacement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant