CN111723587A

CN111723587A - Chinese-Thai entity alignment method oriented to cross-language knowledge graph

Info

Publication number: CN111723587A
Application number: CN202010578711.2A
Authority: CN
Inventors: 黄永忠; 吴辉文; 庄浩宇; 徐鑫宇; 张晨昊
Original assignee: Guilin University of Electronic Technology
Current assignee: Beijing Tianyun Xin'an Technology Co ltd
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2020-09-29

Abstract

The invention discloses a cross-language knowledge graph-oriented Chinese and Tai entity alignment method, which is characterized by comprising the following steps of: 1) acquiring a bilingual data set; 2) constructing and training a machine translation model; 3) extracting entities; 4) and (5) translating and matching the entity. The method can more effectively and accurately realize the alignment of the bilingual entities, and solves the problem of low alignment degree of the existing entity constructed by crossing the language knowledge graph.

Description

Chinese-Thai entity alignment method oriented to cross-language knowledge graph

Technical Field

The invention relates to the field of artificial intelligence, belongs to a cross-language knowledge graph technology, and particularly relates to a cross-language knowledge graph-oriented Chinese and Tai entity alignment method.

Background

With the continuous development of artificial intelligence, knowledge is especially important in various fields of artificial intelligence. In recent years, cross-language knowledge maps have been constructed as a hot area for current research. Although sentences related to bilingual alignment on the internet are richer and richer at present, the accuracy of the alignment of multi-language entities is not satisfactory and the construction of a cross-language knowledge graph is limited due to the low degree of the alignment of the entities.

Generally, the entity alignment method commonly used at present is to perform entity identification first, and then find out an entity that is the same or similar in different languages through a corresponding technology, thereby implementing entity alignment of multiple languages. In aligned bilingual sentences, entities in the sentences all have corresponding entities in the aligned sentences, if the existing translation software such as *** translation, track translation or Baidu translation is directly used, the translation accuracy of the common translation software is higher for a small part of famous entities such as names of people and places, but for a large part of non-famous entities such as names of people, places and organizations, the common translation software is difficult to accurately translate the entities, so that the misreading is easy to occur, and the alignment effect is poor.

In order to improve the accuracy of entity alignment in bilingual sentences, in the case of entities such as non-famous names of people, places, organizations and the like, a feasible method is to train the existing bilingual sentences by a machine translation method to obtain corresponding machine translation models, extract entities in sentences of one language by a corresponding entity extraction method, and finally translate the extracted entities by using the trained translation models so as to match entities aligned in the other language in the sentences to achieve bilingual entity alignment. Because each entity word needing to be aligned in the bilingual sentence is contained in the trained translation model, the translation accuracy is more accurate for various non-famous entities, and the entity alignment effect is improved.

Disclosure of Invention

The invention aims to provide a cross-language knowledge graph-oriented Chinese-Tai entity alignment method in the cross-language knowledge graph construction process, aiming at the problem that the non-famous entity in a bilingual sentence in the prior art is not high in alignment accuracy. The method can more effectively and accurately realize the alignment of the bilingual entities, and solves the problem of low alignment degree of the existing entity constructed by crossing the language knowledge graph.

The technical scheme for realizing the purpose of the invention is as follows:

a cross-language knowledge graph-oriented Chinese-Tai entity alignment method comprises the following steps:

1) bilingual dataset acquisition: acquiring Chinese-Thai bilingual alignment data from a Wikidata and YAGO multi-language knowledge base or each big Chinese-Thai bilingual website, wherein the data sets are aligned Chinese-Thai bilingual sentences, and the aligned entities of the Chinese sentences can be found in the Thai sentences by the entities existing in the Chinese sentences;

2) constructing and training a machine translation model: the Machine Translation (MT) is a process of converting a natural language, i.e., a source language, into another natural language, i.e., a target language, by using a computer, inputting a source language sentence, and outputting a corresponding target language sentence, training a bilingual data set acquired in step 1) through a constructed machine translation model to obtain a trained hantao translation model, and then translating an extracted entity in step 4) through step 3), wherein the process is as follows:

1-2) data preprocessing: preprocessing the Chinese and Thai bilingual data set obtained in the step 1), converting the preprocessed Chinese and Thai bilingual data set into a standard data format for training a machine translation model, and dividing the bilingual data set into a Chinese sentence file Ch.txt and a Thai sentence file Th.txt, wherein each sentence in Ch.txt corresponds to each sentence in Th.txt;

2-2) word segmentation: the method comprises the following steps that a jieba word segmentation tool is adopted for segmenting words in a Chinese data set, a cutkum tool is adopted for segmenting words in a Thai data set, and a space is used for separating the words;

3-2) constructing a Transformer translation model: the Transformer model adopts a framework structure of an Encoder-Decoder, namely an Encoder-Decoder, which is typical in the Seq2Seq model, but unlike the Seq2Seq model, a Transformer Encoder and Decoder do not use a structure of a recurrent neural network, and the main structures of the Encoder and Decoder are as follows:

1-3-2) encoder: the coding layer in the Transformer model is composed of a plurality of same layer stacks, each layer is composed of two sub-layers of Multi-Head Attention, namely Multi-Head Attention, and fully-connected feedforward, namely Feed-Forward network, the multi-head Attention is used in the model to implement Self-Attention, compared with the common Attention mechanism, the Multi-Head Attention mechanism carries out Multi-path linear transformation on input, then respectively calculating the results of the Attention, splicing all the results, performing linear transformation again and outputting, wherein the Attention uses Dot Product, which is to avoid entering the saturation region of softmax due to the result of Dot Product being too large, therefore, after dot product, scale processing is carried out, the fully-connected feedforward network can carry out the same calculation, namely Position-wise, on each Position in the sequence, and the fully-connected feedforward network adopts a structure that ReLU activation is carried out in the middle of two times of linear transformation;

2-3-2) decoder: the decoder and the encoder have similar structures, but the layer of the decoder is added with a sub-layer with multi-head Attention compared with the layer of the encoder, so as to realize the Attention output by the encoder;

3-3-2) construction of Transformer translation model: constructing by adopting a hundred-degree PaddlePaddle, Pythrch or TensorFlow frame;

4-3-2) after the model is constructed, loading the data after word segmentation in the step 2-2) into the Transformer translation model for training to obtain a trained Transformer translation model, namely a Hantai translation model:

Ch-Th-Translation.model；

3) and (3) entity extraction: selecting a currently open-source Chinese entity extraction tool such as Stanford NLP or extracting entities in Chinese sentences by adopting a common Chinese named entity identification model such as BiLSTM + CRF, CRF + +, and the like;

4) entity translation and matching: the entity translation adopts the combination of the currently common translation software and a Transformer translation model, and the specific process is as follows:

1-4) firstly, translating the Chinese entity NER-A extracted in the step 3) by adopting currently common translation software such as Google translation, track translation or Baidu translation to obtain A translated entity NER1-A, then matching with A corresponding Thai sentence, if matching is successful, aligning the next entity, and if matching is failed, turning to the step 2-4);

2-4) translating the entity NER-A which is failed to be matched in the step 1-4) by utilizing the Ch-Th-translation model trained in the step 4-3-2) to obtain A translated entity NER2-A, matching the translated entity NER2-A with A corresponding Thai sentence, and obtaining an entity NER-A in the Chinese sentence and A corresponding entity NER-B in the Thai sentence if the matching is successful;

3-4) finally, the aligned "NER-A: NER-B ", namely, the entity alignment in the Chinese Thai bilingual sentence is completed.

Compared with the prior art, the method solves the problems of low translation accuracy and poor alignment effect of the existing translation software on the non-famous entities, improves the alignment quality of the multi-language entities, and reduces the difficulty in constructing the cross-language knowledge graph.

Drawings

FIG. 1 is a schematic diagram of a network structure of a transform translation model in an embodiment;

FIG. 2 is a schematic structural view of a head according to a further embodiment;

FIG. 3 is a schematic diagram illustrating an alignment process of Hantai bilingual entity in an embodiment;

FIG. 4 is an exemplary diagram of a jieba participle key code in an embodiment;

FIG. 5 is a diagram illustrating an example of data after jieba participle in the embodiment;

FIG. 6 is an exemplary diagram of a cutkum participle key code in an embodiment;

FIG. 7 is a diagram illustrating an example of data after word segmentation of cutkum in an embodiment;

fig. 8 is an exemplary diagram of the Stanford NLP entity extraction key code in the embodiment.

Detailed Description

The invention will be further elucidated with reference to the drawings and examples, without however being limited thereto.

Example (b):

the example takes a Chinese and Tai bilingual dataset as an example, takes Python as a development language, takes Pycharm software as a development environment,

referring to fig. 3, a cross-language knowledge graph-oriented Chinese Thai entity alignment method includes the following steps:

1) bilingual dataset acquisition: acquiring Chinese-Thai bilingual alignment data from a Wikidata, YAGO multi-language knowledge base or each big Chinese-Thai bilingual website, wherein the data sets are aligned Chinese-Thai bilingual sentences, and aligned entities of the Chinese sentences can be found in the Thai sentences by the entities existing in the Chinese sentences, in the example, as shown in the table a, the Chinese entities in the sentences 1-A in Chinese can find aligned Thai entities in sentences 1-B in Thai;

table a hantao aligned sentence data example

2) Constructing and training a machine translation model: constructing a Transformer translation model, training the Hantai bilingual data set obtained in the step 1) to obtain a trained Hantai translation model, and translating the extracted entity in the step 4) through the step 3), wherein the process is as follows:

2-2) word segmentation: the Chinese data set Ch.txt adopts a jieba word segmentation tool to perform word segmentation, and stores the data after word segmentation into a Ch _ Seq.txt file, for example, as a key code example of the jieba word segmentation shown in FIG. 4, words are separated from each other by a space, as shown in FIG. 5, sentences of the Thai data set Th.txt file are segmented by a cutkum tool, wherein FIG. 6 is a key code example of the cutkum word segmentation, the data after word segmentation is stored into the Th _ Seq.txt file, and the words are also separated by a space, as shown in FIG. 7;

3-2) constructing a Transformer translation model: the Transformer model adopts a framework structure of a typical Encoder-Decoder, namely an Encoder-Decoder in the Seq2Seq model, but unlike the Seq2Seq model, a Transformer Encoder and Decoder do not have a structure using a recurrent neural network, and the overall network structure is shown in fig. 1, and the main structures of the Encoder and Decoder are as follows:

1-3-2) encoder: the coding layer in the Transformer model is composed of a plurality of same layer stacks, each layer is composed of two sub-layers of Multi-Head Attention, namely Multi-Head Attention, and fully-connected feedforward, namely Feed-Forward network, the multi-head Attention is used in the model to implement Self-Attention, compared with the common Attention mechanism, the Multi-Head Attention mechanism carries out Multi-path linear transformation on input, then, the results of the Attention are calculated respectively, all the results are spliced, linear transformation is performed again, and the results are output, as shown in FIG. 2, wherein the Attention uses Dot Product, which is to avoid entering the saturation region of softmax due to the result of Dot Product being too large, therefore, after dot product, scale processing is carried out, the fully-connected feedforward network can carry out the same calculation, namely Position-wise, on each Position in the sequence, and the fully-connected feedforward network adopts a structure that ReLU activation is carried out in the middle of two times of linear transformation;

3-3-2) construction and training of a Transformer model: a Transformer model is constructed by adopting a Baidu PaddlePaddle framework, and the example adopts the following website for downloading:

https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/ machine_translation/transformer；

4-3-2) after the model is constructed, loading the data after word segmentation in the step 2-2) into the Transformer model for training to obtain a trained Transformer translation model, namely a Hantai translation model: model of Ch-Th-transformation;

3) and (3) entity extraction: in the embodiment, Stanford NLP is adopted to extract the entities of Chinese sentences, and the process is as follows:

1-3) downloading Stanford CoreNLP file first

http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip

Decompressing; model jar file of Chinese is downloaded again

http://nlp.stanford.edu/software/stanford-chinese-corenlp-2016-10-31- models.jarPut under the root directory;

2-3) A key code example for StanfordNLP entity extraction is shown in FIG. 8, the StanfordNLP tool is used for carrying out entity extraction on A sentence 1-A in A Chinese Ch.txt file to obtain A Chinese entity NER-A;

1-4) firstly, translating the Chinese entity NER-A extracted in the step 2-3) by adopting the currently common Google translation software to obtain A translated entity NER1-A, then matching with the corresponding Thai sentence 1-B, aligning the next entity if matching is successful, and turning to the step 2-4 if matching is failed);

2-4) translating the entity NER-A which fails in matching in the step 1-4) by using the Chinese Thai translation model Ch-Th-translation model trained in the step 4-3-2) to obtain A translated entity NER2-A, matching with the corresponding Thai sentence 1-B, if matching is successful, obtaining an entity in the sentence 1-A and A corresponding entity NER-B in the sentence 1-B, and if matching fails, aligning the next entity;

Claims

1. A cross-language knowledge graph-oriented Chinese-Tai entity alignment method is characterized by comprising the following steps:

1) bilingual dataset acquisition: acquiring a Chinese-Thai bilingual alignment data set from a Wikidata and YAGO multi-language knowledge base or each big Chinese-Thai bilingual website, wherein the data set is aligned Chinese-Thai bilingual sentences, and the aligned entities of the Chinese sentences can be found in the Thai sentences by the entities in the Chinese sentences;

2) constructing and training a machine translation model: constructing a Transformer translation model, training the bilingual data set obtained in the step 1) through the constructed Transformer translation model to obtain a trained Hantai translation model, wherein the process is as follows:

3-2) constructing a Transformer translation model: the Transformer model adopts a framework structure of an Encoder-Decoder, namely an Encoder-Decoder, which is typical in the Seq2Seq model, but unlike the Seq2Seq model, a Transformer Encoder and Decoder do not have a structure using a recurrent neural network, and the main structures of the Encoder and Decoder are as follows:

1-3-2) encoder: the coding layer in the Transformer model is composed of a group of same layer stacks, each layer is composed of two sublayers of Multi-Head Attention, namely Multi-Head Attention, and a fully-connected feedforward, namely Feed-Forward network, wherein the Multi-Head Attention is used for realizing Self-Attention in the model, a Multi-Head Attention mechanism carries out Multi-path linear transformation on input, then the results of the Attention are respectively calculated, all the results are spliced, linear transformation is carried out again and output, the Attention uses Dot Product, namely Dot-Product, and scale processing is carried out after the Dot Product, the fully-connected feedforward network carries out the same calculation, namely Position-wise, on each Position in the sequence, and the fully-connected feedforward network adopts a structure in which ReLU activation is carried out in the middle of two linear transformations;

4-3-2) after the model is constructed, loading the data after word segmentation in the step 2-2) into the Transformer translation model for training to obtain a trained translation model, namely a Hantai translation model;

3) and (3) entity extraction: selecting a Chinese entity extraction tool which is open at present or extracting entities in Chinese sentences by adopting a common Chinese named entity identification model;

1-4) firstly, translating the Chinese entity NER-A extracted in the step 3) by adopting currently common translation software to obtain A translated entity NER1-A, then matching with A corresponding Thai sentence, aligning the next entity if matching is successful, and turning to the step 2-4 if matching is failed);

2-4) translating the entity NER-A which is failed in the matching in the step 1-4) by using the Chinese and Thai translation model trained in the step 4-3-2) to obtain A translated entity NER2-A, matching the translated entity NER2-A with A corresponding Thai sentence, and obtaining an entity NER-A in the Chinese sentence and A corresponding entity NER-B in the Thai sentence if the matching is successful;