CN116579343A

CN116579343A - Named entity identification method for Chinese text travel class

Info

Publication number: CN116579343A
Application number: CN202310560194.XA
Authority: CN
Inventors: 秦智; 杜自豪; 刘恩洋; 张仕斌; 昌燕; 胡贵强
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2023-05-17
Filing date: 2023-05-17
Publication date: 2023-08-11
Anticipated expiration: 2043-05-17
Also published as: CN116579343B

Abstract

The invention discloses a method for identifying a named entity of a Chinese text travel class, which comprises the following steps: s1, acquiring Chinese text travel text data, and inputting the data into a character embedding layer to obtain character vector representation; s2, inputting the character vector representation into a two-way long-short-term memory network layer to obtain a context representation; s3, inputting the context representation into a CNN layer to obtain a multi-scale local context feature fusion representation; s4, inputting the multi-scale local context feature fusion representation to a CRF layer, and performing sequence labeling through the CRF layer to complete the named entity identification of Chinese travel. According to the invention, the problem of less attention to the recognition and research of the named entity of the Chinese travel class is considered, the network construction is carried out aiming at the text data of the Chinese travel class, the multi-scale local context feature fusion representation is learned by utilizing the second CNN module in the CNN layer, the correlation between the semantics is enhanced, and the feature representation beneficial to Chinese recognition is improved.

Description

Named entity identification method for Chinese text travel class

Technical Field

The invention belongs to the technical field of information extraction, and particularly relates to a named entity identification method for Chinese travel classes.

Background

Named Entity Recognition (NER) is a basic information extraction task that can be applied to many downstream tasks in Natural Language Processing (NLP), such as information extraction, social media analysis, search engines, machine translation, knowledge graph. The goal of NER is to extract some predefined specific entities from sentences and identify their correct type, such as person, place, organization. Early named entity recognition was divided into two categories, rule-based and statistical-based. As deep learning becomes more powerful, research into NER has made a tremendous progress. The related fields are various, such as medical field, financial field, news field, etc. However, studies on the recognition of named entities in the travel class are very rare, and studies on the recognition of named entities in the travel class are not focused on.

Based on the differences between languages, studies on NER methods for a specific language, such as english, arabic, indian, and other languages, have been also many, and many researchers have focused mainly on the study of NER in english. However, chinese is an important international general language, and has its own characteristics when compared with English, but research on Chinese NER is far less than that of English NER, and many researches on Chinese NER do not have to be conducted according to the characteristics of Chinese.

Disclosure of Invention

Aiming at the defects in the prior art, the named entity identification method for the Chinese travel class solves the problem that the current named entity identification research has less attention to the Chinese travel class.

In order to achieve the aim of the invention, the invention adopts the following technical scheme: a method for identifying a named entity of a Chinese hotel class comprises the following steps:

s1, acquiring Chinese text travel text data, and inputting the data into a character embedding layer to obtain character vector representation;

s2, inputting the character vector representation into a two-way long-short-term memory network layer to obtain a context representation;

s3, inputting the context representation into a CNN layer to obtain a multi-scale local context feature fusion representation;

s4, inputting the multi-scale local context feature fusion representation to a CRF layer, and performing sequence labeling through the CRF layer to complete the named entity identification of Chinese travel.

Further: in the step S1, a character embedding layer comprises a ChineseBert module and a first CNN module which are parallel;

the step S1 comprises the following sub-steps:

s11, acquiring Chinese text travel text data;

s12, inputting the Chinese text data into a Chinese text module to obtain word embedded vector representation of each word in the Chinese text data;

s13, inputting Chinese text travel text data to a first CNN module to obtain a part first-level embedded representation;

and S14, splicing the character embedding vector representation and the radical level embedding representation to obtain a character vector representation.

Further: the step S12 is specifically as follows:

inputting the Chinese text data to a Chinese Bert module, and encoding and representing the input Chinese text data by the Chinese Bert module to obtain feature vectors, and generating word embedded vector representations of each word in the Chinese text data according to the feature vectors;

wherein the feature vector includes marker embedding, position embedding, and segment embedding.

Further: in the S13, a radical level embedded representation M is obtained ₂ The expression of (2) is specifically:

M ₂ ＝A ₁ (b ₁ +C ₁ (x))

wherein x is the radical level characteristic of Chinese characters, C ₁ (. CNN) is the first CNN module, a ₁ B is the first activation function ₁ Is the bias of the first CNN module.

Further: in the S14, a character vector representation Z is obtained _concat The expression of (2) is specifically:

Z _concat ＝M ₁ +M ₂

wherein M is ₁ A vector representation is embedded for the word.

The beneficial effects of the above-mentioned further scheme are: the character vector representation obtained by splicing the word embedding vector representation and the radical level embedding representation can obtain more semantic features, so that the model can better identify Chinese meanings in the text.

Further: in the step S2, the two-way long-short-term memory network layer comprises a first LSTM unit, a second LSTM unit, a third LSTM unit, a fourth LSTM unit, a fifth LSTM unit and a sixth LSTM unit, wherein the first LSTM unit and the sixth LSTM unit are used for positively processing the input character vector representation, and the seventh LSTM unit and the twelfth LSTM unit are used for reversely processing the input character vector representation;

the method for obtaining the context representation comprises the following steps:

and splicing according to the output results of the first to twelfth LSTM units to obtain the context representation. Further:

further: in S2, the expression for obtaining the context representation H is specifically:

H＝{h ₁ ,...,h _ti ,...,h _D }

in the formula, h _ti Splicing the output results of the first to twelfth LSTM units, wherein ti is the sequence number of the splice, and ti=1, …, D and D are the dimensions represented by the character vectors;

the first to twelfth LSTM units each comprise an input gate i _t Output door o _t And forget door f _t The expression is specifically as follows:

i _t ＝σ(W _xi x _t +W _hi h _t-1 +W _ci c _t-1 +b _i )

f _t ＝σ(W _xf x _t +W _hf h _t-1 +W _cf c _t-1 +b _f )

c _t ＝f _t ⊙c _t-1 +i _t ⊙tanh(W _xc x _t +W _hc h _t-1 +b _c )

o _t ＝σ(W _xo x _t +W _ho h _t-1 +W _co c _t +b _o )

h _t ＝o _t ⊙tanh(c _t )

wherein sigma (& gt) is a sigmoid function per element, tan h (& gt) is a hyperbolic tangent function, and wt. is a multiplication function per element _xi 、W _hi 、W _ci 、W _xf 、W _hf 、W _cf 、W _xc 、W _hc 、W _xo 、W _ho And W is _co All are weight parameters, b _i 、b _f 、b _c And b _o Are all bias parameters, c _t Is a memory cell, h _t To output the result.

Further: in the step S3, a second CNN module is arranged on the CNN layer to obtain a multi-scale local context feature fusion representation M ₃ The expression of (2) is specifically:

M ₃ ＝A ₂ (b ₂ +C ₂ (H))

wherein H is represented by the following formula, C ₂ (. CNN) is the second CNN module, a ₂ B is a second activation function ₂ Is the bias of the second CNN module.

The beneficial effects of the above-mentioned further scheme are: and inputting the context representation to a second CNN module, so that the correlation between semantics can be enhanced, and a multi-scale local context feature fusion representation is generated.

The beneficial effects of the invention are as follows: the method for identifying the named entity of the Chinese travel class solves the problem of less attention to the named entity identification research of the Chinese travel class, carries out network construction aiming at text data of the Chinese travel class, learns radical-level embedded representation based on Chinese by utilizing a first CNN module at a character embedding layer to obtain character vector representation favorable for Chinese identification, learns multi-scale local context feature fusion representation by utilizing a second CNN module at a CNN layer, strengthens the correlation between semantics, and further improves feature representation favorable for Chinese identification.

Drawings

FIG. 1 is a flow chart of a method for identifying a named entity of a Chinese travel class according to the present invention.

Fig. 2 is a schematic diagram of the overall network structure of the present invention.

Fig. 3 is a schematic structural diagram of the chinese bert module of the present invention.

Fig. 4 is a schematic structural diagram of a first CNN module according to the present invention.

Fig. 5 is a schematic structural diagram of a second CNN module according to the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

As shown in fig. 1, in one embodiment of the present invention, a method for identifying named entities of chinese travel class includes the steps of:

In this embodiment, the invention provides a method for identifying a named entity of a chinese text, which aims at the characteristics of chinese characters and is represented by the fusion of radical level features and multi-scale local context features in the application field of text class data, wherein the specific structure of a network is shown in fig. 2.

In the step S1, a character embedding layer comprises a ChineseBert module and a first CNN module which are parallel;

the step S1 comprises the following sub-steps:

s11, acquiring Chinese text travel text data;

In this embodiment, the structure of the chinese bert module is shown in fig. 3, where the chinese bert module is a pre-training model obtained by pre-training chinese corpus, and is specifically directed to processing chinese text data.

The step S12 is specifically as follows:

In the S13, a radical level embedded representation M is obtained ₂ The expression of (2) is specifically:

M ₂ ＝A ₁ (b ₁ +C ₁ (x))

In this embodiment, the Radical level embedding representation radial-level Representation is performed on the input chinese text data using CNN, the result is a Radical level embedded representation, wherein a first CNN module performs Radical Representaion a structural diagram of the input data is shown in fig. 4.

In the S14, a character vector representation Z is obtained _concat The expression of (2) is specifically:

Z _concat ＝M ₁ +M ₂

wherein M is ₁ A vector representation is embedded for the word.

The character vector representation obtained by splicing the word embedding vector representation and the radical level embedding representation can obtain more semantic features, so that the model can better identify Chinese meanings in the text.

In the step S2, the two-way long-short-term memory network layer comprises a first LSTM unit, a second LSTM unit, a third LSTM unit, a fourth LSTM unit, a fifth LSTM unit and a sixth LSTM unit, wherein the first LSTM unit and the sixth LSTM unit are used for positively processing the input character vector representation, and the seventh LSTM unit and the twelfth LSTM unit are used for reversely processing the input character vector representation;

and splicing according to the output results of the first to twelfth LSTM units to obtain the context representation.

In this embodiment, the context representation block obtained by the two-way long-short-term memory network layer can promote semantic representation from the forward direction and the reverse direction, and can better identify the semantics in the paragraph.

In S2, the expression for obtaining the context representation H is specifically:

H＝{h ₁ ,...,h _ti ,...,h _D }

i _t ＝σ(W _xi x _t +W _hi h _t-1 +W _ci c _t-1 +b _i )

f _t ＝σ(W _xf x _t +W _hf h _t-1 +W _cf c _t-1 +b _f )

c _t ＝f _t ⊙c _t-1 +i _t ⊙tanh(W _xc x _t +W _hc h _t-1 +b _c )

o _t ＝σ(W _xo x _t +W _ho h _t-1 +W _co c _t +b _o )

h _t ＝o _t ⊙tanh(c _t )

In the step S3, a second CNN module is arranged on the CNN layer to obtain a multi-scale local context feature fusion representation M ₃ The expression of (2) is specifically:

M ₃ ＝A ₂ (b ₂ +C ₂ (H))

In this embodiment, the structure of the second CNN module is shown in fig. 5, and the context representation is input to the second CNN module, so that the correlation between semantics can be enhanced, and a multi-scale local context feature fusion representation is generated.

And inputting the multi-scale local context feature fusion representation to a CRF layer to finish the task of sequence labeling so as to finish the named entity identification of Chinese text traveling.

In the description of the present invention, it should be understood that the terms "center," "thickness," "upper," "lower," "horizontal," "top," "bottom," "inner," "outer," "radial," and the like indicate or are based on the orientation or positional relationship shown in the drawings, merely to facilitate description of the present invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be configured and operated in a particular orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be interpreted as indicating or implying a relative importance or number of technical features indicated. Thus, a feature defined as "first," "second," "third," or the like, may explicitly or implicitly include one or more such feature.

Claims

1. A method for identifying a named entity of a Chinese text travel class is characterized by comprising the following steps:

2. The method for identifying a named entity of a chinese text traveling class according to claim 1, wherein in S1, the character embedding layer includes a chinese bert module and a first CNN module in parallel;

the step S1 comprises the following sub-steps:

s11, acquiring Chinese text travel text data;

3. The method for identifying a named entity of a chinese text travel class according to claim 2, wherein S12 is specifically:

4. The method for recognizing named entities of chinese text class according to claim 2, wherein in S13, a radical level embedded representation M is obtained ₂ The expression of (2) is specifically:

M ₂ ＝A ₁ (b ₁ +C ₁ (x))

5. The method for recognizing named entities in chinese text class according to claim 4, wherein in S14, a character vector representation Z is obtained _concat The expression of (2) is specifically:

Z _concat ＝M ₁ +M ₂

wherein M is ₁ A vector representation is embedded for the word.

6. The method for recognizing a named entity of a chinese text traveling class according to claim 1, wherein in S2, the bidirectional long-short-term memory network layer includes first to twelfth LSTM units, the first to sixth LSTM units forward process the inputted character vector representation, and the seventh to twelfth LSTM units reverse process the inputted character vector representation;

7. The method for identifying a named entity of a chinese text travel class according to claim 6, wherein in S2, the expression for obtaining the context representation H is specifically:

H＝{h ₁ ,...,h _ti ,...,h _D }

i _t ＝σ(W _xi x _t +W _hi h _t-1 +W _ci c _t-1 +b _i )

f _t ＝σ(W _xf x _t +W _hf h _t-1 +W _cf c _t-1 +b _f )

c _t ＝f _t ⊙c _t-1 +i _t ⊙tanh(W _xc x _t +W _hc h _t-1 +b _c )

o _t ＝σ(W _xo x _t +W _ho h _t-1 +W _co c _t +b _o )

h _t ＝o _t ⊙tanh(c _t )

8. The method for identifying a named entity of a chinese text class according to claim 1, wherein in S3, a CNN layer is provided with a second CNN module, so as to obtain a multiscale local context feature fusion representation M ₃ The expression of (2) is specifically:

M ₃ ＝A ₂ (b ₂ +C ₂ (H))