CN113051913A

CN113051913A - Tibetan word segmentation information processing method, system, storage medium, terminal and application

Info

Publication number: CN113051913A
Application number: CN202110380044.1A
Authority: CN
Inventors: 刘清民; 程国艮
Original assignee: Global Tone Communication Technology Co ltd
Current assignee: Global Tone Communication Technology Co ltd
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2021-06-29

Abstract

The invention belongs to the technical field of information processing, and discloses a Tibetan word segmentation information processing method, a Tibetan word segmentation information processing system, a storage medium, a terminal and application. The Tibetan word segmentation information processing system comprises: a word vector preprocessing module; a model structure building module; a word vector training module; and a word vector training stop judging module. In Tibetan, the method uses an artificial neural network and deep learning solution, and predicts the boundaries of words by learning Tibetan word vectors and using a Convolutional Neural Network (CNN) model and a Conditional Random Field (CRF); the network is iteratively trained by matching the sequence of characters in the sentence to the sequence of manually labeled word boundaries, obtaining weights, i.e., the final parameters.

Description

Tibetan word segmentation information processing method, system, storage medium, terminal and application

Technical Field

The invention belongs to the technical field of information processing, and particularly relates to a Tibetan word segmentation information processing method, a Tibetan word segmentation information processing system, a Tibetan word segmentation information processing storage medium, a Tibetan word segmentation information terminal and application.

Background

Tibetan language

It refers to the Tibetan language used by Tibetan. The Tibetan language belongs to the Tibetan language of the Tibetan Burmese family of the Hanzang language, and is mainly applicable to Tibetan people in China and a part of people in Nepal, Plumbum preparatium, India and Pakistan. The Tibetan belongs to the vowel annex characters of the phonological characters, and there are two statements about the origin of the Tibetan. The scholars think that the Tu-Dynasty Gong Yuan 7 th century is created by sending King Songzhan cloth to Tibetan linguists to swallow mulberry cloth and then learning Sanskrit in North India, and introducing Sanskrit letters after returning to the country. Yong and Zhong are the present teaching that the Tibetan evolves from the elephant-male.

English (English) belongs to the western japanese language branch of the japanese language family in the european system, evolved from languages spoken by japanese people who ancient times move the great british island from the continental europe, saxon and jute tribe, and spread to all over the world through the activities of colonists in the united kingdom.

Tibetan differs from English in that words in Tibetan are often written together without word boundary markers. Whereas the constituent letters of words in english are independent, with boundary markers. For Tibetan, word segmentation is one of the first tasks to build natural language processing applications, such as topic classification, sentiment analysis, document similarity, machine translation, etc.

For a computer, the difficulty of processing characters and texts without word boundary marks exists, and the prior art adopts an artificial neural network and deep learning solution; the Convolutional Neural Network (CNN) is a special neural network, and is one of the most successful models in NLP at present; predicting the boundaries of words by learning Tibetan word vectors (word 2 vec) by using a CNN model and a Conditional Random Field (CRF); the network is iteratively trained by matching the sequence of characters in the sentence to the sequence of manually labeled word boundaries, obtaining weights, i.e., the final parameters. Because the open corpus that discloses is less, and the corpus cost of manufacture is high, only tests under limited parameter at present, and different parameters can be adopted to test in the later stage, lead to the shortcoming that prior art exists: (1) it is to be expected to increase the number of training corpora. (2) The selection of the parameters has an optimization space.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) in the prior art, characters and texts without word boundary marks are processed by using an artificial neural network and deep learning, so that fewer open corpora exist, and the corpus manufacturing cost is high.

(2) In the prior art, the artificial neural network and the deep learning are used for processing characters without word boundary marks, the selection space of the parameters of the text exists in the experiment under the limited parameters, and different parameters can be adopted for the experiment in the later period.

The difficulty in solving the above problems and defects is: the cost of manually marking the participle corpus is too high; the parameter selection needs a plurality of experiments to determine the better version.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a Tibetan word segmentation information processing method, a Tibetan word segmentation information processing system, a storage medium, a terminal and application.

The Tibetan word segmentation information processing method is realized by learning word segmentation linguistic data through word vectors, a convolutional neural network and a conditional random field to generate a Tibetan word boundary rule, and finally achieves word segmentation of the Tibetan. Firstly, through word2vec, the expression method for learning Tibetan words learns the possibility of word segmentation of the Tibetan at a certain position through the existing word segmentation linguistic data and the learned word vectors by using a convolutional neural network and a conditional random field, and performs word segmentation on the Tibetan at a place with high possibility.

Furthermore, the Tibetan word segmentation information processing method predicts the word boundary by learning the Tibetan word vector word2vec and utilizing a Convolutional Neural Network (CNN) model and a Conditional Random Field (CRF).

Further, the Tibetan word segmentation information processing method matches the character sequence in the sentence with the sequence of the manually marked word boundary to iteratively train the network, and obtains the weight, namely the final parameter.

Further, the Tibetan word segmentation information processing method specifically comprises the following steps:

firstly, preprocessing marked word segmentation linguistic data, and learning word vectors of Tibetan through word2vec, namely the representation of each word in deep learning and dictionaries of all the marked words, wherein an unknown word position is specially added;

secondly, building a CNN model, and calculating loss by using CRF;

thirdly, training by using the marked Tibetan and the trained word vectors through the built model;

and fourthly, stopping training when the training reaches a certain accuracy rate after the development set is trained, thereby obtaining word segmentation rules.

Further, the model structure is composed of a convolutional neural network plus a conditional random field.

It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of: and (3) learning the word segmentation linguistic data through word vectors, a convolutional neural network and a conditional random field to generate a Tibetan word boundary rule, and finally realizing word segmentation of the Tibetan.

It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of: and (3) learning the word segmentation linguistic data through word vectors, a convolutional neural network and a conditional random field to generate a Tibetan word boundary rule, and finally realizing word segmentation of the Tibetan.

The invention also aims to provide an information data processing terminal, which is used for realizing the Tibetan word segmentation information processing method.

Another object of the present invention is to provide a Tibetan segmentation information processing system for implementing the Tibetan segmentation information processing method, the Tibetan segmentation information processing system comprising:

the word vector preprocessing module is used for training the Tibetan language with the divided words to learn word vectors of the Tibetan language through the Tibetan language word vectors and storing the word vectors and the dictionary;

the model structure building module is used for building a model structure, and the model structure consists of a convolutional neural network and a conditional random field;

the word vector training module is used for training a model through the marked Tibetan and the trained word vectors;

and the word vector training stopping and judging module is used for stopping training after the development set reaches a certain accuracy.

The invention also aims to provide a computer information processing terminal which is used for realizing the Tibetan word segmentation information processing method.

By combining all the technical schemes, the invention has the advantages and positive effects that: the method learns the word segmentation linguistic data through word vectors, a convolutional neural network and a conditional random field to generate a Tibetan word boundary rule, and finally achieves word segmentation of Tibetan. The Tibetan word segmentation tool can achieve the accuracy of 90 in a test set and can help achieve a better translation effect in machine translation.

In Tibetan, the method uses an artificial neural network and deep learning solution, and predicts the boundaries of words by learning Tibetan word vectors (word 2 vec) and using a Convolutional Neural Network (CNN) model and a Conditional Random Field (CRF); the network is iteratively trained by matching the sequence of characters in the sentence to the sequence of manually labeled word boundaries, obtaining weights, i.e., the final parameters.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained from the drawings without creative efforts.

Fig. 1 is a flowchart of a Tibetan word segmentation information processing method according to an embodiment of the present invention.

FIG. 2 is a schematic structural diagram of a Tibetan word segmentation information processing system according to an embodiment of the present invention;

in fig. 2: 1. a word vector preprocessing module; 2. a model structure building module; 3. a word vector training module; 4. and a word vector training stop judging module.

Fig. 3 is a flowchart of an implementation of the method for processing Tibetan word segmentation information according to the embodiment of the present invention.

Fig. 4 is a graph of the results of the effects provided by the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention provides a Tibetan word segmentation information processing method, a Tibetan word segmentation information processing system, a storage medium, a terminal and application, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the method for processing the Tibetan word segmentation information provided by the present invention comprises the following steps:

s101: training the Tibetan language with the divided words to learn word vectors of the Tibetan language through the word vectors (word 2 vec) of the Tibetan language, and storing the word vectors and the dictionary;

s102: building a model structure, which consists of a Convolutional Neural Network (CNN) and a Conditional Random Field (CRF);

s103: training a model through the marked Tibetan and the trained word vectors;

s104: and stopping training after the development set reaches a certain accuracy.

The Tibetan word segmentation information processing method provided by the invention specifically comprises the following steps:

firstly, preprocessing the marked participle corpus, learning word vectors of Tibetan through word2vec, namely representing each word in deep learning, and adding an unknown word position occupation (representing the word for the word which is not encountered later) specially for all the participle dictionaries;

secondly, building a CNN model, and calculating loss by using CRF;

Those skilled in the art can also implement the method for processing the Tibetan segmentation information provided by the present invention by using other steps, and the method for processing the Tibetan segmentation information provided by the present invention in fig. 1 is only a specific embodiment.

As shown in fig. 2, the Tibetan word segmentation information processing system provided by the present invention includes:

the word vector preprocessing module 1 is used for training the Tibetan language with the divided words to learn word vectors of the Tibetan language through the Tibetan language word vectors and storing the word vectors and the dictionary;

the model structure building module 2 is used for building a model structure, and the model structure consists of a convolutional neural network and a conditional random field;

the word vector training module 3 is used for training a model through marked Tibetan and trained word vectors;

and the word vector training stopping judgment module 4 is used for stopping training after the development set reaches a certain accuracy.

The technical solution of the present invention is further described below with reference to the accompanying drawings.

The invention utilizes word vectors, Convolutional Neural Networks (CNN) and Conditional Random Fields (CRF) to carry out word segmentation on Tibetan. The problem of dividing words for preprocessing Tibetan before the translation training of a neural machine is mainly solved.

As shown in fig. 3, the method for processing the Tibetan word segmentation information provided by the present invention comprises the following steps:

firstly, training the Tibetan language with the divided words to learn word vectors of the Tibetan language through word vectors (word 2 vec) of the Tibetan language, and storing the word vectors and a dictionary.

And secondly, building a model structure which consists of a Convolutional Neural Network (CNN) and a Conditional Random Field (CRF).

And thirdly, training the model through the marked Tibetan and the trained word vector.

And fourthly, stopping training after the development set reaches a certain accuracy.

The word vector can help better learn the deep learning method of the relationship between words and can help better split the possibility between words; the model obtained by the CNN-CRF training has better speed in word segmentation.

It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. The Tibetan word segmentation information processing method is characterized in that the Tibetan word segmentation information processing method learns the possibility of word segmentation of the Tibetan at a certain position through word2vec, the existing word segmentation linguistic data and the learned word vector by using a convolutional neural network and a conditional random field, and performs word segmentation on the Tibetan at a place with high possibility.

2. The Tibetan word segmentation information processing method of claim 1, wherein the Tibetan word segmentation information processing method is used for predicting word boundaries by learning a Tibetan word vector word2vec and utilizing a Convolutional Neural Network (CNN) model and a Conditional Random Field (CRF).

3. The Tibetan segmentation information processing method as claimed in claim 2, wherein the Tibetan segmentation information processing method iteratively trains a network by matching a sequence of characters in a sentence with a sequence of manually labeled word boundaries to obtain weights, i.e., final parameters.

4. The Tibetan word segmentation information processing method of claim 1, wherein the Tibetan word segmentation information processing method specifically comprises:

secondly, building a CNN model, and calculating loss by using CRF;

5. The Tibetan word segmentation information processing method as claimed in claim 4, wherein the structure of the built model is composed of a convolutional neural network and a conditional random field.

6. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of: and (3) learning the word segmentation linguistic data through word vectors, a convolutional neural network and a conditional random field to generate a Tibetan word boundary rule, and finally realizing word segmentation of the Tibetan.

7. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of: and (3) learning the word segmentation linguistic data through word vectors, a convolutional neural network and a conditional random field to generate a Tibetan word boundary rule, and finally realizing word segmentation of the Tibetan.

8. An information data processing terminal, characterized in that the information data processing terminal is used for realizing the Tibetan word segmentation information processing method of any one of claims 1 to 5.

9. A Tibetan word segmentation information processing system for implementing the Tibetan word segmentation information processing method of any one of claims 1 to 5, the Tibetan word segmentation information processing system comprising:

10. A computer information processing terminal is characterized in that the computer information processing terminal is used for realizing the Tibetan word segmentation information processing method of any one of claims 1-5.