CN115050034A - Full-component recognition algorithm for modern Tibetan syllable characters - Google Patents

Full-component recognition algorithm for modern Tibetan syllable characters Download PDF

Info

Publication number
CN115050034A
CN115050034A CN202210561495.XA CN202210561495A CN115050034A CN 115050034 A CN115050034 A CN 115050034A CN 202210561495 A CN202210561495 A CN 202210561495A CN 115050034 A CN115050034 A CN 115050034A
Authority
CN
China
Prior art keywords
character
characters
***
component
syllable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210561495.XA
Other languages
Chinese (zh)
Inventor
拉巴顿珠
孙亚东
欧珠
珠杰
尼玛
格桑多吉
顿珠次仁
赵栋材
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tibet University
Original Assignee
Tibet University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tibet University filed Critical Tibet University
Priority to CN202210561495.XA priority Critical patent/CN115050034A/en
Publication of CN115050034A publication Critical patent/CN115050034A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a full component recognition algorithm of modern Tibetan syllable characters, which is characterized in that the modern Tibetan syllable characters are used as longitudinal superposed characters to be subjected to full component recognition, and if vowels, lower-added characters or upper-added characters exist in components of the modern Tibetan syllable characters, the modern Tibetan syllable characters are judged to be the longitudinal superposed characters; if the component has no vowel, the lower addition character and the upper addition character, the modern Tibetan syllable character is judged not to be the vertical addition character and is used as the horizontal combined character for full component recognition. The method analyzes the complex structure of the syllable characters of the Tibetan language of the two-dimensional alphabetic writing and effectively improves the full component identification effect of the syllable characters of the Tibetan language. The invention carries out the whole component identification test on 18689 latest statistical data obtained at present, the identification rate is up to 100%, and the basic guarantee is provided for the whole development of Tibetan information processing.

Description

Full-component recognition algorithm for modern Tibetan syllable characters
Technical Field
The invention relates to the technical field of information processing, in particular to a full-component recognition algorithm for modern Tibetan syllable characters.
Background
Modern Tibetan is a written alphabetic writing commonly used by the national users in daily life, and is a plane writing formed in the process of spelling in two directions of longitudinal superposition and transverse combination. Syllable characters conforming to the modern Tibetan grammar and grammar rules are called modern Tibetan characters. According to the latest statistics, 18689 Tibetan characters are totally contained in Tibetan syllable characters which accord with the modern Tibetan character forming rule. Modern Tibetan characters are composed of 7 elements, namely, a front additional character, an upper additional character, a base character, a lower additional character, vowels, a rear additional character and a rear additional character. The basic character is a core component constituting the Tibetan character, and is also an indispensable component element in the character constitution, and other components are different according to the character. Because the components at different positions of the syllable characters of the Tibetan language are not designed with independent different codes in the Unicode coding, in other words, a certain consonant letter in the syllable characters of the Tibetan language serves as components of a base character, a prefix character, a postfix character and the like, but the codes are the same no matter which position appears, the position serves as what role, and therefore, the analysis of the syllable characters of the Tibetan language is difficult.
Disclosure of Invention
The invention aims to provide a full component recognition algorithm of modern Tibetan syllable characters, which takes basic characters as a core and judges other components forming the syllable characters according to the character length of the syllable characters.
The technical scheme for realizing the purpose of the invention is as follows:
the modern Tibetan syllable character full component recognition algorithm is used for performing full component recognition by taking the modern Tibetan syllable character as a longitudinal superposed character, and if a vowel, a lower additive character or an upper additive character exists in a component of the modern Tibetan syllable character, judging that the modern Tibetan syllable character is the longitudinal superposed character; if the component does not have vowels, lower addition characters and upper addition characters, judging that the modern Tibetan syllable characters are not longitudinal superposition characters, and using the modern Tibetan syllable characters as transverse combined characters to carry out full component identification;
the longitudinal superposition word carries out full-component recognition and comprises the following steps:
step 1: locating a base word of the modern Tibetan syllable word;
step 2: executing a backward algorithm identification component:
2.1 reading a post character string of the base character;
2.2 identifying whether the next character of the base character is
Figure BDA0003656754590000021
Any one of them;
if yes, the components are as follows: the latter character of the basic character is vowel, and there is no lower addition character;
let the distance from the vowel to the last character of the post string be S:
when S is 0, the component is as follows: adding characters after the character adding does not exist;
when S is 1, the component is as follows: the latter character of the vowel is a postaddition character, and no postaddition character exists;
when S is 2, the component thereof: the latter character of the vowel is a postaddition character, and the latter character is a postaddition character;
if not, continuing;
2.3 identifying whether the character following the base character is
Figure BDA0003656754590000022
Any one of them;
if yes, the components are as follows: the next character of the basic character is a lower addition character;
let the distance from the add-down to the last character of the post-string be S:
when S is 0 then its member: vowels, postaddition characters and further postaddition characters do not exist;
when S is 1, further identifying whether the character next to the added character is
Figure BDA0003656754590000023
Any one of:
if so, the components are as follows: the latter character of the lower additional character is vowel, and the post-addition character do not exist;
if not its components: the latter character of the lower added character is a post added character, and has no vowel and then the post added character;
when S is 2, further identifying whether the next character of the lower added character is
Figure BDA0003656754590000024
Any one of:
if so, the components are as follows: the latter character of the lower additional character is vowel, the latter character is a post additional character, and no post additional character exists; if not its components: the latter character of the lower additional character is a post-additional character, the latter character is a post-additional character, and no vowel exists;
when S is 3, the component is as follows: the latter character of the lower additional character is vowel, the next character is a later additional character, and the last character is a later additional character;
if not, continuing;
and step 3: executing a forward algorithm identification component:
3.1 reading the front character string of the basic character;
3.2 let the distance from the base word to the first character of the preceding string be S:
when S is 0 then its member: no top and top words exist;
when S is 1, further identifying whether the previous character of the base character is
Figure BDA0003656754590000033
Any one of:
if so, the components are as follows: the previous character of the basic character is an add character, and no add character exists;
if not its components: the previous character of the basic character is a plus character, and no plus character exists;
when S is 2, the component is as follows: the former character of the basic character is an adding character, and the next former character is a adding character;
the horizontal combined character carries out full-component recognition and comprises the following steps:
reading the horizontal character string of the modern Tibetan syllable character, and enabling the length of the horizontal character string to be i:
when i is 1, the component is: a base word;
when i is 2, the component is: the first character is a basic character, and the second character is a postaddition character;
when i is 3, further identifying:
if the first character of the horizontal string is not
Figure BDA0003656754590000031
Any one of them, its component: the first character is a basic character, the second character is a postaddition character, and the third character is a postaddition character;
such as a horizontal string of characters
Figure BDA0003656754590000032
Any one of them, its component: the first character is a basic character, the second character is a post-added character, and the third character is a post-added character; otherwise, the components thereof: the first character is a front additional character, the second character is a basic character, and the third character is a rear additional character;
if the horizontal character string has double consonants, the component is as follows: the first character is a basic character, the second character is a postaddition character, and the third character is a postaddition character;
when i is 4, its member: the first character is a front additional character, the second character is a basic character, the third character is a rear additional character, and the fourth character is a rear additional character.
The further technical scheme also comprises the following steps of 2.4:
2.4 identifying whether the character following the base character is
Figure BDA0003656754590000041
Any one of them;
if yes, the components are as follows: the next character of the basic character is a lower addition character;
further identifying whether the latter character of the lower character is "
Figure BDA0003656754590000042
If so, the components are as follows: the latter character of the additional character isAdding characters; if not its components: no more words are added;
if not, continuing.
The invention also provides a use method of the modern Tibetan syllable character full-component recognition algorithm.
One of the using methods is as follows: identifying the component of each modern Tibetan syllable word in the Tibetan text by using a full component identification algorithm, and writing the component into a component knowledge base of the Tibetan text;
in the statistical component knowledge base:
each character of the basic character
Figure BDA0003656754590000043
Figure BDA0003656754590000044
The number of occurrences;
each character of vowel
Figure BDA0003656754590000045
The number of occurrences;
each character of the prefix
Figure BDA0003656754590000046
The number of occurrences;
each character of the postaddition character
Figure BDA0003656754590000047
The number of occurrences;
each character of the postaddition character
Figure BDA0003656754590000048
The number of occurrences;
each character of the added character
Figure BDA0003656754590000049
The number of occurrences;
each character of the lower-added character
Figure BDA00036567545900000410
The number of occurrences;
and counting the frequency of the Tibetan characters in the Tibetan text according to the frequency of the appearance of each character of the basic character, the vowel, the prefix, the postaddition, the upperaddition and the lowercase.
The other using method is as follows: identifying the components of modern Tibetan syllable characters by using a full component identification algorithm; if the member has the following characters, the following characters are firstly converted into consonant characters, i.e. the member has the following characters
Figure BDA0003656754590000051
Respectively convert into
Figure BDA0003656754590000052
If the member has additional characters, converting the additional characters into consonant characters
Figure BDA0003656754590000053
Is converted into
Figure BDA0003656754590000054
If there is no vowel in the component, the Latin character "a" is used as the vowel; converting the Tibetan character into the Latin character according to the sequence of the front plus character → the top plus character → the base character → the bottom plus character → the second top plus character → the vowel → the second top plus character;
the Tibetan language-Latin character conversion table is as follows:
Figure BDA0003656754590000055
the beneficial effect of the invention is that,
1. taking the 'base character' of the Tibetan as a core, and positioning the 'base character' of the syllable character according to the two-dimensional structure characteristics of the syllable character of the Tibetan; and identifying the full member of the longitudinal superposition word by taking the basic word as a boundary in a forward derivation and backward derivation mode, and identifying the full member of the transverse combination word by a length calculation mode of the whole character string.
2. The method analyzes the complex structure of the syllable characters of the Tibetan language of the two-dimensional alphabetic writing and effectively improves the full component identification effect of the syllable characters of the Tibetan language. The invention carries out full-component identification test on 18689 latest statistical data obtained at present, the identification rate is up to 100%, and the basic guarantee is provided for the overall development of Tibetan information processing.
Drawings
FIG. 1 is a diagram of the syllable structure of the modern Tibetan language.
Fig. 2a and 2b are schematic diagrams of the writing sequence of each component of modern Tibetan syllable characters.
FIG. 3 is a flow chart of a base word location for a vertically superimposed word.
FIG. 4 is a flow diagram of a primitive word locating for a horizontal compound word.
FIG. 5 is a flow chart of the backward calculation of vertically superimposed words.
FIG. 6 is a flow chart of forward computation of vertically superimposed words.
FIG. 7 is a flow chart of full building block recognition of a horizontal compound word.
FIG. 8 is a flow chart of the identification of further lower words in vertically superimposed words.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Structure of modern Tibetan syllable characters and spelling grammar thereof
1.1 structural elements of modern syllable characters of Tibetan language
The processing of Tibetan characters is a precondition for the development of Tibetan information processing, and the most basic research object up to now is that the component elements of Tibetan characters are the minimum units forming Tibetan syllable characters, so that the minimum component units comprise 30 consonant letters
Figure BDA0003656754590000061
Figure BDA0003656754590000061
4 vowels
Figure BDA0003656754590000062
Figure BDA0003656754590000062
5 front additional words
Figure BDA0003656754590000063
10 postwords
Figure BDA0003656754590000064
Figure BDA0003656754590000064
Figure BDA0003656754590000065
2 words added later
Figure BDA0003656754590000066
3 adding words
Figure BDA0003656754590000067
Figure BDA0003656754590000067
4 lower words
Figure BDA0003656754590000068
For a total of 58 component elements.
All the structural elements of modern Tibetan syllable characters are generated from 30 consonant letters and 4 vowel letters, but in Unicode, 4 lower-added characters (
Figure BDA0003656754590000069
And
Figure BDA00036567545900000610
) And when the base word is not reduced (
Figure BDA00036567545900000611
Figure BDA00036567545900000612
… … and
Figure BDA00036567545900000613
… …) is different from the reduced code.
1.2 general structure of modern syllable characters of Tibetan language
The decomposition of the Tibetan component is the basis and precondition of Tibetan information processing, which is essential basic work for the healthy and rapid development of Tibetan information processing, and the basic character occupies an important part in the single syllable of Tibetan, and the position of the basic character can be used for judgingAnd (4) analysis of other components. The Tibetan syllable characters are prefix characters
Figure BDA0003656754590000071
Chinese character adding
Figure BDA0003656754590000072
Base character
Figure BDA0003656754590000073
Adding character
Figure BDA0003656754590000074
Vowel sound
Figure BDA0003656754590000075
Rear additional word
Figure BDA0003656754590000076
Then add the character after
Figure BDA0003656754590000077
The alphabetic writing is composed of 1-7 characters in writing order, wherein the front additional character, the base character, the rear additional character and the rear additional character are spelled transversely, and the upper additional character, the base character, the lower additional character and the vowel are spelled in a longitudinally superposed mode. The Tibetan character components are complex and various, which is the key and difficult point of Tibetan character information processing, the Tibetan characters are processed by a computer, and after the monosyllabic decomposition of the Tibetan characters is carried out by an algorithm, the method has very important significance and practicability on the aspects of Tibetan character sequencing, character statistics, automatic proofreading system and the like. The modern syllable characters of Tibetan are combined according to a certain writing sequence as shown in figure 1.
1.2.1 longitudinal Stacking Structure
The longitudinal superposition is to superpose the upper addition character, the lower addition character and the vowel on the basis of the basic character, wherein the upper addition character + the basic character + the lower addition character + the vowel, and whether the front addition character, the rear addition character and the rear addition character are added due to the character or not. For example:
Figure BDA0003656754590000078
(upper plus word + base word + lower plus word),
Figure BDA0003656754590000079
(plus + radical + vowel),
Figure BDA00036567545900000710
(basic word + lower addition word + vowel),
Figure BDA00036567545900000711
(upper plus + base + lower plus + vowel),
Figure BDA00036567545900000712
(plus + base + minus + vowel + plus + minus).
1.2.2 transverse Assembly
The horizontal combination takes the basic character as the core, and the other front and back positions are added with the front additional character, the back additional character and the back additional character according to the character. There are generally 5 structures: (1) syllables formed by only one consonant, e.g.
Figure BDA00036567545900000713
(2) Base word + postword, e.g.
Figure BDA00036567545900000714
(3) Add-before + base + add-after, e.g. word
Figure BDA00036567545900000715
(4) Base word + postaddition word, e.g.
Figure BDA00036567545900000716
(5) Plus + base + plus-minus, e.g. word
Figure BDA00036567545900000717
1.3 spelling rules of modern Tibetan syllable word
The Tibetan character structure is a pinyin character taking a consonant letter 'base character' as a core, and the other letters are based on the 'base character', and can be combined horizontally from front to back and vertically superposed to form a wholeComplete syllable word structure. In general, the structure of the Tibetan character has at least one consonant letter, namely, the Tibetan character consists of one basic character and at most 6 consonant letters, and vowel symbols are added above and below the consonant letters. The consonant letters of the core are called "base character", and the other letters are named according to the position added to the base character. Spelling Tibetan according to syllables, separating syllables by separator between them and using suffix
Figure BDA0003656754590000081
And (4) terminating.
In summary, a syllable word of Tibetan is composed of 7 components at different positions, but according to the traditional grammar description of Tibetan, each syllable is separated by a syllable point "·", that is, a string of characters between two syllable nodes is established as a complete syllable word of Tibetan. And analyzing the truth of the modern Tibetan language text, the syllable characters of the Tibetan language are not only a simple spelling structure, when a phrase or a sentence is formed by a plurality of syllable characters, two syllables are abbreviated to one syllable character, and another spelling structure appears. Such as: character string
Figure BDA0003656754590000082
Two characters of 'base character' and 'vowel' at different positions appear in a syllable character, and more complicated character pattern structure such as
Figure BDA0003656754590000083
3 "base words" and 3 "vowels" appear in a string. According to analysis, the maximum length of a syllable word in the actual text of the modern Tibetan language may not be limited to be formed by 6 consonant letters and 1 vowel character, namely, the syllable word of the Tibetan language can be generalized to two structural forms. Fig. 2a and 2b show the basic writing rules of the Tibetan syllable characters with two structural forms, wherein the numbers in the drawings represent the writing sequence of each component when one Tibetan syllable character is formed.
(1) The general Tibetan writing order:
top-added character → base character → bottom-added character → vowel → top-added character (second top-added character) that is, the Unicode ordering, writing order and phoneme parsing order are the same.
(2) Writing order of syllable characters with binary notes:
top word → base word → bottom word → vowel → back base word → back vowel, (i.e. there are cases of compact words, such as:
Figure BDA0003656754590000084
)。
collocation rule of syllable character components of modern Tibetan language
2.1 Add rules for Add-on-word
Third prefix in Table 2-1
Figure BDA0003656754590000094
May be directly preceded by 10 base words in the first row and may not be directly preceded by 6 base words in the second row, which may be preceded when those 6 base words are overlapping.
TABLE 2-1 Add rules for top-word
Figure BDA0003656754590000091
2.2 Add rules for Add words
The rules for adding additional words to the Tibetan syllable words are shown in the following table 2-2.
Table 2-2 addition table for adding words
Figure BDA0003656754590000092
2.3 Add rules for Add-Down words
As shown in tables 2-3, in which the 5 th character
Figure BDA0003656754590000093
Often in many corpora and in the form of the position of the suffix or further suffix, so are summarized here in the rule of addition of the suffix.
Tables 2-3 addition tables for lower-case words
Figure BDA0003656754590000101
2.4 Add-on rules for Add-on-word
The rules for adding the second additional character (the second additional character) of the syllable characters of the Tibetan language are shown in the following tables 2 to 4, and the additional character only appears behind the additional character and does not appear behind the basic character.
Tables 2-4 Add tables for Add-on-write
Figure BDA0003656754590000102
Three, modern Tibetan syllable word base character positioning algorithm
3.1 Primary word positioning Algorithm design
The algorithm design idea mainly judges the position of the basic character by the characteristics of two different structures of 'longitudinal superposition' and 'transverse combination' of the Tibetan syllable character, namely, the Tibetan syllable character full-component identification method of the 'basic character positioning method' is provided, and the specific implementation steps are as follows:
(1) firstly, defining the collocation rule of modern Tibetan syllable characters according to the traditional Tibetan grammar;
a removing method: in order to simplify the recognition object, the read character string is preprocessed, and the character string of non-Tibetan characters and non-modern Tibetan syllable character structures is excluded.
(2) Reading the syllable characters of Tibetan to be processed, and judging whether the code of the character string is 0F beginning or not, wherein the function of the syllable characters is to judge whether the Tibetan characters exist or not;
(3) if it starts with 0F, the next step is performed directly. Judging whether modern Tibetan characters exist or not, and searching codes conforming to the modern Tibetan characters according to codes 0F00-0FDA of 211 Tibetan characters which are not recorded by the current Unicode6.2; by analyzing the codes, uniformly filtering the codes and numerical codes (0F20-0F29, 0F00-0F3F, 0F80-0F8F and 0FC0-0FDA) which do not belong to the modern Tibetan character, and the functions of the codes and the numerical codes are to exclude character strings of the syllable character structure of the non-modern Tibetan;
(4) further judging whether modern Tibetan syllable characters exist, executing (1), namely judging whether the modern Tibetan syllable characters accord with the rules of forming the modern Tibetan syllables through the defined collocation rules;
the method for positioning the basic character of the longitudinally superposed character comprises the following steps:
(5) if the result is consistent with the syllable structure of the modern Tibetan, judging whether vowels, the upper addition characters and the lower addition characters exist in the syllable characters, positioning the base characters according to the distance between the base characters and the superposed characters, and performing a base character positioning process of longitudinally superposed characters as shown in figure 3.
The basic character positioning method of the horizontal combined character comprises the following steps:
(6) if there is no superimposed character in the syllable, the horizontal string length of the syllable is determined, 4 cases occur in the horizontal string length (syllable length), and the flow of positioning the basic character of the horizontal combined character is shown in fig. 4.
When i ═ 1, the base word is the character itself, such as: syllable character
Figure BDA0003656754590000111
Is a radical
Figure BDA0003656754590000112
(itself);
when i is 2, the base character is the first character, such as: syllable character
Figure BDA0003656754590000113
Is a radical
Figure BDA0003656754590000114
When i is 3, 2 different structures appear, namely 'base character + post-addition character' or 'pre-addition character + base character + post-addition character', and the positioning method of the base character is as follows:
A. judging whether the first character of the syllable character is in 5 prefix characters, if not, the base character is the first character, such as: syllable character
Figure BDA0003656754590000121
Is a radical
Figure BDA0003656754590000122
B. If double consonants appear, the primary character is the first character, such as: syllable character
Figure BDA0003656754590000123
Is a radical
Figure BDA0003656754590000124
C. If the first character is in the 5 prefix words and the syllable word is in the following 11 syllable words, the base character is the first character, otherwise, the base character is the second character, such as: syllable character
Figure BDA0003656754590000125
Is a basic character of
Figure BDA0003656754590000126
While syllable word
Figure BDA0003656754590000127
(not in the 11 syllable words in the following Table) is
Figure BDA0003656754590000128
Figure BDA0003656754590000129
When i is 4, the base character is the second character, such as: syllable character
Figure BDA00036567545900001210
Is a radical
Figure BDA00036567545900001211
The method for positioning the base character of the special syllable character comprises the following steps:
(6) through the component rules and statistical analysis of modern Tibetan syllables, in Tibetan syllables (when i is 3) formed by transversely combining three characters, 9 syllables in total meet the spelling rules of ' adding character before + adding character after + adding character ' and ' adding character after + adding character ", so that an ambiguity phenomenon occurs, but according to the Tibetan traditional grammar ' word organization law ', different modern Tibetan dictionaries and dictionaries are combined, the occurrence frequency in a statistical corpus is combined, and the knowledge intervention of expert scholars, and the base characters of the 9 special syllables are shown in the following table:
Figure BDA00036567545900001212
in addition, the above 9 syllable words only conform to the Tibetan word forming rule, but the occurrence frequency in the current Tibetan text and many modern Tibetan dictionaries is very low, and the processing can be omitted.
(7) Besides the general structure of the modern Tibetan, the Tibetan which does not conform to the spelling rule of the modern Tibetan but often appears in the modern Tibetan can be regarded as the modern Tibetan nowadays, and the positioning method of the basic character is as follows:
A. and then the base character of the lower added character is positioned. Such as
Figure BDA00036567545900001213
And judging the basic character of the syllable by the distance between the following character and the basic character, wherein the basic character is the first character, such as: syllable character
Figure BDA00036567545900001214
Is a radical
Figure BDA00036567545900001215
B. The base word location of the merged word (string of concatenated compact words). Frequently occurring in Tibetan texts are merged words in abbreviated form, e.g.
Figure BDA0003656754590000131
When the basic character is positioned, the character cannot be directly positioned by the longitudinal superposition characterOr the basic character positioning method of the horizontal combined character is processed, firstly, the compact word is identified, the character string is separated into two or more character strings, then the basic character positioning method of the vertical superimposed character and the horizontal combined character is respectively executed to judge the basic character of the character string, the character string does not belong to the monosyllabic character of Tibetan language, and is formed by combining two or more monosyllabic characters (can be regarded as Tibetan language compact words or words), therefore, the character string identifies two or more basic characters, such as: character string
Figure BDA0003656754590000132
Is a radical
Figure BDA0003656754590000133
And
Figure BDA0003656754590000134
C. and positioning the Sanskrit source Tibetan characters based on the characters. Such as
Figure BDA0003656754590000135
The character string adopts the character-forming mode of Sanskrit Tibetan characters in Tibetan, the Sanskrit characters do not have the concept of adding characters up and down, and the character string has the advantages of being simple in structure, convenient to use and capable of being used for storing characters in the Tibetan language
Figure BDA0003656754590000136
The whole character is used as a base character. Full-component recognition algorithm for four-modern Tibetan syllable characters
4.1 full component recognition Algorithm design
The method for positioning the basic characters of the Tibetan syllable words is introduced, and a specific positioning algorithm of the core components of the Tibetan syllable words is provided through different structures of 'longitudinal superposition' and 'transverse combination' of 2 Tibetan syllable words. In this section, an identification algorithm of other components of the Tibetan syllable word in a dynamic combination form is proposed on the basis of "base word positioning", and specific identification steps and algorithms are as follows:
full component identification of longitudinally superimposed characters:
in the process of identifying the whole component of the longitudinal superposition word, the base word is taken as a boundary, and the judgment is carried out by a backward calculation method and a forward calculation methodThe position of other constituents, if any, being empty (as used herein)
Figure BDA0003656754590000137
Represented), this means that the member is not present.
(1) Performing a backward calculation:
first, it is judged whether the character following the capital character is in a vowel "
Figure BDA0003656754590000138
Any one of them. If yes, the vowel exists, and the lower addition character does not exist, then the first step is executed.
The first step is as follows: judging the distance from the vowel to the last character is S-x (x is more than or equal to 0 and less than or equal to 2), wherein when S is 0, the postaddition character and the further postaddition character are both null; when S is 1, the post-addend exists, and the post-addend is empty; when S is 2, it means that both postword and further postword are present.
If not (no vowel is present), the second step is performed.
The second step: judging whether the latter character of the capital character is "
Figure BDA0003656754590000141
If yes, indicating that the lower added character exists, and judging that the distance from the lower added character to the last character is S-x (x is more than or equal to 0 and less than or equal to 3), wherein when S is 0, the vowel, the post-added character and the re-post-added character are all null, and when S is 3, the vowel, the post-added character and the re-post-added character exist at the same time; when S is 1 or S is 2, it is necessary to further determine whether the character following the suffix is present
Figure BDA0003656754590000142
Any one of the above; when S is equal to 1 (2 structures appear: "… + lower-case + vowel" or "… + lower-case + post-case"), and the character is in
Figure BDA0003656754590000143
When any one of them is selected, it means that there is a vowel, the postaddition character and the further postaddition character areNull, when S is 1, and the character is not in
Figure BDA0003656754590000144
When the Chinese character is a Chinese character, the vowel and the postaddition character are null, and the postaddition character exists; when S is 2 (2 structures appear, … + adding character + vowel + adding character after "or … + adding character after) and the first character after the adding character is"
Figure BDA0003656754590000145
Figure BDA0003656754590000146
When the letter is a vowel, the first character after the letter is not in the suffix' S letter, the letter is a space, and when S is 2, the letter is a space "
Figure BDA0003656754590000147
In the middle, the vowel is null, and postaddition and further postaddition exist. The backward calculation process ends.
The flow of performing the backward calculation is shown in fig. 5 with the primitive as the boundary.
(2) Performing a forward calculation:
firstly, judging the distance between the base character and the most front character as S ═ x (x is more than or equal to 0 and less than or equal to 2), wherein when S ═ 0, the base character and the most front character both represent null; when S is equal to 1 (2 structures appear: "add-before + base + …" or "add-on + base …"), the previous character of the base is added with the word
Figure BDA0003656754590000148
When S is equal to 1, and the character preceding the base character is not in the upper character, the upper character is empty
Figure BDA0003656754590000149
When any one of the two is selected, the word is shown to exist, and the upper added word is empty; when S is 2, it means that both the add word and the pre-add word exist simultaneously.
The forward calculation process is performed with the basic character as the boundary as shown in FIG. 6, and it is known from structural analysis of syllable characters in Tibetan as follows that in the member identification of longitudinally superimposed characters, the preceding members of the basic character only have the top and top addition characters with the basic character as the boundary.
Full-component identification of horizontal combined characters:
by performing the above-described "full component recognition of vertically superimposed words", all syllable words that are not vertically superimposed can be excluded. The full-component recognition of the horizontal combined character is completely calculated by the length of the whole character string, and the length of the horizontal character string (i represents the length of the whole syllable character and contains the basic character) has 4 composition structures:
(3) when i is 1, the character is a basic character (one character in 30 consonant letters) and other components are all empty;
(4) when i is 2, the first character is a basic character, and the second character is an addend character. The syllable word structure is 'base word + postaddition word';
(5) when i is 3, there are 2 structures, "base word + postaddition word" or "pre-addition word + base word + postaddition word", and its component identification method is as follows:
1) when i is 3, judging whether the first character of the syllable character is 5 pre-addition characters
Figure BDA0003656754590000151
If not, the first character is a basic character, the last 2 characters are respectively a postaddition character and a postaddition character, and the syllable character has the structure of 'basic character + postaddition character';
2) when i is 3, judging whether double consonants appear, if so, the first character is a basic character, the last 2 characters are respectively a post-addition character and a post-addition character, and the structure of the syllable character is 'basic character + post-addition character';
3) when i is 3, if the first character is in 5 prefix words and the syllable word is in the following 11 syllable words, the first character is a base word, and the last 2 characters are postaddition words and postaddition words respectively. Otherwise, the first character is a front additional character, and the last 2 characters are a base character and a back additional character respectively;
Figure BDA0003656754590000152
4) when i is 4, the 4 characters are respectively an adding character, a base character, an adding character and a adding character again, and the syllable character has a structure of adding character before, base character, adding character after and adding character again.
In the whole component recognition process of the horizontal combined word, the length of the horizontal character string i can be calculated, and the recognition flow is shown in fig. 7.
Full component identification of special syllable characters:
(6) in the horizontal combined word, when i is 3, 9 syllable words not only satisfy the spelling structure of "adding before, adding after + adding base word", but also satisfy "adding after + adding base word", the ambiguity problem is easy to appear. The specific reason for this is described in chapter iii, where 9 special syllables are shown in the following table, where the 3 characters of syllable characters numbered 1-3 are a base character, a postaddition character and a further postaddition character, respectively, and the 3 characters of syllable characters numbered 4-9 are a prefix character, a base character and a postaddition character, respectively.
Figure BDA0003656754590000161
The corresponding knowledge base is established by the whole component recognition results of the above 9 special syllable characters, and in the whole component recognition process, as long as the character string is read, the corresponding component is directly output, and the corresponding knowledge base is shown in table 4-1.
TABLE 4-1 knowledge base of component recognition for special syllable words
Figure BDA0003656754590000162
(7) Syllable words with further additional words appear. In Tibetan, syllabic words in the form of re-uppercase words appear, and the 7 components that do not show the syllable words in Tibetan in the Tibetan grammar also include expressions of re-uppercase words. Therefore, the recognition algorithm is performed as a special syllable wordAnd (6) processing. In Tibetan only the form of the following additional words
Figure BDA0003656754590000163
The character, and appears in the form of an overlaid word. When the whole components of the syllable characters are identified, the base characters can be identified by a base character positioning method of longitudinally superposed characters, and then whether the first character is a lower-added character or not is judged by taking the base characters as boundaries
Figure BDA0003656754590000171
Or
Figure BDA0003656754590000172
If yes, judging whether the second character is
Figure BDA0003656754590000173
I.e. when the last two characters of the basic character are respectively
Figure BDA0003656754590000174
Or "
Figure BDA0003656754590000175
And then, the syllable character is determined to exist and then is added, and the recognition result is output. The spelling structure of the syllable word is "base word + add-down word", and the recognition process is shown in fig. 8.
(8) Merging words (concatenating strings of compact words). The specificity of this type of word is described in chapter three, based word recognition algorithms. First, to compact words
Figure BDA0003656754590000176
And finally, respectively executing the corresponding whole component identification process of 'longitudinally superposed characters' or 'transversely combined characters' according to different structures of the separated characters.
4.2 results and analysis thereof
The invention deeply analyzes the modern Tibetan character structure through the Tibetan grammar and the word-building rule, obtains that the position of the base character in the component elements of the modern Tibetan syllable character has a certain rule according to the analysis result, designs and realizes the algorithm through the prior rule, establishes a knowledge base for the character string which is partially not in accordance with the modern Tibetan word-building rule, and performs special processing. The 18689 modern Tibetan characters which are obtained at present newly are tested by the algorithm, and the experimental result shows that the component identification accuracy of 18000 syllable characters reaches 100 percent.
The component identification results are shown in the following table 4-2.
TABLE 4-2 modern Tibetan language full component identification results (partial examples)
Figure BDA0003656754590000177
Figure BDA0003656754590000181
The method not only can effectively analyze the syllable characters of the Tibetan, but also can be widely applied to the Tibetan information processing fields of Tibetan dictionary (including phrase) sequencing, character statistics, resource construction, spelling check, Tibetan Latin transcription and the like.
For example, the application method of character statistics may be:
identifying the component of each modern Tibetan syllable word in the Tibetan text by using a full component identification algorithm, and writing the component into a component knowledge base of the Tibetan text;
in the statistical component knowledge base:
each character of the basic character
Figure BDA0003656754590000191
Figure BDA0003656754590000192
The number of occurrences;
each character of vowel
Figure BDA0003656754590000193
Number of occurrences;
Each character of the prefix
Figure BDA0003656754590000194
The number of occurrences;
each character of the postaddition character
Figure BDA0003656754590000195
The number of occurrences;
each character of the postaddition character
Figure BDA0003656754590000196
The number of occurrences;
each character of the added character
Figure BDA0003656754590000197
The number of occurrences;
each character of the following
Figure BDA0003656754590000198
The number of occurrences;
and counting the frequency of the Tibetan characters in the Tibetan text according to the frequency of the appearance of each character of the basic character, the vowel, the prefix, the postaddition, the upperaddition and the lowercase.
If the recognized component knowledge base has additional characters, the characters of the additional characters should be counted
Figure BDA0003656754590000199
The number of occurrences.
For another example, the application method of the Tibetan Latin transcription can be as follows:
identifying the components of modern Tibetan syllable characters by using a full component identification algorithm; if the component has a following character, the following character is converted into a consonant character, i.e. the component has the following character
Figure BDA00036567545900001910
Respectively convert into
Figure BDA00036567545900001911
If there are also words in the building block, the words are converted into consonant characters "
Figure BDA00036567545900001912
Is converted into
Figure BDA00036567545900001913
If there is no vowel in the component, the Latin character "a" is used as the vowel; converting the Tibetan characters into the Latin characters according to the sequence of the front plus → the top plus → the base character → the bottom plus → the second bottom plus → the vowel → the second top plus of the Tibetan characters; the Tibetan-Latin character conversion table is:
Figure BDA0003656754590000201

Claims (5)

1. the full-component recognition algorithm of the modern Tibetan syllable words is characterized in that the modern Tibetan syllable words are used as longitudinal superposed words to be subjected to full-component recognition, and if vowels, lower-added words or upper-added words exist in components of the modern Tibetan syllable words, the modern Tibetan syllable words are judged to be the longitudinal superposed words; if the component does not have vowels, lower addition characters and upper addition characters, judging that the modern Tibetan syllable characters are not longitudinal superposition characters, and using the modern Tibetan syllable characters as transverse combined characters to carry out full component identification;
the longitudinal superposition word is used for carrying out full-component identification, and the method comprises the following steps:
step 1: positioning the base character of the modern Tibetan syllable character;
step 2: executing a backward algorithm identification component:
2.1 reading a post character string of the base character;
2.2 identifying whether the character following the base character is
Figure FDA0003656754580000011
Any one of them;
if yes, the components are as follows: the latter character of the basic character is vowel, and there is no lower addition character;
let the distance from the vowel to the last character of the post string be S:
when S is 0, the component is as follows: adding characters after the character adding does not exist;
when S is 1, the component is as follows: the latter character of the vowel is a postaddition character, and no postaddition character exists;
when S is 2, the component is as follows: the latter character of the vowel is a post-addition character, and the latter character is a post-addition character;
if not, continuing;
2.3 identifying whether the character following the base character is
Figure FDA0003656754580000012
Any one of them;
if yes, the components are as follows: the next character of the basic character is a lower addition character;
let the distance from the add-down to the last character of the post-string be S:
when S is 0, the component is as follows: vowels, postaddition characters and further postaddition characters do not exist;
when S is 1, further identifying whether the character next to the added character is
Figure FDA0003656754580000013
Any one of:
if so, the components are as follows: the latter character of the lower added character is vowel, and the latter added character and the later added character do not exist; if not its components: the latter character of the lower additional character is a post additional character, and has no vowel and then the post additional character;
when S is 2, further identifying whether the next character of the lower added character is
Figure FDA0003656754580000021
Any one of:
if so, the components are as follows: the latter character of the lower additional character is vowel, the latter character is a post additional character, and no post additional character exists; if not its components: the latter character of the lower added character is a post added character, the latter character is a post added character, and vowels do not exist;
when S is 3, the component is as follows: the latter character of the lower additional character is vowel, the next character is a later additional character, and the last character is a later additional character;
if not, continuing;
and step 3: executing a forward algorithm identification component:
3.1 reading the front character string of the basic character;
3.2 let the distance from the base word to the first character of the preceding string be S:
when S is 0, the component is as follows: no top and top words exist;
when S is equal to 1, further identifying whether the previous character of the base character is
Figure FDA0003656754580000022
Any one of:
if so, the components are as follows: the previous character of the basic character is an add character, and no add character exists;
if not its components: the previous character of the basic character is a plus character, and no plus character exists;
when S is 2, the component is as follows: the former character of the basic character is an adding character, and the next former character is a adding character;
the horizontal combined character carries out full-component recognition and comprises the following steps:
reading the horizontal character string of the modern Tibetan syllable character, and enabling the length of the horizontal character string to be i:
when i is 1, the component is: a base word;
when i is 2, its member: the first character is a basic character, and the second character is a postaddition character;
when i is 3, further identifying:
e.g. the first character of a horizontal string is not
Figure FDA0003656754580000023
Any one of them, its component: the first character being a base character and the second character being a postadditionThe third character is a second additional character;
such as a horizontal string of characters
Figure FDA0003656754580000031
Figure FDA0003656754580000032
Any one of them, its component: the first character is a basic character, the second character is a postaddition character, and the third character is a postaddition character; otherwise, the components thereof: the first character is a front additional character, the second character is a basic character, and the third character is a rear additional character;
if the horizontal character string has double consonants, the component is as follows: the first character is a basic character, the second character is a postaddition character, and the third character is a postaddition character;
when i is 4, its member: the first character is a front additional character, the second character is a basic character, the third character is a rear additional character, and the fourth character is a rear additional character.
2. The full component recognition algorithm for modern Tibetan syllabic words of claim 1, further comprising the step 2.4:
2.4 identifying whether the character following the base character is
Figure FDA0003656754580000033
Any one of them;
if yes, the components are as follows: the next character of the basic character is a lower addition character;
further identifying whether the next character of the following character is
Figure FDA0003656754580000034
If so, the components are as follows: the next character of the added character is a further added character; if not its components: no further add-down character exists;
if not, continuing.
3. The method as claimed in claim 1, wherein the full component recognition algorithm is used to recognize the component of each modern Tibetan syllable word in the Tibetan text and write the component into the component knowledge base of the Tibetan text;
in the statistical component knowledge base:
each character of the basic character
Figure FDA0003656754580000035
Figure FDA0003656754580000036
The number of occurrences;
each character of vowel
Figure FDA0003656754580000037
The number of occurrences;
each character of the prefix
Figure FDA0003656754580000041
The number of occurrences;
each character of the postaddition character
Figure FDA0003656754580000042
The number of occurrences;
each character of the postaddition character
Figure FDA0003656754580000043
The number of occurrences;
each character of the added character
Figure FDA0003656754580000044
The number of occurrences;
each character of the following
Figure FDA0003656754580000045
The number of occurrences;
and counting the frequency of the Tibetan characters in the Tibetan text according to the frequency of the appearance of each character of the basic character, the vowel, the prefix character, the postaddition character, the upperaddition character and the lowercase character.
4. The method as claimed in claim 2, wherein the full component recognition algorithm is used to recognize the component of each modern Tibetan syllable word in the Tibetan text and write the component into the component knowledge base of the Tibetan text;
in the statistical component knowledge base:
each character of the basic character
Figure FDA0003656754580000046
Figure FDA0003656754580000047
The number of occurrences;
each character of vowel
Figure FDA0003656754580000048
The number of occurrences;
each character of the prefix
Figure FDA0003656754580000049
The number of occurrences;
each character of the postaddition character
Figure FDA00036567545800000410
The number of occurrences;
then each character of the postaddition character
Figure FDA00036567545800000411
The number of occurrences;
each character of the Chinese character
Figure FDA00036567545800000412
The number of occurrences;
each character of the following
Figure FDA00036567545800000413
The number of occurrences;
character with additional words
Figure FDA00036567545800000414
The number of occurrences;
and counting the frequency of the Tibetan characters in the Tibetan text according to the frequency of the appearance of each character of the basic character, the vowel, the prefix, the postaddition, the upperaddition, the lowercase and the lowercase.
5. The method as claimed in claim 2, wherein the full component recognition algorithm is used to recognize the components of the modern Tibetan syllable word; if the component has a following character, the following character is converted into a consonant character, i.e. the component has the following character
Figure FDA0003656754580000051
Respectively convert into
Figure FDA0003656754580000052
If the member has additional characters, converting the additional characters into consonant characters
Figure FDA0003656754580000053
Is converted into
Figure FDA0003656754580000054
If there is no vowel in the component, the Latin character "a" is used as the vowel; converting the Tibetan characters into the Latin characters according to the sequence of the front plus → the top plus → the base character → the bottom plus → the second bottom plus → the vowel → the second top plus of the Tibetan characters;
the Tibetan language-Latin character conversion table is as follows:
Figure FDA0003656754580000055
CN202210561495.XA 2022-05-23 2022-05-23 Full-component recognition algorithm for modern Tibetan syllable characters Pending CN115050034A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210561495.XA CN115050034A (en) 2022-05-23 2022-05-23 Full-component recognition algorithm for modern Tibetan syllable characters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210561495.XA CN115050034A (en) 2022-05-23 2022-05-23 Full-component recognition algorithm for modern Tibetan syllable characters

Publications (1)

Publication Number Publication Date
CN115050034A true CN115050034A (en) 2022-09-13

Family

ID=83158957

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210561495.XA Pending CN115050034A (en) 2022-05-23 2022-05-23 Full-component recognition algorithm for modern Tibetan syllable characters

Country Status (1)

Country Link
CN (1) CN115050034A (en)

Similar Documents

Publication Publication Date Title
KR101083540B1 (en) System and method for transforming vernacular pronunciation with respect to hanja using statistical method
KR100656736B1 (en) System and method for disambiguating phonetic input
Alghamdi et al. Automatic restoration of arabic diacritics: a simple, purely statistical approach
Vasiu et al. Enhancing tokenization by embedding romanian language specific morphology
Mekki et al. COTA 2.0: An automatic corrector of Tunisian Arabic social media texts
CN115050034A (en) Full-component recognition algorithm for modern Tibetan syllable characters
Al-Fedaghi et al. Morphological compression of Arabic text
JP2013097534A (en) Morpheme analysis device, method and program therefor, voice synthesis device, and method and program therefor
UzZaman et al. A comprehensive bangla spelling checker
Lehal Design and implementation of Punjabi spell checker
JPS62165267A (en) Voice word processor device
CN113330430B (en) Sentence structure vectorization device, sentence structure vectorization method, and recording medium containing sentence structure vectorization program
JP7247460B2 (en) Correspondence Generating Program, Correspondence Generating Device, Correspondence Generating Method, and Translation Program
Manohar et al. Spellchecker for Malayalam using finite state transition models
JP2009176148A (en) Unknown word determining system, method and program
CN1257444C (en) Complete pronunciation Chinese input method for computer
KR101777141B1 (en) Apparatus and method for inputting chinese and foreign languages based on hun min jeong eum using korean input keyboard
Lehal et al. Conversion between scripts of Punjabi: Beyond simple transliteration
KR100434526B1 (en) Sentence extracting method from document by using context information and local document form
Lehal et al. A Hindi to Urdu transliteration system
CN1323004A (en) Automatic conversion method from Chinese braille to Chinese character
JP2798931B2 (en) Chinese phonetic delimiter and phonetic kanji conversion
JPS58114224A (en) "kana" (japanese syllabary) and chinese character converting system
JP2004206659A (en) Reading information determination method, device, and program
CN112487762A (en) Natural language processing method based on Chinese character pronunciation and meaning structure Chinese character coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination