CN115050034A - Full-component recognition algorithm for modern Tibetan syllable characters - Google Patents
Full-component recognition algorithm for modern Tibetan syllable characters Download PDFInfo
- Publication number
- CN115050034A CN115050034A CN202210561495.XA CN202210561495A CN115050034A CN 115050034 A CN115050034 A CN 115050034A CN 202210561495 A CN202210561495 A CN 202210561495A CN 115050034 A CN115050034 A CN 115050034A
- Authority
- CN
- China
- Prior art keywords
- character
- characters
- ***
- component
- syllable
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/24—Character recognition characterised by the processing or recognition method
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a full component recognition algorithm of modern Tibetan syllable characters, which is characterized in that the modern Tibetan syllable characters are used as longitudinal superposed characters to be subjected to full component recognition, and if vowels, lower-added characters or upper-added characters exist in components of the modern Tibetan syllable characters, the modern Tibetan syllable characters are judged to be the longitudinal superposed characters; if the component has no vowel, the lower addition character and the upper addition character, the modern Tibetan syllable character is judged not to be the vertical addition character and is used as the horizontal combined character for full component recognition. The method analyzes the complex structure of the syllable characters of the Tibetan language of the two-dimensional alphabetic writing and effectively improves the full component identification effect of the syllable characters of the Tibetan language. The invention carries out the whole component identification test on 18689 latest statistical data obtained at present, the identification rate is up to 100%, and the basic guarantee is provided for the whole development of Tibetan information processing.
Description
Technical Field
The invention relates to the technical field of information processing, in particular to a full-component recognition algorithm for modern Tibetan syllable characters.
Background
Modern Tibetan is a written alphabetic writing commonly used by the national users in daily life, and is a plane writing formed in the process of spelling in two directions of longitudinal superposition and transverse combination. Syllable characters conforming to the modern Tibetan grammar and grammar rules are called modern Tibetan characters. According to the latest statistics, 18689 Tibetan characters are totally contained in Tibetan syllable characters which accord with the modern Tibetan character forming rule. Modern Tibetan characters are composed of 7 elements, namely, a front additional character, an upper additional character, a base character, a lower additional character, vowels, a rear additional character and a rear additional character. The basic character is a core component constituting the Tibetan character, and is also an indispensable component element in the character constitution, and other components are different according to the character. Because the components at different positions of the syllable characters of the Tibetan language are not designed with independent different codes in the Unicode coding, in other words, a certain consonant letter in the syllable characters of the Tibetan language serves as components of a base character, a prefix character, a postfix character and the like, but the codes are the same no matter which position appears, the position serves as what role, and therefore, the analysis of the syllable characters of the Tibetan language is difficult.
Disclosure of Invention
The invention aims to provide a full component recognition algorithm of modern Tibetan syllable characters, which takes basic characters as a core and judges other components forming the syllable characters according to the character length of the syllable characters.
The technical scheme for realizing the purpose of the invention is as follows:
the modern Tibetan syllable character full component recognition algorithm is used for performing full component recognition by taking the modern Tibetan syllable character as a longitudinal superposed character, and if a vowel, a lower additive character or an upper additive character exists in a component of the modern Tibetan syllable character, judging that the modern Tibetan syllable character is the longitudinal superposed character; if the component does not have vowels, lower addition characters and upper addition characters, judging that the modern Tibetan syllable characters are not longitudinal superposition characters, and using the modern Tibetan syllable characters as transverse combined characters to carry out full component identification;
the longitudinal superposition word carries out full-component recognition and comprises the following steps:
step 1: locating a base word of the modern Tibetan syllable word;
step 2: executing a backward algorithm identification component:
2.1 reading a post character string of the base character;
if yes, the components are as follows: the latter character of the basic character is vowel, and there is no lower addition character;
let the distance from the vowel to the last character of the post string be S:
when S is 0, the component is as follows: adding characters after the character adding does not exist;
when S is 1, the component is as follows: the latter character of the vowel is a postaddition character, and no postaddition character exists;
when S is 2, the component thereof: the latter character of the vowel is a postaddition character, and the latter character is a postaddition character;
if not, continuing;
if yes, the components are as follows: the next character of the basic character is a lower addition character;
let the distance from the add-down to the last character of the post-string be S:
when S is 0 then its member: vowels, postaddition characters and further postaddition characters do not exist;
if so, the components are as follows: the latter character of the lower additional character is vowel, and the post-addition character do not exist;
if not its components: the latter character of the lower added character is a post added character, and has no vowel and then the post added character;
when S is 2, further identifying whether the next character of the lower added character isAny one of:
if so, the components are as follows: the latter character of the lower additional character is vowel, the latter character is a post additional character, and no post additional character exists; if not its components: the latter character of the lower additional character is a post-additional character, the latter character is a post-additional character, and no vowel exists;
when S is 3, the component is as follows: the latter character of the lower additional character is vowel, the next character is a later additional character, and the last character is a later additional character;
if not, continuing;
and step 3: executing a forward algorithm identification component:
3.1 reading the front character string of the basic character;
3.2 let the distance from the base word to the first character of the preceding string be S:
when S is 0 then its member: no top and top words exist;
if so, the components are as follows: the previous character of the basic character is an add character, and no add character exists;
if not its components: the previous character of the basic character is a plus character, and no plus character exists;
when S is 2, the component is as follows: the former character of the basic character is an adding character, and the next former character is a adding character;
the horizontal combined character carries out full-component recognition and comprises the following steps:
reading the horizontal character string of the modern Tibetan syllable character, and enabling the length of the horizontal character string to be i:
when i is 1, the component is: a base word;
when i is 2, the component is: the first character is a basic character, and the second character is a postaddition character;
when i is 3, further identifying:
if the first character of the horizontal string is notAny one of them, its component: the first character is a basic character, the second character is a postaddition character, and the third character is a postaddition character;
such as a horizontal string of charactersAny one of them, its component: the first character is a basic character, the second character is a post-added character, and the third character is a post-added character; otherwise, the components thereof: the first character is a front additional character, the second character is a basic character, and the third character is a rear additional character;
if the horizontal character string has double consonants, the component is as follows: the first character is a basic character, the second character is a postaddition character, and the third character is a postaddition character;
when i is 4, its member: the first character is a front additional character, the second character is a basic character, the third character is a rear additional character, and the fourth character is a rear additional character.
The further technical scheme also comprises the following steps of 2.4:
if yes, the components are as follows: the next character of the basic character is a lower addition character;
further identifying whether the latter character of the lower character is "If so, the components are as follows: the latter character of the additional character isAdding characters; if not its components: no more words are added;
if not, continuing.
The invention also provides a use method of the modern Tibetan syllable character full-component recognition algorithm.
One of the using methods is as follows: identifying the component of each modern Tibetan syllable word in the Tibetan text by using a full component identification algorithm, and writing the component into a component knowledge base of the Tibetan text;
in the statistical component knowledge base:
and counting the frequency of the Tibetan characters in the Tibetan text according to the frequency of the appearance of each character of the basic character, the vowel, the prefix, the postaddition, the upperaddition and the lowercase.
The other using method is as follows: identifying the components of modern Tibetan syllable characters by using a full component identification algorithm; if the member has the following characters, the following characters are firstly converted into consonant characters, i.e. the member has the following charactersRespectively convert intoIf the member has additional characters, converting the additional characters into consonant charactersIs converted intoIf there is no vowel in the component, the Latin character "a" is used as the vowel; converting the Tibetan character into the Latin character according to the sequence of the front plus character → the top plus character → the base character → the bottom plus character → the second top plus character → the vowel → the second top plus character;
the Tibetan language-Latin character conversion table is as follows:
the beneficial effect of the invention is that,
1. taking the 'base character' of the Tibetan as a core, and positioning the 'base character' of the syllable character according to the two-dimensional structure characteristics of the syllable character of the Tibetan; and identifying the full member of the longitudinal superposition word by taking the basic word as a boundary in a forward derivation and backward derivation mode, and identifying the full member of the transverse combination word by a length calculation mode of the whole character string.
2. The method analyzes the complex structure of the syllable characters of the Tibetan language of the two-dimensional alphabetic writing and effectively improves the full component identification effect of the syllable characters of the Tibetan language. The invention carries out full-component identification test on 18689 latest statistical data obtained at present, the identification rate is up to 100%, and the basic guarantee is provided for the overall development of Tibetan information processing.
Drawings
FIG. 1 is a diagram of the syllable structure of the modern Tibetan language.
Fig. 2a and 2b are schematic diagrams of the writing sequence of each component of modern Tibetan syllable characters.
FIG. 3 is a flow chart of a base word location for a vertically superimposed word.
FIG. 4 is a flow diagram of a primitive word locating for a horizontal compound word.
FIG. 5 is a flow chart of the backward calculation of vertically superimposed words.
FIG. 6 is a flow chart of forward computation of vertically superimposed words.
FIG. 7 is a flow chart of full building block recognition of a horizontal compound word.
FIG. 8 is a flow chart of the identification of further lower words in vertically superimposed words.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Structure of modern Tibetan syllable characters and spelling grammar thereof
1.1 structural elements of modern syllable characters of Tibetan language
The processing of Tibetan characters is a precondition for the development of Tibetan information processing, and the most basic research object up to now is that the component elements of Tibetan characters are the minimum units forming Tibetan syllable characters, so that the minimum component units comprise 30 consonant letters 4 vowels 5 front additional words10 postwords 2 words added later3 adding words 4 lower wordsFor a total of 58 component elements.
All the structural elements of modern Tibetan syllable characters are generated from 30 consonant letters and 4 vowel letters, but in Unicode, 4 lower-added characters (And) And when the base word is not reduced ( … … and… …) is different from the reduced code.
1.2 general structure of modern syllable characters of Tibetan language
The decomposition of the Tibetan component is the basis and precondition of Tibetan information processing, which is essential basic work for the healthy and rapid development of Tibetan information processing, and the basic character occupies an important part in the single syllable of Tibetan, and the position of the basic character can be used for judgingAnd (4) analysis of other components. The Tibetan syllable characters are prefix charactersChinese character addingBase characterAdding characterVowel soundRear additional wordThen add the character afterThe alphabetic writing is composed of 1-7 characters in writing order, wherein the front additional character, the base character, the rear additional character and the rear additional character are spelled transversely, and the upper additional character, the base character, the lower additional character and the vowel are spelled in a longitudinally superposed mode. The Tibetan character components are complex and various, which is the key and difficult point of Tibetan character information processing, the Tibetan characters are processed by a computer, and after the monosyllabic decomposition of the Tibetan characters is carried out by an algorithm, the method has very important significance and practicability on the aspects of Tibetan character sequencing, character statistics, automatic proofreading system and the like. The modern syllable characters of Tibetan are combined according to a certain writing sequence as shown in figure 1.
1.2.1 longitudinal Stacking Structure
The longitudinal superposition is to superpose the upper addition character, the lower addition character and the vowel on the basis of the basic character, wherein the upper addition character + the basic character + the lower addition character + the vowel, and whether the front addition character, the rear addition character and the rear addition character are added due to the character or not. For example:(upper plus word + base word + lower plus word),(plus + radical + vowel),(basic word + lower addition word + vowel),(upper plus + base + lower plus + vowel),(plus + base + minus + vowel + plus + minus).
1.2.2 transverse Assembly
The horizontal combination takes the basic character as the core, and the other front and back positions are added with the front additional character, the back additional character and the back additional character according to the character. There are generally 5 structures: (1) syllables formed by only one consonant, e.g.(2) Base word + postword, e.g.(3) Add-before + base + add-after, e.g. word(4) Base word + postaddition word, e.g.(5) Plus + base + plus-minus, e.g. word
1.3 spelling rules of modern Tibetan syllable word
The Tibetan character structure is a pinyin character taking a consonant letter 'base character' as a core, and the other letters are based on the 'base character', and can be combined horizontally from front to back and vertically superposed to form a wholeComplete syllable word structure. In general, the structure of the Tibetan character has at least one consonant letter, namely, the Tibetan character consists of one basic character and at most 6 consonant letters, and vowel symbols are added above and below the consonant letters. The consonant letters of the core are called "base character", and the other letters are named according to the position added to the base character. Spelling Tibetan according to syllables, separating syllables by separator between them and using suffixAnd (4) terminating.
In summary, a syllable word of Tibetan is composed of 7 components at different positions, but according to the traditional grammar description of Tibetan, each syllable is separated by a syllable point "·", that is, a string of characters between two syllable nodes is established as a complete syllable word of Tibetan. And analyzing the truth of the modern Tibetan language text, the syllable characters of the Tibetan language are not only a simple spelling structure, when a phrase or a sentence is formed by a plurality of syllable characters, two syllables are abbreviated to one syllable character, and another spelling structure appears. Such as: character stringTwo characters of 'base character' and 'vowel' at different positions appear in a syllable character, and more complicated character pattern structure such as3 "base words" and 3 "vowels" appear in a string. According to analysis, the maximum length of a syllable word in the actual text of the modern Tibetan language may not be limited to be formed by 6 consonant letters and 1 vowel character, namely, the syllable word of the Tibetan language can be generalized to two structural forms. Fig. 2a and 2b show the basic writing rules of the Tibetan syllable characters with two structural forms, wherein the numbers in the drawings represent the writing sequence of each component when one Tibetan syllable character is formed.
(1) The general Tibetan writing order:
top-added character → base character → bottom-added character → vowel → top-added character (second top-added character) that is, the Unicode ordering, writing order and phoneme parsing order are the same.
(2) Writing order of syllable characters with binary notes:
top word → base word → bottom word → vowel → back base word → back vowel, (i.e. there are cases of compact words, such as:)。
collocation rule of syllable character components of modern Tibetan language
2.1 Add rules for Add-on-word
Third prefix in Table 2-1May be directly preceded by 10 base words in the first row and may not be directly preceded by 6 base words in the second row, which may be preceded when those 6 base words are overlapping.
TABLE 2-1 Add rules for top-word
2.2 Add rules for Add words
The rules for adding additional words to the Tibetan syllable words are shown in the following table 2-2.
Table 2-2 addition table for adding words
2.3 Add rules for Add-Down words
As shown in tables 2-3, in which the 5 th characterOften in many corpora and in the form of the position of the suffix or further suffix, so are summarized here in the rule of addition of the suffix.
Tables 2-3 addition tables for lower-case words
2.4 Add-on rules for Add-on-word
The rules for adding the second additional character (the second additional character) of the syllable characters of the Tibetan language are shown in the following tables 2 to 4, and the additional character only appears behind the additional character and does not appear behind the basic character.
Tables 2-4 Add tables for Add-on-write
Three, modern Tibetan syllable word base character positioning algorithm
3.1 Primary word positioning Algorithm design
The algorithm design idea mainly judges the position of the basic character by the characteristics of two different structures of 'longitudinal superposition' and 'transverse combination' of the Tibetan syllable character, namely, the Tibetan syllable character full-component identification method of the 'basic character positioning method' is provided, and the specific implementation steps are as follows:
(1) firstly, defining the collocation rule of modern Tibetan syllable characters according to the traditional Tibetan grammar;
a removing method: in order to simplify the recognition object, the read character string is preprocessed, and the character string of non-Tibetan characters and non-modern Tibetan syllable character structures is excluded.
(2) Reading the syllable characters of Tibetan to be processed, and judging whether the code of the character string is 0F beginning or not, wherein the function of the syllable characters is to judge whether the Tibetan characters exist or not;
(3) if it starts with 0F, the next step is performed directly. Judging whether modern Tibetan characters exist or not, and searching codes conforming to the modern Tibetan characters according to codes 0F00-0FDA of 211 Tibetan characters which are not recorded by the current Unicode6.2; by analyzing the codes, uniformly filtering the codes and numerical codes (0F20-0F29, 0F00-0F3F, 0F80-0F8F and 0FC0-0FDA) which do not belong to the modern Tibetan character, and the functions of the codes and the numerical codes are to exclude character strings of the syllable character structure of the non-modern Tibetan;
(4) further judging whether modern Tibetan syllable characters exist, executing (1), namely judging whether the modern Tibetan syllable characters accord with the rules of forming the modern Tibetan syllables through the defined collocation rules;
the method for positioning the basic character of the longitudinally superposed character comprises the following steps:
(5) if the result is consistent with the syllable structure of the modern Tibetan, judging whether vowels, the upper addition characters and the lower addition characters exist in the syllable characters, positioning the base characters according to the distance between the base characters and the superposed characters, and performing a base character positioning process of longitudinally superposed characters as shown in figure 3.
The basic character positioning method of the horizontal combined character comprises the following steps:
(6) if there is no superimposed character in the syllable, the horizontal string length of the syllable is determined, 4 cases occur in the horizontal string length (syllable length), and the flow of positioning the basic character of the horizontal combined character is shown in fig. 4.
When i is 3, 2 different structures appear, namely 'base character + post-addition character' or 'pre-addition character + base character + post-addition character', and the positioning method of the base character is as follows:
A. judging whether the first character of the syllable character is in 5 prefix characters, if not, the base character is the first character, such as: syllable characterIs a radical
B. If double consonants appear, the primary character is the first character, such as: syllable characterIs a radical
C. If the first character is in the 5 prefix words and the syllable word is in the following 11 syllable words, the base character is the first character, otherwise, the base character is the second character, such as: syllable characterIs a basic character ofWhile syllable word(not in the 11 syllable words in the following Table) is
The method for positioning the base character of the special syllable character comprises the following steps:
(6) through the component rules and statistical analysis of modern Tibetan syllables, in Tibetan syllables (when i is 3) formed by transversely combining three characters, 9 syllables in total meet the spelling rules of ' adding character before + adding character after + adding character ' and ' adding character after + adding character ", so that an ambiguity phenomenon occurs, but according to the Tibetan traditional grammar ' word organization law ', different modern Tibetan dictionaries and dictionaries are combined, the occurrence frequency in a statistical corpus is combined, and the knowledge intervention of expert scholars, and the base characters of the 9 special syllables are shown in the following table:
in addition, the above 9 syllable words only conform to the Tibetan word forming rule, but the occurrence frequency in the current Tibetan text and many modern Tibetan dictionaries is very low, and the processing can be omitted.
(7) Besides the general structure of the modern Tibetan, the Tibetan which does not conform to the spelling rule of the modern Tibetan but often appears in the modern Tibetan can be regarded as the modern Tibetan nowadays, and the positioning method of the basic character is as follows:
A. and then the base character of the lower added character is positioned. Such asAnd judging the basic character of the syllable by the distance between the following character and the basic character, wherein the basic character is the first character, such as: syllable characterIs a radical
B. The base word location of the merged word (string of concatenated compact words). Frequently occurring in Tibetan texts are merged words in abbreviated form, e.g.When the basic character is positioned, the character cannot be directly positioned by the longitudinal superposition characterOr the basic character positioning method of the horizontal combined character is processed, firstly, the compact word is identified, the character string is separated into two or more character strings, then the basic character positioning method of the vertical superimposed character and the horizontal combined character is respectively executed to judge the basic character of the character string, the character string does not belong to the monosyllabic character of Tibetan language, and is formed by combining two or more monosyllabic characters (can be regarded as Tibetan language compact words or words), therefore, the character string identifies two or more basic characters, such as: character stringIs a radicalAnd
C. and positioning the Sanskrit source Tibetan characters based on the characters. Such asThe character string adopts the character-forming mode of Sanskrit Tibetan characters in Tibetan, the Sanskrit characters do not have the concept of adding characters up and down, and the character string has the advantages of being simple in structure, convenient to use and capable of being used for storing characters in the Tibetan languageThe whole character is used as a base character. Full-component recognition algorithm for four-modern Tibetan syllable characters
4.1 full component recognition Algorithm design
The method for positioning the basic characters of the Tibetan syllable words is introduced, and a specific positioning algorithm of the core components of the Tibetan syllable words is provided through different structures of 'longitudinal superposition' and 'transverse combination' of 2 Tibetan syllable words. In this section, an identification algorithm of other components of the Tibetan syllable word in a dynamic combination form is proposed on the basis of "base word positioning", and specific identification steps and algorithms are as follows:
full component identification of longitudinally superimposed characters:
in the process of identifying the whole component of the longitudinal superposition word, the base word is taken as a boundary, and the judgment is carried out by a backward calculation method and a forward calculation methodThe position of other constituents, if any, being empty (as used herein)Represented), this means that the member is not present.
(1) Performing a backward calculation:
first, it is judged whether the character following the capital character is in a vowel "Any one of them. If yes, the vowel exists, and the lower addition character does not exist, then the first step is executed.
The first step is as follows: judging the distance from the vowel to the last character is S-x (x is more than or equal to 0 and less than or equal to 2), wherein when S is 0, the postaddition character and the further postaddition character are both null; when S is 1, the post-addend exists, and the post-addend is empty; when S is 2, it means that both postword and further postword are present.
If not (no vowel is present), the second step is performed.
The second step: judging whether the latter character of the capital character is "If yes, indicating that the lower added character exists, and judging that the distance from the lower added character to the last character is S-x (x is more than or equal to 0 and less than or equal to 3), wherein when S is 0, the vowel, the post-added character and the re-post-added character are all null, and when S is 3, the vowel, the post-added character and the re-post-added character exist at the same time; when S is 1 or S is 2, it is necessary to further determine whether the character following the suffix is presentAny one of the above; when S is equal to 1 (2 structures appear: "… + lower-case + vowel" or "… + lower-case + post-case"), and the character is inWhen any one of them is selected, it means that there is a vowel, the postaddition character and the further postaddition character areNull, when S is 1, and the character is not inWhen the Chinese character is a Chinese character, the vowel and the postaddition character are null, and the postaddition character exists; when S is 2 (2 structures appear, … + adding character + vowel + adding character after "or … + adding character after) and the first character after the adding character is" When the letter is a vowel, the first character after the letter is not in the suffix' S letter, the letter is a space, and when S is 2, the letter is a space "In the middle, the vowel is null, and postaddition and further postaddition exist. The backward calculation process ends.
The flow of performing the backward calculation is shown in fig. 5 with the primitive as the boundary.
(2) Performing a forward calculation:
firstly, judging the distance between the base character and the most front character as S ═ x (x is more than or equal to 0 and less than or equal to 2), wherein when S ═ 0, the base character and the most front character both represent null; when S is equal to 1 (2 structures appear: "add-before + base + …" or "add-on + base …"), the previous character of the base is added with the wordWhen S is equal to 1, and the character preceding the base character is not in the upper character, the upper character is emptyWhen any one of the two is selected, the word is shown to exist, and the upper added word is empty; when S is 2, it means that both the add word and the pre-add word exist simultaneously.
The forward calculation process is performed with the basic character as the boundary as shown in FIG. 6, and it is known from structural analysis of syllable characters in Tibetan as follows that in the member identification of longitudinally superimposed characters, the preceding members of the basic character only have the top and top addition characters with the basic character as the boundary.
Full-component identification of horizontal combined characters:
by performing the above-described "full component recognition of vertically superimposed words", all syllable words that are not vertically superimposed can be excluded. The full-component recognition of the horizontal combined character is completely calculated by the length of the whole character string, and the length of the horizontal character string (i represents the length of the whole syllable character and contains the basic character) has 4 composition structures:
(3) when i is 1, the character is a basic character (one character in 30 consonant letters) and other components are all empty;
(4) when i is 2, the first character is a basic character, and the second character is an addend character. The syllable word structure is 'base word + postaddition word';
(5) when i is 3, there are 2 structures, "base word + postaddition word" or "pre-addition word + base word + postaddition word", and its component identification method is as follows:
1) when i is 3, judging whether the first character of the syllable character is 5 pre-addition charactersIf not, the first character is a basic character, the last 2 characters are respectively a postaddition character and a postaddition character, and the syllable character has the structure of 'basic character + postaddition character';
2) when i is 3, judging whether double consonants appear, if so, the first character is a basic character, the last 2 characters are respectively a post-addition character and a post-addition character, and the structure of the syllable character is 'basic character + post-addition character';
3) when i is 3, if the first character is in 5 prefix words and the syllable word is in the following 11 syllable words, the first character is a base word, and the last 2 characters are postaddition words and postaddition words respectively. Otherwise, the first character is a front additional character, and the last 2 characters are a base character and a back additional character respectively;
4) when i is 4, the 4 characters are respectively an adding character, a base character, an adding character and a adding character again, and the syllable character has a structure of adding character before, base character, adding character after and adding character again.
In the whole component recognition process of the horizontal combined word, the length of the horizontal character string i can be calculated, and the recognition flow is shown in fig. 7.
Full component identification of special syllable characters:
(6) in the horizontal combined word, when i is 3, 9 syllable words not only satisfy the spelling structure of "adding before, adding after + adding base word", but also satisfy "adding after + adding base word", the ambiguity problem is easy to appear. The specific reason for this is described in chapter iii, where 9 special syllables are shown in the following table, where the 3 characters of syllable characters numbered 1-3 are a base character, a postaddition character and a further postaddition character, respectively, and the 3 characters of syllable characters numbered 4-9 are a prefix character, a base character and a postaddition character, respectively.
The corresponding knowledge base is established by the whole component recognition results of the above 9 special syllable characters, and in the whole component recognition process, as long as the character string is read, the corresponding component is directly output, and the corresponding knowledge base is shown in table 4-1.
TABLE 4-1 knowledge base of component recognition for special syllable words
(7) Syllable words with further additional words appear. In Tibetan, syllabic words in the form of re-uppercase words appear, and the 7 components that do not show the syllable words in Tibetan in the Tibetan grammar also include expressions of re-uppercase words. Therefore, the recognition algorithm is performed as a special syllable wordAnd (6) processing. In Tibetan only the form of the following additional wordsThe character, and appears in the form of an overlaid word. When the whole components of the syllable characters are identified, the base characters can be identified by a base character positioning method of longitudinally superposed characters, and then whether the first character is a lower-added character or not is judged by taking the base characters as boundariesOrIf yes, judging whether the second character isI.e. when the last two characters of the basic character are respectivelyOr "And then, the syllable character is determined to exist and then is added, and the recognition result is output. The spelling structure of the syllable word is "base word + add-down word", and the recognition process is shown in fig. 8.
(8) Merging words (concatenating strings of compact words). The specificity of this type of word is described in chapter three, based word recognition algorithms. First, to compact wordsAnd finally, respectively executing the corresponding whole component identification process of 'longitudinally superposed characters' or 'transversely combined characters' according to different structures of the separated characters.
4.2 results and analysis thereof
The invention deeply analyzes the modern Tibetan character structure through the Tibetan grammar and the word-building rule, obtains that the position of the base character in the component elements of the modern Tibetan syllable character has a certain rule according to the analysis result, designs and realizes the algorithm through the prior rule, establishes a knowledge base for the character string which is partially not in accordance with the modern Tibetan word-building rule, and performs special processing. The 18689 modern Tibetan characters which are obtained at present newly are tested by the algorithm, and the experimental result shows that the component identification accuracy of 18000 syllable characters reaches 100 percent.
The component identification results are shown in the following table 4-2.
TABLE 4-2 modern Tibetan language full component identification results (partial examples)
The method not only can effectively analyze the syllable characters of the Tibetan, but also can be widely applied to the Tibetan information processing fields of Tibetan dictionary (including phrase) sequencing, character statistics, resource construction, spelling check, Tibetan Latin transcription and the like.
For example, the application method of character statistics may be:
identifying the component of each modern Tibetan syllable word in the Tibetan text by using a full component identification algorithm, and writing the component into a component knowledge base of the Tibetan text;
in the statistical component knowledge base:
and counting the frequency of the Tibetan characters in the Tibetan text according to the frequency of the appearance of each character of the basic character, the vowel, the prefix, the postaddition, the upperaddition and the lowercase.
If the recognized component knowledge base has additional characters, the characters of the additional characters should be countedThe number of occurrences.
For another example, the application method of the Tibetan Latin transcription can be as follows:
identifying the components of modern Tibetan syllable characters by using a full component identification algorithm; if the component has a following character, the following character is converted into a consonant character, i.e. the component has the following characterRespectively convert intoIf there are also words in the building block, the words are converted into consonant characters "Is converted intoIf there is no vowel in the component, the Latin character "a" is used as the vowel; converting the Tibetan characters into the Latin characters according to the sequence of the front plus → the top plus → the base character → the bottom plus → the second bottom plus → the vowel → the second top plus of the Tibetan characters; the Tibetan-Latin character conversion table is:
Claims (5)
1. the full-component recognition algorithm of the modern Tibetan syllable words is characterized in that the modern Tibetan syllable words are used as longitudinal superposed words to be subjected to full-component recognition, and if vowels, lower-added words or upper-added words exist in components of the modern Tibetan syllable words, the modern Tibetan syllable words are judged to be the longitudinal superposed words; if the component does not have vowels, lower addition characters and upper addition characters, judging that the modern Tibetan syllable characters are not longitudinal superposition characters, and using the modern Tibetan syllable characters as transverse combined characters to carry out full component identification;
the longitudinal superposition word is used for carrying out full-component identification, and the method comprises the following steps:
step 1: positioning the base character of the modern Tibetan syllable character;
step 2: executing a backward algorithm identification component:
2.1 reading a post character string of the base character;
if yes, the components are as follows: the latter character of the basic character is vowel, and there is no lower addition character;
let the distance from the vowel to the last character of the post string be S:
when S is 0, the component is as follows: adding characters after the character adding does not exist;
when S is 1, the component is as follows: the latter character of the vowel is a postaddition character, and no postaddition character exists;
when S is 2, the component is as follows: the latter character of the vowel is a post-addition character, and the latter character is a post-addition character;
if not, continuing;
if yes, the components are as follows: the next character of the basic character is a lower addition character;
let the distance from the add-down to the last character of the post-string be S:
when S is 0, the component is as follows: vowels, postaddition characters and further postaddition characters do not exist;
if so, the components are as follows: the latter character of the lower added character is vowel, and the latter added character and the later added character do not exist; if not its components: the latter character of the lower additional character is a post additional character, and has no vowel and then the post additional character;
when S is 2, further identifying whether the next character of the lower added character isAny one of:
if so, the components are as follows: the latter character of the lower additional character is vowel, the latter character is a post additional character, and no post additional character exists; if not its components: the latter character of the lower added character is a post added character, the latter character is a post added character, and vowels do not exist;
when S is 3, the component is as follows: the latter character of the lower additional character is vowel, the next character is a later additional character, and the last character is a later additional character;
if not, continuing;
and step 3: executing a forward algorithm identification component:
3.1 reading the front character string of the basic character;
3.2 let the distance from the base word to the first character of the preceding string be S:
when S is 0, the component is as follows: no top and top words exist;
when S is equal to 1, further identifying whether the previous character of the base character isAny one of:
if so, the components are as follows: the previous character of the basic character is an add character, and no add character exists;
if not its components: the previous character of the basic character is a plus character, and no plus character exists;
when S is 2, the component is as follows: the former character of the basic character is an adding character, and the next former character is a adding character;
the horizontal combined character carries out full-component recognition and comprises the following steps:
reading the horizontal character string of the modern Tibetan syllable character, and enabling the length of the horizontal character string to be i:
when i is 1, the component is: a base word;
when i is 2, its member: the first character is a basic character, and the second character is a postaddition character;
when i is 3, further identifying:
e.g. the first character of a horizontal string is notAny one of them, its component: the first character being a base character and the second character being a postadditionThe third character is a second additional character;
such as a horizontal string of characters Any one of them, its component: the first character is a basic character, the second character is a postaddition character, and the third character is a postaddition character; otherwise, the components thereof: the first character is a front additional character, the second character is a basic character, and the third character is a rear additional character;
if the horizontal character string has double consonants, the component is as follows: the first character is a basic character, the second character is a postaddition character, and the third character is a postaddition character;
when i is 4, its member: the first character is a front additional character, the second character is a basic character, the third character is a rear additional character, and the fourth character is a rear additional character.
2. The full component recognition algorithm for modern Tibetan syllabic words of claim 1, further comprising the step 2.4:
if yes, the components are as follows: the next character of the basic character is a lower addition character;
further identifying whether the next character of the following character isIf so, the components are as follows: the next character of the added character is a further added character; if not its components: no further add-down character exists;
if not, continuing.
3. The method as claimed in claim 1, wherein the full component recognition algorithm is used to recognize the component of each modern Tibetan syllable word in the Tibetan text and write the component into the component knowledge base of the Tibetan text;
in the statistical component knowledge base:
and counting the frequency of the Tibetan characters in the Tibetan text according to the frequency of the appearance of each character of the basic character, the vowel, the prefix character, the postaddition character, the upperaddition character and the lowercase character.
4. The method as claimed in claim 2, wherein the full component recognition algorithm is used to recognize the component of each modern Tibetan syllable word in the Tibetan text and write the component into the component knowledge base of the Tibetan text;
in the statistical component knowledge base:
and counting the frequency of the Tibetan characters in the Tibetan text according to the frequency of the appearance of each character of the basic character, the vowel, the prefix, the postaddition, the upperaddition, the lowercase and the lowercase.
5. The method as claimed in claim 2, wherein the full component recognition algorithm is used to recognize the components of the modern Tibetan syllable word; if the component has a following character, the following character is converted into a consonant character, i.e. the component has the following characterRespectively convert intoIf the member has additional characters, converting the additional characters into consonant charactersIs converted intoIf there is no vowel in the component, the Latin character "a" is used as the vowel; converting the Tibetan characters into the Latin characters according to the sequence of the front plus → the top plus → the base character → the bottom plus → the second bottom plus → the vowel → the second top plus of the Tibetan characters;
the Tibetan language-Latin character conversion table is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210561495.XA CN115050034A (en) | 2022-05-23 | 2022-05-23 | Full-component recognition algorithm for modern Tibetan syllable characters |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210561495.XA CN115050034A (en) | 2022-05-23 | 2022-05-23 | Full-component recognition algorithm for modern Tibetan syllable characters |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115050034A true CN115050034A (en) | 2022-09-13 |
Family
ID=83158957
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210561495.XA Pending CN115050034A (en) | 2022-05-23 | 2022-05-23 | Full-component recognition algorithm for modern Tibetan syllable characters |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115050034A (en) |
-
2022
- 2022-05-23 CN CN202210561495.XA patent/CN115050034A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101083540B1 (en) | System and method for transforming vernacular pronunciation with respect to hanja using statistical method | |
KR100656736B1 (en) | System and method for disambiguating phonetic input | |
Alghamdi et al. | Automatic restoration of arabic diacritics: a simple, purely statistical approach | |
Vasiu et al. | Enhancing tokenization by embedding romanian language specific morphology | |
Mekki et al. | COTA 2.0: An automatic corrector of Tunisian Arabic social media texts | |
CN115050034A (en) | Full-component recognition algorithm for modern Tibetan syllable characters | |
Al-Fedaghi et al. | Morphological compression of Arabic text | |
JP2013097534A (en) | Morpheme analysis device, method and program therefor, voice synthesis device, and method and program therefor | |
UzZaman et al. | A comprehensive bangla spelling checker | |
Lehal | Design and implementation of Punjabi spell checker | |
JPS62165267A (en) | Voice word processor device | |
CN113330430B (en) | Sentence structure vectorization device, sentence structure vectorization method, and recording medium containing sentence structure vectorization program | |
JP7247460B2 (en) | Correspondence Generating Program, Correspondence Generating Device, Correspondence Generating Method, and Translation Program | |
Manohar et al. | Spellchecker for Malayalam using finite state transition models | |
JP2009176148A (en) | Unknown word determining system, method and program | |
CN1257444C (en) | Complete pronunciation Chinese input method for computer | |
KR101777141B1 (en) | Apparatus and method for inputting chinese and foreign languages based on hun min jeong eum using korean input keyboard | |
Lehal et al. | Conversion between scripts of Punjabi: Beyond simple transliteration | |
KR100434526B1 (en) | Sentence extracting method from document by using context information and local document form | |
Lehal et al. | A Hindi to Urdu transliteration system | |
CN1323004A (en) | Automatic conversion method from Chinese braille to Chinese character | |
JP2798931B2 (en) | Chinese phonetic delimiter and phonetic kanji conversion | |
JPS58114224A (en) | "kana" (japanese syllabary) and chinese character converting system | |
JP2004206659A (en) | Reading information determination method, device, and program | |
CN112487762A (en) | Natural language processing method based on Chinese character pronunciation and meaning structure Chinese character coding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |