CN1883959A - Compression method for words and phonetic alphabet in English electronic dictionary data - Google Patents
Compression method for words and phonetic alphabet in English electronic dictionary data Download PDFInfo
- Publication number
- CN1883959A CN1883959A CN 200510043866 CN200510043866A CN1883959A CN 1883959 A CN1883959 A CN 1883959A CN 200510043866 CN200510043866 CN 200510043866 CN 200510043866 A CN200510043866 A CN 200510043866A CN 1883959 A CN1883959 A CN 1883959A
- Authority
- CN
- China
- Prior art keywords
- string
- word
- alphabetic
- dictionary
- alphabetic string
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Machine Translation (AREA)
Abstract
A method of compressing words and phonetic symbols in an English electronic dictionary comprises ranking a letter string table A according to the appearance times of letter strings in the words of the English dictionary to make each letter string of table A at least appear in one word of the dictionary and determine a position code for each letter string; finding all possible pronunciations of each letter string 'a' of table A in the dictionary, and establishing a phonetic symbol letter string set aB; computing the appearance times of the phonetic symbol letter strings of all possible pronunciations of each letter string 'a' of table A to obtain the appearance probabilities of the phonetic symbol letter strings of all possible pronunciations of each letter string 'a' of table A, and ranking a probability sequence with the greater ones in the front; storing the positions of each letter string 'a' and the letter string position code and the phonetic symbol letter sting of actual pronunciation of the letter string in the sequence into the English electronic dictionary. The invention can obtain highly efficient phonetic symbol compression, and the compression ratio will be 15%-18%.
Description
(1) technical field
The present invention relates to a kind of compression method of dictionary data, the compression method of word and phonetic symbol in particularly a kind of English electronic dictionary data.
(2) background technology
English is current language literal in the international association, and its word amount is big especially, and especially the pronunciation of English is difficult to grasp, and same alphabetic string has different pronunciations in different words, makes the Chinese that use Chinese grasp especially difficulty.For addressing this problem, people have invented a kind of electronic dictionary, and English word is imported and is stored in the electronic dictionary, and during use, people can import prefix, root or an alphabetic string, access the word that will search.In order in limited electronic dictionary memory headroom, to increase information content, someone has invented the compression method of word in a kind of electronic dictionary data, letter that occurs in the dictionary and alphabetic string, how many tabulations, coding according to occurrence number, have 256 alphabetic strings, this has just increased the information content of electronic dictionary limited memory greatly.But the problem that this compression method exists is: the decrement of word is little; Particularly coding, compression measure do not taked in the phonetic symbol of word, phonetic symbol occupies the big problem in electronic dictionary space and is not resolved.
(3) summary of the invention
For overcoming the deficiencies in the prior art, the invention provides a kind of design science, easy to use, compression ratio is high, to all the encode compression method of word and phonetic symbol in the English electronic dictionary data of compression of word and phonetic symbol.
English dictionary has two characteristics: the one, and a lot of alphabetic strings can repeat in the various words of a dictionary; The 2nd, the phonetic symbol of word and word has very strong relevance.
According to the characteristics of English dictionary, adopt following compression method:
(1), the number of times that in the word of English dictionary, occurs according to alphabetic string, arrange out an alphabetic string Table A, each word in the dictionary can be connected to form by the one or more alphabetic strings in the alphabetic string Table A, make each alphabetic string in the alphabetic string Table A appear at least in the word of dictionary, have 213 alphabetic strings;
(2), to each the alphabetic string a in the alphabetic string Table A, find out the phonetic alphabet trail aB of its all possible pronunciation in dictionary, an if word d=a1 a2 in the dictionary ... an, then in each aiB, can find a phonetic alphabet string bi, i=1 ... n makes that the phonetic symbol of d is b1b2 ... bn; Arbitrary phonetic alphabet string b to the phonetic alphabet trail of arbitrary alphabetic string a in the alphabetic string Table A can find a word d=in dictionary ... a ..., the pronunciation of a in d is b;
(3), the alphabetic string in the alphabetic string Table A is used as the letter word in the speed dictionary again, and add up the probability that each alphabetic string occurs in the word of dictionary, according to probability each alphabetic string is carried out Philip Seymour Hoffman (huffman) coding, for each alphabetic string determine one position encoded;
(4), each the alphabetic string a in the statistics alphabetic string Table A the occurrence number of phonetic alphabet string in the word phonetic symbol that might pronounce, draw each alphabetic string a the probability that in the word phonetic symbol, occurs of the phonetic alphabet string that might pronounce, and with each alphabetic string a the phonetic alphabet string that might pronounce line up a probability sequence according to its probability size that in the word phonetic symbol, occurs, probability big preceding;
(5), the phonetic symbol to a word can decide by the position of phonetic alphabet string in the probability sequence of the position encoded of word letter string and the actual pronunciation of this each and every one alphabetic string.
(6), the probability of each position code is the probability summation that the phonetic alphabet string occurs in the phonetic symbol of dictionary on this position of each alphabetic string
(7), the position of phonetic alphabet string in the probability sequence with the position encoded of each alphabetic string a and this alphabetic string and the actual pronunciation of this alphabetic string deposits English electronic dictionary respectively in.
In fact, the mean number of the phonetic alphabet string of each alphabetic string is less than 4, so the inventive method can obtain phonetic symbol compression very efficiently, compression ratio is 15%-18%, is 60% and directly use Philip Seymour Hoffman (huffman) coding to the compression ratio of phonetic symbol compression].
During use, to certain English word, determine that at first this English word is made up of several alphabetic strings, then position encoded at electronic dictionary input alphabet string, English dictionary just demonstrates the phonetic alphabet string of the actual pronunciation of this English word, thereby determines the actual pronunciation of this English word.
Be word letter string and phonetic alphabet string list below.
Word letter string and phonetic alphabet string list
(4) specific embodiment
Embodiment: the present invention adopts following compression method compression to word in the English dictionary:
(1), the number of times that in the word of English dictionary, occurs according to alphabetic string, arrange out an alphabetic string Table A, each word in the dictionary can be connected to form by the one or more alphabetic strings in the alphabetic string Table A, make each alphabetic string in the alphabetic string Table A appear at least in the word of dictionary, have 213 alphabetic strings;
(2), to each the alphabetic string a in the alphabetic string Table A, find out the phonetic alphabet trail aB of its all possible pronunciation in dictionary, an if word d=a1a2 in the dictionary ... an, then in each aiB, can find a phonetic alphabet string bi, i=1,, n makes that the phonetic symbol of d is b1b2 ... bn; Arbitrary phonetic alphabet string b to the phonetic alphabet trail of arbitrary alphabetic string a in the alphabetic string Table A can find a word d=in dictionary ... a ..., the pronunciation of a in d is b;
(3), the alphabetic string in the alphabetic string Table A is used as the letter word in the speed dictionary again, and add up the probability that each alphabetic string occurs in the word of dictionary, according to probability each alphabetic string is carried out Philip Seymour Hoffman (huffman) coding, for each alphabetic string determine one position encoded;
(4), each the alphabetic string a in the statistics alphabetic string Table A the occurrence number of phonetic alphabet string in the word phonetic symbol that might pronounce, draw each alphabetic string a the probability that in the word phonetic symbol, occurs of the phonetic alphabet string that might pronounce, and with each alphabetic string a the phonetic alphabet string that might pronounce line up a probability sequence according to its probability size that in the word phonetic symbol, occurs, probability big preceding;
(5), the phonetic symbol to a word can decide by the position of phonetic alphabet string in the probability sequence of the position encoded of word letter string and the actual pronunciation of this each and every one alphabetic string.
(6), the probability of each position code is the probability summation that the phonetic alphabet string occurs in the phonetic symbol of dictionary on this position of each alphabetic string
(7), the position of phonetic alphabet string in the probability sequence with the position encoded of each alphabetic string a and this alphabetic string and the actual pronunciation of this alphabetic string deposits English electronic dictionary respectively in.
The inventive method can obtain phonetic symbol compression very efficiently, and compression ratio is 15%-18%, is 60% and directly use Philip Seymour Hoffman (huffman) coding to the compression ratio of phonetic symbol compression].
During use, to certain English word, determine that at first this English word is made up of several alphabetic strings, then position encoded at electronic dictionary input alphabet string, English dictionary just demonstrates the phonetic alphabet string of the actual pronunciation of this English word, thereby determines the actual pronunciation of this English word.
Claims (1)
1, the compression method of word and phonetic symbol in a kind of English electronic dictionary data is characterized in that, adopts following compression method:
(1), the number of times that in the word of English dictionary, occurs according to alphabetic string, arrange out an alphabetic string Table A, each word in the dictionary can be connected to form by the one or more alphabetic strings in the alphabetic string Table A, make each alphabetic string in the alphabetic string Table A appear at least in the word of dictionary, have 213 alphabetic strings;
(2), to each the alphabetic string a in the alphabetic string Table A, find out the phonetic alphabet trail aB of its all possible pronunciation in dictionary, an if word d=a1 a2...an in the dictionary, then in each aiB, can find a phonetic alphabet string bi, i=1, ..., n makes that the phonetic symbol of d is b1b2...bn; Arbitrary phonetic alphabet string b to the phonetic alphabet trail of arbitrary alphabetic string a in the alphabetic string Table A can find a word d=...a... in dictionary, the pronunciation of a in d is b;
(3), the alphabetic string in the alphabetic string Table A is used as the letter word in the speed dictionary again, and add up the probability that each alphabetic string occurs in the word of dictionary, according to probability each alphabetic string is carried out Philip Seymour Hoffman (huffman) coding, for each alphabetic string determine one position encoded;
(4), each the alphabetic string a in the statistics alphabetic string Table A the occurrence number of phonetic alphabet string in the word phonetic symbol that might pronounce, draw each alphabetic string a the probability that in the word phonetic symbol, occurs of the phonetic alphabet string that might pronounce, and with each alphabetic string a the phonetic alphabet string that might pronounce line up a probability sequence according to its probability size that in the word phonetic symbol, occurs, probability big preceding;
(5), the phonetic symbol to a word can decide by the position of phonetic alphabet string in the probability sequence of the position encoded of word letter string and the actual pronunciation of this each and every one alphabetic string.
(6), the probability of each position code is the probability summation that the phonetic alphabet string occurs in the phonetic symbol of dictionary on this position of each alphabetic string
(7), the position of phonetic alphabet string in the probability sequence with the position encoded of each alphabetic string a and this alphabetic string and the actual pronunciation of this alphabetic string deposits English electronic dictionary respectively in.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200510043866 CN1883959A (en) | 2005-06-21 | 2005-06-21 | Compression method for words and phonetic alphabet in English electronic dictionary data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 200510043866 CN1883959A (en) | 2005-06-21 | 2005-06-21 | Compression method for words and phonetic alphabet in English electronic dictionary data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN1883959A true CN1883959A (en) | 2006-12-27 |
Family
ID=37582305
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 200510043866 Pending CN1883959A (en) | 2005-06-21 | 2005-06-21 | Compression method for words and phonetic alphabet in English electronic dictionary data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1883959A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033859A (en) * | 2009-09-28 | 2011-04-27 | 佳能株式会社 | Method and system for compressing dictionary and processing words, text-to-speed system and electronic equipment |
CN109002454A (en) * | 2018-04-28 | 2018-12-14 | 陈逸天 | A kind of method and electronic equipment for combining subregion into syllables of determining target word |
-
2005
- 2005-06-21 CN CN 200510043866 patent/CN1883959A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033859A (en) * | 2009-09-28 | 2011-04-27 | 佳能株式会社 | Method and system for compressing dictionary and processing words, text-to-speed system and electronic equipment |
CN102033859B (en) * | 2009-09-28 | 2013-04-10 | 佳能株式会社 | Method and system for compressing dictionary and processing words, text-to-speed system and electronic equipment |
CN109002454A (en) * | 2018-04-28 | 2018-12-14 | 陈逸天 | A kind of method and electronic equipment for combining subregion into syllables of determining target word |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101694601A (en) | Zero-memory Chinese character coding input method | |
WO2010043117A1 (en) | Digital encoding method and application thereof | |
CN1883959A (en) | Compression method for words and phonetic alphabet in English electronic dictionary data | |
CN101739142B (en) | Five-stroke input system and method | |
CN1169041C (en) | Pronunciation and shape phonetic transcription Chinese character input method | |
CN101046707A (en) | Input method for Chinese character of first pronunciation | |
CN1700156A (en) | Method for linking phrases in Chinese character input method | |
CN1164985C (en) | Chinese-character 'Sound-shape code' input method for computer | |
CN1063555C (en) | Three-D, three-codes method for inputting Chinese words and characters combined | |
CN1885242A (en) | Chinese character input method capable of reducing candidate characters: stroke coding and phonetic initial letter | |
CN1106146A (en) | Computer input method by computer Chinese-character phonology-tone coding and its keyboard | |
CN1048561C (en) | Chinese character input method for computer | |
CN1203391C (en) | Left and right pictophonetic and digital computer input method for Chinese character and its keyboard | |
CN1122913C (en) | Normal encoding input method for Chinese data processing in computer | |
CN1419179A (en) | Chinese characters input method according to stroke sequence and keyboard thereof | |
CN1178121C (en) | Double Chinese character stroke order-radical input system | |
CN1036359C (en) | Chinese characters Fanqie encoding input method for computer | |
CN1503112A (en) | Chinese character coding, searching and input method | |
CN1199888A (en) | Dictionary code as one Chinese character input method | |
CN1405660A (en) | Chinese character input method | |
CN1248014A (en) | Computer Chinese input method of component first and last code and its keyboard | |
CN1388430A (en) | Modern Chinese pronunciation input method | |
CN86103506A (en) | " a key diadic " keyboard and China and foreign countries' characters rapid input method | |
CN113761835A (en) | Method for converting Chinese pinyin into Braille ASCII codes | |
CN1828496A (en) | Chinese character stroke input method in network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |