CN1883959A - Compression method for words and phonetic alphabet in English electronic dictionary data - Google Patents

Compression method for words and phonetic alphabet in English electronic dictionary data Download PDF

Info

Publication number
CN1883959A
CN1883959A CN 200510043866 CN200510043866A CN1883959A CN 1883959 A CN1883959 A CN 1883959A CN 200510043866 CN200510043866 CN 200510043866 CN 200510043866 A CN200510043866 A CN 200510043866A CN 1883959 A CN1883959 A CN 1883959A
Authority
CN
China
Prior art keywords
string
word
alphabetic
dictionary
alphabetic string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200510043866
Other languages
Chinese (zh)
Inventor
容毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN 200510043866 priority Critical patent/CN1883959A/en
Publication of CN1883959A publication Critical patent/CN1883959A/en
Pending legal-status Critical Current

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Machine Translation (AREA)

Abstract

A method of compressing words and phonetic symbols in an English electronic dictionary comprises ranking a letter string table A according to the appearance times of letter strings in the words of the English dictionary to make each letter string of table A at least appear in one word of the dictionary and determine a position code for each letter string; finding all possible pronunciations of each letter string 'a' of table A in the dictionary, and establishing a phonetic symbol letter string set aB; computing the appearance times of the phonetic symbol letter strings of all possible pronunciations of each letter string 'a' of table A to obtain the appearance probabilities of the phonetic symbol letter strings of all possible pronunciations of each letter string 'a' of table A, and ranking a probability sequence with the greater ones in the front; storing the positions of each letter string 'a' and the letter string position code and the phonetic symbol letter sting of actual pronunciation of the letter string in the sequence into the English electronic dictionary. The invention can obtain highly efficient phonetic symbol compression, and the compression ratio will be 15%-18%.

Description

The compression method of word and phonetic symbol in the English electronic dictionary data
(1) technical field
The present invention relates to a kind of compression method of dictionary data, the compression method of word and phonetic symbol in particularly a kind of English electronic dictionary data.
(2) background technology
English is current language literal in the international association, and its word amount is big especially, and especially the pronunciation of English is difficult to grasp, and same alphabetic string has different pronunciations in different words, makes the Chinese that use Chinese grasp especially difficulty.For addressing this problem, people have invented a kind of electronic dictionary, and English word is imported and is stored in the electronic dictionary, and during use, people can import prefix, root or an alphabetic string, access the word that will search.In order in limited electronic dictionary memory headroom, to increase information content, someone has invented the compression method of word in a kind of electronic dictionary data, letter that occurs in the dictionary and alphabetic string, how many tabulations, coding according to occurrence number, have 256 alphabetic strings, this has just increased the information content of electronic dictionary limited memory greatly.But the problem that this compression method exists is: the decrement of word is little; Particularly coding, compression measure do not taked in the phonetic symbol of word, phonetic symbol occupies the big problem in electronic dictionary space and is not resolved.
(3) summary of the invention
For overcoming the deficiencies in the prior art, the invention provides a kind of design science, easy to use, compression ratio is high, to all the encode compression method of word and phonetic symbol in the English electronic dictionary data of compression of word and phonetic symbol.
English dictionary has two characteristics: the one, and a lot of alphabetic strings can repeat in the various words of a dictionary; The 2nd, the phonetic symbol of word and word has very strong relevance.
According to the characteristics of English dictionary, adopt following compression method:
(1), the number of times that in the word of English dictionary, occurs according to alphabetic string, arrange out an alphabetic string Table A, each word in the dictionary can be connected to form by the one or more alphabetic strings in the alphabetic string Table A, make each alphabetic string in the alphabetic string Table A appear at least in the word of dictionary, have 213 alphabetic strings;
(2), to each the alphabetic string a in the alphabetic string Table A, find out the phonetic alphabet trail aB of its all possible pronunciation in dictionary, an if word d=a1 a2 in the dictionary ... an, then in each aiB, can find a phonetic alphabet string bi, i=1 ... n makes that the phonetic symbol of d is b1b2 ... bn; Arbitrary phonetic alphabet string b to the phonetic alphabet trail of arbitrary alphabetic string a in the alphabetic string Table A can find a word d=in dictionary ... a ..., the pronunciation of a in d is b;
(3), the alphabetic string in the alphabetic string Table A is used as the letter word in the speed dictionary again, and add up the probability that each alphabetic string occurs in the word of dictionary, according to probability each alphabetic string is carried out Philip Seymour Hoffman (huffman) coding, for each alphabetic string determine one position encoded;
(4), each the alphabetic string a in the statistics alphabetic string Table A the occurrence number of phonetic alphabet string in the word phonetic symbol that might pronounce, draw each alphabetic string a the probability that in the word phonetic symbol, occurs of the phonetic alphabet string that might pronounce, and with each alphabetic string a the phonetic alphabet string that might pronounce line up a probability sequence according to its probability size that in the word phonetic symbol, occurs, probability big preceding;
(5), the phonetic symbol to a word can decide by the position of phonetic alphabet string in the probability sequence of the position encoded of word letter string and the actual pronunciation of this each and every one alphabetic string.
(6), the probability of each position code is the probability summation that the phonetic alphabet string occurs in the phonetic symbol of dictionary on this position of each alphabetic string
(7), the position of phonetic alphabet string in the probability sequence with the position encoded of each alphabetic string a and this alphabetic string and the actual pronunciation of this alphabetic string deposits English electronic dictionary respectively in.
In fact, the mean number of the phonetic alphabet string of each alphabetic string is less than 4, so the inventive method can obtain phonetic symbol compression very efficiently, compression ratio is 15%-18%, is 60% and directly use Philip Seymour Hoffman (huffman) coding to the compression ratio of phonetic symbol compression].
During use, to certain English word, determine that at first this English word is made up of several alphabetic strings, then position encoded at electronic dictionary input alphabet string, English dictionary just demonstrates the phonetic alphabet string of the actual pronunciation of this English word, thereby determines the actual pronunciation of this English word.
Be word letter string and phonetic alphabet string list below.
Word letter string and phonetic alphabet string list
Figure A20051004386600041
Figure A20051004386600051
(4) specific embodiment
Embodiment: the present invention adopts following compression method compression to word in the English dictionary:
(1), the number of times that in the word of English dictionary, occurs according to alphabetic string, arrange out an alphabetic string Table A, each word in the dictionary can be connected to form by the one or more alphabetic strings in the alphabetic string Table A, make each alphabetic string in the alphabetic string Table A appear at least in the word of dictionary, have 213 alphabetic strings;
(2), to each the alphabetic string a in the alphabetic string Table A, find out the phonetic alphabet trail aB of its all possible pronunciation in dictionary, an if word d=a1a2 in the dictionary ... an, then in each aiB, can find a phonetic alphabet string bi, i=1,, n makes that the phonetic symbol of d is b1b2 ... bn; Arbitrary phonetic alphabet string b to the phonetic alphabet trail of arbitrary alphabetic string a in the alphabetic string Table A can find a word d=in dictionary ... a ..., the pronunciation of a in d is b;
(3), the alphabetic string in the alphabetic string Table A is used as the letter word in the speed dictionary again, and add up the probability that each alphabetic string occurs in the word of dictionary, according to probability each alphabetic string is carried out Philip Seymour Hoffman (huffman) coding, for each alphabetic string determine one position encoded;
(4), each the alphabetic string a in the statistics alphabetic string Table A the occurrence number of phonetic alphabet string in the word phonetic symbol that might pronounce, draw each alphabetic string a the probability that in the word phonetic symbol, occurs of the phonetic alphabet string that might pronounce, and with each alphabetic string a the phonetic alphabet string that might pronounce line up a probability sequence according to its probability size that in the word phonetic symbol, occurs, probability big preceding;
(5), the phonetic symbol to a word can decide by the position of phonetic alphabet string in the probability sequence of the position encoded of word letter string and the actual pronunciation of this each and every one alphabetic string.
(6), the probability of each position code is the probability summation that the phonetic alphabet string occurs in the phonetic symbol of dictionary on this position of each alphabetic string
(7), the position of phonetic alphabet string in the probability sequence with the position encoded of each alphabetic string a and this alphabetic string and the actual pronunciation of this alphabetic string deposits English electronic dictionary respectively in.
The inventive method can obtain phonetic symbol compression very efficiently, and compression ratio is 15%-18%, is 60% and directly use Philip Seymour Hoffman (huffman) coding to the compression ratio of phonetic symbol compression].
During use, to certain English word, determine that at first this English word is made up of several alphabetic strings, then position encoded at electronic dictionary input alphabet string, English dictionary just demonstrates the phonetic alphabet string of the actual pronunciation of this English word, thereby determines the actual pronunciation of this English word.

Claims (1)

1, the compression method of word and phonetic symbol in a kind of English electronic dictionary data is characterized in that, adopts following compression method:
(1), the number of times that in the word of English dictionary, occurs according to alphabetic string, arrange out an alphabetic string Table A, each word in the dictionary can be connected to form by the one or more alphabetic strings in the alphabetic string Table A, make each alphabetic string in the alphabetic string Table A appear at least in the word of dictionary, have 213 alphabetic strings;
(2), to each the alphabetic string a in the alphabetic string Table A, find out the phonetic alphabet trail aB of its all possible pronunciation in dictionary, an if word d=a1 a2...an in the dictionary, then in each aiB, can find a phonetic alphabet string bi, i=1, ..., n makes that the phonetic symbol of d is b1b2...bn; Arbitrary phonetic alphabet string b to the phonetic alphabet trail of arbitrary alphabetic string a in the alphabetic string Table A can find a word d=...a... in dictionary, the pronunciation of a in d is b;
(3), the alphabetic string in the alphabetic string Table A is used as the letter word in the speed dictionary again, and add up the probability that each alphabetic string occurs in the word of dictionary, according to probability each alphabetic string is carried out Philip Seymour Hoffman (huffman) coding, for each alphabetic string determine one position encoded;
(4), each the alphabetic string a in the statistics alphabetic string Table A the occurrence number of phonetic alphabet string in the word phonetic symbol that might pronounce, draw each alphabetic string a the probability that in the word phonetic symbol, occurs of the phonetic alphabet string that might pronounce, and with each alphabetic string a the phonetic alphabet string that might pronounce line up a probability sequence according to its probability size that in the word phonetic symbol, occurs, probability big preceding;
(5), the phonetic symbol to a word can decide by the position of phonetic alphabet string in the probability sequence of the position encoded of word letter string and the actual pronunciation of this each and every one alphabetic string.
(6), the probability of each position code is the probability summation that the phonetic alphabet string occurs in the phonetic symbol of dictionary on this position of each alphabetic string
(7), the position of phonetic alphabet string in the probability sequence with the position encoded of each alphabetic string a and this alphabetic string and the actual pronunciation of this alphabetic string deposits English electronic dictionary respectively in.
CN 200510043866 2005-06-21 2005-06-21 Compression method for words and phonetic alphabet in English electronic dictionary data Pending CN1883959A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200510043866 CN1883959A (en) 2005-06-21 2005-06-21 Compression method for words and phonetic alphabet in English electronic dictionary data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200510043866 CN1883959A (en) 2005-06-21 2005-06-21 Compression method for words and phonetic alphabet in English electronic dictionary data

Publications (1)

Publication Number Publication Date
CN1883959A true CN1883959A (en) 2006-12-27

Family

ID=37582305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510043866 Pending CN1883959A (en) 2005-06-21 2005-06-21 Compression method for words and phonetic alphabet in English electronic dictionary data

Country Status (1)

Country Link
CN (1) CN1883959A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033859A (en) * 2009-09-28 2011-04-27 佳能株式会社 Method and system for compressing dictionary and processing words, text-to-speed system and electronic equipment
CN109002454A (en) * 2018-04-28 2018-12-14 陈逸天 A kind of method and electronic equipment for combining subregion into syllables of determining target word

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033859A (en) * 2009-09-28 2011-04-27 佳能株式会社 Method and system for compressing dictionary and processing words, text-to-speed system and electronic equipment
CN102033859B (en) * 2009-09-28 2013-04-10 佳能株式会社 Method and system for compressing dictionary and processing words, text-to-speed system and electronic equipment
CN109002454A (en) * 2018-04-28 2018-12-14 陈逸天 A kind of method and electronic equipment for combining subregion into syllables of determining target word

Similar Documents

Publication Publication Date Title
CN101694601A (en) Zero-memory Chinese character coding input method
WO2010043117A1 (en) Digital encoding method and application thereof
CN1883959A (en) Compression method for words and phonetic alphabet in English electronic dictionary data
CN101739142B (en) Five-stroke input system and method
CN1169041C (en) Pronunciation and shape phonetic transcription Chinese character input method
CN101046707A (en) Input method for Chinese character of first pronunciation
CN1700156A (en) Method for linking phrases in Chinese character input method
CN1164985C (en) Chinese-character 'Sound-shape code' input method for computer
CN1063555C (en) Three-D, three-codes method for inputting Chinese words and characters combined
CN1885242A (en) Chinese character input method capable of reducing candidate characters: stroke coding and phonetic initial letter
CN1106146A (en) Computer input method by computer Chinese-character phonology-tone coding and its keyboard
CN1048561C (en) Chinese character input method for computer
CN1203391C (en) Left and right pictophonetic and digital computer input method for Chinese character and its keyboard
CN1122913C (en) Normal encoding input method for Chinese data processing in computer
CN1419179A (en) Chinese characters input method according to stroke sequence and keyboard thereof
CN1178121C (en) Double Chinese character stroke order-radical input system
CN1036359C (en) Chinese characters Fanqie encoding input method for computer
CN1503112A (en) Chinese character coding, searching and input method
CN1199888A (en) Dictionary code as one Chinese character input method
CN1405660A (en) Chinese character input method
CN1248014A (en) Computer Chinese input method of component first and last code and its keyboard
CN1388430A (en) Modern Chinese pronunciation input method
CN86103506A (en) " a key diadic " keyboard and China and foreign countries' characters rapid input method
CN113761835A (en) Method for converting Chinese pinyin into Braille ASCII codes
CN1828496A (en) Chinese character stroke input method in network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication