CN1883959A

CN1883959A - Compression method for words and phonetic alphabet in English electronic dictionary data

Info

Publication number: CN1883959A
Application number: CN 200510043866
Authority: CN
Inventors: 容毅
Original assignee: Individual
Current assignee: Individual
Priority date: 2005-06-21
Filing date: 2005-06-21
Publication date: 2006-12-27

Abstract

A method of compressing words and phonetic symbols in an English electronic dictionary comprises ranking a letter string table A according to the appearance times of letter strings in the words of the English dictionary to make each letter string of table A at least appear in one word of the dictionary and determine a position code for each letter string; finding all possible pronunciations of each letter string 'a' of table A in the dictionary, and establishing a phonetic symbol letter string set aB; computing the appearance times of the phonetic symbol letter strings of all possible pronunciations of each letter string 'a' of table A to obtain the appearance probabilities of the phonetic symbol letter strings of all possible pronunciations of each letter string 'a' of table A, and ranking a probability sequence with the greater ones in the front; storing the positions of each letter string 'a' and the letter string position code and the phonetic symbol letter sting of actual pronunciation of the letter string in the sequence into the English electronic dictionary. The invention can obtain highly efficient phonetic symbol compression, and the compression ratio will be 15%-18%.

Description

The compression method of word and phonetic symbol in the English electronic dictionary data

(1) technical field

The present invention relates to a kind of compression method of dictionary data, the compression method of word and phonetic symbol in particularly a kind of English electronic dictionary data.

(2) background technology

English is current language literal in the international association, and its word amount is big especially, and especially the pronunciation of English is difficult to grasp, and same alphabetic string has different pronunciations in different words, makes the Chinese that use Chinese grasp especially difficulty.For addressing this problem, people have invented a kind of electronic dictionary, and English word is imported and is stored in the electronic dictionary, and during use, people can import prefix, root or an alphabetic string, access the word that will search.In order in limited electronic dictionary memory headroom, to increase information content, someone has invented the compression method of word in a kind of electronic dictionary data, letter that occurs in the dictionary and alphabetic string, how many tabulations, coding according to occurrence number, have 256 alphabetic strings, this has just increased the information content of electronic dictionary limited memory greatly.But the problem that this compression method exists is: the decrement of word is little; Particularly coding, compression measure do not taked in the phonetic symbol of word, phonetic symbol occupies the big problem in electronic dictionary space and is not resolved.

(3) summary of the invention

For overcoming the deficiencies in the prior art, the invention provides a kind of design science, easy to use, compression ratio is high, to all the encode compression method of word and phonetic symbol in the English electronic dictionary data of compression of word and phonetic symbol.

English dictionary has two characteristics: the one, and a lot of alphabetic strings can repeat in the various words of a dictionary; The 2nd, the phonetic symbol of word and word has very strong relevance.

According to the characteristics of English dictionary, adopt following compression method:

(1), the number of times that in the word of English dictionary, occurs according to alphabetic string, arrange out an alphabetic string Table A, each word in the dictionary can be connected to form by the one or more alphabetic strings in the alphabetic string Table A, make each alphabetic string in the alphabetic string Table A appear at least in the word of dictionary, have 213 alphabetic strings;

(2), to each the alphabetic string a in the alphabetic string Table A, find out the phonetic alphabet trail aB of its all possible pronunciation in dictionary, an if word d=a1 a2 in the dictionary ... an, then in each aiB, can find a phonetic alphabet string bi, i=1 ... n makes that the phonetic symbol of d is b1b2 ... bn; Arbitrary phonetic alphabet string b to the phonetic alphabet trail of arbitrary alphabetic string a in the alphabetic string Table A can find a word d=in dictionary ... a ..., the pronunciation of a in d is b;

(3), the alphabetic string in the alphabetic string Table A is used as the letter word in the speed dictionary again, and add up the probability that each alphabetic string occurs in the word of dictionary, according to probability each alphabetic string is carried out Philip Seymour Hoffman (huffman) coding, for each alphabetic string determine one position encoded;

(4), each the alphabetic string a in the statistics alphabetic string Table A the occurrence number of phonetic alphabet string in the word phonetic symbol that might pronounce, draw each alphabetic string a the probability that in the word phonetic symbol, occurs of the phonetic alphabet string that might pronounce, and with each alphabetic string a the phonetic alphabet string that might pronounce line up a probability sequence according to its probability size that in the word phonetic symbol, occurs, probability big preceding;

(5), the phonetic symbol to a word can decide by the position of phonetic alphabet string in the probability sequence of the position encoded of word letter string and the actual pronunciation of this each and every one alphabetic string.

(6), the probability of each position code is the probability summation that the phonetic alphabet string occurs in the phonetic symbol of dictionary on this position of each alphabetic string

(7), the position of phonetic alphabet string in the probability sequence with the position encoded of each alphabetic string a and this alphabetic string and the actual pronunciation of this alphabetic string deposits English electronic dictionary respectively in.

In fact, the mean number of the phonetic alphabet string of each alphabetic string is less than 4, so the inventive method can obtain phonetic symbol compression very efficiently, compression ratio is 15%-18%, is 60% and directly use Philip Seymour Hoffman (huffman) coding to the compression ratio of phonetic symbol compression].

During use, to certain English word, determine that at first this English word is made up of several alphabetic strings, then position encoded at electronic dictionary input alphabet string, English dictionary just demonstrates the phonetic alphabet string of the actual pronunciation of this English word, thereby determines the actual pronunciation of this English word.

Be word letter string and phonetic alphabet string list below.

Word letter string and phonetic alphabet string list

(4) specific embodiment

Embodiment: the present invention adopts following compression method compression to word in the English dictionary:

(2), to each the alphabetic string a in the alphabetic string Table A, find out the phonetic alphabet trail aB of its all possible pronunciation in dictionary, an if word d=a1a2 in the dictionary ... an, then in each aiB, can find a phonetic alphabet string bi, i=1,, n makes that the phonetic symbol of d is b1b2 ... bn; Arbitrary phonetic alphabet string b to the phonetic alphabet trail of arbitrary alphabetic string a in the alphabetic string Table A can find a word d=in dictionary ... a ..., the pronunciation of a in d is b;

The inventive method can obtain phonetic symbol compression very efficiently, and compression ratio is 15%-18%, is 60% and directly use Philip Seymour Hoffman (huffman) coding to the compression ratio of phonetic symbol compression].

Claims

1, the compression method of word and phonetic symbol in a kind of English electronic dictionary data is characterized in that, adopts following compression method:

(2), to each the alphabetic string a in the alphabetic string Table A, find out the phonetic alphabet trail aB of its all possible pronunciation in dictionary, an if word d=a1 a2...an in the dictionary, then in each aiB, can find a phonetic alphabet string bi, i=1, ..., n makes that the phonetic symbol of d is b1b2...bn; Arbitrary phonetic alphabet string b to the phonetic alphabet trail of arbitrary alphabetic string a in the alphabetic string Table A can find a word d=...a... in dictionary, the pronunciation of a in d is b;