CN102033859B - Method and system for compressing dictionary and processing words, text-to-speed system and electronic equipment - Google Patents

Method and system for compressing dictionary and processing words, text-to-speed system and electronic equipment Download PDF

Info

Publication number
CN102033859B
CN102033859B CN 200910176368 CN200910176368A CN102033859B CN 102033859 B CN102033859 B CN 102033859B CN 200910176368 CN200910176368 CN 200910176368 CN 200910176368 A CN200910176368 A CN 200910176368A CN 102033859 B CN102033859 B CN 102033859B
Authority
CN
China
Prior art keywords
word
mapping table
tone mapping
pronunciation
chinese character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 200910176368
Other languages
Chinese (zh)
Other versions
CN102033859A (en
Inventor
亓超
金浩
康恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canon Inc
Original Assignee
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canon Inc filed Critical Canon Inc
Priority to CN 200910176368 priority Critical patent/CN102033859B/en
Publication of CN102033859A publication Critical patent/CN102033859A/en
Application granted granted Critical
Publication of CN102033859B publication Critical patent/CN102033859B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention provides a method for compressing electronic data in a dictionary by using a computer. The method comprises the following steps of: inputting the dictionary to be compressed, wherein words and pronunciations are stored in the dictionary to be compressed; for each Chinese character, determining the pronunciation which occurs most frequently as the default pronunciation, and forming a first character pronunciation mapping table based on the default pronunciation; for each multi-pronunciation Chinese character, determining the other pronunciations except the default pronunciation as the non-default pronunciations, and forming a second character pronunciation mapping table based on the non-default pronunciations; and according to information of an index or a position of a combination of each Chinese character and the pronunciation thereof in the first character pronunciation mapping table or the second character pronunciation mapping table, compressing the combination into a 16-bit code so as to form the compressed dictionary which comprises the information in a 16-bit coding form, wherein the 16-bit code comprises the information of the index of the Chinese character in the first character pronunciation mapping table or the second character pronunciation mapping table and the information of the index of the pronunciation in the first character pronunciation mapping table or the second character pronunciation mapping table.

Description

Dictionary compression and word treatment method and system, text-to-speech system, electronic equipment
Technical field
Present invention relates in general to method and system, word treatment method and system, text-to-speech system and electronic equipment for the compression of electronic dictionary.
Background technology
Use the electronic equipment of electronic dictionary to be widely used in a plurality of fields.Electronic dictionary usually adopts the data structure that word and the information association relevant with this word are stored.For Chinese electronic dictionary (hereinafter being also referred to as " dictionary "), the information relevant with Chinese word can comprise lexical or textual analysis information and the pronunciation information of this word.
In the Chinese electronic dictionary of majority, pronunciation information is a pith of word information.Usually, Chinese pronunciations information can directly be stored in the electronic dictionary.Yet, if electronic dictionary is installed in the poor equipment of storage condition, in personal digital assistant (PDA), wish electronic dictionary is minimized taking of storer.
Usually, each Chinese word in the electronic dictionary has the pronunciation information of himself, and for example, the pronunciation of word " Country family " is " guo2jia1 ", wherein the tone in the numeral Chinese pronunciations system such as " 2 " and " 1 ".Obviously, the pronunciation character string usually needs to preserve more byte than word itself.
Chinese patent CN02159546.1 discloses a kind of method for store the pronunciation of Chinese character with 2 bytes (16 bit).Therefore, the required memory space of storage pronunciation information is reduced for the 2*N byte, and wherein N represents the quantity of the Chinese character that electronic dictionary comprises.To use Big5 word collection that 2 bytes encode to a Chinese character as example, the memory space that the combination of Chinese character and its pronunciation will take will be 4 bytes (32 bit).
Yet, for comprising thousands of as the electronic dictionary of the word of the combination of Chinese character, still need considerable memory space to store pronunciation information.For example, suppose total total about 70000 words in electronic dictionary, it then need to preserve pronunciation information more than the 273k byte by forming more than 140000 Chinese characters.
In fact, most Chinese characters only has a pronunciation.For example, Chinese character “ Wind " pronunciation be " feng1 ".Some Chinese characters have a plurality of pronunciations, and it is called as " polyphonic Chinese characters ".For example, the pronunciation of Chinese character " poor " comprises " cha4 ", " cha1 ", " chai1 ", " ci1 ".In some cases, the tone of some Chinese characters in word will become gently and read.For example, the pronunciation of Chinese character " Hot " is " re4 ", Chinese character “ Downtown " pronunciation be " nao4 ", but word “ Hot Downtown " pronunciation be " re4nao5 " (numeral " 5 " expression gently read).That is to say that although a Chinese character only has an acquiescence pronunciation as its conventional pronunciation, the tone of this pronunciation may become gently in some cases to be read.
Chinese patent application CN200310114889.8 discloses the method for a kind of utilization bytes store pronunciation information still less, and the method may further comprise the steps: distinguish polyphonic Chinese characters from other Chinese characters; There is the acquiescence pronunciation table for all Chinese characters; There is the non-acquiescence pronunciation table for polyphonic Chinese characters; Produce supplementary for the word that comprises polyphonic Chinese characters, described supplementary represents the non-acquiescence pronunciation of these polyphonic Chinese characters; And word is stored in the electronic dictionary with described supplementary relatedly.According to disclosed method in Chinese patent application CN200310114889.8, the polyphonic Chinese characters that only has non-acquiescence pronunciation need to be stored with the supplementary of the non-acquiescence pronunciation of indication, and pronounce to show by means of Chinese character being stored acquiescence pronunciation table and non-acquiescence, the storage of the acquiescence pronunciation of Chinese character becomes no longer necessary.The method can the memory space that the pronunciation of Chinese character is shared be reduced to average each word and be less than 1 byte.
Yet, according to disclosed method among the Chinese patent application CN200310114889.8, for the combination of Chinese character and its pronunciation, on average need to be more than 2 bytes.Because to the demand of electronic equipment more cheaply, need Chinese character and its pronunciation to occupy than still less electronic dictionary of memory block all in the past.
Summary of the invention
The present inventor notices following inherent law or the fact of Chinese language: with Big5 word collection as an example, Big5 word collection has at most 13060 unsimplified Hanzis, it has 1295 valid utterances, wherein only have 943 Chinese characters to have a plurality of pronunciations, and maximum pronunciations countings that Chinese character has are 6; On the other hand, Big5 word collection uses 2 bytes (16 bit) that a Chinese character is encoded, and still concentrates at the Big5 word to also have a large amount of codings not represent Chinese character.
Consider the technical matters that in aforesaid prior art, exists and above-mentioned inherent law or the fact of Chinese language, still less byte new method and system that Chinese character or Chinese word are stored with its pronunciation of a kind of use is provided.
The universal character set that such as GB2312 other are used for Chinese character also has similar characteristic.Therefore, although in some embodiment or example, with Big5 word collection principle of the present invention is described, is not limited to Big5 word collection for the method that presents Chinese character, and can uses any other method that presents Chinese character.
According to an aspect of the present invention, provide a kind of method of utilizing the electronic data in the computing machine compression dictionary, comprising: input step, input dictionary to be compressed, store word and its pronunciation with the electronic data form in the described dictionary to be compressed; The first word tone mapping table forms step, for each Chinese character in the dictionary to be compressed, a pronunciation is defined as the acquiescence pronunciation, and forms the first word tone mapping table based on the acquiescence pronunciation; The second word tone mapping table forms step, for each polyphonic Chinese characters in the dictionary to be compressed, the residue pronunciation except the acquiescence pronunciation is defined as non-acquiescence pronunciation, and forms the second word tone mapping table based on non-acquiescence pronunciation; And compression step, according to being combined in index in the first word tone mapping table or the second word tone mapping table or the information of position about each Chinese character in each word in the dictionary to be compressed and its pronunciation, should make up boil down to 16 bits of encoded, the compression dictionary that comprises the information with 16 bits of encoded forms with formation, wherein, described 16 bits of encoded comprise about this Chinese character in the first word tone mapping table or the second word tone mapping table index or the information of position, and about the pronunciation corresponding with this Chinese character in the first word tone mapping table or the second word tone mapping table index or the information of position.
According to another aspect of the present invention, provide a kind of system that utilizes the electronic data in the computing machine compression dictionary, comprising: input media, input dictionary to be compressed, store word and its pronunciation with the electronic data form in the described dictionary to be compressed; The first word tone mapping table forms device, for each Chinese character in the dictionary to be compressed, a pronunciation is defined as the acquiescence pronunciation, and forms the first word tone mapping table based on the acquiescence pronunciation; The second word tone mapping table forms device, for each polyphonic Chinese characters in the dictionary to be compressed, the residue pronunciation except the acquiescence pronunciation is defined as non-acquiescence pronunciation, and forms the second word tone mapping table based on non-acquiescence pronunciation; And compression set, according to being combined in index in the first word tone mapping table or the second word tone mapping table or the information of position about each Chinese character in each word in the dictionary to be compressed and its pronunciation, should make up boil down to 16 bits of encoded, the compression dictionary that comprises the information with 16 bits of encoded forms with formation, wherein, described 16 bits of encoded comprise about this Chinese character in the first word tone mapping table or the second word tone mapping table index or the information of position, and about the pronunciation corresponding with this Chinese character in the first word tone mapping table or the second word tone mapping table index or the information of position.
According to a further aspect of the invention, a kind of word treatment method for electronic equipment is provided, described electronic equipment comprises the compression dictionary, each Chinese character in each word in the described compression dictionary and the combination of its pronunciation have utilized the method according to this invention to be compressed into 16 bits of encoded, described method comprises: word tone mapping table obtaining step, obtain the first word tone mapping table and the second word tone mapping table; And decompression step, utilize described the first word tone mapping table or the second word tone mapping table will with described compression dictionary in word in 16 bits of encoded corresponding to Chinese character and the combination of its pronunciation decompress.
According to a further aspect of the invention, a kind of word disposal system for electronic equipment is provided, described electronic equipment comprises the compression dictionary, each Chinese character in each word in the described compression dictionary and the combination of its pronunciation have utilized the method according to this invention to be compressed into 16 bits of encoded, described system comprises: word tone mapping table deriving means, obtain the first word tone mapping table and the second word tone mapping table; And decompressing device, utilize described the first word tone mapping table or the second word tone mapping table will with described compression dictionary in word in 16 bits of encoded corresponding to Chinese character and the combination of its pronunciation decompress.
According to a further aspect of the invention, provide a kind of electronic equipment, described electronic equipment comprises: the compression dictionary, and each Chinese character in each word in the described compression dictionary and the combination of its pronunciation have utilized the method according to this invention to be compressed into 16 bits of encoded; And according to word disposal system of the present invention.
According to a further aspect of the invention, provide a kind of compression dictionary that utilizes with the text-to-speech system of text-converted for voice, described text-to-speech system comprises: the compression dictionary, and each Chinese character in each word in the described compression dictionary and the combination of its pronunciation have utilized the method according to this invention to be compressed into 16 bits of encoded; Text input device is used for input text; Text processing apparatus is used for according to the compression dictionary described text dividing being become word and being institute's predicate phonetic notation; And the speech production device, be used for producing voice based on the result of described text processing apparatus.
According to a further aspect of the invention, provide a kind of electronic equipment, comprising: according to text-to-speech system of the present invention; Be used for and at least one of screen, keyboard and the mouse of described text input device interface; And be used for and at least one of loudspeaker, earphone and the display of described speech production device interface.
Benefit from according to said method of the present invention and system, can realize that storage space that Chinese character and its pronunciation take is than still less electronic dictionary all in the past.Particularly, for example, Big5 word collection uses 2 bytes (16 bit) to compress a Chinese character.According to the present invention, the combination of Chinese character and its pronunciation only takies 2 bytes (16 bit), this with equate according to the shared storage space of the Chinese character of Big5 word collection itself.That is to say that for 16 bits of Chinese character itself, pronunciation information will not take any extra or additional storage space except traditionally.
In addition, all will bring inconvenience to implementing storage even have more 1 bit than 2 bytes (16 bit), and can make Efficiency Decreasing.Therefore, to make the combination of Chinese character and its pronunciation be 16 bits just reducing aspect the storage space that will use is very useful.
From the description referring to accompanying drawing, other property features of the present invention and advantage will become clear.
Description of drawings
Incorporate the accompanying drawing diagram embodiments of the invention of instructions and a formation instructions part into, and be used for illustrating together with the description principle of the present invention.
Fig. 1 is the block diagram that the hardware configuration of the computer system 1000 that can implement embodiments of the invention is shown.
Fig. 2 is that diagram is according to the block diagram of utilizing the system 2000 that computing machine compresses the electronic data in the dictionary of the present invention.
Fig. 3 is that diagram is according to the block diagram of the word disposal system 3000 of using in comprising the electronic equipment that compresses dictionary of the present invention.
Fig. 4 is that diagram is according to the block diagram of text-to-speech system 4000 of the present invention.
Fig. 5 is the process flow diagram that utilizes the method for the electronic data that computing machine compresses dictionary according to of the present invention for implementing.
Fig. 6 illustrates the process flow diagram that is used to form the preferred process of the first word tone mapping table and the second word tone mapping table according to of the present invention.
Fig. 7 is for the process flow diagram of enforcement according to the exemplary compression process of the step S540 of Fig. 5 of the present invention.
Fig. 8 is that diagram utilizes compression dictionary according to the present invention to process the process flow diagram of the method for word.
Fig. 9 is the process flow diagram that illustrates the method for the word that utilizes compression dictionary according to the present invention to come the process user input.
Figure 10 is the process flow diagram that diagram text-to-speech system 4000 utilizes the process that compression dictionary according to the present invention carries out.
Embodiment
Describe embodiments of the invention in detail hereinafter with reference to accompanying drawing.
Note that similar reference number and letter refer among the figure similarly project, thereby in case in a width of cloth figure, defined a project, needn't after figure in this project is discussed again.
In this manual, " dictionary " refers to Chinese electronic dictionary, and " word " refers to Chinese character.
In this manual, the pronunciation of the acquiescence of Chinese character can be one of pronunciation of this Chinese character.For example, acquiescence pronunciation can be the pronunciation of frequent use on statistics of this Chinese character.For example, acquiescence pronunciation can be a priori to find the in daily life pronunciation of frequent use of people, perhaps can be for the maximum pronunciation of this Chinese character occurrence number in all pronunciations corresponding with this Chinese character in the dictionary.Typically, a Chinese character only has an acquiescence pronunciation, and in some cases, the tone of this acquiescence pronunciation can become gently to be read.
In addition, the non-acquiescence pronunciation of polyphonic Chinese characters refers to other pronunciations except its acquiescence pronunciation.Polyphonic Chinese characters can have one or more non-acquiescence pronunciations.
In this manual, the Chinese character in the combination of word and its pronunciation and this word and the combination of its pronunciation have identical meanings.That is to say that the combination of word and its pronunciation means a plurality of combinations, each combination is made of a Chinese character and its pronunciation in this word.
In this manual, the word of compression refers to wherein all compressed words of all Chinese characters and its pronunciation, and the word of compression refers to that this word is compressed into the Chinese character of 16 bits of encoded with its pronunciation according to the present invention.
Fig. 1 is the block diagram that the hardware configuration of the computer system 1000 that can implement embodiments of the invention is shown.
As shown in Figure 1, computer system comprises computing machine 1110.Computing machine 1110 comprises processing unit 1120, system storage 1130, fixed non-volatile memory interface 1140, mobile non-volatile memory interface 1150, user's input interface 1160, network interface 1170, video interface 1190 and the output peripheral interface 1195 that connects via system bus 1121.
System storage 1130 comprises ROM (ROM (read-only memory)) 1131 and RAM (random access memory) 1132.BIOS (Basic Input or Output System (BIOS)) 1133 resides in the ROM 1131.Operating system 1134, application program 1135, other program module 1136 and some routine data 1137 reside in the RAM 1132.
Fixed non-volatile memory 1141 such as hard disk is connected to fixed non-volatile memory interface 1140.Fixed non-volatile memory 1141 for example can storage operating system 1144, application program 1145, other program module 1146 and some routine data 1147.
Mobile nonvolatile memory such as floppy disk 1151 and CD-ROM drive 1155 is connected to mobile non-volatile memory interface 1150.For example, floppy disk can be inserted in the floppy disk 1151, and CD (CD) can be inserted in the CD-ROM drive 1155.
Input equipment such as mouse 1161 and keyboard 1162 is connected to user's input interface 1160.
Computing machine 1110 can be connected to remote computer 1180 by network interface 1170.For example, network interface 1170 can be connected to remote computer 1180 by LAN (Local Area Network) 1171.Perhaps, network interface 1170 can be connected to modulator-demodular unit (modulator-demodulator) 1172, and modulator-demodular unit 1172 is connected to remote computer 1180 via wide area network 1173.
Remote computer 1180 can comprise the storer 1181 such as hard disk, and it can store remote application 1185.
Video interface 1190 is connected to monitor 1191.
Output peripheral interface 1195 is connected to printer 1196 and loudspeaker 1197.
Computer system shown in Figure 1 only is illustrative and never means any restriction to invention, its application, or uses.
Computer system shown in Figure 1 can be incorporated in any embodiment, can be used as stand-alone computer, also can be used as the disposal system in the electronic equipment, can remove one or more unnecessary assemblies, also can add one or more additional assemblies to it.For example, when computer system 1000 is used for electronic dictionary equipment, electronic learning machine, personal digital assistant, mobile phone, video camera or multi-function peripheral, for example can not comprise floppy disk 1151 and CD-ROM drive 1155.
Fig. 2 is that diagram utilizes computing machine to compress the block diagram of the system 2000 of the electronic data in the dictionary.As shown in Figure 2, system 2000 comprises: input media 2100, input dictionary to be compressed, and store word and its pronunciation with the electronic data form in the described dictionary to be compressed; The first word tone mapping table forms device 2200, for each Chinese character in the dictionary to be compressed, a pronunciation is defined as the acquiescence pronunciation, and forms the first word tone mapping table based on the acquiescence pronunciation; The second word tone mapping table forms device 2300, for each polyphonic Chinese characters in the dictionary to be compressed, the residue pronunciation except the acquiescence pronunciation is defined as non-acquiescence pronunciation, and forms the second word tone mapping table based on non-acquiescence pronunciation; And compression set 2400, according to being combined in index in the first word tone mapping table or the second word tone mapping table or the information of position about each Chinese character in each word in the dictionary to be compressed and its pronunciation, should make up boil down to 16 bits of encoded, the compression dictionary that comprises the information with 16 bits of encoded forms with formation, wherein, described 16 bits of encoded comprise about this Chinese character in the first word tone mapping table or the second word tone mapping table index or the information of position, and about the pronunciation corresponding with this Chinese character in the first word tone mapping table or the second word tone mapping table index or the information of position.
Fig. 3 is shown in the block diagram that comprises the word disposal system 3000 of using in the electronic equipment that compresses dictionary.As shown in Figure 3, word disposal system 3000 comprises: word tone mapping table deriving means 3300, obtain the first word tone mapping table and the second word tone mapping table; And decompressing device 3400, utilize described the first word tone mapping table or the second word tone mapping table will with described compression dictionary in word in 16 bits of encoded corresponding to Chinese character and the combination of its pronunciation decompress.
Word disposal system 3000 can further comprise: word input media 3100, input word by the user; Searcher 3200, utilize the compressed format of the first word tone mapping table and the second word tone mapping table word of search input in the compression dictionary, the compressed format of the word of described input is comprised of a plurality of 16 bits of encoded, each 16 bits of encoded represents the combination of a Chinese character and its pronunciation, wherein said decompressing device 3400 utilizes the first word tone mapping table or the second word tone mapping table, and each 16 bits of encoded in the compressed format of institute's predicate that will search out from the compression dictionary de-compress into the combination of this Chinese character and its pronunciation; And output unit 3500, the pronunciation of the word that is obtained by described decompressing device 3400 by text and/or voice output.
Described decompressing device 3400 can further comprise: first determines device 3410, determines that according to the predetermined bit in described 16 bits of encoded this combination is arranged in the first word tone mapping table or the second word tone mapping table; And second determine device 3430, determine that described first device 3410 definite these combinations are arranged in the situation of the first word tone mapping table, according to the 14 predetermined bits in the remaining bits determine about this Chinese character in the first word tone mapping table index or the information of position, and determine that according to a remaining bit whether this pronunciation is for gently reading; Perhaps, determine that described first device 3410 definite these combinations are arranged in the situation of the second word tone mapping table, according to the 12 predetermined bits in the remaining bits determine about this Chinese character in the second word tone mapping table index or the information of position, and determine information about this position of pronunciation in the second word tone mapping table of this Chinese character according to remaining 3 bits.
Fig. 4 is the block diagram of diagram text-to-speech system 4000, described text-to-speech system 4000 is used for utilizing the compression dictionary and text-converted is become voice, described text-to-speech system 4000 comprises: the compression dictionary, and each Chinese character in each word in the described compression dictionary and the combination of its pronunciation have utilized the method according to this invention to be compressed into 16 bits of encoded; Text input device 4100 is used for input text; Text processing apparatus 4300 is used for according to the compression dictionary described text dividing being become word and being institute's predicate phonetic notation; And speech production device 4500, be used for producing voice based on the result of described text processing apparatus 4300.
Described text processing apparatus 4300 can preferably include: text is cut word device 4310, and being used for text dividing is word, and word disposal system 3000.
More than device is the exemplary and/or preferred module for the process that will describe below implementing.The module that is used for implementing each step is not below at large described.Yet, as long as the step of certain process of execution is arranged, just can be useful on functional module or the device of the correspondence of implementing same process.The technical scheme that all combinations by step described below and the device corresponding with these steps limit all is included in the disclosure of the specification, as long as these technical schemes that their consist of are complete and applicable.
In addition, the said system that is made of each device can be incorporated in the hardware device such as electronic equipment as functional module.Except these functional modules, these electronic equipments can have other hardware or component software certainly.
With reference to the method for Fig. 5-7 description according to the Chinese electronic dictionary of compression of the present invention.
Fig. 5 is the process flow diagram that utilizes the method for the electronic data that computing machine compresses dictionary according to of the present invention for implementing.
In step S510, input dictionary to be compressed.Form with electronic data in the dictionary to be compressed stores word and its pronunciation.
In step S520, form the first word tone mapping table (be also referred to as acquiescence pronunciation table, will be described in more detail below).
In step S530, form the second word tone mapping table (be also referred to as non-acquiescence pronunciation table, will be described in more detail below).
In step S540, based on described the first word tone mapping table or the second word tone mapping table, each Chinese character of each word in the dictionary to be compressed and each combination of its pronunciation are compressed into 16 bits of encoded.
Fig. 6 is the process flow diagram that the preferred process that is used to form the first word tone mapping table (step S520) and the second word tone mapping table (step S530) is shown.Although note that step S520 and step S530 is two steps of separating in the process flow diagram of Fig. 5, these two steps can serial be implemented also can parallel practice.Process flow diagram shown in Figure 6 has provided the wherein example of step S520 and step S530 executed in parallel.The flow process of Fig. 6 left side branch (flow process that namely is made of step S610, S620, S630, S640, S660 and S670) is corresponding to step S520, and the flow process of Fig. 6 right branch (flow process that namely is made of step S610, S620, S630, S650, S660 and S670) is corresponding to step S530.
At first, the first initial word tone mapping table and the second initial word tone mapping table for example are empty.
In step S610, obtain a Chinese character in the dictionary, and in step S620, consider a pronunciation of this Chinese character.
The acquiescence pronunciation of the Chinese character whether pronunciation of determining to consider in step S620 in step S630 obtains in step S610.This acquiescence pronunciation can be any one pronunciation of this Chinese character.For example can be with the by default pronunciation of first pronunciation of processing for this Chinese character in the processing procedure.Be used for to judge pronunciation whether the criterion of the acquiescence pronunciation of this Chinese character also can be for example based on the statistical information of obtaining in advance.As mentioned above, the acquiescence of Chinese character pronunciation can be for the pronunciation of frequent use on this Chinese character statistics.For example, acquiescence pronunciation can be a priori to find the in daily life pronunciation of frequent use of people, perhaps can be for the maximum pronunciation of this Chinese character occurrence number in all pronunciations corresponding with this Chinese character in the dictionary.Typically, a Chinese character only has an acquiescence pronunciation, and in some cases, the tone of this acquiescence pronunciation can become gently to be read.
If determining the pronunciation of considering in step S620 in step S630 is the acquiescence pronunciation (being "Yes" among the step S630) of this Chinese character, then process advances to step S640.In step S640, upgrade this first word tone mapping table by add the information relevant with the combination of this Chinese character and the pronunciation of this acquiescence to the first word tone mapping table, and process advances to step S660.
Otherwise, not the acquiescence pronunciation (being non-acquiescence pronunciation) (being "No" among the step S630) of this Chinese character if in step S630, determine the pronunciation of in step S620, considering, then process advances to step S650.In step S650, upgrade this second word tone mapping table by add the information relevant with the combination of this Chinese character and this non-acquiescence pronunciation to the second word tone mapping table, and process advances to step S660.
In step S660, determine whether still have another pronunciation not consider for current Chinese character.If still have another pronunciation not consider (being "Yes" among the step S660) for current Chinese character, then process advances to step S620 to consider the pronunciation that next is not considered for current Chinese character.If do not have other pronunciations to consider (being "No" among the step S660) for current Chinese character, then process advances to step S670 has any Chinese character of not considering to determine whether still to remain in dictionary.If still residue has the Chinese character (being "Yes" among the step S670) of not considering in dictionary, then process advances to step S610 to obtain next Chinese character.If in dictionary, do not remain the Chinese character (being "No" among the step S670) of not considering, then finish the formation to the first word tone mapping table and the second word tone mapping table, and the first word tone mapping table and the second word tone mapping table can be used for other processing.
Table 1 illustrates an illustrative examples of the first word tone mapping table (acquiescence pronunciation table).Note that table 1 is an example wherein using Big5 word collection.Yet, also can use other word collection.
Table 1: the first word tone mapping table
Table 2 illustrates an illustrative examples of the second word tone mapping table (non-acquiescence pronunciation table).Note that table 2 is examples wherein using Big5 word collection.Yet, also can use other word collection.
Table 2: the second word tone mapping table
T2_INDEX (12 bit) Chinese character (16 bit) Pronunciation 1 (16 bit) Pronunciation 2 (16 bits) Pronunciation 3 (16 bits) Pronunciation 4 (16 bits) Pronunciation 5 (16 bits)
107 Fall dao4 N/A N/A N/A N/A
308 With he4 huo4 han4 huo5 hu2
As can be seen from Table 1, the first word tone mapping table can comprise following information.For a Chinese character, the first word tone mapping table can comprise: the acquiescence pronunciation (16 bit) of index (T1_INDEX) (14 bit), this Chinese character itself (16 bits for example have form according to the coding of universal character set such as Big5 or GB2312) and this Chinese character of this Chinese character in the first word tone mapping table.In the first word tone mapping table, because each Chinese character only has an acquiescence pronunciation, clauses and subclauses of the Chinese character in the first word tone mapping table are corresponding to the combination of this Chinese character and the pronunciation of its acquiescence.Although not shown in table 1, the acquiescence pronunciation can be rendered as conventional tone or gently read.In 16 bits of encoded, can have the pronunciation of indication acquiescence and whether get the information of gently reading.Although this information may not present in the first word tone mapping table, this information can be considered in the first word tone mapping table for an index that pronounces.
As can be seen from Table 2, the second word tone mapping table can comprise following information.For a Chinese character, the second word tone mapping table can comprise: the one or more non-acquiescence pronunciation (each 16 bit) of index (T2_INDEX) (12 bit), this Chinese character itself (16 bits for example have form according to the coding of universal character set such as Big5 or GB2312) and this Chinese character of this Chinese character in the second word tone mapping table.In the second word tone mapping table, because each Chinese character has one or more non-acquiescence pronunciations, therefore the clauses and subclauses of the non-acquiescence pronunciation of the Chinese character in the second word tone mapping table are corresponding to all non-acquiescence pronunciations, each non-acquiescence pronunciation has its own column index, and the non-acquiescence pronunciation of this Chinese character in the second word tone mapping table can be counted as the combination of pronouncing corresponding to this Chinese character and this non-acquiescence.
In above-mentioned this mode, every kind of combination of Chinese character and its pronunciation is registered in the first word tone mapping table or the second word tone mapping table.
Note that the first word tone mapping table shown in table 1 and the table 2 and the second word tone mapping table only are used for the purpose of example explanation, and be not intended to limit the form of the first word tone mapping table and the second word tone mapping table.For example, the Chinese character in the first word tone mapping table and the second word tone mapping table and/or pronunciation can be taked other forms but not the form shown in table 1 and the table 2.
The index of the Chinese character in the first word tone mapping table and the second word tone mapping table can at random design, as long as a Chinese character is corresponding to an index in table.Preferably, this index increases progressively from 0 open numbering and with 1, so that need less bit with each index binarization.Notice that the index of Chinese character in the first word tone mapping table can be different with the index of this Chinese character in the second word tone mapping table.Because the Chinese character much less in the second word tone mapping table (because only have polyphone to have non-acquiescence pronunciation thereby be registered in the second word tone mapping table), the index of the Chinese character in the second word tone mapping table takies still less bit than the index in the first word tone mapping table.
Get back to Fig. 5, in step S540, each combination of each Chinese character in each word in the dictionary to be compressed and its pronunciation is compressed to 16 bits of encoded based on the information that is combined in index in the first word tone mapping table or the second word tone mapping table or position about this, in order to form the compression dictionary that comprises the information with 16 bits of encoded forms.
Fig. 7 is the process flow diagram for the exemplary compression process of implementation step S540.
In step S710, obtain the word in the dictionary to be compressed.In step S720, obtain Chinese character in this word and the combination of its pronunciation.In step S730, judge whether this pronunciation gives tacit consent to pronunciation.If judge this pronunciation and be acquiescence pronunciation (being "Yes" among the step S730) in step S730, then process advances to step S740, and in step S740, according to the first word tone mapping table, this combination is compressed to 16 bits of encoded.(be "No" among the step S730 if in step S730, judge this pronunciation and be not the acquiescence pronunciation, be non-acquiescence pronunciation), then process advances to step S750, in step S750, according to the second word tone mapping table, this combination is compressed to 16 bits of encoded.Compressed in step S740 or S750 after the combination of this Chinese character and its pronunciation, process advances to step S760 to determine whether unpressed Chinese character is arranged in this word.If determine to still have in this word unpressed Chinese character (being "Yes" among the step S760) in step S760, then process advances to step S720 to obtain next Chinese character and its pronunciation in this word.If determine there is not unpressed Chinese character in this word (being "No" among the step S760) in step S760, then process advances to step S770 to determine in this dictionary whether any unpressed word being arranged.If determine that in step S770 unpressed word is arranged in the dictionary (being "Yes" among the step S770), then process advances to step S710 to obtain the next word in the dictionary.If in step S770, determine there is not unpressed word in the dictionary (being "No" among the step S770), then finished the compression of dictionary, and obtained the compression dictionary.
Particularly, 16 above-mentioned bits of encoded comprise: about Chinese character in the first word tone mapping table or the second word tone mapping table index or the information of position, and about the pronunciation corresponding with this Chinese character in the first word tone mapping table or the second word tone mapping table index or the information of position.
For example, described 16 bits of encoded can have following exemplary information.Particularly, the combination that can represent this Chinese character and its pronunciation of 1 bit in described 16 bits of encoded be in the first word tone mapping table or the second word tone mapping table in.Under described combination is in situation in the first word tone mapping table, 14 bits in the remaining bits can represent about the index of this Chinese character in the first word tone mapping table or the information of position (T1_INDEX that example is as shown in table 1), and remaining 1 bit is that gently reading of gently reading indicates as the tone that represents this pronunciation.Although not shown in the table 1, the bit that representative is gently read is actually the index of acquiescence pronunciation in the first word tone mapping table.Otherwise, under this combination is in situation in the second word tone mapping table, 12 bits in the remaining bits can represent about the index of this Chinese character in the second word tone mapping table or the information of position (T2_INDEX that example is as shown in table 2), remaining 3 bits representative is about the information (for example, indicating this non-acquiescence pronunciation to be positioned at the information of which row of the second word tone mapping table) of the position of the pronunciation corresponding with this Chinese character in the second word tone mapping table.
For example, 16 bits of encoded " 0011101101100000 " for the combination that represents one of a Chinese character and its pronunciation, first bit " 0 " represents that this pronunciation is the acquiescence pronunciation, 14 bits " 01110110110000 " then represent that (we can know from table 1 index (T1_INDEX) of this Chinese character in the first word tone mapping table, T1_INDEX " 01110110110000 " represents Chinese character " shadow "), last 1 bit " 0 " represents that this pronunciation is not gently to read.
Lift another example, 16 bits of encoded " 1000001101011000 " for another combination that represents one of Chinese character and its pronunciation, first bit " 1 " represents that this pronunciation is non-acquiescence pronunciation, 12 bits " 000001101011 " then represent that (we can know from table 2 index (T2_INDEX) of this Chinese character in the second word tone mapping table, T2INDEX " 000001101011 " expression Chinese character " falls "), last 3 bits " 000 " represent this position of pronunciation in the second word tone mapping table, that is, pronunciation " dao4 ".
Get Big5 word collection as an illustrative examples, the first word tone mapping table can cover all Chinese characters of Big5 word centralized definition.By this way, each index in the first word tone mapping table can have one-to-one relationship or mapping with each Chinese character that the Big5 word is concentrated.There is a large amount of applicable mapping methods.As example, following formula is used for the Big5 code conversion is become a kind of simple method of the index of each Chinese character.
If L<128 then T1_INDEX=(H-0xA4) * 0xA0+L-0x40;
Otherwise T1_INDEX=(H-0xA4) * 0xA0+L-0x60,
Wherein, H is the first byte of the Big5 coding of this Chinese character, and L is the second byte of the Big5 coding of this Chinese character, and T1_INDEX is the index of this Chinese character in the first word tone mapping table.The conversion of coding can be similarly from this index to Big5, and is as follows:
H=T1_INDEX/160+164;
If T1_INDEX%160≤64 then L=(T1_INDEX%160)+64
Otherwise L=(T1_INDEX%160)+96
Wherein, T1_INDEX/160 means by with the integral part of T1_INDEX divided by 160 merchants that obtain, and T1_INDEX%160 mean by with T1_INDEX divided by 160 remainders that obtain.
Note that owing to concentrating at the Big5 word to comprise 16239 Chinese characters, therefore in the first word tone mapping table, have 16239 clauses and subclauses corresponding with these 16239 Chinese characters just enough.That is to say that above-mentioned formula never limits the concrete mapping method between Big5 coding and the index (T1_INDEX).Preferably, index is since 0, and increases progressively with 1.That is to say that for example in the situation of using Big5 word collection, the index in the first word tone mapping table was from 0 to 16238 (it can be represented by 14 bits).
About the index in the second word tone mapping table, can between polyphonic Chinese characters and index (T2_INDEX), set up similar corresponding relation or mapping.Because only 943 Chinese characters to be arranged are polyphones, therefore the index of the Chinese character in the second word tone mapping table takies still less bit (being 12 bits) than the index of the Chinese character in the first word tone mapping table in the situation of Big5 word collection.
Although note that above with Big5 word collection as example, the method for the compression dictionary according to the present invention also can be applied to the Chinese character that represented by other Chinese Character Sets.
Now, for the method for the compression dictionary according to the present invention, will a concrete example be described.
In the process of compression dictionary, word and its pronunciation " inverted image (dao4ying3) " are transfused to, and compressed.
At first, Chinese character " falls " and is acquired with its pronunciation, and be compressed into 16 bits of encoded 1000001101011000 (0x86B0), wherein, first bit " 1 " expression pronunciation " dao4 " is the non-acquiescence pronunciation that Chinese character " falls ", 12 bits " 000001101011 " expression Chinese character then " falls " index (T2INDEX) in the second word tone mapping table, last 3 bits " 000 " represent this position of pronunciation in the second word tone mapping table, that is, pronunciation " dao4 " is first non-acquiescence pronunciation that this Chinese character " falls ".Then, Chinese character " shadow " is acquired with its pronunciation, and be compressed into 16 bits of encoded 0011101101100000 (0x3B60), wherein first bit " 0 " expression pronunciation " ying3 " is the acquiescence pronunciation of this Chinese character " shadow ", the index (T1INDEX) of 14 bits " 01110110110000 " expression Chinese character " shadow " in the first word tone mapping table then, last bit " 0 " represents that this pronunciation is not gently to read.Then, two 16 bits of encoded that the word of compression-namely is corresponding with these two Chinese characters and its pronunciation-be stored in the dictionary with compressed format, described two 16 bits of encoded represent Chinese word " inverted image " and its pronunciation " dao4ying3 ".
Method according to the Chinese electronic dictionary of compression of the present invention has more than been described.Step in the above method can be carried out on computers by computer program.
Utilize described compression dictionary to process the method for word now with reference to Fig. 8 and Fig. 9 description.
Fig. 8 is that diagram utilizes the compression dictionary to process the process flow diagram of the method for word.Described compression dictionary is according to the compressed dictionary of one of above-mentioned method of the present invention.That is to say that by utilizing one of above-described the method according to this invention, each Chinese character in each word in the described compression dictionary and each of its pronunciation combination have been compressed to 16 bits.
In step S810, obtain the first word tone mapping table and the second word tone mapping table.For example, the first word tone mapping table and the second word tone mapping table can be written into from server or memory device etc.The first word tone mapping table and the second word tone mapping table can be one of the above-mentioned types.
In step S820, decompress by using 16 bits of encoded corresponding to Chinese character and the combination of its pronunciation in the word in the first word tone mapping table or the second word tone mapping table pair and the compression dictionary.
Therefore, it is decompressed to represent Chinese character and 16 bits of encoded of the combination of its pronunciation in the word, and can present this combination.
Fig. 9 is the process flow diagram that illustrates the method for the word that utilizes the compression dictionary to come the process user input.Described compression dictionary is according to the compressed dictionary of one of above-mentioned method of the present invention.That is to say that by utilizing one of above-described the method according to this invention, each Chinese character of each word in the described compression dictionary and each of its pronunciation combination have been compressed to 16 bits of encoded.In method shown in Figure 9, the word of user input is processed, each Chinese character of this word of and the combination of this word and its pronunciation-namely and each combination of its pronunciation-be output.
In step S910, input a word by the user.A word can be made of a plurality of Chinese characters.
In step S920, obtain the first word tone mapping table and the second word tone mapping table.For example, the first word tone mapping table and the second word tone mapping table can be written into from server or memory device etc.The first word tone mapping table and the second word tone mapping table can be one of the above-mentioned types.
In step S930, by the first word tone mapping table and the second word tone mapping table, the compressed format of the word of search input in the compression dictionary.The compressed format of the word of input refers to by using the above-mentioned compressed word of method (for example seeing the description to step S540).That is to say that the described compressed format of the word of input is made of a plurality of 16 bits of encoded, each described 16 bits of encoded represents a Chinese character in this word and the combination of its pronunciation.By execution in step S930, can determine the pronunciation of the word inputted.
Below be in a kind of illustrative methods of the step S930 of the compressed format of the word of compression dictionary search input for implementing by the first word tone mapping table and the second word tone mapping table.
At first, at the first word tone mapping table and the second word tone mapping table each Chinese character in this word of search in the two.The first word tone mapping table (acquiescence pronunciation mapping table) comprises the clauses and subclauses of all Chinese characters, and existence is comprised 16 bits of encoded of the index (T1_INDEX) of each Chinese character in this word in the first word tone mapping table.Yet, therefore some Chinese characters (for example if not polyphonic Chinese characters) can not appear in the second word tone mapping table (non-acquiescence pronunciation mapping table), are not that each Chinese character in the word necessarily has 16 bits of encoded of the index (T2_INDEX) that is included in the second word tone mapping table.
Then, carry out matching process with one of each combination of the word (being consisted of by a plurality of 16 bits of encoded) of judging which compression in the compression dictionary and 16 bits of encoded of these Chinese characters coupling.For example, if a word is made of two Chinese characters, and 16 bits of encoded (c2_1) that first Chinese character has 16 bits of encoded (c1_1) of the hypothesis that comprises T1_INDEX (T1_INDEX_1) and comprises the hypothesis of T2_INDEX (T2_INDEX_1), second 16 bits of encoded (c2_2) that Chinese character has 16 bits of encoded (c1_2) of the hypothesis that comprises T1_INDEX (T1_INDEX_2) and comprises the hypothesis of T2_INDEX (T2_INDEX_2), then comprise (c1_1 in the search compression dictionary, c1_2), (c2_1, c1_2), (c1_1, c2_2) and the word of the compression of one of (c2_1, c2_2).First bit that comprises 16 bits of encoded (for example c1_1 or c1_2) of the hypothesis of T1_INDEX for example can be " 0 ", and it is used to indicate 16 bits of encoded that it is based on the first word tone mapping table.Similarly, first bit that comprises 16 bits of encoded (for example c2_1 or c2_2) of the hypothesis of T2_INDEX for example can be " 1 ", and it is used to indicate 16 bits of encoded that it is based on the second word tone mapping table.Can determine described 14 bit index (T1_INDEX) and described 12 bit index (T2_INDEX) for specific Chinese character is unique.Other undetermined bits can be any possible bits, and can not consider in matching process.
After in step S930, in the compression dictionary, finding the compressed format of comprising of this word of a plurality of 16 bits of encoded, in step S940, each combination that in described a plurality of 16 bits of encoded each is extracted and shortens Chinese character and its pronunciation into is so that by utilizing the first word tone mapping table and/or the second word tone mapping table to obtain the combination of this word and its pronunciation.
In step S950, the combination of the pronunciation of this word or this word and its pronunciation is output.The pronunciation of this word can be exported by text and/or voice.
According to one embodiment of present invention, can comprise following step for the step S820 or the S940 that decompress:
The-the first determining step determines that according to a predetermined bit in described 16 bits of encoded (for example first bit) combination of Chinese character and pronunciation is arranged in the first word tone mapping table or the second word tone mapping table; And
The-the second determining step, in the first determining step, determine in the situation that this combination is arranged in the first word tone mapping table, according to the 14 predetermined bits in the remaining bits determine about this Chinese character in the first word tone mapping table index or the information of position, and determine according to a remaining bit whether this pronunciation gently reads, perhaps, in the first determining step, determine in the situation that this combination is arranged in the second word tone mapping table, according to the 12 predetermined bits in the remaining bits determine about this Chinese character in the second word tone mapping table index or the information of position, and determine information about this position of pronunciation in the second word tone mapping table of this Chinese character according to remaining 3 bits.
Now, for utilizing compression dictionary according to the present invention to process the method for the word of input, a concrete example will be described.
The user inputs word " inverted image ", and expects the output of its pronunciation.
At first, Chinese character " falls " the obtained 000101011111001X of being of 16 bits of encoded (C1_1) with the combination of its acquiescence pronunciation, wherein, first bit " 0 " represents that this pronunciation is in the first word tone mapping table, the the 2nd to the 15th bit " 00101011111001 " represents the index (T1_INDEX_1) of this Chinese character in the first word tone mapping table, and last bit " X " is got any possible values.In addition, Chinese character " falls " the obtained 1000001101011XXX of being of 16 bits of encoded (C2_1) of combination of acquiescence non-with it pronunciation, wherein, first bit " 1 " represents that this pronunciation is in the second word tone mapping table, the the 2nd to the 13rd bit " 000001101011 " represents the index (T2_INDEX_1) of this Chinese character in the second word tone mapping table, and last 3 bits " XXX " are got any possible values.In addition, Chinese character " shadow " is obtained with 16 bits of encoded (C1_2) of the combination of its acquiescence pronunciation to be 001110110110000X, wherein, first bit " 0 " represents that this pronunciation is in the first word tone mapping table, the the 2nd to the 15th bit " 01110110110000 " represents the index (T1_INDEX_2) of this Chinese character in the first word tone mapping table, and last bit " X " is got any possible values.Because Chinese character " shadow " is not polyphonic Chinese characters, and do not have the pronunciation of non-acquiescence, do not exist about 16 bits of encoded (C2_2) of the index (T2_INDEX_2) of this Chinese character in the second word tone mapping table.
Then, the word that comprises the compression of (C1_1, C1_2) or (C2_1, C1_2) in the search compression dictionary.
As a result, found the word of the compression that comprises (C2_1, C1_2) corresponding with " inverted image dao4ying3 " in the compression dictionary.
Then, utilize the word of the compression that the first word tone mapping table and/or the second word tone mapping table will find to decompress, and the pronunciation " dao4ying3 " that obtains and export this word.
With reference to Fig. 8 and Fig. 9 the method for utilizing the compression dictionary to process word has been described.Step in the said method can be carried out on computers by computer program.
The method of above-mentioned compression Chinese electronic dictionary can be implemented as a system (routine system 2000 described above) by computer program, and the method for utilizing the compression dictionary to process word also can be implemented as a system (routine system 3000 described above) by computer program.In this case, each step is corresponding to a functional module (device), and each system can be contained in the other system by the computer program implementation and operation, perhaps can be installed in the electronic equipment, as functional module.
For example, can comprise above-mentioned system 3000 for the text-to-speech system 4000 that input text is converted to voice.
Figure 10 illustrates by text-to-speech system 4000 to utilize process flow diagram according to the performed process of the compression dictionary of above-mentioned any compression method compression.Note that each step of the process flow diagram among Figure 10 is corresponding to each device in the system 4000 among Fig. 4.
In step S1010, for example by user input text.This step S1010 is corresponding to the text input device 4100 among Fig. 4.
In step S1020, be word according to the compression dictionary with text cutting.Be that the process of word can be carried out by any prior art with text dividing.This step S1020 cuts word device 4310 corresponding to the text among Fig. 4.
Then, process advances to step S910, and the combination of each word and its pronunciation is processed according to process flow diagram shown in Figure 9.By the processing according to process flow diagram shown in Figure 9, the pronunciation of word can be determined, but and mark with phonetic symbols.Then, in step S950, export the result as the pronunciation of each word in the text, and process advances to the step S1030 of the process flow diagram shown in Figure 10.According to this process of the process flow diagram among Fig. 9 corresponding to the word disposal system 3000 among Fig. 3.
Above-mentioned steps S1020 with according to the combination of the process of the process flow diagram of Fig. 9 corresponding to the text processing apparatus 4300 among Fig. 4.
In step S1030, generate the voice based on result.This step S1030 is corresponding to the speech production device 4500 among Fig. 4.
According to the said method of in text-to-speech system 4000, carrying out, can utilize word disposal system 3000 of the present invention to realize the conversion of literary composition language.
Further, one or more in the said system can be installed in the electronic equipment such as electronic dictionary equipment, electronic learning machine, personal digital assistant, mobile phone, video camera or multi-function peripheral.
For example, described electronic equipment can comprise according to compression dictionary of the present invention, and above-mentioned word disposal system 3000 or text-to-speech system 4000.Preferably, described electronic equipment have for at least one of screen, keyboard and the mouse of its input media interface, and have for at least one of loudspeaker, earphone and the display device of its output unit interface.
According to the present invention, the memory requirement of electronic dictionary being stored pronunciation information reduces greatly.Particularly, by using the present invention, can reduce the size for the pronunciation information of a word.
At the basic electronic dictionary that is used for NLP (natural language processing) module, total total about 70000 words, and the longest word is made of four Chinese characters, and wherein 33525 words do not comprise any Chinese character with a plurality of pronunciations, and 28014 words only comprise a Chinese character with a plurality of pronunciations.
By being applied in the method that proposes among the Chinese patent application CN200310114889.8, the extra memory space that the storage pronunciation information spends greater than the 27k byte less than the 43k byte.On the other hand, by utilizing the method according to this invention, extra memory space reduces to 0, and this is because store pronunciation without any need for extra bit.
Can implement method and system of the present invention by many modes.For example, can implement method and system of the present invention by software, hardware, firmware or its any combination.The order of above-mentioned method step only is illustrative, and method step of the present invention is not limited to above specifically described order, unless otherwise offer some clarification on.In addition, in certain embodiments, the present invention can also be implemented as the program that is recorded in the recording medium, and it comprises for the machine readable instructions that realizes the method according to this invention.Thereby the present invention also covers the recording medium that storage is used for the program of realization the method according to this invention.
Although by the example detail display specific embodiments more of the present invention, it will be appreciated by those skilled in the art that above-mentioned example only is intended that exemplary and does not limit the scope of the invention.It should be appreciated by those skilled in the art that above-described embodiment to be modified and do not depart from the scope and spirit of the present invention.Scope of the present invention is to limit by appended claim.

Claims (25)

1. method of utilizing the electronic data in the computing machine compression dictionary comprises:
Input step is inputted dictionary to be compressed, stores word and its pronunciation with the electronic data form in the described dictionary to be compressed;
The first word tone mapping table forms step, for each Chinese character in the dictionary to be compressed, a pronunciation is defined as the acquiescence pronunciation, and forms the first word tone mapping table based on the acquiescence pronunciation;
The second word tone mapping table forms step, for each polyphonic Chinese characters in the dictionary to be compressed, the residue pronunciation except the acquiescence pronunciation is defined as non-acquiescence pronunciation, and forms the second word tone mapping table based on non-acquiescence pronunciation; And
Compression step, according to being combined in index in the first word tone mapping table or the second word tone mapping table or the information of position about each Chinese character in each word in the dictionary to be compressed and its pronunciation, should make up boil down to 16 bits of encoded, the compression dictionary that comprises the information with 16 bits of encoded forms with formation
Wherein, described 16 bits of encoded comprise about this Chinese character in the first word tone mapping table or the second word tone mapping table index or the information of position, and about the pronunciation corresponding with this Chinese character in the first word tone mapping table or the second word tone mapping table index or the information of position.
2. the method for claim 1, wherein
This Chinese character on statistics the pronunciation of frequent use be confirmed as the acquiescence pronunciation.
3. method as claimed in claim 2, wherein
In the middle of all pronunciations corresponding with described Chinese character, maximum pronunciations is confirmed as the acquiescence pronunciation for this Chinese character occurrence number in dictionary to be compressed.
4. the method for claim 1, wherein
A bit in described 16 bits of encoded represents that described combination is arranged in the first word tone mapping table or the second word tone mapping table;
Be arranged in this combination in the situation of the first word tone mapping table, 14 bits in the remaining bits represent about this Chinese character in the first word tone mapping table index or the information of position, remaining bit is as gently reading sign;
Be arranged in this combination in the situation of the second word tone mapping table, 12 bits in the remaining bits represent about this Chinese character in the second word tone mapping table index or the information of position, and remaining 3 bits represent the information about the position of the pronunciation corresponding with this Chinese character in the second word tone mapping table.
5. system that utilizes the electronic data in the computing machine compression dictionary comprises:
Input media is inputted dictionary to be compressed, stores word and its pronunciation with the electronic data form in the described dictionary to be compressed;
The first word tone mapping table forms device, for each Chinese character in the dictionary to be compressed, a pronunciation is defined as the acquiescence pronunciation, and forms the first word tone mapping table based on the acquiescence pronunciation;
The second word tone mapping table forms device, for each polyphonic Chinese characters in the dictionary to be compressed, the residue pronunciation except the acquiescence pronunciation is defined as non-acquiescence pronunciation, and forms the second word tone mapping table based on non-acquiescence pronunciation; And
Compression set, according to being combined in index in the first word tone mapping table or the second word tone mapping table or the information of position about each Chinese character in each word in the dictionary to be compressed and its pronunciation, should make up boil down to 16 bits of encoded, the compression dictionary that comprises the information with 16 bits of encoded forms with formation
Wherein, described 16 bits of encoded comprise about this Chinese character in the first word tone mapping table or the second word tone mapping table index or the information of position, and about the pronunciation corresponding with this Chinese character in the first word tone mapping table or the second word tone mapping table index or the information of position.
6. system as claimed in claim 5, wherein
This Chinese character on statistics the pronunciation of frequent use be confirmed as the acquiescence pronunciation.
7. system as claimed in claim 6, wherein
In the middle of all pronunciations corresponding with described Chinese character, maximum pronunciations is confirmed as the acquiescence pronunciation for this Chinese character occurrence number in dictionary to be compressed.
8. system as claimed in claim 5, wherein
A bit in described 16 bits of encoded represents that described combination is arranged in the first word tone mapping table or the second word tone mapping table;
Be arranged in this combination in the situation of the first word tone mapping table, 14 bits in the remaining bits represent about this Chinese character in the first word tone mapping table index or the information of position, remaining bit is as gently reading sign;
Be arranged in this combination in the situation of the second word tone mapping table, 12 bits in the remaining bits represent about this Chinese character in the second word tone mapping table index or the information of position, and remaining 3 bits represent the information about the position of the pronunciation corresponding with this Chinese character in the second word tone mapping table.
9. word treatment method that is used for electronic equipment, described electronic equipment comprises the compression dictionary, each Chinese character in each word in the described compression dictionary and the combination of its pronunciation have utilized according to claim 1 that each described method is compressed into 16 bits of encoded in-4, and described word treatment method comprises:
Word tone mapping table obtaining step obtains the first word tone mapping table and the second word tone mapping table; And
Decompression step, utilize described the first word tone mapping table or the second word tone mapping table will with described compression dictionary in word in 16 bits of encoded corresponding to Chinese character and the combination of its pronunciation decompress.
10. word treatment method as claimed in claim 9 also comprises:
The word input step is inputted word by the user;
Search step, utilize the compressed format of the first word tone mapping table and the second word tone mapping table word of search input in the compression dictionary, the compressed format of the word of described input is comprised of a plurality of 16 bits of encoded, each 16 bits of encoded represents the combination of a Chinese character and its pronunciation, wherein in described decompression step, utilize the first word tone mapping table or the second word tone mapping table, each 16 bits of encoded in the compressed format of institute's predicate that will search out from the compression dictionary de-compress into the combination of this Chinese character and its pronunciation; And
The output step, the pronunciation of the word that in described decompression step, obtains by text and/or voice output.
11. such as claim 9 or 10 described word treatment methods, wherein, described decompression step comprises:
The first determining step determines that according to the predetermined bit in described 16 bits of encoded this combination is arranged in the first word tone mapping table or the second word tone mapping table; And
The second determining step, determine that in described the first determining step this combination is arranged in the situation of the first word tone mapping table, according to the 14 predetermined bits in the remaining bits determine about this Chinese character in the first word tone mapping table index or the information of position, and determine that according to a remaining bit whether this pronunciation is for gently reading; Perhaps, determine that in described the first determining step this combination is arranged in the situation of the second word tone mapping table, according to the 12 predetermined bits in the remaining bits determine about this Chinese character in the second word tone mapping table index or the information of position, and determine information about this position of pronunciation in the second word tone mapping table of this Chinese character according to remaining 3 bits.
12. word disposal system that is used for electronic equipment, described electronic equipment comprises the compression dictionary, each Chinese character in each word in the described compression dictionary and the combination of its pronunciation have utilized according to claim 1 that each described method is compressed into 16 bits of encoded in-4, and described system comprises:
Word tone mapping table deriving means obtains the first word tone mapping table and the second word tone mapping table; And
Decompressing device, utilize described the first word tone mapping table or the second word tone mapping table will with described compression dictionary in word in 16 bits of encoded corresponding to Chinese character and the combination of its pronunciation decompress.
13. word disposal system as claimed in claim 12, wherein, described decompressing device comprises:
First determines device, determines that according to the predetermined bit in described 16 bits of encoded this combination is arranged in the first word tone mapping table or the second word tone mapping table; And
Second determines device, determine that at described first definite device this combination is arranged in the situation of the first word tone mapping table, according to the 14 predetermined bits in the remaining bits determine about this Chinese character in the first word tone mapping table index or the information of position, and determine that according to a remaining bit whether this pronunciation is for gently reading; Perhaps, determine that at described first definite device this combination is arranged in the situation of the second word tone mapping table, according to the 12 predetermined bits in the remaining bits determine about this Chinese character in the second word tone mapping table index or the information of position, and determine information about this position of pronunciation in the second word tone mapping table of this Chinese character according to remaining 3 bits.
14. word disposal system as claimed in claim 12 also comprises:
The word input media is inputted word by the user;
Searcher, utilize the compressed format of the first word tone mapping table and the second word tone mapping table word of search input in the compression dictionary, the compressed format of the word of described input is comprised of a plurality of 16 bits of encoded, each 16 bits of encoded represents the combination of a Chinese character and its pronunciation, wherein said decompressing device utilizes the first word tone mapping table or the second word tone mapping table, and each 16 bits of encoded in the compressed format of institute's predicate that will search out from the compression dictionary de-compress into the combination of this Chinese character and its pronunciation; And
Output unit, the pronunciation of the word that is obtained by described decompressing device by text and/or voice output.
15. word disposal system as claimed in claim 14, wherein, described decompressing device comprises:
First determines device, determines that according to the predetermined bit in described 16 bits of encoded this combination is arranged in the first word tone mapping table or the second word tone mapping table; And
Second determines device, determine that at described first definite device this combination is arranged in the situation of the first word tone mapping table, according to the 14 predetermined bits in the remaining bits determine about this Chinese character in the first word tone mapping table index or the information of position, and determine that according to a remaining bit whether this pronunciation is for gently reading; Perhaps, determine that at described first definite device this combination is arranged in the situation of the second word tone mapping table, according to the 12 predetermined bits in the remaining bits determine about this Chinese character in the second word tone mapping table index or the information of position, and determine information about this position of pronunciation in the second word tone mapping table of this Chinese character according to remaining 3 bits.
16. an electronic equipment, described electronic equipment comprises:
The compression dictionary, each Chinese character in each word in the described compression dictionary and the combination of its pronunciation have utilized according to claim 1 that each described method is compressed into 16 bits of encoded in-4; And
Each described word disposal system according to claim 12-13.
17. electronic equipment as claimed in claim 16, wherein, described electronic equipment is one of in electronic dictionary equipment, electronic learning machine, personal digital assistant, mobile phone, video camera and the multi-function peripheral.
18. an electronic equipment, described electronic equipment comprises:
The compression dictionary, each Chinese character in each word in the described compression dictionary and the combination of its pronunciation have utilized according to claim 1 that each described method is compressed into 16 bits of encoded in-4; And
Each described word disposal system according to claim 14-15.
19. electronic equipment as claimed in claim 18, wherein, described electronic equipment is one of in electronic dictionary equipment, electronic learning machine, personal digital assistant, mobile phone, video camera and the multi-function peripheral.
20. electronic equipment as claimed in claim 18, also comprise for at least one of screen, keyboard and the mouse of institute predicate input media interface.
21. electronic equipment according to claim 18, also comprise for at least one of loudspeaker, earphone and the display device of described output unit interface.
22. one kind is utilized the compression dictionary with the text-to-speech system of text-converted for voice, described text-to-speech system comprises:
The compression dictionary, each Chinese character in each word in the described compression dictionary and the combination of its pronunciation have utilized according to claim 1 that each described method is compressed into 16 bits of encoded in-4;
Text input device is used for input text;
Text processing apparatus is used for according to the compression dictionary described text dividing being become word and being institute's predicate phonetic notation; And
The speech production device is used for producing voice based on the result of described text processing apparatus.
23. text-to-speech system according to claim 22, wherein said text processing apparatus comprises:
Text is cut the word device, and being used for text dividing is word; And
Each described word disposal system according to claim 12-15.
24. an electronic equipment comprises:
According to claim 22 or 23 described text-to-speech systems;
Be used for and at least one of screen, keyboard and the mouse of described text input device interface; And
Be used for and at least one of loudspeaker, earphone and the display of described speech production device interface.
25. electronic equipment according to claim 24, wherein, described electronic equipment is one of in electronic dictionary equipment, electronic learning machine, personal digital assistant, mobile phone, video camera and the multi-function peripheral.
CN 200910176368 2009-09-28 2009-09-28 Method and system for compressing dictionary and processing words, text-to-speed system and electronic equipment Expired - Fee Related CN102033859B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200910176368 CN102033859B (en) 2009-09-28 2009-09-28 Method and system for compressing dictionary and processing words, text-to-speed system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200910176368 CN102033859B (en) 2009-09-28 2009-09-28 Method and system for compressing dictionary and processing words, text-to-speed system and electronic equipment

Publications (2)

Publication Number Publication Date
CN102033859A CN102033859A (en) 2011-04-27
CN102033859B true CN102033859B (en) 2013-04-10

Family

ID=43886774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200910176368 Expired - Fee Related CN102033859B (en) 2009-09-28 2009-09-28 Method and system for compressing dictionary and processing words, text-to-speed system and electronic equipment

Country Status (1)

Country Link
CN (1) CN102033859B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779141B (en) * 2011-05-12 2017-03-01 阿尔派株式会社 Facility data retrieval device and navigation system
CN104599670B (en) * 2015-01-30 2017-12-26 泰顺县福田园艺玩具厂 The audio recognition method of talking pen

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1194504A (en) * 1997-03-26 1998-09-30 富士通株式会社 Data compression/decompression apparatus/method and program recording medium
CN1512308A (en) * 2002-12-27 2004-07-14 佳能株式会社 Character processing method, device and storage medium
CN1614584A (en) * 2003-11-07 2005-05-11 佳能株式会社 Electronic dictionary and its data structure forming method and spelling information determining method
CN1779624A (en) * 2005-08-02 2006-05-31 高明利 Chinese coding and input method on syllable compression platform and keyboard
CN1883959A (en) * 2005-06-21 2006-12-27 容毅 Compression method for words and phonetic alphabet in English electronic dictionary data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1194504A (en) * 1997-03-26 1998-09-30 富士通株式会社 Data compression/decompression apparatus/method and program recording medium
CN1512308A (en) * 2002-12-27 2004-07-14 佳能株式会社 Character processing method, device and storage medium
CN1614584A (en) * 2003-11-07 2005-05-11 佳能株式会社 Electronic dictionary and its data structure forming method and spelling information determining method
CN1883959A (en) * 2005-06-21 2006-12-27 容毅 Compression method for words and phonetic alphabet in English electronic dictionary data
CN1779624A (en) * 2005-08-02 2006-05-31 高明利 Chinese coding and input method on syllable compression platform and keyboard

Also Published As

Publication number Publication date
CN102033859A (en) 2011-04-27

Similar Documents

Publication Publication Date Title
KR100271861B1 (en) Data compression, expansion method and apparatus and data processing unit and network
CN102982021B (en) For eliminating the method for the ambiguity of the multiple pronunciations in language conversion
US7512533B2 (en) Method and system of creating and using chinese language data and user-corrected data
CN101174448B (en) Talking picture playing method and device, method for generating index file of talking picture
CN1181618C (en) Data compression/decompression apparatus/method and program recording medium
CN1212601C (en) Imbedded voice synthesis method and system
CN106528536A (en) Multilingual word segmentation method based on dictionaries and grammar analysis
WO2001084357A2 (en) Cluster and pruning-based language model compression
JP2003218703A (en) Data coder and data decoder
JP6680126B2 (en) Encoding program, encoding device, encoding method, and search method
CN102033859B (en) Method and system for compressing dictionary and processing words, text-to-speed system and electronic equipment
US11669553B2 (en) Context-dependent shared dictionaries
CN1333501A (en) Dynamic Chinese speech synthesizing method
CN114528944B (en) Medical text coding method, device, equipment and readable storage medium
CN114546988A (en) Method for supporting multi-field type markdown database document to sql table building statement
JP2007042146A (en) Method and system of creating and using chinese data and user-corrected data
CN115988149A (en) Method for generating video by AI intelligent graphics context
US7469205B2 (en) Apparatus and methods for pronunciation lexicon compression
EP4172985A1 (en) Speech synthesis and speech recognition
CN102567294A (en) Text data processing method and text data processing device
CN102375817A (en) Method and device for acquiring self-created words
JP2006092223A (en) Portable communication terminal and multi-language display control method
CN1089045A (en) The computer speech of Chinese-character text is monitored and critique system
JP2010009355A (en) Electronic device, morphological element compounding method, and its program
Bao Design and implementation of Cyrillic Mongolian syllable text corpus system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130410

Termination date: 20170928

CF01 Termination of patent right due to non-payment of annual fee