TW527784B - Method for compressing statistical data characteristics - Google Patents

Method for compressing statistical data characteristics Download PDF

Info

Publication number
TW527784B
TW527784B TW89127106A TW89127106A TW527784B TW 527784 B TW527784 B TW 527784B TW 89127106 A TW89127106 A TW 89127106A TW 89127106 A TW89127106 A TW 89127106A TW 527784 B TW527784 B TW 527784B
Authority
TW
Taiwan
Prior art keywords
data
language
compressing
compression
statistical
Prior art date
Application number
TW89127106A
Other languages
Chinese (zh)
Inventor
Fred H Y Chen
Wei-Guo Wu
Original Assignee
Inventec Besta Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inventec Besta Co Ltd filed Critical Inventec Besta Co Ltd
Priority to TW89127106A priority Critical patent/TW527784B/en
Application granted granted Critical
Publication of TW527784B publication Critical patent/TW527784B/en

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The present invention provides a method for compressing statistical data characteristics. The method mainly applies statistical techniques to find data characteristics (such as the language unit of a dictionary datum and the repeat frequency of the language unit), and then, in accordance with the uniqueness of the characteristics, the datum is divided into several blocks. Furthermore, in accordance with a special pattern, the characteristics are arranged in sequence where each characteristic is given with a serial number for compression coding on these blocks. Each code is then replaced by a corresponding serial number, and the part without a characteristic in the datum is also coded in accordance with a Huffman compression algorithm. Thus, the required storage space can be minimized and, consequently, a higher data compression efficiency can be accomplished. Therefore, for extra large volume data, especially dictionary data having a high repeat frequency of language unit, the method can be applied for more efficient and effective data compression in an offline process.

Description

527784 五、發明說明(1) 發明背景: 本發明係一種統計資料姓 _ 貝杆特徵進行壓縮之方法,尤指一 種對超大容量資料進行盧wu扣 一、 处理過程(0 f f 1 i n e )中,對於該 資料利用統計技術得到資袓 丁〗貝枓之特徵,以一較佳壓縮算 法,提高資料的壓縮效率之方法。 ^ 先前技藝: 按’當今電子產鞏碰、 m π.γ #者义Μ A 爪速如展,電腦等高科技產品發展 迅猛,k著田如的I上刑I工、火迪·士 1 H,r p ,^上型電子消費產品,尤其是個人數位 助理機(Personal Di gi “ ] A · 、多虛田 又柄批# gltal Assistant,簡稱pDA )的廣 泛應用,人們對掌上刮φ 2、由洗士 一 ,,ΛΑ ^ ^電子4費產品的使用要求越來越 咼,未來的掌上型電子 ^ ^ ^527784 V. Description of the invention (1) Background of the invention: The present invention is a method for compressing the characteristics of the surname _ shellfish of statistical data, especially a method for processing large-capacity data. 1. Process (0 ff 1 ine), For this data, the statistical characteristics are used to obtain the characteristics of the data, and a better compression algorithm is used to improve the compression efficiency of the data. ^ Previous skills: Press' Today ’s electronics production, m π.γ # 者 义 M A claw speed, such as exhibitions, computer and other high-tech products are developing rapidly. H, rp, ^ upper-type electronic consumer products, especially personal digital assistants (Personal Digi "] A ·, multiple virtual fields and handles # gltal Assistant, abbreviated as pDA, are widely used. By Shi Shiyi ,, ΛΑ ^ ^ The use of 4 electronic products is increasingly demanding, the future of handheld electronics ^ ^ ^

識和其他訊息服務已成皂 一 ^ X 標誌。 成為外仏局科技產品技術是否領先的 但疋’ ^如的掌上刮雷羊、、、占蟲立0 數位助理機產A,由費,口,尤其是各種個人And other messaging services have become a ^ X mark. Become a leader in the technology products of the Foreign Economic Relations Bureau, but 疋 ’^ such as the minescraper on the palm ,,, and Zhan Zili Li 0 digital assistant machine production A, by expense, mouth, especially various individuals

Only Memory )空間有、/、項體(R0M, Read 料之儲存問題。 艮,口而…法有效解決超大容量資 ☆去ί 7 Η疋f f在貝料壓縮等資料處理過程中,一般是採用 二編二,:不次壓縮算法,即對資料所有内容進行統 八塊壓缩來# 二貧料中語言的重複率,以及是否資料可 尤其像字典資料中語:::::遇:丨超大容量資料’ 不針對資料之特徵,‘:2率高的資料時’如果 長:出隶it*的>c、%方案,將可能造成唯Only Memory) space has the problem of storage of ROM and Read materials. That is, it can effectively solve the problem of large-capacity resources. ☆ Go to 7 In the process of data processing such as shell material compression, it is generally used Second edition, two times: No-time compression algorithm, that is, the entire content of the data is compressed by eight blocks to reduce the language's repetition rate in the second material, and whether the data can be particularly similar to the dictionary data in Chinese ::::: "Capacity data" is not targeted at the characteristics of the data, and ": for data with a high rate of 2". If the length of the data is greater than "c,%", it may cause only

第4頁 527784 五、發明說明(2) 項3己憶體之存儲空間的浪費 發明 研究 進行 之方 該資 資料 算法 目的 置特 配合 綱要: 有鑑於 與實驗 壓縮之 本發明 法,該 料進行 中非具 予以編 Ο 為便 徵及其 圖示, 此,為改進 ’終於開發 方法。 之—目的, 方法係利用 分塊,以及 有特徵之部 碼,以達到 貴審查委員 功效,做更 詳細說明如 習用之缺點,發明人經過長久努 設計出本發明之一種統計資料特^ 係提供一種統計資料特徵進行壓^ 統計資料之特徵’根據該等特徵^ 對該等分塊進行壓縮編碼,而對> 份則按照哈夫曼(H u f f m a η )壓、端 減少内存空間提高資料壓縮效率< 能對本發明之目的、形狀、構造袭 進一步之認識與瞭解,兹舉實施^ 下: 丨 詳細說明: 本發明係-種「統計資料特徵 方法在對電子裝置中之資料m + 灯^难心刀5:」 该 中,提高資料壓縮優化理過程(0ffllner) 率,其特別對於資料中Hr 而提高資料壓缩敦 縮效率尤其顯著。〃 料㈣m其髮 本發明中,該方法主I η心 徵,嗣,根據該特徵之^:利:統計技術得到資料之特 獨特H ’將該資料分成若干個小分Page 4 527784 V. Description of the invention (2) Item 3 Waste of memory space of the memory. The research of the invention is carried out according to the purpose of the algorithm. The outline of the algorithm is: In view of the method of the invention with experimental compression, the material is in progress. It has been edited as a convenient sign and its diagram, so this method has finally been developed for improvement. The purpose of the method is to use blocks and characteristic part codes to achieve the effectiveness of your review committee, and to explain in more detail the shortcomings such as custom. The inventor has designed a statistical data feature of the present invention through long-term efforts. A feature of statistical data is compressed ^ The feature of statistical data is based on these features ^ Compression coding of these blocks, and for > copies are compressed in accordance with Huffman (H uffma η), reducing the memory space to improve data compression Efficiency < Can further understand and understand the purpose, shape, and structure of the present invention, and implement it as follows: 丨 Detailed description: The present invention is a kind of "statistical data characteristic method for electronic device data m + lamp ^ Difficult knife 5: "In this, the data compression optimization process (0ffllner) rate is increased, which is especially significant for Hr in the data to improve the compression efficiency of data compression. In the present invention, the method mainly has I η characteristics, and 根据, according to the characteristics of this feature: profit: the characteristics of the data obtained by statistical technology unique H ′ divide the data into several small points

$ 5頁 527784 五、發明說明 塊 徵分別 以及將該特徵按特定規律排庠, 一一編列-序號,以對該等小分雄ί根f該排序將該特 編碼係分別以該等序號替代,而對該等 部份則按照哈夫曼(Huffman) &缩算法扁、有特徵之 又叩運到k咼貧料壓縮效率 的目的。 尤指一種做為辅助學習 資料内之語言單位具有很高之重複性,該等二:二:中貧 其重複頻率,即是該字典類資料之特徵^ σ。早以及 例如···在牛津雙解詞典中英 shou1dn π 理想" _ ( •一# 為令本發=對前述之資料進;;壓縮,所採取之處理程序 及所達成,效,能有更具體且清晰之瞭解,茲特舉一 佳實施例配合第一圖,詳說明如下·· 竿乂 ,+ ” " ν 1 1/7 1 u u 注音nnneed”等,漢語部分f•母親π、 J :、語言單位在整個資料中都多次出現。 i Hll ; iff 4 二& 山、1 、/丄___ 士原:資料4纟發明係、一種針對字典類資料的壓縮 方法,尤:種在對個人數位助理機(以下簡稱PDA )中的 字典資=行處理過程⑽Une)中幫助資料壓縮優化 的一種异^,從而提高資料壓縮效率。其尤其對於字典類 資料中胃t單位重複率高的情況之壓縮效率尤其顯著。'本 發明方法主要是通過統計:責㈣複單位並對重複單位以序 唬代替,而對於非重複單位的簡單語言按照哈夫曼 壓縮算法予以編碼,再利用資料之特徵,將 大貝料進仃小塊分割壓縮,最終達到減少内存空間提高資 527784 五、發明說明(4) 料壓縮效率之目的。另,本發明之方法可實現於在對各種 掌上型消費電子產品的字典類資料處理過程中,對於超大 容ΐ資料’尤其是資料中語言重複出現率高的大資料,其 壓縮效率尤其顯著。......"這樣一段文字,對其統計。八 首先,統計出字典資料中語言單位及其重複頻率,從 而得到語言單位及其重複頻率之列表(如第二圖所示); 其次,根據列表中的統計結果所得出資料之語言單位和其 重複頻率,總結出整個資料的特徵,從而提出壓縮該筆資 料的最優方案;即依照資料的特徵,將整個資料分成若干 資料塊,並存儲各小塊之地址索引,如此,以避免因 資料過大,因而資料編碼過長,從而造成壓縮率降低之情 =汉依據該特徵將該等語言單位,依照該等語言單仏 ϋ律排序(如依字長順序排序),並根據該等排片 刀別編列一序號。 單位Ξ後,依據資料特徵,將資料進行分塊編碼,而重補 可以仍ϋί序之序號代替編碼,對於非重複的語言單位則 編碼。::的按照哈夫曼(Hufffflan)壓縮壓縮方法予以 減少資:二:可以很大程度減少資料的編瑪,從而i縮後 、抖存儲空間及提高資料的壓縮率。 實現ίΐ所述,該統計資料之語言單位和其重複頻率,是 實法、謝料壓縮率的關鍵,因此下面將對 爲現統计貧料之算法做一說明: 對資料處理過程中以一索引值形式代替資料中真正的$ 5 pages 527784 V. The description of the invention and the features are arranged according to a specific rule, one by one-serial number, in order to identify these small males, the order, the special coding system is replaced by these serial numbers For these parts, Huffman & shrinking algorithm is used to flatten and have the characteristic of compressing the material to the efficiency of k-lean material compression. In particular, a language unit used as a supplementary learning material has a high degree of repetition. These two: two: moderate and poor. The repetition frequency is the characteristic of the dictionary material ^ σ. Early and for example, in the Oxford English Dictionary, the English and Chinese shou1dn π ideal " _ (• a # is to make the present = to advance the aforementioned information; compression, the processing procedures adopted and achieved, the effect, can have For a more specific and clear understanding, I would like to cite a good embodiment with the first picture, and explain it in detail as follows: · 乂, + ”" ν 1 1/7 1 uu phonetic nnneed”, etc., Chinese part f J :, language units appear many times throughout the material. I Hll; iff 4 2 & mountain, 1, 丄 / ____ Shi Yuan: the 4 Department of Invention, a compression method for dictionary-type materials, especially: This is a kind of difference that helps the data compression and optimization in the dictionary data = line processing process (Une) in the personal digital assistant (hereinafter referred to as PDA), thereby improving the data compression efficiency. The compression efficiency is particularly significant for the case where the repetition rate of stomach t units in dictionary data is high. 'The method of the present invention is mainly based on statistics: blame the complex unit and replace the repeated unit with sequential bluffing, and the simple language of the non-repeated unit is encoded according to the Huffman compression algorithm, and then the characteristics of the data are used to feed the big shell material Dividing and compressing small blocks finally reduces the memory space and increases the cost of 527784. V. Description of the invention (4) Material compression efficiency. In addition, the method of the present invention can be implemented in the processing of dictionary data for various palm-type consumer electronics products, and its compression efficiency is particularly significant for large-capacity data, especially large data with a high recurrence rate of language in the data. ... " Such a paragraph of statistics. First, calculate the language units and their repetition frequencies in the dictionary data, so as to obtain a list of language units and their repetition frequencies (as shown in the second figure); Second, according to the statistical results in the list, the language units and their Repeat the frequency, summarize the characteristics of the entire data, and then propose the optimal solution to compress the data; that is, according to the characteristics of the data, the entire data is divided into several data blocks, and the address index of each small block is stored. Too large, so the data encoding is too long, resulting in a reduction in compression rate = Chinese will sort these language units according to this feature according to the language single law (such as sorting by word length order), and according to these sorting knife Don't list a serial number. After the unit, the data is encoded in blocks according to the characteristics of the data, and the repetition can be replaced by the serial number instead of the encoding, and the non-repeating language units are encoded. :: The Hufffflan compression method is used to reduce the data. Second: It can greatly reduce the editing of the data, thereby reducing the size, shaking the storage space, and increasing the compression rate of the data. The language unit of the statistical data and its repetition frequency are the keys to the actual method and the compression ratio of the data. Therefore, the algorithm for the current statistically poor data will be explained below. The index value form replaces the real

第7頁 527784 五、發明說明(5)Page 7 527784 V. Description of the invention (5)

語言單位,即建立 位之重複頻率。 個索引文件來幫助統a十資料中語言單 > ^,計出該資料中語言單位及其重複頻率後,就可建 立語言單位及其重複頻率之列表,及獲得該資料之特徵。 上述說明了根據本發明方法在設計思想上所採取的優 化方案後,下面將對其具體實施方案做進一步舉例 、丄,· 丨、、 < 问 速·The language unit is the repetition frequency of the established bit. An index file is used to help unify the language list > ^ in the data. After calculating the language units and their repetition frequency in the data, a list of language units and their repetition frequency can be established and the characteristics of the data can be obtained. The above illustrates the optimization scheme adopted in the design idea according to the method of the present invention, and its specific implementation scheme will be further exemplified below.

其建立一個索弓丨文件之 位重覆出現之頻率,韓換成 表,將該資料文件中重覆出 索引表中之索引值 該資料進行壓縮之處理過程 資料中語言單位重覆出現之 方法,係根據該資料中語言單 一排序作業,藉建立一索引 現頻率較高之$吾S單位,以該 料中真正之語言單位,俾在對 中,可藉該索引表協助統計該 頻率。 例如:有一筆牛津英漢辭典資料,其原始文件 8, 777, 1 3 9bytes,首先統計出資料語言單位和这丄^ w 率’並將3吾a卓位按字長順序存放於· r e p文件中,;^至 語言單位長度為447, 432bytes,再根據資料獨特性將1資§^ 分塊,並在每個分塊頭建立地址索引,將地址索引存^料 • idx文件中,其長度為431,280 bytes,做完上述工作後; 開始對資料進行壓縮,得到壓縮結果4, 3 28, 5 54byteS。而 使用傳統HUFF MAN壓縮算法壓縮該筆牛津辭典資料時,其 資料長度為5, 8 54; 6 0 8bytes。如第三圖所示,其為牛津英 漢辭典壓縮結果對比表。 ' 另外,資料共分為1 0 7,8 1 8塊’不加地址索引壓縮率It establishes a method of repeating the frequency of occurrence of the position of the document, replacing the Korean with a table, and repeating the process of compressing the index values in the index table in the data file. It is based on the single ordering operation of the language in the data. By establishing an index of the current unit with a higher frequency, the real language unit in the material can be used to align the statistics. The index table can be used to help calculate the frequency. For example: There is a piece of Oxford English-Chinese dictionary data. The original file is 8, 777, 1 3 9bytes. First, the language unit of the data and the 丄 ^ w rate 'are counted, and the 3 digits are stored in the rep file in word order. , ^ To the length of the language unit is 447, 432bytes, and then divide the 1 asset § ^ according to the uniqueness of the data, and create an address index at the beginning of each block, and store the address index ^ material • idx file, its length is 431 , 280 bytes, after finishing the above work; start to compress the data, get the compression result 4, 3 28, 5 54byteS. When the traditional HUFF MAN compression algorithm is used to compress the Oxford Dictionary data, the data length is 5, 8 54; 608 bytes. As shown in the third figure, it is a comparison table of the compression results of the Oxford English-Chinese dictionary. 'In addition, the data is divided into 1 0 7, 8 1 8 blocks ’without the address index compression ratio

527784 五、發明說明(6) ί ^ 加地址索引慶縮率為59· 3%,而通用HUFFMANf^ ‘方法壓縮率為66. 7%。 . '言f j將 > 料为塊壓縮比將資料不分塊壓縮的壓縮率 古2 ί ί阿,而本發明方法之壓縮率比傳統HUFFMAN壓縮 法壓縮率有更大顯著提高。 貫施例二為放在某種PDA產品中之另一版本牛津英漢 :典,第四圖所示’其為放在某種pDA I品中之另一版 本牛津英漢辭典之資料壓縮結果對比表。 料共分為92, 242小分塊’不加地址索引時麼縮 率為57· 7/。,加地址索引時壓縮率為615%,而通用 HUFFMAN壓縮方法壓縮率為71·8%。 如上實施例分析可知,依據本發明方法中之實現步 驟即統計貧料中語言單位重複出現率,並對重複單位以 序號代替,而對於非曹;^> μ 、 卜更设早位之#份則按照HUMFFMAN壓縮 鼻法予以編碼’再利用資制_之柱外《 〜 〇 μ $ , 特欲,將大資料進行小塊分 割壓、% ’农終達到減少使用的存儲空間和提高資料壓縮效 率。 透過具體=際數據對比可以看出,本發明方法不僅提 高了傳統資枓處理方法上資料壓縮的效率,而且對超大容 量資料特別是語言單位頻率高的資料,更是在處理上實現 了更快、更方便統計字串重複頻率的功吱。 以上所遂^為本發明方法之較佳實施例而已,惟, 並非用以限定,明方法之範圍。按,凡熟悉該項技術人 士,依據本發明所揭露之技術内$,在其它未脫離本發明527784 V. Description of the invention (6) ^ ^ The address index index reduction rate is 59.3%, and the general HUFFMANf ^ ‘method compression rate is 66.7%. 'Yan fj will be> Compression ratio of the data compression without block compression, and the compression ratio of the method of the present invention is significantly greater than the compression ratio of the traditional HUFFMAN compression method. The second embodiment is another version of Oxford English-Chinese: Dictionary, which is placed in a PDA product, as shown in the fourth figure. 'It is a comparison table of the compression results of another version of the Oxford English-Chinese dictionary, which is placed in a pDA I product. . The material is divided into 92, 242 small blocks' without any address index. The shrinkage is 57.7 /. When adding the address index, the compression rate is 615%, while the general HUFFMAN compression method has a compression rate of 71.8%. As can be seen from the analysis of the above embodiment, according to the implementation steps in the method of the present invention, the recurrence rate of linguistic units in poor materials is counted, and the repetitive units are replaced by serial numbers, and for non-Cao; ^ > μ , 卜 更 Set early bit # The copies are coded according to the HUMFFMAN compressed nose method. 'Recycling capital_ the outside of the pillar' ~ ~ 〇μ $, special desire, the big data is divided into small blocks,% 'to reduce the use of storage space and improve data compression at the end of the farm effectiveness. It can be seen from the comparison of specific data that the method of the present invention not only improves the efficiency of data compression on traditional resource processing methods, but also achieves faster processing of ultra-large-capacity data, especially data with high frequency of language units. It is more convenient to count the frequency of string repetition. The foregoing is merely a preferred embodiment of the method of the present invention, but it is not intended to limit or clarify the scope of the method. According to the person who is familiar with the technology, according to the technology disclosed in the present invention, the others do not depart from the present invention.

第9頁 527784 五、發明說明(7) 所揭示之精神下所完成之等效改變或修飾,均應包含在上 述之專利範圍内。 iili 第10頁 527784 圖式簡單說明 圖示說明: 第一圖乃本發明之實現資料壓縮的實現過程之流程 圖。 第二圖乃本發明用來統計資料語言單位及其重複頻率 之列表。 第三圖乃本發明實施例之一之牛津英漢辭典壓縮結果 對比表。 第四圖乃本發明實施例之二之一種PDA產品中牛津英 漢辭典壓縮結果對比表。Page 9 527784 V. Description of the invention (7) Equivalent changes or modifications made under the spirit disclosed should be included in the scope of patents mentioned above. iili Page 10 527784 Brief description of the diagrams The diagram illustrates: The first diagram is a flowchart of the process of implementing data compression according to the present invention. The second figure is a list of the language units and repetition frequencies used by the present invention for statistics. The third figure is a comparison table of the compression results of the Oxford English-Chinese Dictionary, one of the embodiments of the present invention. The fourth figure is a comparison table of the compression results of the Oxford English-Chinese dictionary in a PDA product according to the second embodiment of the present invention.

第11頁Page 11

Claims (1)

527784 六、申請專利範圍 1. 一種統計資料特徵進行壓縮之方法,該方法主要是 利用統計技術得到資料之特徵,嗣,該資料根據該特徵之 獨特性^將該貢料分成若干個小分塊’以及將該特徵按特 定規律排序,並根據該排序將該特徵分別編列一序號,以 對該等小分塊進行壓縮編碼,該等編碼係分別以該等序、號 替代,而對該資料中非具有特徵之部份則按照哈夫曼壓縮 算法予以編碼。 2 ·如申請專利範圍第1項所述之一種統計資料特徵進 行壓縮之方法,其中該等小分塊進行壓縮編碼時,並可存 儲各小塊之地址索引。 3 .如申請專利範圍第1項所述之一種統計資料特徵進 行壓縮之方法,其中該資料係可為一做為辅助學習之工具 之字典類貧料’其中該資料内之語言早位具有很南之重複 性,該等語言單位以及其重複頻率,即是該字典類資料之 特徵。 4.如申請專利範圍第3項所述之一種統計資料特徵進 行壓縮之方法,其中該資料之係可依下列步驟進行統計壓 縮: 首先,統計出字典資料中語言單位及其重複頻率,從 而得到語言單位及其重複頻率之列表; 其次,根據列表中的統計結果所得出資料之語言單位 和其重複頻率,總結出整個資料的特徵,從而提出壓縮該 筆資料的最優方案;即依照資料的特徵,將整個資料分成 若干個小資料塊,並存儲各小塊之地址索引;527784 VI. Scope of patent application 1. A method for compressing the characteristics of statistical data. This method mainly uses statistical techniques to obtain the characteristics of the data. Alas, the data is divided into several small blocks based on the uniqueness of the characteristics. 'And the feature is sorted according to a specific law, and the feature is serialized with a serial number according to the sorting to compress and encode these small blocks, and these codes are replaced by the serial numbers and numbers respectively, and the data are The characteristic parts of Central Africa are coded according to the Huffman compression algorithm. 2 · A method of compressing statistical data features as described in item 1 of the scope of patent application, where the small blocks are compressed and encoded, and the address index of each small block can be stored. 3. A method for compressing the characteristics of statistical data as described in item 1 of the scope of patent applications, wherein the data is a dictionary-like material that can be used as a tool for auxiliary learning, where the language in the data has a very early bit The repetitiveness of the South, the linguistic units and their repetition frequency are the characteristics of the dictionary data. 4. A method for compressing the characteristics of statistical data as described in item 3 of the scope of patent application, wherein the data can be statistically compressed according to the following steps: First, the language units and their repetition frequency in the dictionary data are calculated to obtain A list of language units and their repetition frequency. Secondly, according to the statistical results in the list, the language units of the data and their repetition frequency are summarized to summarize the characteristics of the entire data, so as to propose the optimal scheme for compressing the data; that is, according to the data's Characteristics, the entire data is divided into several small data blocks, and the address index of each small block is stored; 第12頁 527784Page 12 527784 以及依據該特徵將該等語言單位,依照該等語^言單 之特定規律排序,並根據該等排序分別編列一序號。早位 最後’依據資料特徵,將資料進行分塊編碼,而重複 單位則以排序之序號代替編碼,對於非重複的語言單位則 <以仍間单的按照哈夫曼壓縮算法方法予以編碼。 5 ·如申請專利範圍第4項所述之一種統計資料特徵進 行麈縮之方法 辛呈中以索引值 個索引文件來 6 ·如申請 歷縮之方法 行 資料 建立 單位 位, 協助$ 中語言單 /索引表 ,以該索 禆在對該 計該資 ’其中實 形式代替 幫助統計 專利範圍 ’其中建 位重覆出 ,將該資 引表中之 資料進行 料中語言 現統計資料 語言單位 資料中真史的 & * 資料中語f單位之重複頻率。 之/種統计貧料特徵進 立5一項片Hi爻件之方法,係根據該 —個系,轉換成一排序作業,藉 =之頻$重覆出現頻率較高之語言 :;文件:J该資料中真正之語言單 斤弓丨值该過糕中’可藉該索引表 堡縮之處Ϊ規之頻率。 單位重覆出見And according to the characteristics, the language units are sorted according to the specific rules of the language lists, and a serial number is separately arranged according to the sorts. Early bit Finally 'according to the characteristics of the data, the data is coded in blocks, and the repeating unit is replaced by the sequence number, and the non-repeating language units are encoded using the Huffman compression algorithm. 5 · Method of shrinking a statistical data feature as described in item 4 of the scope of patent application Xin Cheng uses index values to create an index file. 6 · If applying for a calendar method to create unit data, assist $ language list / index Table, with the request to repeat the establishment of the information in the form of “the actual form instead of helping to statistics the scope of patents”, repeat the establishment of the information in the reference form, the data in the language, the current statistics, the true history of the language unit data, & * The repetition frequency of the f unit in the data. The method of statistically identifying the characteristics of poor materials into 5 pieces of Hi files is based on the one line, which is converted into a sorting operation, and the frequency of repeated occurrences of borrowing == $ is repeated for languages with higher frequency: File: J The true language in this document is worth the frequency of the past in the past. Units repeatedly appear
TW89127106A 2000-12-18 2000-12-18 Method for compressing statistical data characteristics TW527784B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW89127106A TW527784B (en) 2000-12-18 2000-12-18 Method for compressing statistical data characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW89127106A TW527784B (en) 2000-12-18 2000-12-18 Method for compressing statistical data characteristics

Publications (1)

Publication Number Publication Date
TW527784B true TW527784B (en) 2003-04-11

Family

ID=28787553

Family Applications (1)

Application Number Title Priority Date Filing Date
TW89127106A TW527784B (en) 2000-12-18 2000-12-18 Method for compressing statistical data characteristics

Country Status (1)

Country Link
TW (1) TW527784B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109412604A (en) * 2018-12-05 2019-03-01 云孚科技(北京)有限公司 A kind of data compression method based on language model
CN110442489A (en) * 2018-05-02 2019-11-12 阿里巴巴集团控股有限公司 The method and storage medium of data processing

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110442489A (en) * 2018-05-02 2019-11-12 阿里巴巴集团控股有限公司 The method and storage medium of data processing
CN110442489B (en) * 2018-05-02 2024-03-01 阿里巴巴集团控股有限公司 Method of data processing and storage medium
CN109412604A (en) * 2018-12-05 2019-03-01 云孚科技(北京)有限公司 A kind of data compression method based on language model

Similar Documents

Publication Publication Date Title
Williams et al. Compressing integers for fast file access
US6047298A (en) Text compression dictionary generation apparatus
Barbay et al. On compressing permutations and adaptive sorting
EP2157573B1 (en) An encoding and decoding method
CN112507065A (en) Code searching method based on annotation semantic information
CN101350624A (en) Method for compressing Chinese text supporting ANSI encode
JPH1079672A (en) Method and device for compressing and decompressing message
TW200824306A (en) Huffman decoding method
EP0450049A1 (en) Character encoding.
TW527784B (en) Method for compressing statistical data characteristics
Hahn A new technique for compression and storage of data
Cooper et al. Text compression using variable‐to fixed‐length encodings
Awajan et al. Hybrid technique for Arabic text compression
Shanmugasundaram et al. Text preprocessing using enhanced intelligent dictionary based encoding (EIDBE)
TW490622B (en) Method for summing up repeat of string in data document
Fraenkel et al. Combinatorial compression and partitioning of large dictionaries
Nagaprasad et al. Authorship attribution based on data compression for telugu text
JPH07182354A (en) Method for generating electronic document
Bell et al. Compressing the digital library
JPH05152971A (en) Data compressing/restoring method
JPH05241776A (en) Data compression system
Prachumrak Weighted Finite Automata encoding over Thai language
JPS5822434A (en) Japanese document processing system
Nithya et al. The Study of Text Compression Algorithms and their Efficiencies Under Different Types of Files
TW385596B (en) A compression method applicable to wide character Unicode file

Legal Events

Date Code Title Description
GD4A Issue of patent certificate for granted invention patent
MM4A Annulment or lapse of patent due to non-payment of fees