TWI579830B

TWI579830B - On the Chinese Text Normalization System and Method of Semantic Cooperative Processing

Info

Publication number: TWI579830B
Application number: TW104144149A
Authority: TW
Inventors: Pao Ching Chen; Chen Ming Pan; I Bin Liao; Guan Ting Liou; Chen Yu Chiang
Original assignee: Chunghwa Telecom Co Ltd
Priority date: 2015-12-29
Filing date: 2015-12-29
Publication date: 2017-04-21
Also published as: TW201724083A

Description

語意協同處理之中文文字正規化系統與方法 Chinese character formalization system and method for semantic collaborative processing

本發明係關於一種中文文字正規化系統與方法，特別是一種以關鍵詞以及語意協同處理之中文文字正規化系統與方法，主要可用於文字語音轉換。 The invention relates to a Chinese character formalization system and method, in particular to a Chinese character normalization system and method which are processed by keywords and semantics, and can be mainly used for text-to-speech conversion.

將文字轉為語音之系統可用於多種用途，例如：自動化服務、語音應答、電腦輔助教學、閱讀器..等等，而某些簡單的文字轉換語音系統僅依據純文字依序發音，然而中文中各種語意、語法之屬性以及穿插於文章中的符號與英文字母，依其文章之意涵以及約定俗成之意義，可以有許多種不同的讀音念法，僅根據文字或符號本身念法直接讀出發音之方式，或是將具有特殊意涵之符號以及文字結合之段落採取跳過或直接發音之方式，皆會造成聽者難以理解文章所欲表達之意義，甚或是造成誤會，實則並非處理此種將文字轉為語音技術之適當方式。 Systems that convert text to speech can be used for a variety of purposes, such as automated services, voice response, computer-assisted instruction, readers, etc., while some simple text-to-speech systems are only based on plain text, but in Chinese. The various semantics, the attributes of grammar, and the symbols and English letters interspersed in the article, according to the meaning of the article and the meaning of the convention, can have many different pronunciations, read directly according to the text or the symbol itself. The way of pronunciation, or the way in which the symbols with special meanings and the combination of words are skipped or directly pronounced, will make it difficult for the listener to understand the meaning of the article, or even cause misunderstanding. In fact, this is not the case. An appropriate way to turn text into speech technology.

而雖然現在已有許多文字正規化方法之研究，可用於文章辨識、關鍵字搜尋、統計等方面；然而，依舊缺少一種依據語意、文意以及潛移默化之習慣影響文字讀音念法以將文字正規化，以利後續轉為語音之方法，本問題實則有其必要性，是領域中人急欲解決的問題之一。 Although there are many studies on the formalization of texts, they can be used for article identification, keyword search, statistics, etc. However, there is still a lack of a habit of semantics, semantics, and imperceptible influences on text pronunciation to formalize words. In order to facilitate the subsequent conversion to speech, this issue is actually necessary, and it is one of the problems that people in the field are anxious to solve.

本發明係一種語意協同處理之中文文字正規化系統與方法，其主要的目的在於提出分階段循序使用有限狀態機規則、鑑別式關鍵詞及文字符號內含語意協同處理模組等技術解決各階段所留存未解決之讀音念法問題，以達成中文文字正規化內容以判別文字之正確讀音，當可有效地利用於語音播報。 The invention relates to a Chinese character formalization system and method for syntactic collaborative processing, and the main purpose thereof is to propose a phased sequential use of finite state machine rules, a discriminative keyword and a text symbol including semantic semantic processing module to solve various stages. The unresolved pronunciation problem is retained to achieve the normalized content of Chinese characters to determine the correct pronunciation of the words, which can be effectively utilized for voice broadcast.

本發明之方法首先係通過一斷詞標記模組接收文字資訊，該斷詞標記模組根據儲存之詞類規則以標記該文字資訊之內容，以產生經過標記之詞類標記文字資訊以供後續處理。接著，一斷詞重組模組接收經過標記的詞類標記文字資訊，該斷詞重組模組根據一切割重組模型對該詞類標記文字資訊之內容進行處理以產生一重組標記文字資訊，其中，該斷詞重組模組由重斷範圍標記器、切割器、重組範圍標記器以及重組器組成，經過順序處理後初步將文字資訊整理標記為可能之詞組斷句的該重組標記文字資訊。 The method of the present invention first receives text information through a word segmentation tagging module, and the word segmentation tag module marks the content of the text information according to the stored word class rules to generate tagged word class tag text information for subsequent processing. Then, a word segment reorganization module receives the tagged word class tag text information, and the word segment reorganization module processes the content of the word class tag text information according to a cut recombination model to generate a recombination tag text message, wherein the word segmentation The word reorganization module is composed of a re-segment range marker, a cutter, a recombination range marker, and a reorganizer. After sequential processing, the text information is initially marked as the recombination marker text information of the possible phrase sentence.

本發明之方法接著係以一文字正規化標記模組接收該重組標記文字資訊，該文字正規化標記模組根據一文字正規化範圍模型標記該重組標記文字資訊之內容以產生一正規化標記文字資訊，其中，該文字正規化範圍模型係標出特定格式為需要再處理之範圍，其他未標記者則為符合格式的可直接處理範圍；本發明再來通過一文字正規化模組接收該正規化標記文字資訊，該文字正規化模組根據一一般規則、一衝突規則以及一大規則用以處理該正規化標記文字資訊之內容以產生該已正規化文字資訊，其中，大規則係為一般規則與衝突規則結合之判斷規則。 The method of the present invention then receives the reorganized markup text information by a text normalization markup module, and the text normalization markup module marks the content of the recombined markup text information according to a text normalization range model to generate a normalized markup text information. Wherein, the text normalization scope model indicates that the specific format is a range that needs to be processed, and other unmarked persons are directly operable ranges that conform to the format; the present invention further receives the normalized markup text through a text normalization module. Information, the text normalization module processes the content of the normalized markup text information according to a general rule, a conflict rule, and a large rule to generate the normalized text information, wherein the large rule is a general rule and a conflict The rules of judgment combined with rules.

接著，一關鍵詞正規化模組接收該已正規化文字資訊，該關鍵詞正規化模組根據一關鍵詞判斷規則處理該已正規化文字資訊之內容以產生一初步判斷文字資訊；其中，該關鍵詞判斷規則係根據內存複數關鍵字詞組對該已正規化文字資訊之內容進行標記，且該關鍵詞判斷規則係根據關鍵字詞組與已正規化文字內容之距離作為優先權以判斷該已正規化文字資訊之內容的正確讀音。 Then, a keyword normalization module receives the normalized text information, and the keyword normalization module processes the content of the normalized text information according to a keyword determination rule to generate a preliminary judgment text information; wherein The keyword determination rule marks the content of the normalized text information according to the memory plural keyword phrase, and the keyword determination rule determines the regularity according to the distance between the keyword phrase and the normalized text content. Correct pronunciation of the content of the text information.

而後，一語意讀法處理模組接收該初步判斷文字資訊，該語意讀法處理模組根據一語意標示讀法模型以及一語意協同分析模型以判斷處理該初步判斷文字資訊以產生一完整正規化文字資訊，該完整正規化文字資訊係用以提升語音播報之正確性；其中，該語意標示讀法模型係為以重複訓練之所有文件中特定詞組出現之數量以及正規化之標點符號以建立二維矩陣，將通過文件重要性評估技術(TF-IDF)之二維矩陣再通過奇異值分解(Singular Value Decomposition,SVD)降維，再以降維之矩陣參數建立相同文字或詞組讀音之高斯混合模型(Gaussian Mixture Model,GMM)，而該語意協同分析模型係為文件摺入最大概似估計(Maximum Likelihood)法以處理該初步判斷文字資訊內容以產生該完整正規化文字資訊。 Then, the semantic reading module receives the preliminary judgment text information, and the semantic reading processing module marks the reading model and a semantic collaborative analysis model according to a semantic meaning to determine the processing of the preliminary determining text information to generate a complete normalization. Textual information, the complete regularized text information is used to improve the correctness of the speech broadcast; wherein the semantically indicated reading model is the number of specific phrases in all the files of the repeated training and the normalized punctuation marks to establish two The dimensional matrix will be reduced by the two-dimensional matrix of the document importance evaluation technique (TF-IDF) and then reduced by the Singular Value Decomposition (SVD), and then the Gaussian mixture model of the same text or phrase pronunciation will be established by the matrix parameters of the dimensionality reduction. (Gaussian Mixture Model, GMM), and the semantic collaborative analysis model is a file folding into the most approximate similarity (Maximum Likelihood) method to process the preliminary judgment text information content to generate the complete normalized text information.

關於前述關鍵詞判斷規則，常理來說，由於關鍵詞可影響句中部分文字之發音類型，而不同的關鍵詞會令相同的文字符號組合有不同唸法，舉例來說：日子或月份，此二種關鍵詞具有相同讀音念法之屬性，像是若文句中出現：12月、12日、11-12月、28-29日，此時若以中文文法概念來發音，上述的”-”號應不是讀作”減”而會讀作”到”或” 至”，但是***數字則具有相同唸法，因此數字當可歸類成同一類別，反覆訓練學習後，往後處理文件時文字在正規化範圍內只要找尋到某一類之關鍵詞便大概可以確定其念法。 Regarding the aforementioned keyword judgment rule, in common sense, since the keyword can affect the pronunciation type of some characters in the sentence, different keywords may make the same text symbol combination have different ways of thinking, for example: day or month, this The two types of keywords have the same pronunciation of the pronunciation, such as if the sentence appears: December, 12, 11-12, 28-29, if the Chinese grammar concept is used to pronounce, the above "-" The number should not be read as "minus" but will be read as "to" or " To ", but the Arabic numerals have the same way of thinking, so the numbers can be classified into the same category. After the training is repeated, when the documents are processed later, the words can only be found in the normalized range. Its way of thinking.

簡而言之，本發明之方法係用以正規化符號、***數字、英文字詞以及中文字詞交錯的一段或一篇中文文字資訊，使該中文文字資訊可以根據該文章欲表達之意思以語音發音準確的播報出來。 Briefly, the method of the present invention is used to normalize a symbol, an Arabic numeral, an English word, and a Chinese character information interlaced with a Chinese word, so that the Chinese text information can be expressed according to the meaning of the article. The pronunciation pronunciation is accurately broadcasted.

101‧‧‧文句資訊 101‧‧‧ sentence information

102‧‧‧斷詞標記模組 102‧‧‧Battery tag module

103‧‧‧切割重組模型 103‧‧‧Cutting recombination model

104‧‧‧斷詞重組模組 104‧‧‧Words reorganization module

105‧‧‧文字正規化範圍模型 105‧‧‧Text Normalization Scope Model

106‧‧‧文字正規化標記模組 106‧‧‧Text Normalization Marking Module

107‧‧‧文字正規化模組 107‧‧‧Text formalization module

108‧‧‧有限狀態機文字正規化規則 108‧‧‧ finite state machine text formalization rules

109‧‧‧關鍵詞正規化模組 109‧‧‧Keyword normalization module

110‧‧‧關鍵詞文字正規化規則 110‧‧‧Keyword text formalization rules

111‧‧‧語意讀法處理模組 111‧‧‧Speech reading module

112‧‧‧語意協同分析模型 112‧‧‧Speech Collaborative Analysis Model

113‧‧‧語意標示讀法模型 113‧‧‧Speech index reading model

114‧‧‧正確讀音念法之文本 114‧‧‧The text of the correct pronunciation

301‧‧‧重斷範圍標記器 301‧‧‧Re-cut range marker

302‧‧‧切割器 302‧‧‧Cutter

303‧‧‧重斷範圍標記器 303‧‧‧Remove range marker

304‧‧‧重組器 304‧‧‧Reorganizer

S1001~S1002‧‧‧方法步驟 S1001~S1002‧‧‧ method steps

S1501~S1504‧‧‧方法步驟 S1501~S1504‧‧‧ method steps

S2101~S2107‧‧‧方法步驟 S2101~S2107‧‧‧ method steps

S2601~S2602‧‧‧方法步驟 S2601~S2602‧‧‧ method steps

圖1為本發明之模組架構圖；圖2為本發明之實施例經斷詞標記模組標記之文句資訊範例；圖3為本發明之斷詞重組模組之構成示意圖；圖4為本發明之重斷範圍標記器對文句資訊所標記之內容；圖5為本發明之切割器輸出之內容範例；圖6為本發明之重組範圍標記器輸出之內容範例；圖7為本發明重組器304輸出之內容範例；圖8為本發明文字正規化標記模組106之輸出內容範例；圖9為本發明文字正規化模組所依循之有限狀態機文字正規化規則範例；圖10為本發明有限狀態機文字正規化規則執行之步驟流程圖；圖11為本發明中經一般規則與衝突規則標記後之文句格式內容範例；圖12為本發明中經大規則標記後之文句之格式內容範例；圖13為本發明使用關鍵詞文字正規化規則處理後之文句內容；圖14為本發明中關鍵詞前後優先權判別規則之示意圖；圖15為本發明之關鍵詞尋找步驟流程圖；圖16為本發明中一種文章歸類之範例；圖17為本發明中斷詞與重新斷詞之範例；圖18為本發明中計算詞在各文件的出現數量並建立二維矩陣之範例；圖19為本發明中提取可用關鍵詞之範例；圖20為本發明中關鍵詞正規化模組參考關鍵詞文字正規化規則之結果；圖21為本發明語意讀法處理模組建立語意標示讀法模型與語意協同分析模型之步驟流程圖；圖22為本發明中一種文章歸類之範例；圖23為本發明文件重要性評估技術之範例；圖24為本發明奇異值分解之範例；圖25為本發明以特徵能量進行降維之範例；圖26為本發明之文字符號內含語意協同處理模組流程步驟圖。 1 is a schematic diagram of a module architecture of the present invention; FIG. 2 is an example of textual information marked by a word breaker module according to an embodiment of the present invention; FIG. 3 is a schematic diagram of a composition of a word segmentation reorganization module according to the present invention; Figure 5 is an example of the content of the output of the cutter of the present invention; Figure 6 is an example of the content of the output of the recombination range marker of the present invention; Figure 7 is a recombiner of the present invention FIG. 8 is an example of the output content of the text normalization markup module 106 of the present invention; FIG. 9 is an example of a finite state machine text normalization rule followed by the text normalization module of the present invention; FIG. Flow chart of steps of finite state machine text normalization rule execution; FIG. 11 is a text sentence of the invention marked by general rules and conflict rules Example of the content content; FIG. 12 is an example of the format content of the sentence after the large rule mark in the present invention; FIG. 13 is the content of the sentence sentence processed by the keyword formalization rule according to the present invention; FIG. 15 is a flowchart of a keyword searching step of the present invention; FIG. 16 is an example of classifying an article in the present invention; FIG. 17 is an example of interrupt words and re-words in the present invention; In the invention, an example of calculating the number of occurrences of words in each file and establishing a two-dimensional matrix; FIG. 19 is an example of extracting available keywords in the present invention; FIG. 20 is a reference normalization rule of keyword normalization module in the present invention. FIG. 21 is a flowchart of steps for establishing a semantic meaning reading model and a semantic cooperation analysis model according to the semantic reading processing module of the present invention; FIG. 22 is an example of classification of an article in the present invention; FIG. FIG. 24 is an example of singular value decomposition of the present invention; FIG. 25 is an example of dimensional degradation of feature energy according to the present invention; FIG. Co-processing module of the process steps of FIG.

圖27為本發明之輸出結果範例。 Figure 27 is an example of the output of the present invention.

本發明係一種語意協同處理之中文文字正規化系統與方法，請一齊參照圖1之模組架構圖與圖2之實施例經斷詞標記模組標記之文本範例，本發明之處理步驟主要分為四大部分，第一步驟由斷詞標記模組102、斷詞重組模組104、文字正規化標記模組106等三模組執行；第二部分為文字正規化模組107執行；第三部分為關鍵字正規化模組109來執行，而第四部份為由語意讀法處理模組111執行。 The invention is a formalization of Chinese characters in semantic cooperation processing For the system and method, please refer to the module architecture diagram of FIG. 1 and the text example of the example of FIG. 2, which are marked by the word segmentation module. The processing steps of the present invention are mainly divided into four parts, and the first step is marked by a word breaker. The module 102, the word segmentation reorganization module 104, the text normalization tag module 106 and the like execute three modules; the second part is executed by the text normalization module 107; and the third part is executed by the keyword normalization module 109. The fourth part is executed by the semantic read processing module 111.

第一部分之實施方式及相關輸入內容如下：首先將文句資訊101經斷詞標記模組102處理，關於斷詞標記模組102對文句資訊所作標記之內容如圖2範例所示，位於範例中最左側第一區塊為斷詞項目斷出之詞組，而範例中第二區塊以及第三區塊係為斷詞後對斷詞項目中詞組進行詞類標記，以及對標記出之詞句的構成成分以提供後續模組處理之參數，其中第二區塊係為類別，第二區塊中第一行為子結構類別(Subpos)例如：Neu(數詞定詞)、Nf(量詞)，第二區塊中第二行為詞類名稱(POS)例如：Neu(數詞定詞)、Da(數量副詞)，PM(標號)等，第二區塊中第三行為片語類別(PHtype)；再來最右側第三區塊為一連串數字與一個英文字母之組合，其意義為表示斷詞項目之詞組是否包含某種資訊之組合，其中，該五位數字由左至右代表的意思是，第一位：詞組是否包含數字、第二位：詞組是否包含中文之數字、第三位：詞組是否包含符號、第四位：詞組是否包含英文。 The implementation of the first part and the related input contents are as follows: First, the sentence information 101 is processed by the word segmentation module 102, and the content of the word tag information of the word tagging module 102 is shown in the example of FIG. 2, which is the most in the example. The first block on the left is the phrase broken out by the word segmentation item, and the second block and the third block in the example are word class tags for the words in the word segmentation item after the word break, and the constituent elements of the word segmented words. To provide parameters for subsequent module processing, wherein the second block is a category, and the first behavior substructure category (Subpos) in the second block is, for example, Neu (numerical word), Nf (quantifier), and second block. The second action word class name (POS) is, for example, Neu (numerical word), Da (quantity adverb), PM (label), etc., the third act phrase category (PHtype) in the second block; The three blocks are a combination of a series of numbers and an English letter, which means that the phrase indicating the word breaker item contains a combination of certain information, wherein the five digits represent from left to right, the first digit: the phrase Whether it contains numbers, second place: words It contains numbers of Chinese, third place: the phrase contains symbols, fourth place: whether to include the phrase in English.

斷詞重組模組104之模組構成示意圖如圖3所示，斷詞重組模組104係由重斷範圍標記器301、切割器302、重組範圍標記器303以及重組器304所構成，其中，重斷範圍標記器301對文句資訊所標記之內容如圖4所示，最右邊的欄位包含一位英文字，代表此詞組在斷開之句子中之位置，例如：B(開始)、M(中間)、E(結束)、S(單一個)，其中，若斷詞錯誤依據規則判別，更詳細說明如下：如整句長度為二個單位則斷開之詞組依序標記為：B、E，若整句長度為三個單位則類推標記成：B、M、E；而重斷範圍標記器301利用條件隨機域(Conditional Random Field,CRF)模型之技術，以建立欲重斷範圍之機率場，藉該機率場找尋出欲重斷之範圍，而切割器302輸出之內容如圖5所示，係為將重斷範圍標記器301所重斷範圍進行個別文字的拆解，將標記範圍內的詞拆成字以利接下來的重組範圍標記器303重組。 FIG. 3 is a schematic diagram of a module structure of the word segmentation reorganization module 104. The word segmentation reorganization module 104 is composed of a double-cut range marker 301, a cutter 302, a recombination range marker 303, and a recombiner 304. The content marked by the re-segment range marker 301 for the sentence information is as shown in FIG. 4, and the rightmost field contains an English word representing the position of the phrase in the broken sentence. Set, for example: B (start), M (middle), E (end), S (single one), wherein if the word break is judged according to the rules, it is explained in more detail as follows: if the length of the whole sentence is two units, then The open phrase is sequentially labeled as B and E. If the length of the whole sentence is three units, the analogy is labeled as: B, M, and E; and the repeated range marker 301 uses the Conditional Random Field (CRF) model. The technique is to establish a probability field of the range to be broken, and the probability field is used to find the range to be broken, and the output of the cutter 302 is as shown in FIG. 5, and the weight of the double-cut range marker 301 is broken. The range is to disassemble the individual characters, and the words within the range of the mark are broken into words to facilitate the reorganization of the next recombination range marker 303.

而重組範圍標記器303輸出之內容如圖6所示，其中最右邊的欄位係為英文字後接續數字之組合，其代表之意思為：B(開始)、B2(開始第二個)、B3(開始第三個)、M(中間)、E(結束)、S(單一個)，每一列為一個單位，下列對標記方式進行更詳細之說明：詞組如長度為二個單位則標記成：B、E，如詞組長度為三個單位則標記成：B、M、E，如長度為四個單位則標記成：B、B2、B3、E，如長度為五個單位則標記成：B、B2、B3、M、E，最後，如長度為六個單位以上則標記成：B、B2、B3、M…M、E。 The content of the output of the recombination range marker 303 is as shown in FIG. 6, wherein the rightmost field is a combination of the following words of the English word, and the representative meanings are: B (start), B2 (start second), B3 (starting the third), M (middle), E (end), S (single), each column is a unit. The following describes the marking method in more detail: if the phrase is two units in length, it is marked as :B, E, if the length of the phrase is three units, it is marked as: B, M, E. If the length is four units, it is marked as: B, B2, B3, E. If the length is five units, it is marked as: B, B2, B3, M, E, and finally, if the length is more than six units, it is marked as: B, B2, B3, M...M, E.

而重組範圍標記器303利用條件隨機域模型建立欲重組範圍之機率場，藉此標記出需重組之範圍，其中重組範圍是數字與符號部分，數字的重組與後續判別唸法有關，而數字與符號可以構成判別唸法之規則，即為將數字與可判別規則之組合進行重組。 The recombination range marker 303 uses the conditional random domain model to establish a probability field of the range to be recombined, thereby marking the range of recombination, wherein the recombination range is a part of numbers and symbols, and the reorganization of numbers is related to subsequent discriminating methods, and the numbers and The symbol can constitute a rule for discriminating the idea, that is, to recombine the combination of the number and the discriminable rule.

關於重組器304輸出之內容可參照圖7所示，將重組範圍標記器303之輸出結果將標記範圍內的字重新合併成一字串提供給後續文字正規化標記模組106所讀取之格式。 For the content of the output of the reorganizer 304, as shown in FIG. 7, the output result of the recombination range marker 303 recombines the words in the tag range into a string for the subsequent text normalization tag module 106 to read. formula.

文字正規化標記模組106之輸出內容如圖8所示，此模組係針對斷詞後不符合重組格式之項目進行重組，因此在此模組標出格式需要再處理之範圍，其他範圍則為符合格式的欲處理範圍。 The output content of the text normalization markup module 106 is as shown in FIG. 8. The module is reorganized for items that do not conform to the reorganization format after the word break, so the format of the module needs to be processed again, and other ranges are To meet the scope of the format to be processed.

關於本發明之第二部分實施方式說明如下，請參照第9圖，係為文字正規化模組107所依循之規則，分為一般規則、衝突規則與大規則等三種制定方式，三種規則統稱為有限狀態機文字正規化規則108，其中，將定義判別唸法之規則為一般規則(R)，不可判別唸法規則為衝突規則(CR)，利用R與CR關聯性所構成的BR係為大規則，而下方有兩狀態機圖之例子；首先，第一例欲判斷之文字為11：16：661，此圖中有四個狀態，分別為0、1、2、3，而狀態0可連接狀態數為1(轉移至狀態1/ADL)，狀態1可連接狀態數為3(轉移至狀態1/ADL、轉移至狀態2/：、轉移至狀態3/END)，狀態2可連接狀態數為1(轉移至狀態1/ADL)，狀態3可連接狀態數為0；再來，第二例欲判斷之文字為2.75%，此圖中有五個狀態，分別為0、1、2、3、4，而狀態0可連接狀態數為1(轉移至狀態1/ADL)，狀態1可連接狀態數為2(轉移至狀態1/ADL、轉移至狀態2/‧)，狀態2可連接狀態數為1(轉移至狀態3/ADL)，狀態3可連接狀態數為2(轉移至狀態3/ADL、轉移至狀態4/%)。 The second embodiment of the present invention is described below. Please refer to FIG. 9 for the rules followed by the text normalization module 107, which are divided into three types: general rules, conflict rules, and large rules. The three rules are collectively referred to as three rules. The finite state machine text normalization rule 108, wherein the rule defining the discrimination method is a general rule (R), the unrecognizable law rule is a conflict rule (CR), and the BR system formed by the correlation between R and CR is large. Rules, and there are two examples of state machine diagrams below; first, the first sentence to judge is 11:16:661, there are four states in this figure, respectively 0, 1, 2, 3, and state 0 can The number of connection states is 1 (transition to state 1/ADL), and the number of state 1 connectable states is 3 (transition to state 1/ADL, transition to state 2/:, transition to state 3/END), state 2 connectable state The number is 1 (transition to state 1/ADL), the number of state 3 connectable states is 0; again, the second sentence to judge text is 2.75%, there are five states in this figure, respectively 0, 1, 2 , 3, 4, and the state 0 connectable state number is 1 (transition to state 1 / ADL), the state 1 connectable state number is 2 (turn Move to state 1/ADL, transition to state 2/‧), state 2 connectable state number is 1 (transition to state 3/ADL), state 3 connectable state number is 2 (transfer to state 3/ADL, transfer to State 4/%).

圖10為有限狀態機文字正規化規則執行之步驟流程圖，首先，經步驟S1001規則與衝突規則標記，以列為單位找尋符合有限狀態機文字之規則，再來才是步驟S1002大規則標記，以大規則(BR)進行列之組合規則搜尋。 10 is a flow chart of the steps of the finite state machine text normalization rule execution. First, the rule conforming to the finite state machine text is searched by the rule and the conflict rule mark in step S1001, and then the step S1002 large rule mark is performed. Combine rule search with large rules (BR).

圖11為經步驟S1001一般規則與衝突規則標記後之文句格式內容，步驟S1001係以各列依序方式進行符合之規則標記，標記完後便可得到有一般規則(R)之唸法及無法確定唸法之衝突規則(CR)標記，但由於接著還有大規則(BR)可再解決一部分衝突規則無法決定唸法之狀況，因此先保留標記的資訊，以利接下來的大規則。 11 is the format of the text of the sentence after the general rule and the conflict rule are marked in step S1001. Step S1001 performs the matching rule mark in the order of each column, and after the mark is finished, the general rule (R) can be obtained and cannot be obtained. Determine the conflicting rule (CR) mark of the mind, but since there are still big rules (BR) to solve a part of the conflict rule can not determine the state of the law, so first retain the marked information to benefit the next big rule.

圖12為大規則(BR)標記後之文句之格式內容，由於一般規則是已經可以確定唸法之規則，無法確定唸法者係為衝突規則部分，故大規則係以衝突規則部分為目標，找尋前後文字正規化範圍連續性的規則，故根據前面所標記的部分尋找是否符合大規則，經過此級大規則標記後大部分文字正規化項目都可確定唸法，故可知一般規則與大規則可找尋其對照規則轉成文字的唸法方式，而剩餘之衝突規則由於仍無法判別唸法需進入關鍵詞正規化模組109來確認其讀音念法。 Figure 12 is the format content of the sentence after the big rule (BR) mark. Since the general rule is that the rule of the law can be determined, and the ruler cannot be determined as the conflict rule part, the big rule is aimed at the conflict rule part. Looking for the rules of the continuity of the formalization of the text before and after, so according to the part marked above to find out whether it meets the big rules, most of the text normalization items can be determined after the large rules of this level, so the general rules and the big rules can be known. It can be found in the way of comparing the rules into text, and the remaining conflict rules need to enter the keyword normalization module 109 to confirm the pronunciation method because they still cannot judge the method.

本發明第三部分之實施方式說明如下，請一併參照圖13，所示者係以關鍵詞正規化模組109使用關鍵詞文字正規化規則110處理後之文句內容，藉關鍵詞分類可得知個別關鍵詞所影響之文字正規化項目念法，但由於在經過標記的文字正規化內容前後皆有可能搜尋到關鍵詞，或是一整句中出現複數個關鍵詞，此時當如何判斷亦須一套規則，由一般中文文法常理可知，離被標記部分之文字部分距離較近的關鍵詞應對於被標記部分之讀音念法具較大之影響力，因此本發明會建立前後關鍵詞優先權，用以挑選標記部分前文與後文所搜尋到的各第一個關鍵詞，根據相距之距離以選擇離文字正規化範圍最接近者，倘若前後關鍵詞距離相同，則不挑選任何一個關鍵詞，關鍵詞前後優先權判別規則之示意圖請參照圖14所示。 The third embodiment of the present invention will be described below. Referring to FIG. 13 together, the keyword normalization module 109 uses the keyword text normalization rule 110 to process the content of the sentence, which can be obtained by keyword classification. Knowing the textualization project that affects individual keywords, but because it is possible to search for keywords before and after the normalized content of the marked text, or when multiple keywords appear in a whole sentence, how to judge A set of rules is also required. It is known from the common sense of Chinese grammar that keywords that are closer to the text portion of the marked part should have a greater influence on the pronunciation of the marked part. Therefore, the present invention will establish pre- and post-key words. Priority is used to select the first keyword searched in the preceding part and the following text, and the closest distance to the text normalization range is selected according to the distance between the marks, if the front-end keyword distance is the same, then For the selection of any keyword, a schematic diagram of the priority discrimination rules before and after the keyword is shown in FIG.

圖15為關鍵詞尋找流程圖，將依序為步驟S1501文章歸類、步驟S1502斷詞與重新斷詞、步驟S1503計算詞在各文件的出現數量並建立二維矩陣以及步驟S1504提取可用關鍵詞。 15 is a flowchart of keyword search, which will be classified into step S1501, step S1502, and step S1502, and step S1503, the number of occurrences of the word in each file is calculated and a two-dimensional matrix is established, and step S1504 extracts available keywords. .

圖16為步驟S1501文章歸類之範例，將欲確定唸法之符號依應該發音之分類以組成集合，圖中所示之冒號符號”：”其餘文句中皆應代表比數之意義，其讀音念法應作”比”，此種比數之念法常出現於比賽、匯率、統計調查等文件中，以此作為一種範例。 Figure 16 is an example of the classification of the article in step S1501. The symbol of the sentence should be determined according to the classification of the pronunciation to form a set. The colon symbol ":" in the figure should represent the meaning of the ratio, and its pronunciation. The method of reading should be "ratio". This method of comparison often appears in documents such as competitions, exchange rates, and statistical surveys as an example.

圖17為步驟S1502斷詞與重新斷詞之範例，根據處理後之內容選出名詞、量詞、動詞當作關鍵詞。 FIG. 17 is an example of the word breaking and re-interpreting in step S1502, and selecting nouns, quantifiers, and verbs as keywords according to the processed content.

圖18為步驟S1503計算詞在各文件的出現數量並建立二維矩陣之範例，係一範例表示本方法計算各種詞在每個句子中出現之數量並建立二維矩陣，圖中有四種詞分別對應於三份文件中出現之數量。 18 is an example of calculating the number of occurrences of words in each file and establishing a two-dimensional matrix in step S1503. An example shows that the method calculates the number of occurrences of various words in each sentence and establishes a two-dimensional matrix. Corresponds to the number of occurrences in the three documents.

圖19為步驟S1504提取可用關鍵詞之範例，以所有句子出現過之詞進行對於各個句子進行計數，挑選出現次數較高者為關鍵詞，根據圖18之範例，可以發現詞2與詞3出現在各文件數目為最多，因此極可能為關鍵詞，故此範例中挑選該兩者詞2與詞3建立新的可用關鍵詞。 FIG. 19 is an example of extracting available keywords in step S1504, counting each sentence with words appearing in all sentences, and selecting the higher number of occurrences as keywords. According to the example of FIG. 18, words 2 and 3 can be found. Now the number of files is the most, so it is very likely to be a keyword. Therefore, the two words 2 and 3 are used to create new available keywords.

圖20為關鍵詞正規化模組109，參考關鍵詞文字正規化規則110之結果，於大規則(BR)進行標記完後，大多數文字正規化標記範圍之項目便可確定讀音念法，而剩下無法確定唸法之項目則進行搜尋前後最近之關鍵詞，並利用優先權決定該項目之念法歸屬進行標記進而決定唸法，將文字正規化範圍進行前後關鍵詞之查找，找尋到距離最為近之關鍵詞並查詢關鍵詞類別以判別文字正規化項目之唸法，圖中，數字9若為原先標記不知如何發音之部分，因其下方距離最近之關鍵詞為”萬”，則可判定該數字9應是與”萬”連綴之形容詞，讀音念做九萬。 FIG. 20 is a keyword normalization module 109. Referring to the result of the keyword formalization rule 110, after the large rule (BR) is marked, most of the text normalizes the range of the mark to determine the pronunciation method. For the remaining items that cannot be determined, the most recent keywords before and after the search are used. The first right determines the affiliation of the project to mark and decide the way of thinking. The normalization of the text is carried out before and after the keyword search, and the keyword with the closest distance is found and the keyword category is queried to determine the formalization of the text. In the figure, if the number 9 is the part of the original mark that does not know how to pronounce it, because the keyword closest to the bottom is "10,000", it can be judged that the number 9 should be an adjective with the "wan", and the pronunciation is 90,000. .

本發明之第四部分之實施方式說明如下：由於經過第二部分文字正規化模組107與第三部分以關鍵詞正規化模組109之處理後，仍有一小部分無法正確確定讀音念法之文字正規化項目(無法找尋到符合一般規則與大規則且未找尋到關鍵詞或關鍵詞前後之距離相當並無法判別唸法者)可經由以語意讀法處理模組111使用內含的語意標示讀法模型113以及內含的語意協同分析模型112進行處理以得出文字正規化後正確讀音念法之文本114。 The fourth embodiment of the present invention is described as follows: after the second partial text normalization module 107 and the third portion are processed by the keyword normalization module 109, there is still a small portion that cannot correctly determine the pronunciation method. Text normalization project (cannot find a person who meets the general rules and the big rules and does not find the keyword or the distance between the keywords and can not discriminate the numerology) can use the semantic meaning of the module 111 by using the semantic reading method The reading model 113 and the included semantic collaborative analysis model 112 process to obtain the text 114 of the correct pronunciation after the text is normalized.

圖21為語意讀法處理模組111建立語意標示讀法模型與語意協同分析模型之步驟流程圖，其詳細說明如下：首先為，步驟S2101文章歸類：將含有相同欲文字正規化項目之句子統整成一份文件並將包含文字正規化符號(如“：”)之句子建立一集合歸類，請參閱圖22所示為一種文章歸類之範例。 21 is a flow chart showing the steps of the semantic reading processing module 111 for establishing a semantic meaning reading model and a semantic cooperation analysis model, which are described in detail as follows: First, the step S2101 is classified as: a sentence containing a normalized item of the same desired text. Consolidate into a single document and categorize sentences containing formalized symbols (such as ":"). See Figure 22 for an example of article categorization.

再來為步驟S2102斷詞與重新斷詞，可再利用斷詞標記102與斷詞重組模組104再進行斷詞，然後用進入步驟S2103以所有句子出現過之詞計算各詞出現在文件間數量以建立二維矩陣，再來經過文件重要性評估技術(Term Frequency-Inverse Document Frequency,TF-IDF)，即為步驟S2104 TF-IDF，係評估個別單詞對於文件的集合或詞庫中一份文件的重要程度之對應資料表，請參照圖23所示。 Then, in step S2102, the word is broken and the word is re-interrupted, and the word-breaking flag 102 and the word-reversing module 104 can be used to perform the word-breaking, and then the words appearing in all the sentences in the sentence S2103 are used to calculate the words appearing between the files. The quantity is used to establish a two-dimensional matrix, and then the Term Frequency-Inverse Document Frequency (TF-IDF), which is step S2104 TF-IDF, is to evaluate individual words for a collection of files or a corpus of words. Please refer to Figure 23 for the correspondence table of the importance of the documents.

再來係為步驟S2105奇異值分解(Singular Value Decomposition,SVD)，係以SVD技術求出如圖24所示之結果，以U、Σ、V等三矩陣建立語意協同分析模型，U代表各文件與文件類別之間的相關性，Σ矩陣代表文件類別之重要性，V為各文件類別與個別詞之間的相關性；U矩陣中每一列代表一份文件，U矩陣的每一欄代表投影到各基底的參數，而Σ代表基底的能量值；而V矩陣經過轉置(Transpose)後每一列代表一基底，而每一欄為每一個詞所貢獻之參數。 Then, it is the Singular Value Decomposition (SVD) in step S2105. The result shown in Figure 24 is obtained by SVD technology. The semantic analysis model is established by three matrices such as U, Σ, V, etc. U represents each file. Correlation with file categories, Σ matrix represents the importance of file categories, V is the correlation between each file category and individual words; each column in the U matrix represents a file, and each column of the U matrix represents a projection The parameters to each substrate, and Σ represents the energy value of the substrate; and after the V matrix is transposed, each column represents a substrate, and each column contributes to the parameters contributed by each word.

接著係為步驟S2106以特徵能量進行降維，運用特徵能量對矩陣U進行降維，如圖25之範例資料所示，λ為利用SVD分解後Σ矩陣對角值平方所形成的矩陣，令e_I=0.8(意即只需能量總和百分之八十)作為參考值，若是將λ的對角元素設為零，對於U矩陣而言就是減少一個欄位，將λ原對角值5更改成0後可以發現U所對應欄位值亦會變成0，故透過此方法得以令矩陣維度下降，其中而降維之參數，其中E _k為欲降成維度之能量總和，E _N為原能量總和，降維的目的是為了將能量小(影響較小)之基底剔除，因此運用λ對角值計算出欲所保留的基底，如同上述公式中E _k為欲保留能量總數，E _N為總能量數，其係為運用特徵能量進行降維之步驟S2106之結果，最後為步驟S2107建立讀法模型，本發明中係利用SVD降低維度後之U矩陣參數將相同讀音念法之文件檔建立高斯混合模型(Gaussian Mixture Model,GMM)之讀法模型。 Then, in step S2106, the feature energy is used for dimensionality reduction, and the characteristic energy is used to reduce the dimension of the matrix U. As shown in the example data of FIG. 25, λ is a matrix formed by squared the diagonal value of the unitary matrix after decomposition by SVD, so that e _I = 0.8 (meaning that only 80% of the total energy) is used as a reference value. If the diagonal element of λ is set to zero, for the U matrix, one field is reduced, and the original value of λ is changed by 5. After becoming 0, it can be found that the field value corresponding to U will also become 0, so the matrix dimension can be reduced by this method. Dimensionality reduction parameter Where E _k is the sum of the energy to be reduced to the dimension, E _N is the sum of the original energies, and the purpose of dimensionality reduction is to eliminate the base of the small energy (less impact), so the λ diagonal value is used to calculate the desired The substrate, as in the above formula, E _k is the total amount of energy to be retained, E _N is the total energy number, which is the result of the step S2106 of using the characteristic energy for dimensionality reduction, and finally the reading model is established for the step S2107, which is utilized in the present invention. The U matrix parameter after the SVD reduces the dimension creates a Gaussian Mixture Model (GMM) reading model for the same pronunciation file.

接著再經由文字符號內含語意協同處理模組流程處理，請參照圖26，分為兩步驟，第一係為步驟S2601文件摺入，以及接著的步驟S2602使用最大概似估計(Maximum Likelihood)法對應高斯混合讀法模型以確立正確之讀音念法，其中，步驟S2601之文件摺入技術示意圖為圖27，該步驟之說明如下：未找出規則且鑑別式關鍵詞標記系統標記之句子或者包含未確定發音(標記只有衝突規則)之句子進行文件摺入來求出，而l代表***文件之數目，K為基底數，其中為欲判別之文字正規化句子經過SVD轉換降維後所產生的參數矩陣，此方法是利用矩陣特性，使須文字正規化之文件，不需要與先前訓練之檔案一起重算SVD分解以求出需文字正規化文件之U參數，可以相當程度的節省時間。 Then, through the word symbol embedded semantic processing module processing, please refer to FIG. 26, which is divided into two steps, the first is the file folding in step S2601, and the next step S2602 uses the most approximate similar estimation (Maximum Likelihood) method. Corresponding to the Gaussian mixed reading model to establish a correct reading method, wherein the file folding technical diagram of step S2601 is as shown in FIG. 27, and the description of the step is as follows: the rule is not found and the identifier of the identification keyword tagging system is marked or included The sentence of the undetermined pronunciation (the mark only has the conflict rule) is obtained by folding the file into And l represents the number of inserted files, K is the number of substrates, where For the parameter matrix of the normalized sentence of the character to be discriminated after SVD conversion and dimension reduction, this method uses the matrix property to make the file of the text normalized, and does not need to recalculate the SVD decomposition together with the previously trained file to find The U parameter of the text normalization file can save a considerable amount of time.

如圖27所示，係為找出該文句之語意空間SVD參數，再利用最大概似估計對應高斯混合模型以找尋讀音念法，最後結果係為經過文字符號內含語意協同處理轉換後得到較佳的念法之正規化結果，亦為本發明之輸出結果之範例，當判斷出文件的正確讀音後，應可用在後製語音播報等用途上。 As shown in Fig. 27, it is to find the semantic space SVD parameter of the sentence, and then use the most approximate estimated Gaussian mixture model to find the pronunciation method. The final result is obtained by the semantic combination of the semantics of the text symbol. The normalized result of the good method is also an example of the output of the invention. When the correct pronunciation of the document is judged, it should be used for post-production voice broadcast and other purposes.

而上述之說明乃針對本發明之較佳實施例進行詳細的具體說明，惟該實施例並非用以限制本發明之專利範圍，凡未脫離本發明技藝精神所為之等效實施或變更，均應包含於本案之專利範圍中。 The above description of the preferred embodiments of the present invention is intended to be a detailed description of the preferred embodiments of the present invention. It is included in the patent scope of this case.

綜上所述，本發明於技術思想上實屬創新，也具備先前技術不及的多種功效，已充分符合新穎性及進步性之法定發明專利要件，爰依法提出專利申請，懇請貴局核准本件發明專利申請案以勵發明，至感德便。 In summary, the present invention is innovative in terms of technical ideas, and also has various functions that are not in the prior art, and has fully complied with the statutory invention patent requirements of novelty and progressiveness, and has filed a patent application according to law, and invites you to approve the invention. The patent application was inspired to invent, and it was a matter of feeling.

101‧‧‧文句資訊 101‧‧‧ sentence information

102‧‧‧斷詞標記模組 102‧‧‧Battery tag module

103‧‧‧切割重組模型 103‧‧‧Cutting recombination model

104‧‧‧斷詞重組模組 104‧‧‧Words reorganization module

107‧‧‧文字正規化模組 107‧‧‧Text formalization module

109‧‧‧關鍵詞正規化模組 109‧‧‧Keyword normalization module

111‧‧‧語意讀法處理模組 111‧‧‧Speech reading module

113‧‧‧語意標示讀法模型 113‧‧‧Speech index reading model

Claims

一種語意協同處理之中文文字正規化方法，其步驟包含：一關鍵詞正規化模組接收一已正規化文字資訊，該關鍵詞正規化模組根據一關鍵詞判斷規則處理該已正規化文字資訊之內容以產生一初步判斷文字資訊；一語意讀法處理模組接收該初步判斷文字資訊，該語意讀法處理模組根據一語意標示讀法模型以及一語意協同分析模型以判斷處理該初步判斷文字資訊以產生一完整正規化文字資訊，該完整正規化文字資訊係用以提升語音播報之正確性；其中，該關鍵詞判斷規則係根據內存複數關鍵字詞組對該已正規化文字資訊之內容進行標記，且該關鍵詞判斷規則係根據關鍵字詞組與已正規化文字內容之距離作為優先權以判斷該已正規化文字資訊之內容的正確讀音；其中，該語意標示讀法模型係為以重複訓練之所有文件中特定詞組出現之數量以及正規化之標點符號以建立二維矩陣，將通過文件重要性評估技術(TF-IDF)之二維矩陣再通過奇異值分解(Singular Value Decomposition,SVD)以降低維度，再以降維後之矩陣參數建立相同文字或詞組讀音之高斯混合模型(Gaussian Mixture Model,GMM)，以及其中，該語意協同分析模型係為文件摺入最大概似估計(Maximum Likelihood)法以處理該初步判斷文字資訊內容以產生該完整正規化文字資訊。 A Chinese text normalization method for semantic collaborative processing, the method comprising: a keyword normalization module receiving a normalized text information, the keyword normalization module processing the normalized text information according to a keyword determination rule The content is used to generate a preliminary judgment text message; the semantic reading module receives the preliminary judgment text information, and the semantic reading processing module marks the reading model and a semantic collaborative analysis model according to a semantic meaning to judge the preliminary judgment The text information is used to generate a complete normalized text information, which is used to improve the correctness of the voice broadcast; wherein the keyword judgment rule is based on the content of the normalized text information according to the memory plural keyword phrase Marking, and the keyword judging rule is based on the distance between the keyword phrase and the normalized text content as a priority to determine the correct pronunciation of the content of the normalized text information; wherein the semantically indicating the reading model is The number of occurrences of specific phrases in all files of repeated training and the standardization of normalization The symbol is used to establish a two-dimensional matrix, and the two-dimensional matrix of the file importance evaluation technique (TF-IDF) is further reduced by Singular Value Decomposition (SVD) to reduce the dimension, and then the same text is created by the dimensionally reduced matrix parameter or The Gaussian Mixture Model (GMM) of the phrase pronunciation, and Wherein, the semantic collaborative analysis model is a file folding into a most approximate similarity (Maximum Likelihood) method to process the preliminary judgment text information content to generate the complete normalized text information.

如申請專利範圍第1項所述之中文文字正規化方法，其步驟更可包含：一斷詞標記模組接收一文字資訊，該斷詞標記模組根據詞類規則標記該文字資訊之內容以產生一詞類標記文字資訊；一斷詞重組模組接收該詞類標記文字資訊，該斷詞重組模組根據一切割重組模型對該詞類標記文字資訊之內容進行處理以產生一重組標記文字資訊；一文字正規化標記模組接收該重組標記文字資訊，該文字正規化標記模組根據一文字正規化範圍模型標記該重組標記文字資訊之內容以產生一正規化標記文字資訊；以及一文字正規化模組接收該正規化標記文字資訊，該文字正規化模組根據一一般規則、一衝突規則以及一大規則用以處理該正規化標記文字資訊之內容以產生該已正規化文字資訊。 For example, the Chinese character formalization method described in claim 1 may further include: a word segmentation module receiving a text message, the word segmentation module marking the content of the text information according to the word class rule to generate a The word class marks text information; a word segment reorganization module receives the word class tag text information, and the word segment reorganization module processes the content of the word class tag text information according to a cut recombination model to generate a reorganized tag text information; The markup module receives the recombined markup text information, the text normalization markup module marks the content of the reorganized markup text information according to a text normalization range model to generate a normalized markup text information; and a text normalization module receives the normalization Marking text information, the text normalization module processes the content of the normalized markup text information according to a general rule, a conflict rule, and a large rule to generate the normalized text information.

一種語意協同處理之中文文字正規化系統，其包含：一關鍵詞正規化模組，用以接收一已正規化文字資訊，該關鍵詞正規化模組根據一關鍵詞判斷規則處理該已正規化文字資訊之內容以產生一初步判斷文字資訊；一語意讀法處理模組，用以接收該初步判斷文字資訊，該語意讀法處理模組根據一語意標示讀法模型以及一語意協同分析模型以判斷處理該初步判斷文字資訊以產生一完整正規化文字資訊，該完整正規化文字資訊係用以提升語音播報之正確性；其中，該關鍵詞判斷規則係根據內存複數關鍵字詞組對該已正規化文字資訊之內容進行標記，且該關鍵詞判斷規則係根據關鍵字詞組與已正規化文字內容之距離作為優先權以判斷該已正規化文字資訊之內容的正確讀音；其中，該語意標示讀法模型係為以重複訓練之所有文件中特定詞組出現之數量以及正規化之標點符號以建立二維矩陣，將通過文件重要性評估技術(TF-IDF)之二維矩陣再通過奇異值分解(Singular Value Decomposition,SVD)以降低維度，再以降維後之矩陣參數建立相同文字或詞組讀音之高斯混合模型(Gaussian Mixture Model,GMM)，以及其中，該語意協同分析模型係為文件摺入最大概似估計(Maximum Likelihood)法以處理該初步判斷文字資訊內容以產生該完整正規化文字資訊。 A Chinese character normalization system for semantic collaborative processing, comprising: a keyword normalization module for receiving a normalized text information, the keyword normalization module processing the normalized according to a keyword judgment rule The content of the text information is used to generate a preliminary judgment text message; the linguistic reading processing module is configured to receive the preliminary judgment text information, and the semantic reading processing module marks the reading model according to a semantic meaning and a The semantic collaborative analysis model determines to process the preliminary judgment text information to generate a complete normalized text information, which is used to improve the correctness of the voice broadcast; wherein the keyword judgment rule is based on the memory plural keyword The phrase tags the content of the normalized text information, and the keyword determination rule determines the correct pronunciation of the content of the normalized text information according to the distance between the keyword phrase and the normalized text content; The semantically indicated reading model is to establish a two-dimensional matrix by the number of specific phrases in all the files of the repeated training and the normalized punctuation, which will be passed through the two-dimensional matrix of the document importance evaluation technique (TF-IDF). The Singular Value Decomposition (SVD) is used to reduce the dimension, and then the Gaussian Mixture Model (GMM) of the same text or phrase pronunciation is established by the matrix parameter after the dimension reduction, and wherein the semantic collaborative analysis model is The document is folded into the Maximum Likelihood method to process the preliminary judgment. Word information content to produce the complete normalization of text information.