TWI639091B

TWI639091B - Big data based automated analysis processing system

Info

Publication number: TWI639091B
Application number: TW106143079A
Authority: TW
Inventors: 鐘振聲; 徐凡耘; 李錦和
Original assignee: 鐘振聲; 徐凡耘; 李錦和
Priority date: 2017-12-08
Filing date: 2017-12-08
Publication date: 2018-10-21
Also published as: TW201926074A

Abstract

本發明係有關於一種基於大數據之自動化分析處理系統，包括資料轉換模組、資料池模組、統計與機器學習分析模組及模型監控評估模組。資料轉換模組設置以接收原始資料，並予以進行轉換、整合及初步分析，以產生特徵欄位資料。資料池模組設置以儲存特徵欄位資料。統計與機器學習分析模組設置以根據事先設定的分析主題，而從資料池模組搜尋擷取相關之特徵欄位資料，以建置預測模型。模型監控評估模組設置以根據最新的原始資料，來監控預測模型的準確度，並評估是否對預測模型進行自動更新或重建。 The invention relates to an automated analysis and processing system based on big data, comprising a data conversion module, a data pool module, a statistical and machine learning analysis module and a model monitoring evaluation module. The data conversion module is configured to receive the original data and perform conversion, integration and preliminary analysis to generate feature field data. The data pool module is set to store the feature field data. The statistical and machine learning analysis module is configured to search for the relevant feature field data from the data pool module according to the previously set analysis theme to construct the prediction model. The model monitors the evaluation module settings to monitor the accuracy of the prediction model based on the latest raw data and to evaluate whether the prediction model is automatically updated or reconstructed.

Description

基於大數據之自動化分析處理系統 Automatic data processing system based on big data

本發明係關於一種自動化分析處理系統，尤指一種基於大數據之具有自動化學習更新的自動化分析處理系統。 The present invention relates to an automated analytical processing system, and more particularly to an automated analytical processing system based on big data with automated learning updates.

雖著資訊科技的進步與成熟，使大數據(Big Data)因而興起爆發，且其應用亦成為近年來的熱門議題。 Despite the advancement and maturity of information technology, Big Data has sprung up and its application has become a hot topic in recent years.

一般而言，大數據(Big Data)導入於實際分析應用時，由於必先透過累積大量的資料，進而在有限的可能性中推估計算出具有較高發生機率的預測分析。然而，目前一般的大數據分析系統，除了使用進入的技術門檻高之外，其資料分析運算處理往往是曠日廢時，因此初步分析報告出來後，往往都已經錯過使用者進行決策的最佳時機，或者只能用以作為結果的驗證。如此一來，往往導致大數據分析的應用無法即時有效地商業營運的時際面相互接軌，以創造產生出最高應用價值。 In general, when Big Data is introduced into an actual analysis application, it is necessary to first estimate the prediction analysis with a high probability of occurrence by accumulating a large amount of data and estimating the probability of occurrence with a limited probability. However, the current general data analysis system, in addition to using the high technical barriers to entry, its data analysis and processing is often a waste of time, so after the preliminary analysis report, the user has often missed the best decision-making. Timing, or can only be used as a result of verification. As a result, applications that often lead to big data analytics cannot be integrated with the time and time of commercial operations in an instant and effective manner to create the highest application value.

因此，亟需發展出一種新穎且創新之基於大數據之自動化分析處理系統，以能夠提供一簡易直覺且可視性操作的基於大數據之資料整合分析平台，以降低使用者使用的技術門檻，且可快速迅捷地建立預測模型，以產出即時有效地針對特定目標需求對應產生預測分析，從而提供使用者可信賴的決策輔助。 Therefore, there is an urgent need to develop a novel and innovative big data-based automated analytics processing system that provides a simple and intuitive operation of a big data-based data integration analysis platform to reduce the technical thresholds used by users. Quickly and quickly build predictive models To produce predictive analysis corresponding to the specific target demand in an instant and effective manner, so as to provide user-reliable decision-making assistance.

本發明之主要目的係在提供一種基於大數據之自動化分析處理系統，俾能提供使用者一站式的資料分析介面，進而可直覺地進行網頁操作以有效降低進階資料分析之門檻，同時亦能夠針對資料數據進行前期剖析部署，從而可快速建置預測模型及快速產出分析洞察報告。此外，預測模型亦可具有失效自動預警與模型衰退自動重建，而無須使用者人為介入操作。 The main object of the present invention is to provide an automatic analysis and processing system based on big data, which can provide a one-stop data analysis interface for a user, thereby intuitively performing webpage operations to effectively reduce the threshold of advanced data analysis. Pre-analysis deployment of data data enables rapid deployment of predictive models and rapid output analysis insight reports. In addition, the predictive model can also have automatic automatic warning of failure and automatic reconstruction of model regression without user intervention.

為達成上述目的，本發明之一態樣係提供一種基於大數據之自動化分析處理系統，包括：一資料轉換模組，設置用以接收至少一原始資料，並轉換、整合及初步分析該原始資料，以產生一特徵欄位資料；一資料池模組，設置用以儲存該特徵欄位資料；一統計與機器學習分析模組，設置用以根據一設定分析主題，自動從該資料池模組搜尋擷取與該設定分析主題相關之該特徵欄位資料，以建置一預測模型，進而產生一分析結果；以及一模型監控評估模組，設置用以根據該資料轉換模組所接收之最新的原始資料，並據以監控該預測模型的準確度，進而評估是否自動更新該預測模型之特徵欄位資料之係數或重建該預測模型。 To achieve the above object, an aspect of the present invention provides an automated data processing system based on big data, comprising: a data conversion module configured to receive at least one original data, and convert, integrate, and preliminary analyze the original data. To generate a feature field data; a data pool module configured to store the feature field data; a statistical and machine learning analysis module, configured to automatically analyze the theme according to a setting, automatically from the data pool module Searching for the feature field data related to the set analysis topic to construct a prediction model to generate an analysis result; and a model monitoring evaluation module configured to receive the latest information received by the data conversion module The original data is used to monitor the accuracy of the prediction model, thereby evaluating whether to automatically update the coefficient of the feature field data of the prediction model or reconstruct the prediction model.

其次，本發明一種基於大數據之自動化分析處理系統，其中該原始資料包括結構化資料與非結構化資料其中至少一者。 Secondly, the present invention is an automated data processing system based on big data, wherein the original data comprises at least one of structured data and unstructured data.

再者，本發明一種基於大數據之自動化分析處理系統，其中該資料轉換模組包括：一資料整合單元，用以將該原始資料轉換整合為具有一相容資料格式的整合後資料；及一資料檢核單元，用以檢核該整合後資料，並根據一預設數據門檻檢測值，針對該整合後資料予以排除或差補調整，進而產生該特徵欄位資料。 Furthermore, the present invention is an automated data processing system based on big data, wherein the data conversion module comprises: a data integration unit for integrating the original data into an integrated data having a compatible data format; The data checking unit is configured to check the integrated data, and according to a preset data threshold detection value, the integrated data is excluded or adjusted, thereby generating the characteristic field data.

此外，本發明一種基於大數據之自動化分析處理系統，其中該資料池模組包括：一頻率偵測單元，設置以偵測該特徵欄位資料之擷取使用頻率，而將該特徵欄位資料分類標示儲存。 In addition, the present invention is an automated data processing system based on big data, wherein the data pool module includes: a frequency detecting unit configured to detect a frequency of use of the feature field data, and the feature field data The classification is stored.

又，本發明一種基於大數據之自動化分析處理系統，其中該統計與機器學習分析模組係以一預選演算法，針對與該設定分析主題相關之該特徵欄位資料的特徵資料，建立該預測模型。 Moreover, the present invention relates to an automated analysis processing system based on big data, wherein the statistical and machine learning analysis module establishes the prediction by using a pre-selected algorithm for the feature data of the feature field data related to the set analysis topic. model.

另，本發明一種基於大數據之自動化分析處理系統，其中該預選演算法包括多元回歸、羅吉斯迴歸演算法、深層類神經網絡演算法與隨機森林決策樹演算法其中至少之一。 In addition, the present invention is an automated data processing system based on big data, wherein the pre-selected algorithm comprises at least one of a multiple regression, a logistic regression algorithm, a deep neural network algorithm and a random forest decision tree algorithm.

更者，本發明一種基於大數據之自動化分析處理系統，其中該模型監控評估模組係根據一預設參數，以監控是否更新或重建該預測模型。另，模型監控評估模組包含模型上架單元，其可將統計與機器學習分析模組所建立的預測模型予以上架，進而提供日後新資料進行自動化評分與監控；再者，除了模型上架之功效外，模型監控評估模組也可將預測模型進行下架以停止評分與監控。 Moreover, the present invention is an automated data processing system based on big data, wherein the model monitoring evaluation module is configured to monitor whether to update or reconstruct the prediction model according to a preset parameter. In addition, the model monitoring and evaluation module includes a model racking unit, which can put the forecasting model established by the statistical and machine learning analysis module on the shelf, and then provide new data for automatic scoring and monitoring in the future; in addition, in addition to the function of the model racking The model monitoring and evaluation module can also remove the prediction model to stop the evaluation. Distribution and monitoring.

較佳的，本發明一種基於大數據之自動化分析處理系統，其中該預設參數為一吉尼指標(Gini Index)或一預設更新時程表。 Preferably, the present invention is an automated data processing system based on big data, wherein the preset parameter is a Gini Index or a preset update schedule.

再者，本發明一種基於大數據之自動化分析處理系統，更包含一工單調節模組，其中該工單調節模組包括：一運算資源監控單元，用以監控叢集式運算框架內每一運算裝置的資源狀態；及一工單派送單元，用以根據每一運算裝置的資源狀態，調節派送每一運算裝置之運算工作量。 Furthermore, the present invention relates to an automated analysis processing system based on big data, and further comprises a work order adjustment module, wherein the work order adjustment module comprises: an operation resource monitoring unit for monitoring each operation in the cluster operation framework a resource status of the device; and a work order dispatch unit for adjusting the operational workload of each computing device according to the resource status of each computing device.

另外，本發明一種基於大數據之自動化分析處理系統，更包含一使用者操作介面，用以提供使用者於一操作網頁上對應操作及設定該資料轉換模組、該資料池模組、該統計與機器學習分析模組、該模型監控評估模組及該工單調節模組。 In addition, the present invention provides an automatic analysis and processing system based on big data, and further includes a user operation interface for providing a user to operate and set the data conversion module, the data pool module, and the statistics on an operation webpage. And a machine learning analysis module, the model monitoring evaluation module, and the work order adjustment module.

更，本發明一種基於大數據之自動化分析處理系統，其中該使用者操作介面顯示該特徵欄位資料之分類標示主題，以提供使用者對該資料池模組執行快速查詢。 Further, the present invention is an automated data processing system based on big data, wherein the user operation interface displays a classification indication theme of the feature field data to provide a user to perform a quick query on the data pool module.

1‧‧‧基於大數據之自動化分析處理系統 1‧‧‧Automated analytical processing system based on big data

2‧‧‧資料轉換模組 2‧‧‧Data Conversion Module

3‧‧‧資料池模組 3‧‧‧Data Pool Module

4‧‧‧統計與機器學習分析模組 4‧‧‧Statistical and Machine Learning Analysis Module

5‧‧‧模型監控評估模組 5‧‧‧Model Monitoring and Evaluation Module

6‧‧‧使用者操作介面 6‧‧‧User interface

7‧‧‧工單調節模組 7‧‧‧Work Order Adjustment Module

21‧‧‧資料整合單元 21‧‧‧Data Integration Unit

22‧‧‧資料檢核單元 22‧‧‧ Data Inspection Unit

31‧‧‧頻率偵測單元 31‧‧‧Frequency detection unit

32‧‧‧主題分類單元 32‧‧‧Subject classification unit

71‧‧‧運算資源監控單元 71‧‧‧Computation Resource Monitoring Unit

72‧‧‧工單派送單元 72‧‧‧Work Order Delivery Unit

圖1A係依本發明一實施例所繪示之一種基於大數據之自動化分析處理系統的示意圖。 FIG. 1A is a schematic diagram of an automated data processing system based on big data according to an embodiment of the invention.

圖1B係繪示圖1A中之資料轉換模組的功能方塊圖。 FIG. 1B is a functional block diagram of the data conversion module of FIG. 1A.

圖1C係繪示根據本發明之另一較佳實施例之資料池模組的功能方塊圖。 1C is a diagram showing a data pool according to another preferred embodiment of the present invention. Functional block diagram of the module.

圖2係根據本發明之另一實施例之自動化分析處理系統之使用者操作介面的示意圖。 2 is a schematic diagram of a user interface of an automated analytical processing system in accordance with another embodiment of the present invention.

圖3係根據本發明之另一實施例之工單調節模組的功能方塊圖。 3 is a functional block diagram of a work order adjustment module in accordance with another embodiment of the present invention.

本發明基於大數據之自動化分析處理系統在本實施例中被詳細描述之前，要特別注意的是，以下的說明中，類似的元件將以相同的元件符號來表示。再者，本發明之圖式僅作為示意說明，其未必按比例繪製，且所有細節也未必全部呈現於圖式中。 Before the present invention is described in detail in the present embodiment, it is to be noted that in the following description, similar elements will be denoted by the same reference numerals. In addition, the drawings of the present invention are merely illustrative, and are not necessarily drawn to scale, and all details are not necessarily shown in the drawings.

請參照圖1A，其係依本發明一實施例所繪示之一種基於大數據之自動化分析處理系統的示意圖。如圖所示，基於大數據之自動化分析處理系統1包括一資料轉換模組2、一資料池模組3、一統計與機器學習分析模組4以及一模型監控評估模組5。其中，資料轉換模組2係設置用以接收至少一原始資料，並轉換整合及初步分析原始資料，以產生一特徵欄位資料。 Please refer to FIG. 1A , which is a schematic diagram of an automated data processing system based on big data according to an embodiment of the invention. As shown, the automated data processing system 1 based on big data includes a data conversion module 2, a data pool module 3, a statistical and machine learning analysis module 4, and a model monitoring evaluation module 5. The data conversion module 2 is configured to receive at least one original data, and convert the integrated and preliminary analysis of the original data to generate a feature field data.

請同參照圖1B，其係繪示圖1A中之資料轉換模組2的功能方塊圖。如圖1B所示，資料轉換模組2係可包括一資料整合單元21及一資料檢核單元22。更具體地說，資料整合單元21係可用以將所接收到的原始資料轉換整合為具有一相容資料格式的整合後資料。然而，於本發明之一實施例中，原始資料係可為結構化資料、半結構化資料與非結構化資料其中至少一者。如此一來，自動化分析處理系統1係可藉由資料轉換模組2中的資料整合單元21，來接收對接異質性資料，而沒有對輸入資料格式態樣予以限制，以提昇系統進行收集累積大量數據的多元性與便利性。 Please refer to FIG. 1B , which is a functional block diagram of the data conversion module 2 of FIG. 1A . As shown in FIG. 1B, the data conversion module 2 can include a data integration unit 21 and a data inspection unit 22. More specifically, the data integration unit 21 can be used to integrate the received raw data conversion into an integrated material having a compatible data format. However, in an embodiment of the present invention, the original data system may be at least one of structured data, semi-structured data, and unstructured data. As a result, The automated analysis processing system 1 can receive the docking heterogeneous data through the data integration unit 21 in the data conversion module 2 without limiting the format of the input data to improve the diversity of the system for collecting and accumulating large amounts of data. And convenience.

接著，資料檢核單元22係設置用以檢核由資料整合單元21經資料格式轉換後所彙整出的整合後資料，同時資料檢核單元22亦可根據一預設數據門檻檢測值，針對整合後資料予以排除或差補調整，進而產生具有資料品質較高的特徵欄位資料。更進一步地說，本實施例中所揭示的差補方式係根據欄位屬性之不同而有對應不同的處理方式，舉例而言，針對類別型的特徵欄位資料會以眾數進行差補調整；針對連續型的特徵欄位資料會以平均數或中位數予以進行差補調整；或是可依據實際應用需求而利用資料探勘演算法來進行差補調整。更具體地說，當整合後資料的其中一筆資料中，其性別欄位呈現空值(null)或是聯絡地址資訊，而不符合系統對性別欄位所設定之預設數據門檻檢測值為男性或女性，資料檢核單元22則可根據該筆資料其他的欄位資訊，例如姓名或身分證號碼前兩碼資訊等資訊，來對性別欄位予以進行差補而自動填上一對應性別資訊，以作為提供自動化分析處理系統1有效分析的資料；抑或是，資料檢核單元22根據使用者之偏好或設定而直接對該筆資料予以排除，視其為無效的錯誤資訊。 Then, the data checking unit 22 is configured to check the integrated data collected by the data integration unit 21 after the data format conversion, and the data checking unit 22 can also perform the integration according to a preset data threshold value. After the data is excluded or adjusted, the characteristic field data with higher data quality will be generated. Furthermore, the difference compensation method disclosed in this embodiment has different processing modes according to different field attributes. For example, the feature field data for the category type is adjusted by the majority. For the continuous type of feature field data, the average or the median will be adjusted by difference; or the data exploration algorithm can be used to make the difference adjustment according to the actual application requirements. More specifically, in one of the data of the integrated data, the gender field presents a null value or contact address information, and does not meet the system's default data threshold set for the gender field. Or female, the data checking unit 22 can automatically fill in a corresponding gender information based on other field information of the data, such as the name or the first two codes of the identity card number. In order to provide effective analysis of the automated analysis processing system 1; or the data verification unit 22 directly excludes the data according to the user's preference or setting, and regards it as invalid error information.

如此一來，資料轉換模組2係可透過資料整合單元21及資料檢核單元22，而對原始資料予以進行初步的格式整合與內容剖析彙整，甚至資料檢核單元22可將系統使用者所預先輸入預設特定的資訊欄位類別資訊及其分類門檻值，予以合併彙整，致使所提供的特徵欄位資料之資訊內容更具準確性且參考性，於後續進一步的資料分析應用時，除了不僅將能夠更為有效地提升其處理效率，且同時亦可讓使用者享受到數據免清洗的便利性。 In this way, the data conversion module 2 can initially perform the original data through the data integration unit 21 and the data inspection unit 22. The format integration and content analysis and reorganization, and even the data checking unit 22 can pre-enter the system specific user to pre-set the specific information field category information and the classification threshold value, and merge and consolidate, so that the provided characteristic field data The information content is more accurate and informative. In addition to the subsequent data analysis application, it will not only improve the processing efficiency more effectively, but also allow users to enjoy the convenience of data no-cleaning.

請繼續參照圖1A，其中資料池模組3係設置用以將資料轉換模組2所初步剖析產生之特徵欄位資料予以儲存。此外，於本發明之另一較佳實施例中，資料池模組3除了用以儲存特徵欄位資料，同時亦可將原始資料予以同步儲存，進而可有效保護備份資料的完整性。 Referring to FIG. 1A , the data pool module 3 is configured to store the feature field data generated by the preliminary analysis of the data conversion module 2 . In addition, in another preferred embodiment of the present invention, the data pool module 3 can store the feature field data and simultaneously store the original data, thereby effectively protecting the integrity of the backup data.

接者，請同步參照圖1A及圖1C，其中圖1C係繪示根據本發明之另一較佳實施例之資料池模組的功能方塊圖。更進一步地說，資料池模組3係可包括一頻率偵測單元31。其中，頻率偵測單元31係設置用以偵測紀錄系統整體對於特徵欄位資料之擷取使用頻率，從而根據使用頻率的高低對應分類儲存配置於不同存取速度的存儲媒體中。舉例而言，當頻率偵測單元31偵測辨識出具有高使用頻率的特徵欄位資料後，即儲存至存取速度較快的固態硬碟(SSD)；相對而言，當頻率偵測單元31偵測辨識出具有低使用頻率的特徵欄位資料後，即儲存至存取速度相對較慢的傳統硬碟(HDD)。 Referring to FIG. 1A and FIG. 1C, FIG. 1C is a functional block diagram of a data pool module according to another preferred embodiment of the present invention. Furthermore, the data pool module 3 can include a frequency detecting unit 31. The frequency detecting unit 31 is configured to detect the frequency of use of the feature field data by the recording system as a whole, and store the data in different storage speeds according to the usage frequency. For example, when the frequency detecting unit 31 detects that the feature field data with a high frequency of use is recognized, it stores the data to a solid-state hard disk (SSD) with a faster access speed; in contrast, when the frequency detecting unit After detecting the feature field data with low usage frequency, it detects the data to a traditional hard disk (HDD) with relatively slow access speed.

此外，資料池模組3亦可包括一主題分類單元32，而主題分類單元32係設置用以根據一關鍵主題而將特徵欄位資料予以進一步分類標示及儲存，如此將可大幅減省後續分析數據的搜尋擷取時間。其中，關鍵主題係可系統所預設或是使用者即時輸入的關鍵引索字及其對應資料欄位類別。 In addition, the data pool module 3 may also include a topic classification unit 32, and the topic classification unit 32 is configured to be based on a key theme. The feature field data is further classified and stored, which will greatly reduce the search time for subsequent analysis data. Among them, the key topics are key reference words that can be preset by the system or input by the user and their corresponding data field categories.

請再繼續參照圖1A，其中統計與機器學習分析模組4係設置用以根據由系統中預定或使用者即時輸入的設定分析主題，自動地與資料池模組3進行勾稽，來從中搜尋擷取與設定分析主題相關之一特徵欄位資料，以建置一預測模型，進而產生一分析結果。更進一步地說，統計與機器學習分析模組4係以一預選演算法，針對與設定分析主題相關之特徵欄位資料的特徵資料，建立預測模型。其中，於本實施例中，預選演算法係可包括羅吉斯迴歸演算法、類神經網絡演算法(其可含深度學習)、決策樹與隨機森林演算法其中至少之一，惟本發明不以此為限，預選演算法亦可端視實際使用者需求而選擇適切的演算法，甚至亦可選擇由系統1根據設定分析主題之特徵予以建議較佳的演算法。 Please refer to FIG. 1A again, wherein the statistics and machine learning analysis module 4 is configured to automatically search with the data pool module 3 according to the configuration analysis theme preset by the system or input by the user, to search for 撷. Take one of the feature field data related to the set analysis topic to build a prediction model, and then generate an analysis result. Furthermore, the statistical and machine learning analysis module 4 uses a pre-selected algorithm to establish a prediction model for the feature data of the feature field data related to the set analysis topic. In this embodiment, the pre-selected algorithm may include at least one of a Logis regression algorithm, a neural network-like algorithm (which may include deep learning), a decision tree, and a random forest algorithm, but the present invention does not To this end, the pre-selection algorithm can also select an appropriate algorithm depending on the actual user's needs, or even choose to recommend a better algorithm based on the characteristics of the set analysis theme.

此外，當統計與機器學習分析模組4藉由特徵資料作為樣本，以進行訓練產生預測模型後，統計與機器學習分析模組4係再從資料池模組3擷取與設定分析主題相關之另一特徵欄位資料作為驗證資料，亦對預測模型做初步的準確度驗證，從而建立出更為貼近且精確的預測模型。 In addition, after the statistical and machine learning analysis module 4 uses the feature data as a sample to perform the training to generate the prediction model, the statistical and machine learning analysis module 4 extracts from the data pool module 3 and associates with the set analysis theme. Another feature field data is used as verification data, and preliminary accuracy verification is also performed on the prediction model to establish a more accurate and accurate prediction model.

請參照圖1A，自動化分析處理系統1所包括之模型監控評估模組5，其係設置用以根據資料轉換模組 2所接收之最新一筆的原始資料產出分析結果，據以監控預測模型的準確度。亦即，模型監控評估模組5可將最新接收的原始資料，經轉換整合的特徵欄位資料，使用於預測模型來取得即時分析結果，並且據以判斷所獲得的分析結果是否符合預期設定的範圍區間，以監控預測模型的準確度，進而評估是否自動更新調整預測模型之特徵欄位資料的局部參數或係數，或是再根據所更新及累積儲存的特徵欄位資料予以重新建立預測模組，來提高預測準確度。 Referring to FIG. 1A, the model monitoring and evaluation module 5 included in the automated analysis processing system 1 is configured to be based on a data conversion module. 2 The latest raw data received is output analysis results to monitor the accuracy of the prediction model. That is, the model monitoring evaluation module 5 can use the newly received original data, the converted integrated feature field data, and use the prediction model to obtain the real-time analysis result, and judge whether the obtained analysis result meets the expected setting. The range interval is used to monitor the accuracy of the prediction model, and then to evaluate whether to automatically update the local parameters or coefficients of the feature field data of the adjustment prediction model, or to re-establish the prediction module according to the updated and accumulated stored feature field data. To improve forecast accuracy.

再者，由於預測模型係憑藉著過去既有的整合數據資料而予以建立，而預測模型的精準度將會隨著實際上線使用的時間而對應衰減，因此為避免預測模型對系統實際當下所接收數據資料產生失真，致使所產出的分析結果失效而與現實面有嚴重落差，模型監控評估模組5則係設置用以根據資料轉換模組2所接收之最新的原始資料及其對應轉換的特徵欄位資料，予以即時監控預測模型的準確度，進而評估預測模型是否需自動更新預測模型之特徵欄位資料之係數或重建預測模型。換言之，模型監控評估模組5將藉由監測分析結果的成效，以判定預測模型是否失效，進而產生預警機制來自動化重建或更新預測模型，而無須使用者額外介入更新或重建。 Furthermore, since the predictive model is built with the existing integrated data in the past, the accuracy of the predictive model will be correspondingly attenuated with the actual time of the line, so in order to avoid the prediction model from being actually received by the system. The data data is distorted, causing the output analysis result to be invalid and has a serious gap with the real surface. The model monitoring and evaluation module 5 is configured to be based on the latest original data received by the data conversion module 2 and its corresponding conversion. The feature field data is used to monitor the accuracy of the prediction model in real time, and then to evaluate whether the prediction model needs to automatically update the coefficient of the feature field data of the prediction model or reconstruct the prediction model. In other words, the model monitoring evaluation module 5 will automatically determine whether the prediction model is invalid by monitoring the effectiveness of the analysis result, thereby generating an early warning mechanism to automatically reconstruct or update the prediction model without additional intervention by the user.

然而，於本發明之一較佳實施例中，模型監控評估模組5係根據一預設參數，以作為監控預測模型所提供的分析結果是否已失真不具有效性，進而判讀決定是將預測模型予以更新或重建。其中，預設參數可以依據實際設計應用需求，而選擇為吉尼指標(Gini Index)作為評估準確性的憑藉，抑或是選擇採用時間性相關的預設參數，例如預設更新時程表，藉由預先設定系統自動更新預測模型的時程(一週或一個月等)，來維持預測模型及其產出之分析結果的準確性及有效性。 However, in a preferred embodiment of the present invention, the model monitoring and evaluation module 5 is based on a preset parameter to determine whether the analysis result provided by the monitoring prediction model is distorted or not, and the interpretation decision is to predict the model. Update or rebuild. Among them, the preset parameters can be According to the actual design and application requirements, choose Gini Index as the basis for evaluating accuracy, or choose to use time-related preset parameters, such as preset update schedule, and automatically update by preset system. Predict the time course of the model (one week or one month, etc.) to maintain the accuracy and validity of the results of the prediction model and its output.

如此一來，自動化分析處理系統1將可藉由模型監控評估模組5之上述關於準確度之預警機制及其對應自動化重建或更新機制，而可自主地學習修正以提供使用者更為精準的分析結果，例如：自動化分析處理系統1可從既有的客戶資訊中，產出針對新事業項目類別極具開發潛力的客戶名單，進而可事半功倍地提升行銷活動及成效。 In this way, the automated analysis processing system 1 can use the model to monitor the above-mentioned early warning mechanism for accuracy and its corresponding automatic reconstruction or update mechanism, and can learn the correction autonomously to provide more accurate users. The results of the analysis, for example, the automated analysis processing system 1 can generate a list of customers with great development potential for the new business project category from the existing customer information, thereby improving marketing activities and effectiveness with half the effort.

另，於本發明之另一具體實施例中，模型監控評估模組5可更包括模型上架單元，其可將統計與機器學習分析模組4所建立的預測模型予以上架使用，進而提供日後新資料進行自動化評分與監控。相對地，模型監控評估模組5亦可包括模型下架單元，其可用以將預測模型進行下架以停止評分與監控。如此一來，模型監控評估模組5將可提供預測模型上架與下架之運作操控機制，進而可更有效地控管預測模型之精準度與有效性。 In addition, in another embodiment of the present invention, the model monitoring and evaluation module 5 may further include a model racking unit, which can use the statistical model and the prediction model established by the machine learning analysis module 4 to provide a new model for future use. Data for automated scoring and monitoring. In contrast, the model monitoring evaluation module 5 may also include a model racking unit that can be used to offload the predictive model to stop scoring and monitoring. In this way, the model monitoring and evaluation module 5 can provide an operational control mechanism for predicting the model on the shelf and the shelf, thereby more effectively controlling the accuracy and effectiveness of the prediction model.

請參照圖2，其係根據本發明之另一實施例之自動化分析處理系統之使用者操作介面的示意圖。如圖所示，自動化分析處理系統1可更包含一使用者操作介面6，其係可用以提供使用者於一操作網頁上對應操作及設定資料轉換模組2、資料池模組3、統計與機器學習分析模組4、模型監控評估模組5與工單調節模組7。如此一來，自動化分析處理系統1可藉由使用者操作介面6將可提供一站簡易且直觀可視化的資料分析操作使用，從而有效降低進階深度資料探勘分析之使用門檻。 Please refer to FIG. 2, which is a schematic diagram of a user operation interface of an automated analysis processing system according to another embodiment of the present invention. As shown in the figure, the automated analysis processing system 1 can further include a user operation interface 6, which can be used to provide a user with a corresponding operation and setting data conversion module 2, a data pool module 3, statistics and Machine learning analysis The module 4, the model monitoring evaluation module 5 and the work order adjustment module 7. In this way, the automated analysis processing system 1 can provide a simple and intuitive visual data analysis operation by the user operation interface 6, thereby effectively reducing the usage threshold of the advanced depth data exploration analysis.

舉例而言，倘若電信業者(非機器學習專家使用者)欲分析發現早期潛在流失客戶，以即時進行客戶維繫，從而有效避免客戶流失。使用者即可透過直覺式的使用者操作介面6，圖像選取欲分析的原始資料(例如既有客戶資訊)以傳送至資料轉換模組2，進而透過資料整合單元21及資料檢核單元22對既有客戶資訊進行資料格式的整合，以及當其資料欄位出現空值抑或欄位數值展現數據離群值的問題，直接進一步排除或補值處理，進而產理出更具參考性資訊內容的特徵欄位資料，並且透過可視化分析圖即時顯示資料轉換及檢核的處理程度。 For example, if a telecom operator (non-machine learning expert user) wants to analyze and discover early potential churn customers, it can maintain customer retention in an instant, thus effectively avoiding customer churn. The user can use the intuitive user interface 6 to select the original data to be analyzed (for example, existing customer information) for transmission to the data conversion module 2, and then through the data integration unit 21 and the data checking unit 22 Integrate the data format of existing customer information, and when there is a null value in the data field or the field value shows the outlier value of the data, directly exclude or supplement the value, and then produce more reference information content. The feature field data, and the degree of processing of data conversion and verification is immediately displayed through the visual analysis chart.

接著，當特徵欄位資料傳送至資料池模組3後，不僅可進行分類儲存同時也可根據系統預設主題或使用者輸入的主題關鍵字，以主題式方式呈現目前資料池模組3所儲存可進行分析的資料群組，例如資料池模組3可依據資料使用頻率來分類顯示資料，其中使用頻率高視為熱(Hot)資料，相對地使用頻率低視為冷(Cold)資料。 Then, after the feature field data is transmitted to the data pool module 3, not only can the classification and storage be performed, but also the current data pool module 3 can be presented in a thematic manner according to the system preset theme or the theme keyword input by the user. The data group that can be analyzed is stored. For example, the data pool module 3 can classify and display the data according to the frequency of use of the data, wherein the high frequency of use is regarded as hot data, and the relatively low frequency of use is regarded as cold data.

再者，統計與機器學習分析模組4則根據使用者預定之潛在流失客戶的分析主題，予以對應產生顯示分析結果。如此一來，使用者可藉由圖像式的分析結果，即時且簡易地獲得關於潛在流失客戶名單及其相關分布與比較資訊等。另外，使用者亦可透過使用者操作介面6所整合顯示之模型監控評估模組5的顯示欄位圖像資訊，獲得分析結果的可靠度與準確性，讓使用者即時觀測到分析品質。 Furthermore, the statistical and machine learning analysis module 4 correspondingly generates display analysis results according to the analysis theme of the potential lost customers predetermined by the user. In this way, the user can obtain the list of potential lost customers and their related distribution and comparison information in real time and easily by using the image analysis result. In addition, the user can also use the user interface The image display information of the model monitoring and evaluation module 5 integrated in the face 6 is obtained, and the reliability and accuracy of the analysis result are obtained, so that the user can observe the analysis quality in real time.

請同步參照圖2及圖3，其中圖3係根據本發明之另一實施例之自動化分析處理系統之工單調節模組之功能方塊圖。如圖所示，自動化分析處理系統1可更包含一工單調節模組7，其係可用以調節叢集運算裝置之工作運算量，使每一運算裝置73之處理運算效能趨近一致，從而以提升系統模型建置與分析運算效率。更具體地說，工單調節模組7包括一運算資源監控單元71及一工單派送單元72。運算資源監控單元71係設置用以監控叢集式運算框架內每一運算裝置73的資源狀態，如此則可即時監測記錄每一運算裝置73的閒置資源之多寡。工單派送單元72則係設置用以根據每一運算裝置73資源狀態及其源多寡來進行運算派工規劃，進而調節派送每一運算裝置73之運算工作量，致使系統1可具有較高的運算處理效能，而可更即時地提供分析資訊。 Please refer to FIG. 2 and FIG. 3 simultaneously, wherein FIG. 3 is a functional block diagram of a work order adjustment module of an automated analysis processing system according to another embodiment of the present invention. As shown in the figure, the automated analysis processing system 1 can further include a work order adjustment module 7, which can be used to adjust the workload of the cluster computing device, so that the processing performance of each computing device 73 approaches the same, thereby Improve system model construction and analysis efficiency. More specifically, the work order adjustment module 7 includes an operation resource monitoring unit 71 and a work order dispatch unit 72. The computing resource monitoring unit 71 is configured to monitor the resource status of each computing device 73 in the cluster computing framework, so that the amount of idle resources of each computing device 73 can be monitored immediately. The work order dispatching unit 72 is configured to perform an operation dispatching plan according to the resource state of each computing device 73 and its source, thereby adjusting the computing workload of each computing device 73, so that the system 1 can have a higher The processing power is processed, and the analysis information can be provided more instantly.

上述實施例僅係為了方便說明而舉例而已，本發明所主張之權利範圍自應以申請專利範圍所述為準，而非僅限於上述實施例。 The above-mentioned embodiments are merely examples for convenience of description, and the scope of the claims is intended to be limited to the above embodiments.

Claims

一種基於大數據之自動化分析處理系統，包括：一資料轉換模組，設置用以接收至少一原始資料，並轉換、整合及初步分析該原始資料，以產生一特徵欄位資料，其中該資料轉換模組包括：一資料整合單元，用以將該原始資料轉換整合為具有一相容資料格式的整合後資料；及一資料檢核單元，用以檢核該整合後資料，並根據一預設數據門檻檢測值，針對該整合後資料予以排除或差補調整，進而產生該特徵欄位資料；一資料池模組，設置用以儲存該特徵欄位資料；一統計與機器學習分析模組，設置用以根據一設定分析主題，自動從該資料池模組搜尋擷取與該設定分析主題相關之該特徵欄位資料，以建置一預測模型，進而產生一分析結果；以及一模型監控評估模組，設置用以根據該資料轉換模組所接收之最新的原始資料，並據以監控該預測模型的準確度，進而評估是否自動更新該預測模型之特徵欄位資料之係數或重建該預測模型。 An automated data processing system based on big data, comprising: a data conversion module, configured to receive at least one original data, and convert, integrate, and preliminary analyze the original data to generate a feature field data, wherein the data conversion The module includes: a data integration unit for integrating the original data into an integrated data having a compatible data format; and a data checking unit for checking the integrated data and according to a preset The data threshold detection value is used to exclude or adjust the data after the integration, thereby generating the feature field data; a data pool module configured to store the feature field data; a statistical and machine learning analysis module, The setting is configured to automatically search for the feature field data related to the set analysis topic from the data pool module according to a set analysis theme, to construct a prediction model, thereby generating an analysis result; and a model monitoring evaluation a module configured to monitor the accuracy of the prediction model according to the latest original data received by the data conversion module , In order to assess whether the automatic update feature of the prediction model coefficient data of the field or the reconstruction of the prediction model.

如請求項1之基於大數據之自動化分析處理系統，其中該原始資料包括結構化資料、半結構化資料與非結構化資料其中至少一者。 The big data-based automated analysis processing system of claim 1, wherein the raw material comprises at least one of structured data, semi-structured data, and unstructured data.

如請求項1之基於大數據之自動化分析處理系統，其中該資料池模組包括：一頻率偵測單元，設置以偵測該特徵欄位資料之擷取使用頻率，而將該特徵欄位資料分類標示及儲存。 The big data-based automated analysis processing system of claim 1, wherein the data pool module comprises: a frequency detecting unit configured to detect the feature field data The frequency of use is captured, and the feature field data is classified and stored.

如請求項1之基於大數據之自動化分析處理系統，其中該統計與機器學習分析模組係以一預選演算法，針對與該設定分析主題相關之該特徵欄位資料的特徵資料，建立該預測模型。 The big data-based automated analysis processing system of claim 1, wherein the statistical and machine learning analysis module establishes the prediction by using a pre-selected algorithm for the feature data of the feature field data related to the set analysis topic. model.

如請求項4之基於大數據之自動化分析處理系統，其中該預選演算法包括羅吉斯迴歸演算法、類神經網絡演算法與隨機森林演算法其中至少之一。 The big data-based automated analysis processing system of claim 4, wherein the pre-selected algorithm comprises at least one of a Logis regression algorithm, a neural network-like algorithm, and a random forest algorithm.

如請求項1之基於大數據之自動化分析處理系統，其中該模型監控評估模組係根據一預設參數，以監控是否更新該預測模型之特徵欄位資料之係數或重建該預測模型，其中該預設參數為一吉尼指標(Gini Index)或一預設更新時程表。 The big data-based automated analysis processing system of claim 1, wherein the model monitoring evaluation module is configured to monitor whether to update a coefficient of the feature field data of the prediction model or reconstruct the prediction model according to a preset parameter, where The preset parameter is a Gini Index or a preset update schedule.

如請求項1之基於大數據之自動化分析處理系統，更包含一工單調節模組，其中該工單調節模組包括：一運算資源監控單元，用以監控叢集式運算框架內每一運算裝置的資源狀態；及一工單派送單元，用以根據每一運算裝置的資源狀態，調節派送每一運算裝置之運算工作量。 The big data-based automated analysis processing system of claim 1, further comprising a work order adjustment module, wherein the work order adjustment module comprises: an operation resource monitoring unit, configured to monitor each computing device in the cluster computing framework The resource status; and a work order dispatch unit for adjusting the operation workload of each computing device according to the resource status of each computing device.

如請求項7之基於大數據之自動化分析處理系統，更包含一使用者操作介面，用以提供使用者於一操作網頁上對應操作及設定該資料轉換模組、該資料池模組、該統計與機器學習分析模組、該模型監控評估模組及該工單調節模組。 The big data-based automated analysis processing system of claim 7 further includes a user operation interface for providing a user to operate and set the data conversion module, the data pool module, and the statistics on an operation webpage. And a machine learning analysis module, the model monitoring evaluation module, and the work order adjustment module.

如請求項8之基於大數據之自動化分析處理系統，其中該使用者操作介面顯示該特徵欄位資料之分類標示主題，以提供使用者對該資料池模組執行快速查詢。 The big data-based automated analysis processing system of claim 8 The user operation interface displays a classification indication theme of the feature field data to provide a user to perform a quick query on the data pool module.