您的瀏覽器不支援JavaScript語法,但是並不影響您獲取本網站的內容。

技術探索

應用虛擬樣本方法改善不平衡大數據分類性能

摘要

在大數據的時代,企業經常能夠獲得大量的資料建構一個學習模型來進行決策。對大數據而言,如此的學習模型很有可能受到不平衡資料集(imbalanced data set)的影響而產生有偏差的訓練,造成模型傾向於數量較多的類別。因此,使用不平衡類別資料集來建構一個可靠的大數據學習模型是目前企業最重要的挑戰之一。為了解決這個問題,本研究提出一個新的增加少數抽樣(over-sampling)方法來增加少數量類別的數量,提出的方法是使用整體趨勢擴散(mega-trend-diffusion;MTD)技術生成虛擬樣本,以及應用可能性評估機制(plausibility assessment mechanism;PAM)來評估虛擬樣本合適性,其目的在降低分類上的偽陽性率(false positive rate;FPR)而不影響其他評估分類性能之指標如:分類正確性、geometric mean (Gmean)與F1-measure (F1)。在此,我們使用一個模擬資料集來建構支持向量機器(support vector machine;SVM)的分類模型,而實驗結果顯示所提出的方法能夠有效地改善不平衡大數據的分類性能。

Abstract

In the age of big data, enterprise normally can obtain numerous data to build a learning model to make a decision. For big data, such learning model tends to majority class due to imbalanced data set likely leads to a biased training. Hence, using an imbalanced data set to build a reliable learning model for big data is one of the most important challenges in enterprise. For solving this, this paper proposes a new over-sampling method to increase the data size in minority class. The proposed method is to use the mega-trend-diffusion (MTD) technology to generate virtual samples and the plausibility assessment mechanism (PAM) to access the suitability of virtual sample. In addition, this paper is to decrease the false positive rate (FPR) on classification and not to influence the other indices for accessing the classification performance, such as accuracy, geometric mean (Gmean), and F1-measure (F1). In this paper, a simulated data set is used to build the support vector machine (SVM) classification model, and the experiment results show that the proposed method can effectively improve classification performances for imbalanced big data sets.

關鍵詞(Key Words)

不平衡大數據(Imbalanced Big Data;IBD)
增加少數抽樣(Over-sampling)
虛擬樣本(Virtual Sample;VS)
偽陽性率(False Positive Rate;FPR)

相關檔案

共有0則留言張貼留言
顯示更多回答
歡迎留下您的意見:
姓名:
E-Mail:
文章留言:
輸入驗證碼:
請輸入驗證碼2FW4