決策樹: 易于解釋說明(對于某些人來說 —— 我不確定我是否在這其中)。它可以毫無壓力地處理特征間的交互關系并且是非參數(shù)化的,因此你不必擔心異常值或者數(shù)據(jù)是否線性可分(舉個例子,決策樹能輕松處理好類別A在某個 特征維度x的末端 ,類別B在中間,然后類別A又出現(xiàn)在特征維度x前端的情況 )。它的一個缺點就是不支持在線學習,于是在新樣本到來后,決策樹需要全部重建。另一個缺點是容易過擬合,但這也就是諸如隨機森林(或提升樹)之類的集成方法的切入點。另外,隨機森林經(jīng)常是很多分類問題的贏家(通常比支持向量機好上那么一點,我認為),它快速并且可調,同時你無須擔心要像支持向量機那樣調一大堆參數(shù),所以最近它貌似相當受歡迎。
Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive, hard to interpret, and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.
支持向量機: 高準確率,為避免過擬合提供了很好的理論保證,而且就算數(shù)據(jù)在原特征空間線性不可分,只要給個合適的核函數(shù),它就能運行得很好。在動輒超高維的文本分類問題中特別受歡迎。可惜內存消耗大,難以解釋,運行和調參也有些煩人,所以我認為隨機森林要開始取而代之了。
But…
然而。。。
Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, then whichever classification algorithm you use might not matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).
盡管如此,回想一下,好的數(shù)據(jù)卻要優(yōu)于好的算法,設計優(yōu)良特征是大有裨益的。假如你有一個超大數(shù)據(jù)集,那么無論你使用哪種算法可能對分類性能都沒太大影響(此時就根據(jù)速度和易用性來進行抉擇)。
And to reiterate what I said above, if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize (and Middle Earth), just use an ensemble method to choose them all.
再重申一次我上面說過的話,倘若你真心在乎準確率,你一定得嘗試多種多樣的分類器,并且通過交叉驗證選擇最優(yōu)。要么就從Netflix Prize(和Middle Earth)取點經(jīng),用集成方法把它們合而用之,妥妥的。