How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by cross-validation. But if you’re simply looking for a “good enough” algorithm for your problem, or a place to start, here are some general guidelines I’ve found to work well over the years.
如何針對(duì)某個(gè)分類問題決定使用何種機(jī)器學(xué)習(xí)算法? 當(dāng)然,如果你真心在乎準(zhǔn)確率,最好的途徑就是測(cè)試一大堆各式各樣的算法(同時(shí)確保在每個(gè)算法上也測(cè)試不同的參數(shù)),最后選擇在交叉驗(yàn)證中表現(xiàn)最好的。倘若你只是想針對(duì)你的問題尋找一個(gè)“足夠好”的算法,或者一個(gè)起步點(diǎn),這里給出了一些我覺得這些年用著還不錯(cuò)的常規(guī)指南。
How large is your training set?
訓(xùn)練集有多大?
If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren’t powerful enough to provide accurate models.
如果是小訓(xùn)練集,高偏差/低方差的分類器(比如樸素貝葉斯)要比低偏差/高方差的分類器(比如k最近鄰)具有優(yōu)勢(shì),因?yàn)楹笳呷菀走^擬合。然而隨著訓(xùn)練集的增大,低偏差/高方差的分類器將開始具有優(yōu)勢(shì)(它們擁有更低的漸近誤差),因?yàn)楦咂罘诸惼鲗?duì)于提供準(zhǔn)確模型不那么給力。
You can also think of this as a generative model vs. discriminative model distinction.
你也可以把這一點(diǎn)看作生成模型和判別模型的差別。
Advantages of some particular algorithms
一些常用算法的優(yōu)缺點(diǎn)
Advantages of Naive Bayes: Super simple, you’re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well. Its main disadvantage is that it can’t learn interactions between features (e.g., it can’t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they’re together).
樸素貝葉斯: 巨尼瑪簡(jiǎn)單,你只要做些算術(shù)就好了。倘若條件獨(dú)立性假設(shè)確實(shí)滿足,樸素貝葉斯分類器將會(huì)比判別模型,譬如邏輯回歸收斂得更快,因此你只需要更少的訓(xùn)練數(shù)據(jù)。就算該假設(shè)不成立,樸素貝葉斯分類器在實(shí)踐中仍然有著不俗的表現(xiàn)。如果你需要的是快速簡(jiǎn)單并且表現(xiàn)出色,這將是個(gè)不錯(cuò)的選擇。其主要缺點(diǎn)是它學(xué)習(xí)不了特征間的交互關(guān)系(比方說,它學(xué)習(xí)不了你雖然喜歡甄子丹和姜文的電影,卻討厭他們共同出演的電影《關(guān)云長(zhǎng)》的情況)。
Advantages of Logistic Regression: Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.
邏輯回歸: 有很多正則化模型的方法,而且你不必像在用樸素貝葉斯那樣擔(dān)心你的特征是否相關(guān)。與決策樹與支持向量機(jī)相比,你還會(huì)得到一個(gè)不錯(cuò)的概率解釋,你甚至可以輕松地利用新數(shù)據(jù)來更新模型(使用在線梯度下降算法)。如果你需要一個(gè)概率架構(gòu)(比如簡(jiǎn)單地調(diào)節(jié)分類閾值,指明不確定性,或者是要得得置信區(qū)間),或者你 以后 想將更多的訓(xùn)練數(shù)據(jù) 快速 整合到模型中去,使用它吧。
Advantages of Decision Trees: Easy to interpret and explain (for some people – I’m not sure I fall into this camp). They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). One disadvantage is that they don’t support online learning, so you have to rebuild your tree when new examples come on. Another disadvantage is that they easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they’re fast and scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.