首先分清聚類和分類的區別:
分類——監督學習算法,需要給定訓練數據
聚類——無監督學習算法,無訓練數據。
聚類分為 層次方法和非層次方法:
層次方法——最后形成一棵tree,每個node或者有k個分支,或者是葉子節點。( 過程似huffman tree)
非層次方法——是一個迭代過程,直至滿足某個閥值退出。(主要包括k-mean 和 EM算法)
k-mean算法的步驟:(每個樣本只能屬于一個聚類)
1)隨機選出k個centroid(質心)
2)將每個樣本分配給與之距離最近的centroid
3)重新計算centroid
4)重復2) 3)直至centroid所對應的set不變
EM算法:(每個樣本可以屬于不同的聚類,但概率不同)
理論基礎—計算極大似然估計,需要求似然函數的極值。輸入是樣本,但這個樣本并不是完整數據,它只是觀測數據。
完整數據(Z)包括觀測數據(X)和未知數據(Y)。
log似然函數:
? ( θ; Z ) = log p( Z|
θ
)
=
log p( X,Y|
θ
)-----(1)
它的步驟跟EM所代表的單詞有關,
<wbr><wbr><wbr><wbr>E——expectation,<br><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>已知: 觀測數據X,<span style="word-wrap:normal; word-break:normal; line-height:17px; font-size:11pt">參數θ的當前值</span><span style="word-wrap:normal; word-break:normal; line-height:17px; font-size:11pt">θt</span><br><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><span style="word-wrap:normal; word-break:normal; line-height:17px; font-size:11pt">未知:Y<br><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>對公式(1)求Y的期望,</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></span>得表達式Q(<span style="word-wrap:normal; word-break:normal; line-height:17px; font-size:11pt">θ,</span><span style="word-wrap:normal; word-break:normal; line-height:17px; font-size:11pt">θt</span>), Q中不含有變量Y。<br><wbr><wbr><wbr><wbr>M——maximization,對E步得到的Q(<span style="word-wrap:normal; word-break:normal; line-height:17px; font-size:11pt">θ,</span><span style="word-wrap:normal; word-break:normal; line-height:17px; font-size:11pt">θt</span>)求極大值,<br><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><span style="word-wrap:normal; word-break:normal; line-height:17px; font-size:11pt">即θt+1 =</span>argmax Q(<span style="word-wrap:normal; word-break:normal; line-height:17px; font-size:11pt">θ,</span><span style="word-wrap:normal; word-break:normal; line-height:17px; font-size:11pt">θt )</span><br><wbr><wbr><wbr><wbr>每次參數更新會增加非完整似然函數值<br><wbr><wbr><wbr><wbr>重復E、M兩步,直至似然函數收斂到局部極大值。<br><br> EM會收斂到局部極值,但不保證收斂到全局最優;對初值很敏感,需要一個好的、快速的初始化過程。<br></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>
<wbr></wbr>
<wbr></wbr>
k-mean聚類算法如下:
1.<wbr><wbr><wbr><wbr><wbr><wbr>從數據點中,隨機選取k個數據中心作為初始的聚類中心。例如k=3,則選擇3個數據點</wbr></wbr></wbr></wbr></wbr></wbr>
2.<wbr><wbr><wbr><wbr><wbr><wbr>分別計算每一個點到k個中心點的距離(本文計算的是歐式距離),如果當前計算的數據點離第i個(i=1,2,…,k)中心點最近,則把當前點歸到第i類.</wbr></wbr></wbr></wbr></wbr></wbr>
3.<wbr><wbr><wbr><wbr><wbr><wbr>重新計算k個聚類中心點。計算方式如下,如果第i類有n個數據點,則第i類新的中心為:</wbr></wbr></wbr></wbr></wbr></wbr>
<wbr><a target="_blank" style="text-decoration:none; color:rgb(27,113,155)"></a><br><a target="_blank" style="text-decoration:none; color:rgb(27,113,155)"><img src="http://s3.sinaimg.cn/bmiddle/48b8276ftc1ef9044f7c3&690" name="image_operate_65101339134613281" alt="分類與聚類" title="分類與聚類" style="margin:0px; padding:0px; border:0px; list-style:none"></a><br><br><br></wbr>
<wbr><wbr><wbr><wbr>4.如果新的聚類中心跟上一次的聚類中心比較變化小于某值算法結束,否則轉到第二步。</wbr></wbr></wbr></wbr>
聚類結果如下:
<wbr></wbr>
代碼如下:http://download.csdn.net/source/3374443
load gaussdata; %由于我先前生成好了,直接load進來
maxiter = 50;<wbr><wbr>%設定最大頻數</wbr></wbr>
iter = 1;
num = size(X,2);<wbr>%num為數據點個數</wbr>
index = randperm(num); %產生1到num個數字的一個隨機排列
center = X(:,index(1:3)); %選擇出3個初始的聚類中心
?nter = X(:,1:3);
old = center;<wbr><wbr><wbr><wbr><wbr><wbr>%記錄舊的聚類中心</wbr></wbr></wbr></wbr></wbr></wbr>
<wbr></wbr>
hold on ;
plot(X(1,:),X(2,:),'g.');<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>%繪制數據點</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>
plot(center(1,:),center(2,:),'ro'); %繪制數據中心
title(num2str(iter));<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>%顯示迭代步數</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>
hold off;
xnum = size(X,2);
cdim = size(center,2);
while iter<=maxiter
<wbr><wbr>clf;</wbr></wbr>
<wbr><wbr>hold on;</wbr></wbr>
<wbr><wbr>5-39行計算每一個點到k個中心的距離,一個很神奇的技巧,自己想吧,呵呵</wbr></wbr>
<wbr><wbr>sumX = sum(X.^2,1);</wbr></wbr>
<wbr><wbr>sumC = sum(center.^2,1);</wbr></wbr>
<wbr><wbr>XY = (2*X'*center)';</wbr></wbr>
<wbr><wbr>distance = repmat(sumX,cdim,1)+repmat(sumC',1,xnum)-XY;</wbr></wbr>
<wbr><wbr>[v,idx] = min(distance,[],1);<wbr><wbr>%求出數據點到哪一個中心的距離最近</wbr></wbr></wbr></wbr>
<wbr><wbr>Y = idx;<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>%對數據點進行分類</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>
<wbr><wbr>idx1 = find(idx==1);</wbr></wbr>
<wbr><wbr>idx2 = find(idx==2);</wbr></wbr>
<wbr><wbr>idx3 = find(idx==3);</wbr></wbr>
<wbr><wbr>%下面三行計算出新的聚類中心</wbr></wbr>
<wbr><wbr>center(:,1) = mean(X(:,idx1),2);</wbr></wbr>
<wbr><wbr>center(:,2) = mean(X(:,idx2),2);</wbr></wbr>
<wbr><wbr>center(:,3) = mean(X(:,idx3),2);</wbr></wbr>
<wbr><wbr>title(num2str(iter));</wbr></wbr>
<wbr><wbr>plot_data(X,Y,center);</wbr></wbr>
<wbr><wbr>hold off;</wbr></wbr>
<wbr><wbr>pause(0.1);</wbr></wbr>
<wbr><wbr>error = sum((center(:,1)-old(:,1)).^2,1);<wbr><wbr><wbr>%計算迭代中止條件</wbr></wbr></wbr></wbr></wbr>
<wbr><wbr>if error<0.000001</wbr></wbr>
<wbr><wbr><wbr><wbr><wbr>break;</wbr></wbr></wbr></wbr></wbr>
<wbr><wbr>end</wbr></wbr>
<wbr><wbr>old = center;</wbr></wbr>
<wbr><wbr>iter = iter+1;</wbr></wbr>
end
更多文章、技術交流、商務合作、聯系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號聯系: 360901061
您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對您有幫助就好】元
