機器學習(周志華) 西瓜書 第九章課后習題9.10—— Python實現?
-
實驗題目
試設計一個能自動確定聚類數的改進 k 均值算法,編程實現并在西瓜數據集 4.0 上運行。
-
實驗原理
K均值算法原理
K均值算法
自動確定k值的度量指標,最小化E:
? ? ? ?E值越小則簇內樣本相似度越高,簇間樣本相似度越低,且k值保證是較小的值,即簇類盡可能保證是大型簇類(這里考慮樣本只有兩種類別,所以k值應趨近于2);
-
實驗過程
數據集獲取
將西瓜數據集4.0保存為data_4.txt
編號,密度,含糖率
1,0.697,0.460
2,0.774,0.376
3, 0.634,0.264
4,0.608,0.318
5,0.556,0.215
6,0.403,0.237
7,0.481,0.149
8,0.437,0.211
9,0.666,0.091
10,0.243,0.267
11,0.245,0.057
12,0.343,0.099
13,0.639,0.161
14,0.657,0.198
15,0.360,0.370
16,0.593,0.042
17,0.719,0.103
18,0.359,0.188
19,0.339,0.241
20,0.282,0.257
21,0.748,0.232
22,0.714,0.346
23,0.483,0.312
24,0.478,0.437
25,0.525,0.369
26,0.751,0.489
27,0.532,0.472
28,0.473,0.376
29,0.725,0.445
30,0.446,0.459
算法實現
讀取數據
計算兩樣本向量的的歐式距離
為給定的簇類計算均值向量
靜態K均值算法,獲得劃分為k簇類集
對劃分后的結果進行誤差計算,基于自動確定k值的度量指標
動態K均值算法,返回最佳的k值
main函數,調用上述函數,輸出自動確定k值后的劃分結果
-
實驗結果
-
程序清單:
import random as rd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def loadData(filename):
dataSet = pd.read_csv(filename)
dataSet.drop(columns=['編號'], inplace=True)
return dataSet
#計算每個向量和均值向量之間的距離
def calc_distance(x, mu):
distance = 0
for xi, yi in zip(x, mu):
distance += (xi-yi)**2
return distance**(.5)
#根據目前的簇類計算出均值向量
def calc_mu(dataSet, indexs):
Ci = dataSet.loc[indexs]
return np.array(Ci.mean())
def k_means(dataSet, k, iterate=100):
Mu_indexs = rd.sample(range(dataSet.shape[0]), k)
Mu = [np.array(dataSet.loc[index]) for index in Mu_indexs]
now, flag = 0, True
while flag and now < iterate:
C = [[] for _ in range(k)]
for index, row in dataSet.iterrows():
distances = []
for mu in Mu:
# x = np.array(dataSet.loc[index])
distance = calc_distance(row, mu)
distances.append(distance)
label = np.argmin(distances)
C[label].append(index)
flag = False
for i in range(len(Mu)):
new_mu = calc_mu(dataSet, C[i])
if (new_mu!=Mu[i]).any():
flag = True
Mu[i] = new_mu
now += 1
return C, Mu
def calc_E(dataSet, C, Mu, k):
E_inside, E_outside, size= 0, 0, dataSet.shape[0]
#簇內
for Ci, mu in zip(C, Mu):
for index in Ci:
distance = calc_distance(dataSet.loc[index], mu)
E_inside += distance**2
# 正則化保持權重
E_inside /= size
# 簇間
for a in range(k):
for b in range(k):
if a == b:
continue
distance = calc_distance(Mu[a], Mu[b])
E_outside += distance**2
E_outside /= k
return E_inside - E_outside + 2*k/size
def Dynamic_K_means(dataSet):
size, before_E = len(dataSet), 9999
for k in range(2, size):
Es = []
# 計算多次k均值,取平方誤差平均值
for time in range(10):
C, Mu = k_means(dataSet, k)
E = calc_E(dataSet, C, Mu, k)
Es.append(E)
Best_E = sum(Es)/len(Es)
if before_E <= Best_E:
return k-1
else:
before_E = Best_E
return 1
if __name__=='__main__':
filename = 'data_4.txt'
dataSet = loadData(filename)
k = Dynamic_K_means(dataSet)
Best_E = 9999
# 多次計算,取最好結果
for _ in range(10):
C, Mu= k_means(dataSet, k)
E = calc_E(dataSet, C, Mu, k)
if E < Best_E:
Best_E = E
Best_C = C
print('k =', k)
for Ci in Best_C:
print(Ci)
?
更多文章、技術交流、商務合作、聯系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號聯系: 360901061
您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對您有幫助就好】元
