sklearn介紹
scikit-learn是數據挖掘與分析的簡單而有效的工具。
依賴于NumPy, SciPy和matplotlib。
它主要包含以下幾部分內容:
從功能來分:
classification
Regression
Clustering
Dimensionality reduction
Model selection
經常用到的有clustering, classification(svm, tree, linear regression 等), decomposition, preprocessing, metrics等
cluster
閱讀sklearn.cluster的API,可以發現里面主要有兩個內容:一個是各種聚類方法的class如cluster.KMeans,一個是可以直接使用的聚類方法的函數
sklearn
.
cluster
.
k_means
(
X
,
n_clusters
,
init
=
'k-means++'
,
precompute_distances
=
'auto'
,
n_init
=
10
,
max_iter
=
300
,
verbose
=
False
,
tol
=
0.0001
,
random_state
=
None
,
copy_x
=
True
,
n_jobs
=
1
,
algorithm
=
'auto'
,
return_n_iter
=
False
)
所以實際使用中,對應也有兩種方法。
在sklearn.cluster共有9種聚類方法,分別是
AffinityPropagation: 吸引子傳播
AgglomerativeClustering: 層次聚類
Birch
DBSCAN
FeatureAgglomeration: 特征聚集
KMeans: K均值聚類
MiniBatchKMeans
MeanShift
SpectralClustering: 譜聚類
拿我們最熟悉的Kmeans舉例說明:
采用類構造器,來構造Kmeans聚類器
首先API中KMeans的構造函數為:
sklearn
.
cluster
.
KMeans
(
n_clusters
=
8
,
init
=
'k-means++'
,
n_init
=
10
,
max_iter
=
300
,
tol
=
0.0001
,
precompute_distances
=
'auto'
,
verbose
=
0
,
random_state
=
None
,
copy_x
=
True
,
n_jobs
=
1
,
algorithm
=
'auto'
)
參數的意義:
n_clusters:簇的個數,即你想聚成幾類
init: 初始簇中心的獲取方法
n_init: 獲取初始簇中心的更迭次數
max_iter: 最大迭代次數(因為kmeans算法的實現需要迭代)
tol: 容忍度,即kmeans運行準則收斂的條件
precompute_distances:是否需要提前計算距離
verbose: 冗長模式(不太懂是啥意思,反正一般不去改默認值)
random_state: 隨機生成簇中心的狀態條件。
copy_x: 對是否修改數據的一個標記,如果True,即復制了就不會修改數據。
n_jobs: 并行設置
algorithm: kmeans的實現算法,有:‘auto’, ‘full’, ‘elkan’, 其中 'full’表示用EM方式實現
下面給一個簡單的例子:
import
numpy
as
np
from
sklearn
.
cluster
import
KMeans
data
=
np
.
random
.
rand
(
100
,
3
)
#生成一個隨機數據,樣本大小為100, 特征數為3
#假如我要構造一個聚類數為3的聚類器
estimator
=
KMeans
(
n_clusters
=
3
)
#構造聚類器
estimator
.
fit
(
data
)
#聚類
label_pred
=
estimator
.
label_
#獲取聚類標簽
centroids
=
estimator
.
cluster_centers_
#獲取聚類中心
inertia
=
estimator
.
inertia_
# 獲取聚類準則的最后值
直接采用kmeans函數:
import
numpy
as
np
from
sklearn
import
cluster
data
=
np
.
random
.
rand
(
100
,
3
)
#生成一個隨機數據,樣本大小為100, 特征數為3
k
=
3
# 假如我要聚類為3個clusters
[
centroid
,
label
,
inertia
]
=
cluster
.
k_means
(
data
,
k
)
classification
常用的分類方法有:
KNN最近鄰:sklearn.neighbors
logistic regression邏輯回歸: sklearn.linear_model.LogisticRegression
svm支持向量機: sklearn.svm
Naive Bayes樸素貝葉斯: sklearn.naive_bayes
Decision Tree決策樹: sklearn.tree
Neural network神經網絡: sklearn.neural_network
那么下面以KNN為例(主要是Nearest Neighbors Classification)來看看怎么使用這些方法:
from
sklearn
import
neighbors
,
datasets
# import some data to play with
iris
=
datasets
.
load_iris
(
)
n_neighbors
=
15
X
=
iris
.
data
[
:
,
:
2
]
# we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y
=
iris
.
target
weights
=
'distance'
# also set as 'uniform'
clf
=
neighbors
.
KNeighborsClassifier
(
n_neighbors
,
weights
=
weights
)
clf
.
fit
(
X
,
y
)
# if you have test data, just predict with the following functions
# for example, xx, yy is constructed test data
x_min
,
x_max
=
X
[
:
,
0
]
.
min
(
)
-
1
,
X
[
:
,
0
]
.
max
(
)
+
1
y_min
,
y_max
=
X
[
:
,
1
]
.
min
(
)
-
1
,
X
[
:
,
1
]
.
max
(
)
+
1
xx
,
yy
=
np
.
meshgrid
(
np
.
arange
(
x_min
,
x_max
,
h
)
,
np
.
arange
(
y_min
,
y_max
,
h
)
)
Z
=
clf
.
predict
(
np
.
c_
[
xx
.
ravel
(
)
,
yy
.
ravel
(
)
]
)
# Z is the label_pred
再比如svm:
from
sklearn
import
svm
X
=
[
[
0
,
0
]
,
[
1
,
1
]
]
y
=
[
0
,
1
]
#建立支持向量分類模型
clf
=
svm
.
SVC
(
)
#擬合訓練數據,得到訓練模型參數
clf
.
fit
(
X
,
y
)
#對測試點[2., 2.], [3., 3.]預測
res
=
clf
.
predict
(
[
[
2
.
,
2
.
]
,
[
3
.
,
3
.
]
]
)
#輸出預測結果值
print
(
res
)
#get support vectors
print
(
"support vectors:"
,
clf
.
support_vectors_
)
#get indices of support vectors
print
(
"indices of support vectors:"
,
clf
.
support_
)
#get number of support vectors for each class
print
(
"number of support vectors for each class:"
,
clf
.
n_support_
)
當然SVM還有對應的回歸模型SVR
from
sklearn
import
svm
X
=
[
[
0
,
0
]
,
[
2
,
2
]
]
y
=
[
0.5
,
2.5
]
clf
=
svm
.
SVR
(
)
clf
.
fit
(
X
,
y
)
res
=
clf
.
predict
(
[
[
1
,
1
]
]
)
print
(
res
)
邏輯回歸
from
sklearn
import
linear_model
X
=
[
[
0
,
0
]
,
[
1
,
1
]
]
y
=
[
0
,
1
]
logreg
=
linear_model
.
LogisticRegression
(
C
=
1e5
)
#we create an instance of Neighbours Classifier and fit the data.
logreg
.
fit
(
X
,
y
)
res
=
logreg
.
predict
(
[
[
2
,
2
]
]
)
print
(
res
)
preprocessing
這一塊通常我要用到的是Scale操作。而Scale類型也有很多,包括:
StandardScaler
MaxAbsScaler
MinMaxScaler
RobustScaler
Normalizer
等其他預處理操作
對應的有直接的函數使用:scale(), maxabs_scale(), minmax_scale(), robust_scale(), normaizer()。
import
numpy
as
np
from
sklearn
import
preprocessing
X
=
np
.
random
.
rand
(
3
,
4
)
#用scaler的方法
scaler
=
preprocessing
.
MinMaxScaler
(
)
X_scaled
=
scaler
.
fit_transform
(
X
)
#用scale函數的方法
X_scaled_convinent
=
preprocessing
.
minmax_scale
(
X
)
decomposition
NMF
import
numpy
as
np
X
=
np
.
array
(
[
[
1
,
1
]
,
[
2
,
1
]
,
[
3
,
1.2
]
,
[
4
,
1
]
,
[
5
,
0.8
]
,
[
6
,
1
]
]
)
from
sklearn
.
decomposition
import
NMF
model
=
NMF
(
n_components
=
2
,
init
=
'random'
,
random_state
=
0
)
model
.
fit
(
X
)
print
(
model
.
components_
)
print
(
model
.
reconstruction_err_
)
print
(
model
.
n_iter_
)
PCA
import
numpy
as
np
X
=
np
.
array
(
[
[
1
,
1
]
,
[
2
,
1
]
,
[
3
,
1.2
]
,
[
4
,
1
]
,
[
5
,
0.8
]
,
[
6
,
1
]
]
)
from
sklearn
.
decomposition
import
PCA
model
=
PCA
(
n_components
=
2
)
model
.
fit
(
X
)
print
(
model
.
components_
)
print
(
model
.
n_components_
)
print
(
model
.
explained_variance_
)
print
(
model
.
explained_variance_ratio_
)
print
(
model
.
mean_
)
print
(
model
.
noise_variance_
)
datasets
sklearn本身也提供了幾個常見的數據集,如iris, diabetes, digits, covtype, kddcup99, boson, breast_cancer,都可以通過sklearn.datasets.load_iris類似的方法加載相應的數據集。它返回一個數據集。采用下列方式獲取數據與標簽。
from
sklearn
.
datasets
import
load_iris
iris
=
load_iris
(
)
X
=
iris
.
data
y
=
iris
.
target
更多文章、技術交流、商務合作、聯系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號聯系: 360901061
您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對您有幫助就好】元
