色综合伊人色综合网站无码,国内精品综合久久久40p,99色在线观看

作者：chen_h
微信號(hào) & QQ：862251340
微信公眾號(hào)：coderpai

（一）機(jī)器學(xué)習(xí)中的集成學(xué)習(xí)入門

（二）bagging 方法

（三）使用Python進(jìn)行交易的隨機(jī)森林算法

（四）Python中隨機(jī)森林的實(shí)現(xiàn)與解釋

（五）如何用 Python 從頭開始實(shí)現(xiàn) Bagging 算法

（六）如何利用Python從頭開始實(shí)現(xiàn)隨機(jī)森林算法

介紹

隨機(jī)森林是集成學(xué)習(xí)中一個(gè)主要的算法。簡而言之，集成方法是一種將幾個(gè)弱學(xué)習(xí)器的預(yù)測(cè)結(jié)果進(jìn)行組合，最終形成一個(gè)強(qiáng)學(xué)習(xí)器的方法。可以直觀的猜測(cè)一下，隨機(jī)森林通過減少過擬合來達(dá)到比決策樹更好的效果。決策樹和隨機(jī)森林都可用于回歸和分類問題。在這篇文章中，我們利用隨機(jī)森林來解決一些問題。

理論

在開始編寫代碼之前，我們需要了解一些基本理論：

1.特征bagging：自舉過程是一種從原始樣本中進(jìn)行又放回的采樣。在特征 bagging 過程中，我們從原始特征中進(jìn)行隨機(jī)特征采樣，并且把采樣到的特征傳遞到不同的樹上面。（不采用放回的采集，因?yàn)榫哂腥哂嗵卣魇菦]有意義的）。這樣做事為了減少樹之間的相關(guān)性。我們的目標(biāo)就是制作高度不相關(guān)的決策樹。

2.聚合：使隨機(jī)森林比決策樹更好的核心是聚合不相關(guān)的樹。我們的想法是創(chuàng)建幾個(gè)淺層的樹模型，然后將它們平均化以創(chuàng)建更好的隨機(jī)森林，這樣可以將一些隨機(jī)誤差的平均值變?yōu)榱恪Ｔ诨貧w的情況下，我們可以平均每個(gè)樹的預(yù)測(cè)（平均值），而在分類問題的情況下，我們可以簡單的取每個(gè)樹投票的大多數(shù)類別。

Python 代碼

要從頭開始編碼我們的隨機(jī)森林，我們將遵循自上而下的方法。我們將從一個(gè)黑盒子開始，并進(jìn)一步將其分解為幾個(gè)黑盒子，抽象級(jí)別越來越低，細(xì)節(jié)越來越多，直到我們最終達(dá)到不再抽象的程度。

隨機(jī)森林類

我們正在創(chuàng)建一個(gè)隨機(jī)森林回歸器，如果你想創(chuàng)建一個(gè)分類器，那么只需要對(duì)此代碼進(jìn)行細(xì)微的調(diào)整就行了。首先，我們需要知道我們的黑盒子的輸入和輸出是什么，所以我們需要知道定義我們的隨機(jī)森林的參數(shù)是：

x：訓(xùn)練集的自變量。為了保持簡單，我不單獨(dú)創(chuàng)建一個(gè) fit 方法，因此基類構(gòu)造函數(shù)將接受訓(xùn)練集；
y：監(jiān)督學(xué)習(xí)所需的相應(yīng)因變量（隨機(jī)森林是一種監(jiān)督學(xué)習(xí)技術(shù)）；
n_trees：我們合作創(chuàng)建隨機(jī)森林的不相關(guān)樹的數(shù)量；
n_features：要采樣并傳遞到每棵樹的要素?cái)?shù)量，這是特征bagging 發(fā)生的位置。它可以是 sqrt ，log2 或者整數(shù)。在 sqrt 的情況下，對(duì)于每個(gè)樹采樣的特征的數(shù)量是總特征的平方根，在 log2 的情況下是總特征的對(duì)數(shù)基數(shù) 2；
sample_size：隨機(jī)選擇并傳遞到每個(gè)樹的行數(shù)。這通常等于總行數(shù)，但在某些情況下可以減少以提高性能并降低樹的相關(guān)性（樹的 bagging 方法是一種完全獨(dú)立的機(jī)器學(xué)習(xí)技術(shù)）；
depth：每個(gè)決策樹的深度。更高的深度意味著更多的分裂，這增加了每棵樹的過度擬合傾向，但由于我們聚集了幾個(gè)不相關(guān)的樹木，所以過度擬合單個(gè)樹木幾乎不會(huì)對(duì)整個(gè)森林造成干擾；
min_leaf：節(jié)點(diǎn)中導(dǎo)致進(jìn)一步拆分所需的最小行數(shù)。降低 min_leaf，樹的深度會(huì)越高；

讓我們開始定義我們的隨機(jī)森林類。

            
              
                class
              
              
                RandomForest
              
              
                (
              
              
                )
              
              
                :
              
              
                def
              
              
                __init__
              
              
                (
              
              self
              
                ,
              
               x
              
                ,
              
               y
              
                ,
              
               n_trees
              
                ,
              
               n_features
              
                ,
              
               sample_sz
              
                ,
              
               depth
              
                =
              
              
                10
              
              
                ,
              
               min_leaf
              
                =
              
              
                5
              
              
                )
              
              
                :
              
              
        np
              
                .
              
              random
              
                .
              
              seed
              
                (
              
              
                12
              
              
                )
              
              
                if
              
               n_features 
              
                ==
              
              
                'sqrt'
              
              
                :
              
              
            self
              
                .
              
              n_features 
              
                =
              
              
                int
              
              
                (
              
              np
              
                .
              
              sqrt
              
                (
              
              x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
                )
              
              
                )
              
              
                elif
              
               n_features 
              
                ==
              
              
                'log2'
              
              
                :
              
              
            self
              
                .
              
              n_features 
              
                =
              
              
                int
              
              
                (
              
              np
              
                .
              
              log2
              
                (
              
              x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
                )
              
              
                )
              
              
                else
              
              
                :
              
              
            self
              
                .
              
              n_features 
              
                =
              
               n_features
        
              
                print
              
              
                (
              
              self
              
                .
              
              n_features
              
                ,
              
              
                "sha: "
              
              
                ,
              
              x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
                )
              
                  
        self
              
                .
              
              x
              
                ,
              
               self
              
                .
              
              y
              
                ,
              
               self
              
                .
              
              sample_sz
              
                ,
              
               self
              
                .
              
              depth
              
                ,
              
               self
              
                .
              
              min_leaf  
              
                =
              
               x
              
                ,
              
               y
              
                ,
              
               sample_sz
              
                ,
              
               depth
              
                ,
              
               min_leaf
        self
              
                .
              
              trees 
              
                =
              
              
                [
              
              self
              
                .
              
              create_tree
              
                (
              
              
                )
              
              
                for
              
               i 
              
                in
              
              
                range
              
              
                (
              
              n_trees
              
                )
              
              
                ]
              
              
                def
              
              
                create_tree
              
              
                (
              
              self
              
                )
              
              
                :
              
              
        idxs 
              
                =
              
               np
              
                .
              
              random
              
                .
              
              permutation
              
                (
              
              
                len
              
              
                (
              
              self
              
                .
              
              y
              
                )
              
              
                )
              
              
                [
              
              
                :
              
              self
              
                .
              
              sample_sz
              
                ]
              
              
        f_idxs 
              
                =
              
               np
              
                .
              
              random
              
                .
              
              permutation
              
                (
              
              self
              
                .
              
              x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
                )
              
              
                [
              
              
                :
              
              self
              
                .
              
              n_features
              
                ]
              
              
                return
              
               DecisionTree
              
                (
              
              self
              
                .
              
              x
              
                .
              
              iloc
              
                [
              
              idxs
              
                ]
              
              
                ,
              
               self
              
                .
              
              y
              
                [
              
              idxs
              
                ]
              
              
                ,
              
               self
              
                .
              
              n_features
              
                ,
              
               f_idxs
              
                ,
              
              
                    idxs
              
                =
              
              np
              
                .
              
              array
              
                (
              
              
                range
              
              
                (
              
              self
              
                .
              
              sample_sz
              
                )
              
              
                )
              
              
                ,
              
              depth 
              
                =
              
               self
              
                .
              
              depth
              
                ,
              
               min_leaf
              
                =
              
              self
              
                .
              
              min_leaf
              
                )
              
              
                def
              
              
                predict
              
              
                (
              
              self
              
                ,
              
               x
              
                )
              
              
                :
              
              
                return
              
               np
              
                .
              
              mean
              
                (
              
              
                [
              
              t
              
                .
              
              predict
              
                (
              
              x
              
                )
              
              
                for
              
               t 
              
                in
              
               self
              
                .
              
              trees
              
                ]
              
              
                ,
              
               axis
              
                =
              
              
                0
              
              
                )
              
              
                def
              
              
                std_agg
              
              
                (
              
              cnt
              
                ,
              
               s1
              
                ,
              
               s2
              
                )
              
              
                :
              
              
                return
              
               math
              
                .
              
              sqrt
              
                (
              
              
                (
              
              s2
              
                /
              
              cnt
              
                )
              
              
                -
              
              
                (
              
              s1
              
                /
              
              cnt
              
                )
              
              
                **
              
              
                2
              
              
                )

__init__：構(gòu)造函數(shù)只需借助我們的參數(shù)定義隨機(jī)森林并創(chuàng)建所需數(shù)量的樹；
creat_tree：通過調(diào)用 Decision Tree 類的構(gòu)造函數(shù)創(chuàng)建一個(gè)新的決策樹。現(xiàn)在假設(shè)它是一個(gè)黑盒子。我們稍后會(huì)寫關(guān)于它的代碼。每棵樹都會(huì)受到一個(gè)隨機(jī)的特征子集（特征 bagging）和一組隨機(jī)的行；
Predict：我們的隨機(jī)森林預(yù)測(cè)只是所有決策樹預(yù)測(cè)的平均值；

如果我們能夠神奇的創(chuàng)建樹，那么想想隨機(jī)森林是多么容易。現(xiàn)在我們降低抽象級(jí)別并編寫代碼來創(chuàng)建決策樹。

決策樹類

決策樹將具有以下參數(shù)：

indxs：此參數(shù)用于跟蹤原始集的哪些索引向右移動(dòng)，哪些索引轉(zhuǎn)到左側(cè)樹。因此，每個(gè)樹都有這個(gè)參數(shù) indxs，它存儲(chǔ)它包含的行的索引。通過平均這些行來進(jìn)行預(yù)測(cè)。
min_leaf：葉節(jié)點(diǎn)上需要的最小行樣本。每個(gè)葉節(jié)點(diǎn)的行樣本都小于 min_leaf ，因?yàn)樗鼈儾荒茉俜指睢?
depth：每棵樹內(nèi)可能的最大深度或者最大分割數(shù)。

            
              
                class
              
              
                DecisionTree
              
              
                (
              
              
                )
              
              
                :
              
              
                def
              
              
                __init__
              
              
                (
              
              self
              
                ,
              
               x
              
                ,
              
               y
              
                ,
              
               n_features
              
                ,
              
               f_idxs
              
                ,
              
              idxs
              
                ,
              
              depth
              
                =
              
              
                10
              
              
                ,
              
               min_leaf
              
                =
              
              
                5
              
              
                )
              
              
                :
              
              
        self
              
                .
              
              x
              
                ,
              
               self
              
                .
              
              y
              
                ,
              
               self
              
                .
              
              idxs
              
                ,
              
               self
              
                .
              
              min_leaf
              
                ,
              
               self
              
                .
              
              f_idxs 
              
                =
              
               x
              
                ,
              
               y
              
                ,
              
               idxs
              
                ,
              
               min_leaf
              
                ,
              
               f_idxs
        self
              
                .
              
              depth 
              
                =
              
               depth
        self
              
                .
              
              n_features 
              
                =
              
               n_features
        self
              
                .
              
              n
              
                ,
              
               self
              
                .
              
              c 
              
                =
              
              
                len
              
              
                (
              
              idxs
              
                )
              
              
                ,
              
               x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
        self
              
                .
              
              val 
              
                =
              
               np
              
                .
              
              mean
              
                (
              
              y
              
                [
              
              idxs
              
                ]
              
              
                )
              
              
        self
              
                .
              
              score 
              
                =
              
              
                float
              
              
                (
              
              
                'inf'
              
              
                )
              
              
        self
              
                .
              
              find_varsplit
              
                (
              
              
                )
              
              
                def
              
              
                find_varsplit
              
              
                (
              
              self
              
                )
              
              
                :
              
              
                #Will make it recursive later
              
              
                for
              
               i 
              
                in
              
               self
              
                .
              
              f_idxs
              
                :
              
               self
              
                .
              
              find_better_split
              
                (
              
              i
              
                )
              
              
                def
              
              
                find_better_split
              
              
                (
              
              self
              
                ,
              
               var_idx
              
                )
              
              
                :
              
              
                #Lets write it later
              
              
                pass
              
              
                for
              
               i 
              
                in
              
              
                range
              
              
                (
              
              
                0
              
              
                ,
              
              self
              
                .
              
              n
              
                -
              
              self
              
                .
              
              min_leaf
              
                -
              
              
                1
              
              
                )
              
              
                :
              
              
            xi
              
                ,
              
              yi 
              
                =
              
               sort_x
              
                [
              
              i
              
                ]
              
              
                ,
              
              sort_y
              
                [
              
              i
              
                ]
              
              
            lhs_cnt 
              
                +=
              
              
                1
              
              
                ;
              
               rhs_cnt 
              
                -=
              
              
                1
              
              
            lhs_sum 
              
                +=
              
               yi
              
                ;
              
               rhs_sum 
              
                -=
              
               yi
            lhs_sum2 
              
                +=
              
               yi
              
                **
              
              
                2
              
              
                ;
              
               rhs_sum2 
              
                -=
              
               yi
              
                **
              
              
                2
              
              
                if
              
               i
              
                <
              
              self
              
                .
              
              min_leaf 
              
                or
              
               xi
              
                ==
              
              sort_x
              
                [
              
              i
              
                +
              
              
                1
              
              
                ]
              
              
                :
              
              
                continue
              
              

            lhs_std 
              
                =
              
               std_agg
              
                (
              
              lhs_cnt
              
                ,
              
               lhs_sum
              
                ,
              
               lhs_sum2
              
                )
              
              
            rhs_std 
              
                =
              
               std_agg
              
                (
              
              rhs_cnt
              
                ,
              
               rhs_sum
              
                ,
              
               rhs_sum2
              
                )
              
              
            curr_score 
              
                =
              
               lhs_std
              
                *
              
              lhs_cnt 
              
                +
              
               rhs_std
              
                *
              
              rhs_cnt
            
              
                if
              
               curr_score
              
                <
              
              self
              
                .
              
              score
              
                :
              
               
                self
              
                .
              
              var_idx
              
                ,
              
              self
              
                .
              
              score
              
                ,
              
              self
              
                .
              
              split 
              
                =
              
               var_idx
              
                ,
              
              curr_score
              
                ,
              
              xi

    @
              
                property
              
              
                def
              
              
                split_name
              
              
                (
              
              self
              
                )
              
              
                :
              
              
                return
              
               self
              
                .
              
              x
              
                .
              
              columns
              
                [
              
              self
              
                .
              
              var_idx
              
                ]
              
              
    
    @
              
                property
              
              
                def
              
              
                split_col
              
              
                (
              
              self
              
                )
              
              
                :
              
              
                return
              
               self
              
                .
              
              x
              
                .
              
              values
              
                [
              
              self
              
                .
              
              idxs
              
                ,
              
              self
              
                .
              
              var_idx
              
                ]
              
              

    @
              
                property
              
              
                def
              
              
                is_leaf
              
              
                (
              
              self
              
                )
              
              
                :
              
              
                return
              
               self
              
                .
              
              score 
              
                ==
              
              
                float
              
              
                (
              
              
                'inf'
              
              
                )
              
              
                or
              
               self
              
                .
              
              depth 
              
                <=
              
              
                0
              
              
                def
              
              
                predict
              
              
                (
              
              self
              
                ,
              
               x
              
                )
              
              
                :
              
              
                return
              
               np
              
                .
              
              array
              
                (
              
              
                [
              
              self
              
                .
              
              predict_row
              
                (
              
              xi
              
                )
              
              
                for
              
               xi 
              
                in
              
               x
              
                ]
              
              
                )
              
              
                def
              
              
                predict_row
              
              
                (
              
              self
              
                ,
              
               xi
              
                )
              
              
                :
              
              
                if
              
               self
              
                .
              
              is_leaf
              
                :
              
              
                return
              
               self
              
                .
              
              val
        t 
              
                =
              
               self
              
                .
              
              lhs 
              
                if
              
               xi
              
                [
              
              self
              
                .
              
              var_idx
              
                ]
              
              
                <=
              
              self
              
                .
              
              split 
              
                else
              
               self
              
                .
              
              rhs
        
              
                return
              
               t
              
                .
              
              predict_row
              
                (
              
              xi
              
                )

我們使用屬性裝飾器使我們的代碼更加簡潔。

__init__：決策樹構(gòu)造函數(shù)。它有幾個(gè)有趣的片段可供研究：

a. 如果 idxs 為 None：idxs = np.arange(len(y))，如果我們沒有在這個(gè)特定樹的計(jì)算中指定行的索引，只需占用所有行；

b. self.val = np.mean(y[idxs]) 每個(gè)決策樹預(yù)測(cè)一個(gè)值，該值是它所持有的所有行的平均值。變量 self.val 保存樹的每個(gè)節(jié)點(diǎn)的預(yù)測(cè)。對(duì)于根節(jié)點(diǎn)，該值將僅僅是所有觀察值的平均值，因?yàn)樗Ａ袅怂行校驗(yàn)槲覀兩形催M(jìn)行拆分。我在這里使用了“節(jié)點(diǎn)”這個(gè)詞，因?yàn)楸举|(zhì)上決策樹只是一個(gè)節(jié)點(diǎn)，左邊是決策樹，右邊也是決策樹。

c. Self.score = float(“inf”) 節(jié)點(diǎn)的得分是根據(jù)它如何 “劃分” 原始數(shù)據(jù)集來進(jìn)行計(jì)算的。我們稍后會(huì)定義這個(gè) “好”，我們現(xiàn)在假設(shè)我們有辦法測(cè)量這樣的數(shù)量。此外，我們的節(jié)點(diǎn)將得分設(shè)置為無窮大，因?yàn)槲覀兩形催M(jìn)行任何拆分，因此我們存在的拆分無線差，表明任何拆分都優(yōu)于不拆分。

d. self.find_varsplit() 我們首先進(jìn)行拆分！

find_varsplit：我們使用暴力方法找到最佳分裂。此函數(shù)按順序循環(huán)遍歷所有列，并在他們之間找到最佳分割。這個(gè)函數(shù)仍然不完整，因?yàn)樗贿M(jìn)行一次拆分，后來我們擴(kuò)展這個(gè)函數(shù)，為每個(gè)拆分做出左右決策，直到我們到達(dá)葉子節(jié)點(diǎn)。
split_name：一個(gè)屬性裝飾器，用于返回我們要拆分的列的名稱。var_idx 是此列的索引，我們將在 find_better_split 函數(shù)中計(jì)算此索引以及我們拆分的列的值。
split_col：一個(gè)屬性裝飾器，用于返回索引 var_idx 處的列，其中元素位于 indxs 變量給出的索引處。基本上，將列與選定的行隔離。
find_better_split：這個(gè)函數(shù)是在某個(gè)列中找到最好的分割，這很復(fù)雜，所以我們?cè)谏厦娴拇a中把它看做是一個(gè)黑盒子。讓我們稍后再定義它。
is_leaf：葉節(jié)點(diǎn)是從未進(jìn)行過分割的節(jié)點(diǎn)，因此它具有無限分?jǐn)?shù)，因此該函數(shù)用于標(biāo)識(shí)葉節(jié)點(diǎn)。同樣，如果我們已經(jīng)越過了最大深度，即 self.depth <= 0 ，它就是一個(gè)葉子節(jié)點(diǎn)，因?yàn)槲覀儾荒茉偕钊肓恕?

如何找到最好的分割點(diǎn)？

決策樹通過基于某些條件遞歸的將數(shù)據(jù)分為兩半來進(jìn)行訓(xùn)練。如果測(cè)試集在每列中有 10 列，每列有 10 個(gè)數(shù)據(jù)點(diǎn)，則總共可以進(jìn)行 10*10 = 100 次拆分，我們手頭的任務(wù)是找到哪些拆分是最適合我們的數(shù)據(jù)。

我們根據(jù)將數(shù)據(jù)分為兩半，然后使得兩者中的每一個(gè)數(shù)據(jù)都是非常“相似的”。增加這種相似性的一種方法是減少兩半的方差或者標(biāo)準(zhǔn)偏差。因此，我們希望最小化兩邊標(biāo)準(zhǔn)差的加權(quán)平均值。我們使用貪婪算法通過將數(shù)據(jù)劃分為列中每個(gè)值的兩半來找到拆分，并計(jì)算兩半的標(biāo)準(zhǔn)偏差的加權(quán)平均值以找到最小值。

為了加快速度，我們可以復(fù)制一個(gè)列并對(duì)其進(jìn)行排序，通過在第 n+1 個(gè)索引處使用 sum 的值和由第 n 個(gè)索引分割創(chuàng)建的兩半值的平方和來分割加權(quán)平均值來計(jì)算加權(quán)平能均值。這是基于以下標(biāo)準(zhǔn)偏差公式：

下面的圖像以圖形方式展示了分?jǐn)?shù)計(jì)算的過程，每個(gè)圖像中的最后一列是表示分割得分的單個(gè)數(shù)字，即左右標(biāo)準(zhǔn)偏差的加權(quán)平均值。

我們繼續(xù)對(duì)每列進(jìn)行排序：

現(xiàn)在我們按順序進(jìn)行拆分：

index = 0

index = 1

Index = 2 (best split)

Index = 3

index = 4

index=5

通過簡單的貪婪算法，我們發(fā)現(xiàn)在 index = 2 時(shí)進(jìn)行的拆分是最好的拆分，因?yàn)樗梅肿畹汀Ｎ覀兩院髮?duì)所有列執(zhí)行相同的步驟并將它們?nèi)勘容^以貪婪算法找到最小值。

以下是上述圖示表示的簡單代碼：

            
              
                def
              
              
                std_agg
              
              
                (
              
              cnt
              
                ,
              
               s1
              
                ,
              
               s2
              
                )
              
              
                :
              
              
                return
              
               math
              
                .
              
              sqrt
              
                (
              
              
                (
              
              s2
              
                /
              
              cnt
              
                )
              
              
                -
              
              
                (
              
              s1
              
                /
              
              cnt
              
                )
              
              
                **
              
              
                2
              
              
                )
              
              
                def
              
              
                find_better_split
              
              
                (
              
              self
              
                ,
              
               var_idx
              
                )
              
              
                :
              
              
        x
              
                ,
              
               y 
              
                =
              
               self
              
                .
              
              x
              
                .
              
              values
              
                [
              
              self
              
                .
              
              idxs
              
                ,
              
              var_idx
              
                ]
              
              
                ,
              
               self
              
                .
              
              y
              
                [
              
              self
              
                .
              
              idxs
              
                ]
              
              
        sort_idx 
              
                =
              
               np
              
                .
              
              argsort
              
                (
              
              x
              
                )
              
              
        sort_y
              
                ,
              
              sort_x 
              
                =
              
               y
              
                [
              
              sort_idx
              
                ]
              
              
                ,
              
               x
              
                [
              
              sort_idx
              
                ]
              
              
        rhs_cnt
              
                ,
              
              rhs_sum
              
                ,
              
              rhs_sum2 
              
                =
              
               self
              
                .
              
              n
              
                ,
              
               sort_y
              
                .
              
              
                sum
              
              
                (
              
              
                )
              
              
                ,
              
              
                (
              
              sort_y
              
                **
              
              
                2
              
              
                )
              
              
                .
              
              
                sum
              
              
                (
              
              
                )
              
              
        lhs_cnt
              
                ,
              
              lhs_sum
              
                ,
              
              lhs_sum2 
              
                =
              
              
                0
              
              
                ,
              
              
                0
              
              
                .
              
              
                ,
              
              
                0
              
              
                .
              
              
                for
              
               i 
              
                in
              
              
                range
              
              
                (
              
              
                0
              
              
                ,
              
              self
              
                .
              
              n
              
                -
              
              self
              
                .
              
              min_leaf
              
                -
              
              
                1
              
              
                )
              
              
                :
              
              
            xi
              
                ,
              
              yi 
              
                =
              
               sort_x
              
                [
              
              i
              
                ]
              
              
                ,
              
              sort_y
              
                [
              
              i
              
                ]
              
              
            lhs_cnt 
              
                +=
              
              
                1
              
              
                ;
              
               rhs_cnt 
              
                -=
              
              
                1
              
              
            lhs_sum 
              
                +=
              
               yi
              
                ;
              
               rhs_sum 
              
                -=
              
               yi
            lhs_sum2 
              
                +=
              
               yi
              
                **
              
              
                2
              
              
                ;
              
               rhs_sum2 
              
                -=
              
               yi
              
                **
              
              
                2
              
              
                if
              
               i
              
                <
              
              self
              
                .
              
              min_leaf 
              
                or
              
               xi
              
                ==
              
              sort_x
              
                [
              
              i
              
                +
              
              
                1
              
              
                ]
              
              
                :
              
              
                continue
              
              

            lhs_std 
              
                =
              
               std_agg
              
                (
              
              lhs_cnt
              
                ,
              
               lhs_sum
              
                ,
              
               lhs_sum2
              
                )
              
              
            rhs_std 
              
                =
              
               std_agg
              
                (
              
              rhs_cnt
              
                ,
              
               rhs_sum
              
                ,
              
               rhs_sum2
              
                )
              
              
            curr_score 
              
                =
              
               lhs_std
              
                *
              
              lhs_cnt 
              
                +
              
               rhs_std
              
                *
              
              rhs_cnt
            
              
                if
              
               curr_score
              
                <
              
              self
              
                .
              
              score
              
                :
              
               
                self
              
                .
              
              var_idx
              
                ,
              
              self
              
                .
              
              score
              
                ,
              
              self
              
                .
              
              split 
              
                =
              
               var_idx
              
                ,
              
              curr_score
              
                ,
              
              xi

上面的代碼我們需要一些解釋：

函數(shù) std_agg 使用平方和的值來計(jì)算標(biāo)準(zhǔn)偏差；
curr_score = lhs_std*lhs_cnt + rhs_std*rhs_cnt 每次迭代的分割得分只是兩個(gè)標(biāo)準(zhǔn)差的加權(quán)平均值。較低的分?jǐn)?shù)有助于降低方差，較低的方差有助于對(duì)類似數(shù)據(jù)進(jìn)行分組，從而實(shí)現(xiàn)更好的預(yù)測(cè)；
if curr_score

現(xiàn)在我們知道如何為所選列找到最佳拆分，我們需要遞歸的為每個(gè)決策樹進(jìn)行拆分。對(duì)于每一棵樹，我們找到最好的列和它的值，然后我們遞歸的制作兩個(gè)決策樹，知道我們到達(dá)葉子及誒單。為此，我們將不完整的函數(shù) find_varsplit 進(jìn)行擴(kuò)展：

            
              
                def
              
              
                find_varsplit
              
              
                (
              
              self
              
                )
              
              
                :
              
              
                for
              
               i 
              
                in
              
               self
              
                .
              
              f_idxs
              
                :
              
               self
              
                .
              
              find_better_split
              
                (
              
              i
              
                )
              
              
                if
              
               self
              
                .
              
              is_leaf
              
                :
              
              
                return
              
              
        x 
              
                =
              
               self
              
                .
              
              split_col
        lhs 
              
                =
              
               np
              
                .
              
              nonzero
              
                (
              
              x
              
                <=
              
              self
              
                .
              
              split
              
                )
              
              
                [
              
              
                0
              
              
                ]
              
              
        rhs 
              
                =
              
               np
              
                .
              
              nonzero
              
                (
              
              x
              
                >
              
              self
              
                .
              
              split
              
                )
              
              
                [
              
              
                0
              
              
                ]
              
              
        lf_idxs 
              
                =
              
               np
              
                .
              
              random
              
                .
              
              permutation
              
                (
              
              self
              
                .
              
              x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
                )
              
              
                [
              
              
                :
              
              self
              
                .
              
              n_features
              
                ]
              
              
        rf_idxs 
              
                =
              
               np
              
                .
              
              random
              
                .
              
              permutation
              
                (
              
              self
              
                .
              
              x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
                )
              
              
                [
              
              
                :
              
              self
              
                .
              
              n_features
              
                ]
              
              
        self
              
                .
              
              lhs 
              
                =
              
               DecisionTree
              
                (
              
              self
              
                .
              
              x
              
                ,
              
               self
              
                .
              
              y
              
                ,
              
               self
              
                .
              
              n_features
              
                ,
              
               lf_idxs
              
                ,
              
               self
              
                .
              
              idxs
              
                [
              
              lhs
              
                ]
              
              
                ,
              
               depth
              
                =
              
              self
              
                .
              
              depth
              
                -
              
              
                1
              
              
                ,
              
               min_leaf
              
                =
              
              self
              
                .
              
              min_leaf
              
                )
              
              
        self
              
                .
              
              rhs 
              
                =
              
               DecisionTree
              
                (
              
              self
              
                .
              
              x
              
                ,
              
               self
              
                .
              
              y
              
                ,
              
               self
              
                .
              
              n_features
              
                ,
              
               rf_idxs
              
                ,
              
               self
              
                .
              
              idxs
              
                [
              
              rhs
              
                ]
              
              
                ,
              
               depth
              
                =
              
              self
              
                .
              
              depth
              
                -
              
              
                1
              
              
                ,
              
               min_leaf
              
                =
              
              self
              
                .
              
              min_leaf
              
                )

完結(jié)

最后我們給出完整代碼：

            
              
                class
              
              
                RandomForest
              
              
                (
              
              
                )
              
              
                :
              
              
                def
              
              
                __init__
              
              
                (
              
              self
              
                ,
              
               x
              
                ,
              
               y
              
                ,
              
               n_trees
              
                ,
              
               n_features
              
                ,
              
               sample_sz
              
                ,
              
               depth
              
                =
              
              
                10
              
              
                ,
              
               min_leaf
              
                =
              
              
                5
              
              
                )
              
              
                :
              
              
        np
              
                .
              
              random
              
                .
              
              seed
              
                (
              
              
                12
              
              
                )
              
              
                if
              
               n_features 
              
                ==
              
              
                'sqrt'
              
              
                :
              
              
            self
              
                .
              
              n_features 
              
                =
              
              
                int
              
              
                (
              
              np
              
                .
              
              sqrt
              
                (
              
              x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
                )
              
              
                )
              
              
                elif
              
               n_features 
              
                ==
              
              
                'log2'
              
              
                :
              
              
            self
              
                .
              
              n_features 
              
                =
              
              
                int
              
              
                (
              
              np
              
                .
              
              log2
              
                (
              
              x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
                )
              
              
                )
              
              
                else
              
              
                :
              
              
            self
              
                .
              
              n_features 
              
                =
              
               n_features
        
              
                print
              
              
                (
              
              self
              
                .
              
              n_features
              
                ,
              
              
                "sha: "
              
              
                ,
              
              x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
                )
              
                  
        self
              
                .
              
              x
              
                ,
              
               self
              
                .
              
              y
              
                ,
              
               self
              
                .
              
              sample_sz
              
                ,
              
               self
              
                .
              
              depth
              
                ,
              
               self
              
                .
              
              min_leaf  
              
                =
              
               x
              
                ,
              
               y
              
                ,
              
               sample_sz
              
                ,
              
               depth
              
                ,
              
               min_leaf
        self
              
                .
              
              trees 
              
                =
              
              
                [
              
              self
              
                .
              
              create_tree
              
                (
              
              
                )
              
              
                for
              
               i 
              
                in
              
              
                range
              
              
                (
              
              n_trees
              
                )
              
              
                ]
              
              
                def
              
              
                create_tree
              
              
                (
              
              self
              
                )
              
              
                :
              
              
        idxs 
              
                =
              
               np
              
                .
              
              random
              
                .
              
              permutation
              
                (
              
              
                len
              
              
                (
              
              self
              
                .
              
              y
              
                )
              
              
                )
              
              
                [
              
              
                :
              
              self
              
                .
              
              sample_sz
              
                ]
              
              
        f_idxs 
              
                =
              
               np
              
                .
              
              random
              
                .
              
              permutation
              
                (
              
              self
              
                .
              
              x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
                )
              
              
                [
              
              
                :
              
              self
              
                .
              
              n_features
              
                ]
              
              
                return
              
               DecisionTree
              
                (
              
              self
              
                .
              
              x
              
                .
              
              iloc
              
                [
              
              idxs
              
                ]
              
              
                ,
              
               self
              
                .
              
              y
              
                [
              
              idxs
              
                ]
              
              
                ,
              
               self
              
                .
              
              n_features
              
                ,
              
               f_idxs
              
                ,
              
              
                    idxs
              
                =
              
              np
              
                .
              
              array
              
                (
              
              
                range
              
              
                (
              
              self
              
                .
              
              sample_sz
              
                )
              
              
                )
              
              
                ,
              
              depth 
              
                =
              
               self
              
                .
              
              depth
              
                ,
              
               min_leaf
              
                =
              
              self
              
                .
              
              min_leaf
              
                )
              
              
                def
              
              
                predict
              
              
                (
              
              self
              
                ,
              
               x
              
                )
              
              
                :
              
              
                return
              
               np
              
                .
              
              mean
              
                (
              
              
                [
              
              t
              
                .
              
              predict
              
                (
              
              x
              
                )
              
              
                for
              
               t 
              
                in
              
               self
              
                .
              
              trees
              
                ]
              
              
                ,
              
               axis
              
                =
              
              
                0
              
              
                )
              
              
                def
              
              
                std_agg
              
              
                (
              
              cnt
              
                ,
              
               s1
              
                ,
              
               s2
              
                )
              
              
                :
              
              
                return
              
               math
              
                .
              
              sqrt
              
                (
              
              
                (
              
              s2
              
                /
              
              cnt
              
                )
              
              
                -
              
              
                (
              
              s1
              
                /
              
              cnt
              
                )
              
              
                **
              
              
                2
              
              
                )
              
              
                class
              
              
                DecisionTree
              
              
                (
              
              
                )
              
              
                :
              
              
                def
              
              
                __init__
              
              
                (
              
              self
              
                ,
              
               x
              
                ,
              
               y
              
                ,
              
               n_features
              
                ,
              
               f_idxs
              
                ,
              
              idxs
              
                ,
              
              depth
              
                =
              
              
                10
              
              
                ,
              
               min_leaf
              
                =
              
              
                5
              
              
                )
              
              
                :
              
              
        self
              
                .
              
              x
              
                ,
              
               self
              
                .
              
              y
              
                ,
              
               self
              
                .
              
              idxs
              
                ,
              
               self
              
                .
              
              min_leaf
              
                ,
              
               self
              
                .
              
              f_idxs 
              
                =
              
               x
              
                ,
              
               y
              
                ,
              
               idxs
              
                ,
              
               min_leaf
              
                ,
              
               f_idxs
        self
              
                .
              
              depth 
              
                =
              
               depth
        
              
                print
              
              
                (
              
              f_idxs
              
                )
              
              
                #         print(self.depth)
              
              
        self
              
                .
              
              n_features 
              
                =
              
               n_features
        self
              
                .
              
              n
              
                ,
              
               self
              
                .
              
              c 
              
                =
              
              
                len
              
              
                (
              
              idxs
              
                )
              
              
                ,
              
               x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
        self
              
                .
              
              val 
              
                =
              
               np
              
                .
              
              mean
              
                (
              
              y
              
                [
              
              idxs
              
                ]
              
              
                )
              
              
        self
              
                .
              
              score 
              
                =
              
              
                float
              
              
                (
              
              
                'inf'
              
              
                )
              
              
        self
              
                .
              
              find_varsplit
              
                (
              
              
                )
              
              
                def
              
              
                find_varsplit
              
              
                (
              
              self
              
                )
              
              
                :
              
              
                for
              
               i 
              
                in
              
               self
              
                .
              
              f_idxs
              
                :
              
               self
              
                .
              
              find_better_split
              
                (
              
              i
              
                )
              
              
                if
              
               self
              
                .
              
              is_leaf
              
                :
              
              
                return
              
              
        x 
              
                =
              
               self
              
                .
              
              split_col
        lhs 
              
                =
              
               np
              
                .
              
              nonzero
              
                (
              
              x
              
                <=
              
              self
              
                .
              
              split
              
                )
              
              
                [
              
              
                0
              
              
                ]
              
              
        rhs 
              
                =
              
               np
              
                .
              
              nonzero
              
                (
              
              x
              
                >
              
              self
              
                .
              
              split
              
                )
              
              
                [
              
              
                0
              
              
                ]
              
              
        lf_idxs 
              
                =
              
               np
              
                .
              
              random
              
                .
              
              permutation
              
                (
              
              self
              
                .
              
              x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
                )
              
              
                [
              
              
                :
              
              self
              
                .
              
              n_features
              
                ]
              
              
        rf_idxs 
              
                =
              
               np
              
                .
              
              random
              
                .
              
              permutation
              
                (
              
              self
              
                .
              
              x
              
                .
              
              shape
              
                [
              
              
                1
              
              
                ]
              
              
                )
              
              
                [
              
              
                :
              
              self
              
                .
              
              n_features
              
                ]
              
              
        self
              
                .
              
              lhs 
              
                =
              
               DecisionTree
              
                (
              
              self
              
                .
              
              x
              
                ,
              
               self
              
                .
              
              y
              
                ,
              
               self
              
                .
              
              n_features
              
                ,
              
               lf_idxs
              
                ,
              
               self
              
                .
              
              idxs
              
                [
              
              lhs
              
                ]
              
              
                ,
              
               depth
              
                =
              
              self
              
                .
              
              depth
              
                -
              
              
                1
              
              
                ,
              
               min_leaf
              
                =
              
              self
              
                .
              
              min_leaf
              
                )
              
              
        self
              
                .
              
              rhs 
              
                =
              
               DecisionTree
              
                (
              
              self
              
                .
              
              x
              
                ,
              
               self
              
                .
              
              y
              
                ,
              
               self
              
                .
              
              n_features
              
                ,
              
               rf_idxs
              
                ,
              
               self
              
                .
              
              idxs
              
                [
              
              rhs
              
                ]
              
              
                ,
              
               depth
              
                =
              
              self
              
                .
              
              depth
              
                -
              
              
                1
              
              
                ,
              
               min_leaf
              
                =
              
              self
              
                .
              
              min_leaf
              
                )
              
              
                def
              
              
                find_better_split
              
              
                (
              
              self
              
                ,
              
               var_idx
              
                )
              
              
                :
              
              
        x
              
                ,
              
               y 
              
                =
              
               self
              
                .
              
              x
              
                .
              
              values
              
                [
              
              self
              
                .
              
              idxs
              
                ,
              
              var_idx
              
                ]
              
              
                ,
              
               self
              
                .
              
              y
              
                [
              
              self
              
                .
              
              idxs
              
                ]
              
              
        sort_idx 
              
                =
              
               np
              
                .
              
              argsort
              
                (
              
              x
              
                )
              
              
        sort_y
              
                ,
              
              sort_x 
              
                =
              
               y
              
                [
              
              sort_idx
              
                ]
              
              
                ,
              
               x
              
                [
              
              sort_idx
              
                ]
              
              
        rhs_cnt
              
                ,
              
              rhs_sum
              
                ,
              
              rhs_sum2 
              
                =
              
               self
              
                .
              
              n
              
                ,
              
               sort_y
              
                .
              
              
                sum
              
              
                (
              
              
                )
              
              
                ,
              
              
                (
              
              sort_y
              
                **
              
              
                2
              
              
                )
              
              
                .
              
              
                sum
              
              
                (
              
              
                )
              
              
        lhs_cnt
              
                ,
              
              lhs_sum
              
                ,
              
              lhs_sum2 
              
                =
              
              
                0
              
              
                ,
              
              
                0
              
              
                .
              
              
                ,
              
              
                0
              
              
                .
              
              
                for
              
               i 
              
                in
              
              
                range
              
              
                (
              
              
                0
              
              
                ,
              
              self
              
                .
              
              n
              
                -
              
              self
              
                .
              
              min_leaf
              
                -
              
              
                1
              
              
                )
              
              
                :
              
              
            xi
              
                ,
              
              yi 
              
                =
              
               sort_x
              
                [
              
              i
              
                ]
              
              
                ,
              
              sort_y
              
                [
              
              i
              
                ]
              
              
            lhs_cnt 
              
                +=
              
              
                1
              
              
                ;
              
               rhs_cnt 
              
                -=
              
              
                1
              
              
            lhs_sum 
              
                +=
              
               yi
              
                ;
              
               rhs_sum 
              
                -=
              
               yi
            lhs_sum2 
              
                +=
              
               yi
              
                **
              
              
                2
              
              
                ;
              
               rhs_sum2 
              
                -=
              
               yi
              
                **
              
              
                2
              
              
                if
              
               i
              
                <
              
              self
              
                .
              
              min_leaf 
              
                or
              
               xi
              
                ==
              
              sort_x
              
                [
              
              i
              
                +
              
              
                1
              
              
                ]
              
              
                :
              
              
                continue
              
              

            lhs_std 
              
                =
              
               std_agg
              
                (
              
              lhs_cnt
              
                ,
              
               lhs_sum
              
                ,
              
               lhs_sum2
              
                )
              
              
            rhs_std 
              
                =
              
               std_agg
              
                (
              
              rhs_cnt
              
                ,
              
               rhs_sum
              
                ,
              
               rhs_sum2
              
                )
              
              
            curr_score 
              
                =
              
               lhs_std
              
                *
              
              lhs_cnt 
              
                +
              
               rhs_std
              
                *
              
              rhs_cnt
            
              
                if
              
               curr_score
              
                <
              
              self
              
                .
              
              score
              
                :
              
               
                self
              
                .
              
              var_idx
              
                ,
              
              self
              
                .
              
              score
              
                ,
              
              self
              
                .
              
              split 
              
                =
              
               var_idx
              
                ,
              
              curr_score
              
                ,
              
              xi

    @
              
                property
              
              
                def
              
              
                split_name
              
              
                (
              
              self
              
                )
              
              
                :
              
              
                return
              
               self
              
                .
              
              x
              
                .
              
              columns
              
                [
              
              self
              
                .
              
              var_idx
              
                ]
              
              
    
    @
              
                property
              
              
                def
              
              
                split_col
              
              
                (
              
              self
              
                )
              
              
                :
              
              
                return
              
               self
              
                .
              
              x
              
                .
              
              values
              
                [
              
              self
              
                .
              
              idxs
              
                ,
              
              self
              
                .
              
              var_idx
              
                ]
              
              

    @
              
                property
              
              
                def
              
              
                is_leaf
              
              
                (
              
              self
              
                )
              
              
                :
              
              
                return
              
               self
              
                .
              
              score 
              
                ==
              
              
                float
              
              
                (
              
              
                'inf'
              
              
                )
              
              
                or
              
               self
              
                .
              
              depth 
              
                <=
              
              
                0
              
              
                def
              
              
                predict
              
              
                (
              
              self
              
                ,
              
               x
              
                )
              
              
                :
              
              
                return
              
               np
              
                .
              
              array
              
                (
              
              
                [
              
              self
              
                .
              
              predict_row
              
                (
              
              xi
              
                )
              
              
                for
              
               xi 
              
                in
              
               x
              
                ]
              
              
                )
              
              
                def
              
              
                predict_row
              
              
                (
              
              self
              
                ,
              
               xi
              
                )
              
              
                :
              
              
                if
              
               self
              
                .
              
              is_leaf
              
                :
              
              
                return
              
               self
              
                .
              
              val
        t 
              
                =
              
               self
              
                .
              
              lhs 
              
                if
              
               xi
              
                [
              
              self
              
                .
              
              var_idx
              
                ]
              
              
                <=
              
              self
              
                .
              
              split 
              
                else
              
               self
              
                .
              
              rhs

更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主

微信掃碼或搜索：z360901061

微信掃一掃加我為好友

QQ號(hào)聯(lián)系： 360901061

您的支持是博主寫作最大的動(dòng)力，如果您喜歡我的文章，感覺我的文章對(duì)您有幫助，請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧，狠狠點(diǎn)擊下面給點(diǎn)支持吧，站長非常感激您！手機(jī)微信長按不能支付解決辦法：請(qǐng)將微信支付二維碼保存到相冊(cè)，切換到微信，然后點(diǎn)擊微信右上角掃一掃功能，選擇支付二維碼完成支付。

【本文對(duì)您有幫助就好】元

2元

5元

10元

20元

自定義

亚洲免费在线-亚洲免费在线播放-亚洲免费在线观看-亚洲免费在线观看视频-亚洲免费在线看-亚洲免费在线视频

介紹

理論

Python 代碼

隨機(jī)森林類

決策樹類

如何找到最好的分割點(diǎn)？

完結(jié)