Gradient Descent
機(jī)器學(xué)習(xí)中很多模型的參數(shù)估計(jì)都要用到優(yōu)化算法,梯度下降是其中最簡(jiǎn)單也用得最多的優(yōu)化算法之一。梯度下降(Gradient Descent)[3]也被稱之為最快梯度(Steepest Descent),可用于尋找函數(shù)的局部最小值。梯度下降的思路為,函數(shù)值在梯度反方向下降是最快的,只要沿著函數(shù)的梯度反方向移動(dòng)足夠小的距離到一個(gè)新的點(diǎn),那么函數(shù)值必定是非遞增的,如圖1所示。
梯度下降思想的數(shù)學(xué)表述如下: \begin{equation} b=a-\alpha \nabla F(a)\Rightarrow f(a)\geq f(b) \end{equation} 其中\(zhòng)(f(x)\)為存在下界的可導(dǎo)函數(shù)。根據(jù)該思路,如果我們從\(x_0\)為出發(fā)點(diǎn),每次沿著當(dāng)前函數(shù)梯度反方向移動(dòng)一定距離\(\alpha_k\),得到序列\(zhòng)(x_0,x_1,\cdots,x_n\): \begin{equation} x_{k+1}=x_k-\alpha_k\nabla f(x_k), 0\leq k\leq n \end{equation} 對(duì)應(yīng)的各點(diǎn)函數(shù)值序列之間的關(guān)系為: \begin{equation} f(x_0)\geq f(x_1)\geq f(x_2)\geq\cdots\geq f(x_n) \end{equation} 很顯然,當(dāng)\(n\)達(dá)到一定值時(shí),函數(shù)\(f(x)\)是會(huì)收斂到局部最小值的。算法1簡(jiǎn)單描述了一般化的梯度優(yōu)化方法。
在算法1中,我們需要選擇一個(gè)搜索方向\(d_k\)滿足以下關(guān)系: \begin{equation} f(x_k+\alpha d_k)<f(x_k)\; \forall\alpha\in (0,\epsilon] \end{equation} 當(dāng)\(d_k=-\nabla f(x)\)時(shí)\(f(x)\)下降最快,但是只要滿足\(\nabla f(x_k)^Td_k<0\)的\(d_k\)都可以作為搜素方向。一般搜索方向表述為如下形式: \begin{equation} d_k=-B_k\nabla f(x_k) \end{equation} 其中\(zhòng)(B_k\)為正定矩陣。當(dāng)\(B_k=I\)時(shí)對(duì)應(yīng)最快梯度下降算法;當(dāng)\(B_k=H(x_k)^{-1}\)時(shí)對(duì)應(yīng)牛頓法,如果\(H(x_k)=\nabla^2f(x_k)\)為正定矩陣。 在迭代過(guò)程中用于更新\(x_k\)的步長(zhǎng)\(\alpha_k\)可以是常數(shù)也可以是變化的。如果\(\alpha_k\)足夠小,收斂是可以得到保證的,但這意味這迭代次數(shù)\(n\)要很大時(shí)函數(shù)才會(huì)收斂(圖2(a));如果\(\alpha_k\)比較大,更新后的點(diǎn)很可能越過(guò)局部最優(yōu)解(圖2(b))。有什么方法可以幫助我們自動(dòng)確定最優(yōu)步長(zhǎng)呢?下面要說(shuō)的線性搜索就包含一組解決方案。
Line Search
在給定搜索方向\(d_k\)的前提下,線性搜索要解決的問(wèn)題如下: \begin{equation} \alpha=arg\;\underset{\alpha\geq 0}{\min}h(\alpha)=arg\;\underset{\alpha\geq 0}{\min}f(x_k+\alpha d_k) \end{equation} 如果\(h(\alpha)\)是可微的凸函數(shù),我們能通過(guò)解析解直接求得上式最優(yōu)的步長(zhǎng);但非線性的優(yōu)化問(wèn)題需要通過(guò)迭代形式求得近似的最優(yōu)步長(zhǎng)。對(duì)于上式,局部或全局最優(yōu)解對(duì)應(yīng)的導(dǎo)數(shù)為\(h'(\alpha)=\nabla f(x_k+\alpha d_k)^Td_k=0\)。因?yàn)閈(d_k\)與\(f(x_k)\)在\(x_k\)處的梯度方向夾角大于90度,因此\(h'(0)\leq 0\),如果能找到\(\hat{\alpha}\)使得\(h'(\hat{\alpha})>0\),那么必定存在\(\alpha^{\star}\in [0,\hat{\alpha})\)使得\(h'(\alpha^{\star})=0\)。有多種迭代算法可以求得\(\alpha^{\star}\)的近似值,下面選擇幾種典型的介紹。
Bisection Search
二分線性搜索(Bisection Line Search)[2]可用于求解函數(shù)的根,其思想很簡(jiǎn)單,就是不斷將現(xiàn)有區(qū)間劃分為兩半,選擇必定含有使\(h'(\alpha)=0\)的半個(gè)區(qū)間作為下次迭代的區(qū)間,直到尋得\(h'(\alpha^{\star})\approx 0\)為止,算法描述見(jiàn)2。
二分線性搜素可以確保\(h(\alpha)\)是收斂的,只要\(h(\alpha)\)在區(qū)間\((0,\hat{\alpha})\)上是連續(xù)的且\(h'(0)\)和\(h(\hat{\alpha})\)異號(hào)。經(jīng)歷\(n\)次迭代后,當(dāng)前區(qū)間\([\alpha_l,\alpha_h]\)的長(zhǎng)度為: \begin{equation} L=\left(\frac{1}{2}\right)^n\hat{\alpha} \end{equation} 由迭代的終止條件之一\(\alpha_h-\alpha_l\geq\epsilon\)知迭代次數(shù)的上界為: \begin{equation} L\leq \epsilon\Rightarrow k\leq\left[\log_2\left(\frac{\hat{\alpha}}{\epsilon}\right)\right] \end{equation} 下面給出二分搜索的Python代碼
1 def bisection(dfun,theta,args,d,low,high,maxiter=1e4 ): 2 """ 3 #Functionality:find the root of the function(fun) in the interval [low,high] 4 #@Parameters 5 #dfun:compute the graident of function f(x) 6 #theta:Parameters of the model 7 #args:other variables needed to compute the value of dfun 8 #[low,high]:the interval which contains the root 9 #maxiter:the max number of iterations 10 """ 11 eps=1e-6 12 val_low=np.sum(dfun(theta+low*d,args)* d.T) 13 val_high=np.sum(dfun(theta+high*d,args)* d.T) 14 if val_low*val_high> 0: 15 raise Exception( ' Invalid interval! ' ) 16 iter_num=1 17 while iter_num< maxiter: 18 mid=(low+high)/2 19 val_mid=np.sum(dfun(theta+mid*d,args)* d.T) 20 if abs(val_mid)<eps or abs(high-low)< eps: 21 return mid 22 elif val_mid*val_low> 0: 23 low= mid 24 else : 25 high= mid 26 iter_num+=1
Backtracking
回溯線性搜索(Backing Line Search)[1]基于Armijo準(zhǔn)則計(jì)算搜素方向上的最大步長(zhǎng),其基本思想是沿著搜索方向移動(dòng)一個(gè)較大的步長(zhǎng)估計(jì)值,然后以迭代形式不斷縮減步長(zhǎng),直到該步長(zhǎng)使得函數(shù)值\(f(x_k+\alpha d_k)\)相對(duì)與當(dāng)前函數(shù)值\(f(x_k)\)的減小程度大于期望值(滿足Armijo準(zhǔn)則)為止。Armijo準(zhǔn)則(見(jiàn)圖3)的數(shù)學(xué)描述如下: \begin{equation} f(x_k+\alpha d_k)\leq f(x_k)+c_1\alpha f'(x_k)^Td_k \end{equation} 其中\(zhòng)(f:\mathbb{R}^n\rightarrow\mathbb{R}\),\(c_1\in(0,1)\),\(\alpha\)為步長(zhǎng),\(d_k\in\mathbb{R}^n\)為滿足\(f'(x_k)^Td_k<0\)的搜索方向。
但是僅憑Armijo準(zhǔn)則不足以求得較好的步長(zhǎng),根據(jù)前面的梯度下降的知識(shí)可知,只要\(\alpha\)足夠小就能滿足Armijo準(zhǔn)則。因此常用的策略就是從較大的步長(zhǎng)開(kāi)始,然后以\(\tau\in(0,1)\)的速度縮短步長(zhǎng),直到滿足Armijo準(zhǔn)則為止,這樣選出來(lái)的步長(zhǎng)不至于太小,對(duì)應(yīng)的算法描述見(jiàn)3。前面介紹的二分線性搜索的目標(biāo)是求得滿足\(h'(\alpha)\approx 0\)的最優(yōu)步長(zhǎng)近似值,而回溯線性搜索放松了對(duì)步長(zhǎng)的約束,只要步長(zhǎng)能使函數(shù)值有足夠大的變化即可。前者可以少計(jì)算幾次搜索方向,但在計(jì)算最優(yōu)步長(zhǎng)上花費(fèi)了不少代價(jià);后者退而求其次,找到一個(gè)差不多的步長(zhǎng)即可,那么代價(jià)就是要多計(jì)算幾次搜索方向。
接下來(lái),我們要證明回溯線性搜索在Armijo準(zhǔn)則下的收斂性問(wèn)題[6]。因?yàn)閈(h'(0)=f'(x_k)^Td_k<0\),且\(0<c_1<1\),則有 \begin{equation} h'(0)<c_1h'(0)<0 \end{equation} 根據(jù)導(dǎo)數(shù)的基本定義,結(jié)合上式,有如下關(guān)系: \begin{equation} h'(0)=\lim_{\alpha\rightarrow 0}\frac{h(\alpha)-h(0)}{\alpha}=\lim_{\alpha\rightarrow 0}\frac{f(x_k+\alpha d_k)-f(x_k)}{\alpha}<ch'(0) \end{equation} 因此,存在一個(gè)步長(zhǎng)\(\hat{\alpha}>0\),對(duì)任意的\(\alpha\in(0,\hat{\alpha})\),下式均成立 \begin{equation} \frac{f(x_k+\alpha d_k)-f(x_k)}{\alpha}<cf'(x_k)^Td_k \end{equation} 即\(\forall \alpha\in(0,\hat{\alpha}),f(x_k+\alpha d_k)<f(x_k)+c\alpha f'(x_k)^Td_k\)。 下面給出基于Armijo準(zhǔn)則的線性搜索Python代碼:
1 def ArmijoBacktrack(fun,dfun,theta,args,d,stepsize=1,tau=0.5,c1=1e-3 ): 2 """ 3 #Functionality:find an acceptable stepsize via backtrack under Armijo rule 4 #@Parameters 5 #fun:compute the value of objective function 6 #dfun:compute the gradient of objective function 7 #theta:a vector of parameters of the model 8 #stepsize:initial step size 9 #c1:sufficient decrease Parameters 10 #tau:rate of shrink of stepsize 11 """ 12 slope=np.sum(dfun(theta,args)* d.T) 13 obj_old= costFunction(theta,args) 14 theta_new=theta+stepsize* d 15 obj_new= costFunction(theta_new,args) 16 while obj_new>obj_old+c1*stepsize* slope: 17 stepsize*= tau 18 theta_new=theta+stepsize* d 19 obj_new= costFunction(theta_new,args) 20 return stepsize
Interpolation
基于Armijo準(zhǔn)則的回溯線性搜索的收斂速度無(wú)法得到保證,特別是要回退很多次后才能落入滿足Armijo準(zhǔn)則的區(qū)間。如果我們根據(jù)已有的函數(shù)值和導(dǎo)數(shù)信息,采用多項(xiàng)式插值法(Interpolation)[12,6,5,9]擬合函數(shù),然后根據(jù)該多項(xiàng)式函數(shù)估計(jì)函數(shù)的極值點(diǎn),這樣選擇合適步長(zhǎng)的效率會(huì)高很多。 假設(shè)我們只有\(zhòng)(x_k\)處的函數(shù)值\(f(x_k)\)及其倒數(shù)\(f'(x_k)\),且第一次嘗試的步長(zhǎng)為\(\alpha_0\)。如果\(\alpha_0\)不滿足條件,那么我們根據(jù)這些信息可以構(gòu)造一個(gè)二次近似函數(shù)\(h_q(\alpha)\) \begin{equation} h_q(\alpha)=\left(\frac{h(\alpha_0)-h(0)-\alpha_0h'(0)}{\alpha_0^2}\right)\alpha^2+h'(0)\alpha+h(0) \end{equation} 注意,該二次函數(shù)滿足\(h_q(0)=h(0)\),\(h_q'(0)=h'(0)\)和\(h_q(\alpha_0)=h(\alpha_0)\),如圖4(a)所示。接下來(lái),根據(jù)\(h_q(\alpha)\)的最小值估計(jì)下一個(gè)步長(zhǎng): \begin{equation} \alpha_1=\frac{h'(0)\alpha_0^2}{2[h(0)+h'(0)\alpha_0-h(\alpha_0)]} \end{equation} 如果\(\alpha_1\)仍然不滿足條件,我們可以繼續(xù)重復(fù)上述過(guò)程,直到得到的步長(zhǎng)滿足條件為止。假設(shè)我們?cè)谡麄€(gè)線性搜索過(guò)程中都用二次插值函數(shù),那么最好有\(zhòng)(c_1\in(0,0.5]\),為什么呢?簡(jiǎn)單證明一下:如果\(\alpha_0\)不滿足Armijo準(zhǔn)則,那么必定存在比\(\alpha_0\)小的步長(zhǎng)滿足該準(zhǔn)則,所以利用二次插值函數(shù)估算的步長(zhǎng)\(\alpha_1<\alpha_0\)才合理。結(jié)合\(\alpha_0\)不滿足Armijo準(zhǔn)則和\(\alpha_1<\alpha_0\),可知\(c_1\leq 0.5\)。 如果我們已經(jīng)嘗試了多個(gè)步長(zhǎng),卻每次只用上一次步長(zhǎng)的相關(guān)信息構(gòu)造二次函數(shù),未免是對(duì)計(jì)算資源的浪費(fèi),其實(shí)我們可以利用多個(gè)步長(zhǎng)的信息構(gòu)造信息量更大更準(zhǔn)確的插值函數(shù)的。在計(jì)算導(dǎo)數(shù)的代價(jià)大于計(jì)算函數(shù)值的代價(jià)時(shí),應(yīng)盡量避免計(jì)算\(h'(\alpha)\),下面給出一個(gè)三次插值函數(shù)\(h_c(\alpha)\),如圖4(b)所示 \begin{equation} h_c(\alpha)=a\alpha^3+b\alpha^2+h'(0)\alpha+h(0) \end{equation} 其中 \begin{equation} \left[\begin{array}{c} a\\ b \end{array}\right] =\frac{1}{\alpha_{i-1}^2\alpha_i^2(\alpha_i-\alpha_{i-1})} \left[\begin{array}{cc} \alpha_{i-1}^2 & -\alpha_i^2\\ -\alpha_{i-1}^3 & \alpha_i^3 \end{array}\right] \left[\begin{array}{c} h(\alpha_i)-h(0)-h'(0)\alpha_i\\ h(\alpha_{i-1})-h(0)-h'(0)\alpha_{i-1} \end{array}\right] \end{equation} 對(duì)\(h_c(\alpha)\)求導(dǎo),可得極值點(diǎn)\(\alpha_{i+1}\in[0,\alpha_i]\)的形式如下: \begin{equation} \alpha_{i+1}=\frac{-b+\sqrt{b^2-3ah'(0)}}{3a} \end{equation} 利用以上的三次插值函數(shù)求解下一個(gè)步長(zhǎng)的過(guò)程不斷重復(fù),直到步長(zhǎng)滿足條件為止。如果出現(xiàn)\(a=0\)的情況,三次插值函數(shù)退化為二次插值函數(shù),在實(shí)現(xiàn)該算法時(shí)需要注意這點(diǎn)。在此過(guò)程中,如果\(\alpha_i\)太小或\(\alpha_{i-1}\)與\(\alpha_i\)太接近,需要重置\(\alpha_i=\alpha_{i-1}/2\),該保護(hù)措施(safeguards)保證下一次的步長(zhǎng)不至于太小[6,5]。為什么會(huì)有這個(gè)作用呢?1)因?yàn)閈(\alpha_{i+1}\in[0,\alpha_i]\),所以當(dāng)\(\alpha_i\)很小時(shí)\(\alpha_{i+1}\)也很小;2)當(dāng)\(\alpha_{i-1}\)與\(\alpha_i\)太靠近時(shí)有\(zhòng)(a\approx b\approx\infty\),根據(jù)\(\alpha_{i+1}\)的表達(dá)式可知\(\alpha_{i+1}\approx 0\)。 但是,在很多情況下,計(jì)算函數(shù)值后只需付出較小的代價(jià)就能順帶計(jì)算出導(dǎo)數(shù)值或其近似值,這使得我們可以用更精確的三次Hermite多項(xiàng)式[6]進(jìn)行插值,如圖4(c)所示 \begin{equation} \begin{array}{rl} H_3(\alpha)=&\left[1+2\frac{\alpha_i-\alpha}{\alpha_i-\alpha_{i-1}}\right]\left[\frac{\alpha-\alpha_{i-1}}{\alpha_i-\alpha_{i-1}}\right]^2h(\alpha_{i})\\ &+\left[1+2\frac{\alpha-\alpha_{i-1}}{\alpha_i-\alpha_{i-1}}\right]\left[\frac{\alpha_{i+1}-\alpha}{\alpha_i-\alpha_{i-1}}\right]^2h(\alpha_{i-1})\\ &+(\alpha-\alpha_i)\left[\frac{\alpha-\alpha_{i-1}}{\alpha_i-\alpha_{i-1}}\right]^2h'(\alpha_i)\\ &+(\alpha-\alpha_{i-1})\left[\frac{\alpha_i-\alpha}{\alpha_i-\alpha_{i-1}}\right]^2h'(\alpha_{i-1}) \end{array} \end{equation} 其中,三次Hermite多項(xiàng)式滿足\(H_3(\alpha_{i-1})=h(\alpha_{i-1})\),\(H_3(\alpha_{i})=h(\alpha_{i})\),\(H_3'(\alpha_{i-1})=h'(\alpha_{i-1})\)和\(H_3'(\alpha_{i})=h'(\alpha_{i})\)。\(H(\alpha)\)的最小值只可能在兩側(cè)的端點(diǎn)或者極值點(diǎn)處,令\(H_3'(\alpha)=0\)可得下一個(gè)步長(zhǎng): \begin{equation} \alpha_{i+1}=\alpha_i-(\alpha_i-\alpha_{i-1})\left[\frac{h'(\alpha_i)+d_2-d_1}{h'(\alpha_i)-h'(\alpha_{i-1})+2d_2}\right] \end{equation} 其中 \begin{equation} d_1=h'(\alpha_i)+h'(\alpha_{i-1})-3\left[\frac{h(\alpha_i)-h(\alpha_{i-1})}{\alpha_i-\alpha_{i-1}}\right] \end{equation} \begin{equation} d_2=sign(\alpha_i-\alpha_{i-1})\sqrt{d_1^2-h'(\alpha_{i-1})h'(\alpha_i)} \end{equation}
下面給出二次插值及三次插值的Python代碼:
1 def quadraticInterpolation(a,h,h0,g0): 2 """ 3 #Functionality:Approximate h(a) with a quadratic function and return its stationary point 4 #@Parameters 5 #a:current stepsize 6 #h:a function value about stepsize,h(a)=f(x_k+a*d) 7 #h:h(0)=f(x_k) 8 #g0:h'(0)=f'(0) 9 """ 10 numerator=g0*a**2 11 denominator=2*(g0*a+h0- h) 12 if abs(denominator)<1e-12: # indicates that a is almost 0 13 return a 14 return numerator/denominator
def cubicInterpolation(a0,h0,a1,h1,h,g): """ #Functionality:Approximate h(x) with a cubic function and return its stationary point #This version of cubic interpolation computes h'(x) as few as possible,suitable for the case in which computing derivative is more expensive than computing function values #@Parameters #a0 and a1 are stepsize it previous two iterations #h0:h(a0) #h1:h(a1) #h:h(0)=f(x) #g:h'(0) """ mat =matlib.matrix([[a0**2,-a1**2],[-a0**3,a1**3 ]]) vec =matlib.matrix([[h1-h-g*a1],[h0-h-g* a0]]) ab =mat*vec/(a0**2*a1**2*(a1- a0)) a = ab[0,0] b =ab[1 ,0] if abs(a)<1e-12: # a=0 and cubic function is a quadratic one return -g/(2* b) return (-b+np.sqrt(b**2-3*a*g))/(3*a)
def cubicInterpolationHermite(a0,h0,g0,a1,h1,g1): """ #Functionality:Approximate h(a) with a cubic Hermite polynomial function and return its stationary point #This version of cubic interpolation computes h(a) as few as possible,suitable for the case in which computing derivative is easier than computing function values #@Parameters #a0 and a1 are stepsize it previous two iterations #h0:h(a0) #g0:h'(a0) #h1:h(a1) #g1:h'(a1) """ d1 =g0+g1-3*(h1-h0)/(a1- a0) d2 =np.sign(a1-a0)*np.sqrt(d1**2-g0* g1) res =a1-(a1-a0)*(g1+d2-d2)/(g1-g0+2* d2) return res
基于Armijo準(zhǔn)則的線性搜索的算法描述如下[4]
對(duì)應(yīng)的Armijo線性搜索的Python代碼如下:
1 def ArmijoLineSearch(fun,dfun,theta,args,d,a0=1,c1=1e-3,a_min=1e-7,max_iter=1e5 ): 2 """ 3 #Functionality:Line search under Armijo condition with quadratic and cubic interpolation 4 #@Parameters 5 #fun:objective Function 6 #dfun:compute the gradient of fun 7 #theta:a vector of parameters of the model 8 #args:other variables needed for fun and func 9 #d:search direction 10 #a0:initial stepsize 11 #c1:constant used in Armijo condition 12 #a_min:minimun of stepsize 13 #max_iter:maximum of the number of iterations 14 """ 15 eps=1e-6 16 c1=min(c1,0.5) # c1 should<=0.5 17 a_pre=h_pre=g_pre= 0 18 a_cur= a0 19 f_val=fun(theta,args) # h(0)=f(0) 20 g_val=np.sum(dfun(theta,args)*d.T) # h'(0)=f'(x)^Td 21 h_cur=g_cur= 0 22 k= 0 23 while a_cur>a_min and k< max_iter: 24 h_cur=fun(theta+a_cur* d,args) 25 g_cur=np.sum(dfun(theta+a_cur*d,args)* d.T) 26 if h_cur<=f_val+c1*a_cur*g_val: # meet Armijo condition 27 return a_cur 28 if not k: # k=0,use quadratic interpolation 29 a_new= quadraticInterpolation(a_cur,h_cur,f_val,g_val) 30 else : # k>0,use cubic Hermite interpolation 31 a_new= cubicInterpolationHermite(a_pre,h_pre,g_pre,a_cur,h_cur,g_cur) 32 if abs(a_new-a_cur)<eps or abs(a_new)<eps: # safeguard procedure 33 a_new=a_cur/2 34 a_pre= a_cur 35 a_cur= a_new 36 h_pre= h_cur 37 g_pre= g_cur 38 k+=1 39 return a_min # failed search
Wolfe Search
前面說(shuō)到單憑Armijo準(zhǔn)則(不考慮回溯策略)選出的步長(zhǎng)可能太小,為了排除這些微小的步長(zhǎng),我們加上曲率的約束條件(如圖5所示) \begin{equation} h'(\alpha)=f'(x_k+\alpha d_k)^Td_k\geq c_2f'(x_k)^Td_k \end{equation}
其中\(zhòng)(c_2\in(c_1,1)\),\(c_1\)為Armijo準(zhǔn)則中的常量。
當(dāng)\(h'(\alpha)\)為很小的負(fù)數(shù)甚至為正數(shù)時(shí),說(shuō)明從\(x_k\)沿著\(d_k\)移動(dòng)\(\alpha\)后的函數(shù)梯度方向與搜索方向的夾角接近90度,繼續(xù)向前移動(dòng)已經(jīng)不能很明顯減小函數(shù)值了,此時(shí)可以停止沿著\(d_k\)繼續(xù)搜索;反之,說(shuō)明繼續(xù)減小函數(shù)值的空間還是很大的,可以繼續(xù)向前搜索。Armijo準(zhǔn)則與曲率約束兩者合起來(lái)稱為Wolfe準(zhǔn)則[5]:
\begin{equation} \left\lbrace \begin{array}{rl} f(x_k+\alpha d_k)&\leq f(x_k)+c_1\alpha f'(x_k)^Td_k\\ f'(x_k+\alpha d_k)^Td_k&\geq c_2f'(x_k)^Td_k \end{array} \right. \end{equation} 其中\(zhòng)(0<c_1<c_2<1\)。如圖6所示,滿足Wolfe準(zhǔn)則的步長(zhǎng)也許離\(h(\alpha)\)的極值點(diǎn)較遠(yuǎn)。我們可以修改曲率約束條件使得步長(zhǎng)落到\(h(\alpha)\)的極值點(diǎn)的一個(gè)較寬的領(lǐng)域中,強(qiáng)Wolfe準(zhǔn)則對(duì)步長(zhǎng)\(\alpha\)的約束如下: \begin{equation} \left\lbrace \begin{array}{rl} f(x_k+\alpha d_k)&\leq f(x_k)+c_1\alpha f'(x_k)^Td_k\\ |f'(x_k+\alpha d_k)^Td_k|&\geq c_2|f'(x_k)^Td_k| \end{array} \right. \end{equation} 強(qiáng)Wolfe準(zhǔn)則不允許\(h'(\alpha)\)為太大的正數(shù),可以排除遠(yuǎn)離極值點(diǎn)的區(qū)間。
那么到底是否存在滿足強(qiáng)Wolfe準(zhǔn)則的步長(zhǎng)呢?假設(shè)\(h(\alpha)=f(x_k+\alpha d_k)\)連續(xù)可微,在整個(gè)\(\alpha>0\)的定義域上存在下界。因?yàn)閈(0<c_1<1\),所以\(l(\alpha)=f(x_k)+\alpha c_1f'(x_k)^Td_k\)必然與\(h(\alpha)\)至少有一個(gè)交點(diǎn)。假設(shè)\(\alpha'\)為最小的交點(diǎn)對(duì)應(yīng)的步長(zhǎng),則有 \begin{equation} f(x_k+\alpha'd_k)=f(x_k)+\alpha'c_1f'(x_k)^Td_k \end{equation} 那么對(duì)于滿足\(\alpha\in(0,\alpha')\)的步長(zhǎng)必然都滿足Armijo準(zhǔn)則。根據(jù)零值定理,存在\(\alpha''\in(0,\alpha')\)滿足 \begin{equation} f(x_k+\alpha'd_k)-f(x_k)=\alpha'f'(x_k+\alpha''d_k)^Td_k \end{equation} 結(jié)合上面兩個(gè)關(guān)系式,由\(0<c_1<c_2\)和\(f'(x_k)^Td_k<0\),可得 \begin{equation} f'(x_k+\alpha''d_k)^Td_k=c_1f'(x_k)^Td_k>c_2f'(x_k)^Td_k \end{equation} 由此可知,\(\alpha''\)滿足強(qiáng)Wolfe準(zhǔn)則。如果\(h(\alpha)\)是一個(gè)較為平滑的函數(shù),那么包含\(\alpha''\)的較小領(lǐng)域都會(huì)滿足強(qiáng)Wolfe準(zhǔn)則。 如果在線性搜索過(guò)程中利用強(qiáng)Wolfe準(zhǔn)則,可以更精確得找到更靠近極值點(diǎn)的步長(zhǎng),在目前線性搜索中用得很多。基于強(qiáng)Wolfe準(zhǔn)則的線性搜索包含兩個(gè)階段:第一個(gè)階段從初始步長(zhǎng)開(kāi)始,不斷增大步長(zhǎng),直到找到一個(gè)滿足強(qiáng)Wolfe準(zhǔn)則的步長(zhǎng)或包含該步長(zhǎng)的區(qū)間為止;第二個(gè)階段是在已知包含滿足強(qiáng)Wolfe準(zhǔn)則步長(zhǎng)區(qū)間的基礎(chǔ)上,不斷縮減區(qū)間,直到找到滿足強(qiáng)Wolfe準(zhǔn)則的步長(zhǎng)為止。基于強(qiáng)Wolfe準(zhǔn)則的線性搜索算法描述如下5
在算法5中,\(\alpha_{new}\)的更新是因?yàn)樵趨^(qū)間\((\alpha_{pre},\alpha_{cur})\)內(nèi)沒(méi)有滿足Wolfe準(zhǔn)則的步長(zhǎng),所以要選取一個(gè)大于\(\alpha_{cur}\)的步長(zhǎng)\(\alpha_{new}\)。在算法中,我們是用二次插值函數(shù)計(jì)算\(\alpha_{new}\)的,所以要求\(0<c_1<0.5\)。當(dāng)然,也可以用其他方法,比如讓\(\alpha_{cur}\)乘以一個(gè)大于1的常數(shù),只要能較快找到一個(gè)包含滿足Wolfe的區(qū)間即可。所以,該算法每次嘗試的步長(zhǎng)\(\alpha_{cur}\)的逐漸遞增的;一旦找到了包含滿足Wolfe準(zhǔn)則的步長(zhǎng)的區(qū)間,立即調(diào)用\(zoom\)函數(shù)不斷縮短區(qū)間,并返回滿足Wolfe的步長(zhǎng)。根據(jù)算法邏輯,我們可以推斷出\(\alpha_{pre}\)滿足Armijo準(zhǔn)則,但違背曲率約束,而且導(dǎo)數(shù)為負(fù)數(shù)。由上述三個(gè)條件,可知\(\alpha_{pre}\)必定位于滿足Wolfe準(zhǔn)則的區(qū)間的左側(cè)的呈下降趨勢(shì)的曲線上,只要\(\alpha_{cur}\)位于該區(qū)間的右側(cè)即可。那么怎樣判斷區(qū)間\((\alpha_{pre},\alpha_{cur})\)是否包含滿足Wolfe準(zhǔn)則的步長(zhǎng)呢?下面給出三種\(\alpha_{cur}\)位于該區(qū)間的右側(cè)的充分條件:
- \(\alpha_{cur}\)不滿足Armijo準(zhǔn)則;
- \(h(\alpha_{cur})\geq h(\alpha_{pre})\);
- \(h'(\alpha_{cur})\geq 0\)
這一點(diǎn)結(jié)合圖7就很容易理解了,我在圖中分別用紅色和綠色點(diǎn)標(biāo)注了\(\alpha_{pre}\)和\(\alpha_{cur}\)可能的位置,藍(lán)色帶數(shù)字的圓圈注明了\(\alpha_{cur}\)滿足哪些條件。
基于Wolfe準(zhǔn)則的線性搜索Python代碼如下:
1 def WolfeLineSearch(fun,dfun,theta,args,d,a0=1,c1=1e-4,c2=0.9,a_min=1e-7,max_iter=1e5 ): 2 """ 3 #Functionality:find a stepsize meeting Wolfe condition 4 #@Parameters 5 #fun:objective Function 6 #dfun:compute the gradient of fun 7 #theta:a vector of parameters of the model 8 #args:other variables needed for fun and func 9 #d:search direction 10 #a0:intial stepsize 11 #c1:constant used in Armijo condition 12 #c2:constant used in curvature condition 13 #a_min:minimun of stepsize 14 #max_iter:maximum of the number of iterations 15 """ 16 eps=1e-16 17 c1=min(c1,0.5 ) 18 a_pre= 0 19 a_cur= a0 20 f_val=fun(theta,args) # h(0)=f(x) 21 g_val=np.sum(dfun(theta,args)* d.T) 22 h_pre=f_val # h'(0)=f'(x)^Td 23 k= 0 24 while k<max_iter and abs(a_cur-a_pre)>= eps: 25 h_cur=fun(theta+a_cur*d,args) # f(x+ad) 26 if h_cur>f_val+c1*a_cur*g_val or h_cur>=h_pre and k> 0: 27 return zoom(fun,dfun,theta,args,d,a_pre,a_cur,c1,c2) 28 g_cur=np.sum(dfun(theta+a_cur*d,args)* d.T) 29 if abs(g_cur)<=-c2*g_val: # satisfy Wolfe condition 30 return a_cur 31 if g_cur>= 0: 32 return zoom(fun,dfun,theta,args,d,a_pre,a_cur,c1,c2) 33 a_new= quadraticInterpolation(a_cur,h_cur,f_val,g_val) 34 a_pre= a_cur 35 a_cur= a_new 36 h_pre= h_cur 37 k+=1 38 return a_min
zoom函數(shù)的算法描述見(jiàn)6。zoom函數(shù)中需要傳入搜尋區(qū)間\([\alpha_{low},\alpha_{high}]\),其中\(zhòng)(\alpha_{low}<\alpha_{high}\)。本文中的zoom函數(shù)與文獻(xiàn)[5]中的內(nèi)容略有差異,但是本文的zoom函數(shù)思路更簡(jiǎn)單和清晰。由算法5中分析得到的調(diào)用zoom函數(shù)的條件,知道\(\alpha_{low}\)必須滿足Armijo準(zhǔn)則,且位于所有滿足Wolfe準(zhǔn)則的步長(zhǎng)的左側(cè)。我們先取\([\alpha_{low},\alpha_{high}]\)區(qū)間的中值作為下一個(gè)測(cè)試的步長(zhǎng)\(\alpha_{new}\),如果恰好滿足Wolfe準(zhǔn)則,則直接返回。如果\(\alpha_{new}\)違反Armijo準(zhǔn)則或大于\(h(\alpha_{low})\),顯然區(qū)間\([\alpha_{low},\alpha_{new}]\)包含滿足Wolfe準(zhǔn)則的步長(zhǎng),因此用\(\alpha_{new}\)替換\(\alpha_{high}\)以縮短區(qū)間長(zhǎng)度;否則,\(\alpha_{new}\)必然也滿足Armijo準(zhǔn)則,如果\(h'(\alpha_{new})>0\),則\(\alpha_{new}\)與\(\alpha_{high}\)都在滿足Wolfe準(zhǔn)則的區(qū)間右側(cè),用\(\alpha_{new}\)替代\(\alpha_{high}\),反之則用\(\alpha_{new}\)替代\(\alpha_{low}\)。上述的迭代過(guò)程不斷縮短步長(zhǎng),知道求得滿足Wolfe準(zhǔn)則的步長(zhǎng)為止。如果在有限迭代次數(shù)內(nèi)搜索失敗,則返回必然滿足Armijo準(zhǔn)則的步長(zhǎng)\(\alpha_{low}\)。
zoom函數(shù)對(duì)應(yīng)的Python代碼如下:
1 def zoom(fun,dfun,theta,args,d,a_low,a_high,c1=1e-3,c2=0.9,max_iter=1e4 ): 2 """ 3 #Functionality:enlarge the interval to find a stepsize meeting Wolfe condition 4 #@Parameters 5 #fun:objective Function 6 #dfun:compute the gradient of fun 7 #theta:a vector of parameters of the model 8 #args:other variables needed for fun and func 9 #d:search direction 10 #[a_low,a_high]:interval containing a stepsize satisfying Wolfe condition 11 #c1:constant used in Armijo condition 12 #c2:constant used in curvature condition 13 #max_iter:maximum of the number of iterations 14 """ 15 if a_low> a_high: 16 print ( ' low:%f,high:%f ' % (a_low,a_high)) 17 raise Exception( ' Invalid interval of stepsize in zoom procedure ' ) 18 eps=1e-16 19 h=fun(theta,args) # h(0)=f(x) 20 g=np.sum(dfun(theta,args)*d.T) # h'(0)=f'(x)^Td 21 k= 0 22 h_low=fun(theta+a_low* d,args) 23 h_high=fun(theta+a_high* d,args) 24 if h_low>h+c1*a_low* g: 25 raise Exception( ' Left endpoint violates Armijo condition in zoom procedure ' ) 26 while k<max_iter and abs(a_high-a_low)>= eps: 27 a_new=(a_low+a_high)/2 28 h_new=fun(theta+a_new* d,args) 29 if h_new>h+c1*a_new*g or h_new> h_low: 30 a_high= a_new 31 h_high= h_new 32 else : 33 g_new=np.sum(dfun(theta+a_new*d,args)* d.T) 34 if abs(g_new)<=-c2*g: # satisfy Wolfe condition 35 return a_new 36 if g_new*(a_high-a_low)>= 0: 37 a_high= a_new 38 h_high= h_new 39 else : 40 a_low= a_new 41 h_low= h_new 42 k+=1 43 return a_low # a_low definitely satisfy Armijo condition
Newton's Method
牛頓法(Newton's method)[8]以迭代方式求解函數(shù)的根,其基本思想是從一個(gè)初始點(diǎn)出發(fā),不斷在當(dāng)前點(diǎn)\(x_k\)處用切線近似函數(shù)\(f(x)\),并求得該切線與\(x\)軸的交點(diǎn)作為下一次的迭代初始點(diǎn)\(x_{k+1}\),直到找到\(f(x)=0\)的近似解為止。Newton法可用于二次可微函數(shù)\(f(x)\)的最優(yōu)化問(wèn)題。 在\(x_k\)處用二階泰勒展開(kāi)來(lái)對(duì)\(f(x_k)\)其進(jìn)行逼近。 \begin{equation} f(x_{k}+\triangle x)\approx f(x_k)+f'(x_k)\triangle x+\frac{1}{2}{\triangle x}^TB_k\triangle x \end{equation} 現(xiàn)在,我們的目標(biāo)是在\(x^{k}\)附近求得使\(f(x)\)取得極小值的\(\triangle x\)。將上式對(duì)\(\triangle x\)求導(dǎo)可得函數(shù)\(f'(x)\)在\(x_{k+1}=x_k+\triangle x\)處的線性近似如下: \begin{equation} f'(x_{k+1})=f'(x_k)+B_k(x_{k+1}-x_k) \end{equation} 其中\(zhòng)(B_k=\nabla^2f(x_k)\)為\(f(x)\)在\(x_k\)處對(duì)應(yīng)的Hessian矩陣。由于函數(shù)的極值點(diǎn)一般都對(duì)應(yīng)\(f'(x)=0\),令\(f'(x_{k+1})=0\)并化簡(jiǎn)可得迭代公式為: \begin{equation} x_{k+1}=x_k-B_k^{-1}f'(x_k) \end{equation} 牛頓迭代法收斂速度很快,對(duì)于二次函數(shù)可以一次性找到最優(yōu)解。但用于求解優(yōu)化問(wèn)題時(shí),需要付出很大的代價(jià)求得函數(shù)的一階導(dǎo)數(shù)、二階導(dǎo)數(shù)及其逆矩陣。此外,有的函數(shù)還存在不可導(dǎo)、Hessian矩陣不可逆、迭代點(diǎn)之間存在循環(huán)(即\(x_{k+t}=x_k\))等情形,這些都成為了牛頓法的應(yīng)用障礙。牛頓迭代法用于求解極值點(diǎn)\(f'(x)=0\)的步驟見(jiàn)算法7。當(dāng)然,也可用牛頓法求最優(yōu)步長(zhǎng),只需將算法7中的函數(shù)\(f(x)\)替換為關(guān)于步長(zhǎng)的函數(shù)\(h(\alpha)\)即可。
Quasi-Newton Method
擬牛頓(Quasi-Newton)[11]算法可用于求解函數(shù)的局部最優(yōu)解,也就是那些導(dǎo)數(shù)為0的駐點(diǎn)。牛頓法用于解決優(yōu)化問(wèn)題時(shí),事先假設(shè)原函數(shù)可用二次函數(shù)近似,然后用一階和二階導(dǎo)數(shù)尋找局部最優(yōu)解。而在擬牛頓算法中,不需要準(zhǔn)確計(jì)算Hessian矩陣,取而代之的是運(yùn)用下面的擬牛頓條件分析連續(xù)兩個(gè)梯度向量得到的近似值矩陣\(B_k\): \begin{equation} f'(x_{k+1})-f'(x_k)\approx B_{k+1}(x_{k+1}-x_k) \end{equation} 擬牛頓算法的算法流程見(jiàn)8,其基本思想是利用矩陣\(B_k\)計(jì)算牛頓方向的近似值\(d_k\)。各種擬牛頓算法的主要差異在于近似Hessian矩陣的更新策略,表1列出了部分主流的擬牛頓算法的迭代更新規(guī)則,其中\(zhòng)(s_k=x_{k+1}-x_k=-\alpha_kB_k^{-1}f'(x_k)\),\(y_k=f'(x_{k+1})-f'(x_k)\)。
擬牛頓算法中最常用的是BFGS,其針對(duì)有限內(nèi)存的機(jī)器的算法變種L-BFGS[4]在機(jī)器學(xué)習(xí)領(lǐng)域又備受青睞。BFGS需要存儲(chǔ)\(n\times n\)的矩陣\(H_k\)用于近似Hessian矩陣的逆矩陣;而L-BFGS僅需要存儲(chǔ)過(guò)去\(m\)(\(m\)一般小于10)對(duì)\(n\)維的更新數(shù)據(jù)\((x,f'(x_i))\)即可。L-BFGS的空間復(fù)雜度為線性,特別適用于變量非常多的優(yōu)化問(wèn)題。BFGS的算法描述很容易寫出來(lái),如下:
1 def BFGS(fun,dfun,theta,args,H=None,mode=0,eps=1e-12,max_iter=1e4 ): 2 """ 3 #Functionality:find the minimum of objective function f(x) 4 #@Parameters 5 #fun:objective function f(x) 6 #dfun:compute the gradient of f(x) 7 #args:parameters needed by fun and dfun 8 #theta:start vector of parameters of the model 9 #H:initial inverse Hessian approximation 10 #mode:index of line search algorithm 11 """ 12 x_pre=x_cur= theta 13 g= dfun(x_cur,args) 14 I= matlib.eye(theta.size) 15 if not H: # initialize H as an identity matrix 16 H= I 17 k= 0 18 while k<max_iter and np.sum(np.abs(g))> eps: 19 d=-g* H 20 step=LineSearch(fun,dfun,x_pre,args,d,1 ,mode) 21 x_cur=x_pre+step* d 22 s=step* d 23 y=dfun(x_cur,args)- dfun(x_pre,args) 24 ys=np.sum(y* s.T) 25 if abs(ys)< eps: 26 return x_cur 27 change=(ys+np.sum(y*H*y.T))*(s.T*s)/(ys**2)-(H*y.T*s+s.T*y*H)/ ys 28 H+= change 29 g= dfun(x_cur,args) 30 x_pre= x_cur 31 k+=1 32 return x_cur
下面我們分析如何構(gòu)造下L-BFGS的算法[10,13]。假設(shè)我們現(xiàn)在處于優(yōu)化過(guò)程的第\(k(k\geq m)\)次迭代,參數(shù)為\(x_k\),梯度\(g_k=f'(x_k)\),已經(jīng)保存的\(m\)條更新數(shù)據(jù)為\(s_k=x_{k+1}-x_k\)及\(y_k=g_{k+1}-g_k\)。我們最終需要計(jì)算的是搜索方向\(d_k=-H_kg_k\),于是令\(V_k=(I-\rho_ky_ks_k^T)\),\(\rho_k=1/(y_k^Ts_k)\),將表1中BFGS的\(H_{k}\)的更新規(guī)則展開(kāi),我們可以得到下式: \begin{equation} \begin{array}{rl} &H_{k}g_k\\ =&V_{k-1}^TH_{k-1}V_{k-1}g_k+s_{k-1}\rho_{k-1}s_{k-1}^Tg_k\\ =&V_{k-1}^TV_{k-2}^TH_{k-2}V_{k-2}V_{k-1}g_k+V_{k-1}^Ts_{k-2}\rho_{k-2}s_{k-2}^TV_{k-1}g_k+s_{k-1}\rho_{k-1}s_{k-1}^Tg_k\\ =&(V_{k-1}^TV_{k-2}^T\cdots V_{k-m}^T)H_{k-m}(V_{k-m}\cdots V_{k-2}V_{k-1})g_k\\ &+(V_{k-1}^TV_{k-2}^T\cdots V_{k-m+1}^T)s_{k-m}\rho_{k-m}s_{k-m}^T(V_{k-m+1}\cdots V_{k-1}V_k)g_k\\ &+(V_{k-1}^TV_{k-2}^T\cdots V_{k-m+2}^T)s_{k-m+1}\rho_{k-m+1}s_{k-m+1}^T(V_{k-m+2}\cdots V_{k-2}V_{k-1})g_k\\ &+ \cdots\\ &+V_{k-1}^Ts_{k-2}\rho_{k-2}s_{k-2}^TV_{k-1}g_k\\ &+ s_{k-1}\rho_{k-1}s_{k-1}^Tg_k \end{array} \end{equation} 上式非常有規(guī)律,這就為迭代求解奠定了很好的基礎(chǔ)。我們令\(q_0=g_k\),則當(dāng)\(1\leq i\leq m\)時(shí)有 \begin{equation} q_i=(V_{k-i}\cdots V_{k-2}V_{k-1})g_k \end{equation} \begin{equation} a_i=\rho_{k-i}s_{k-i}^Tq_{i-1} \end{equation} 那么可以得到如下的迭代規(guī)則: \begin{equation} \begin{array}{rl} q_i&=V_{k-i+1}q_{i-1}\\ &=q_{i-1}-\rho_{k-i+1}y_{k-i+1}s_{k-i+1}^Tq_{i-1}\\ &=q_{i-1}-a_{i-1}y_{k-i+1} \end{array} \end{equation} 到目前為止,我們已經(jīng)可以求解出\(H_{k}g_k\)所有項(xiàng)的右半部分,那左半部分如何處理?在這里采用不斷提前最左端的公因式的方法完成迭代過(guò)程: \begin{equation} H_{k}g_k=P_1=V_{k-1}^TP_{2}+s_{k-1}a_1 \end{equation} \begin{equation} P_{2}=V_{k-2}^TP_{3}+s_{k-2}a_2 \end{equation} 重復(fù)該過(guò)程,很快就可以發(fā)現(xiàn)規(guī)律: \begin{equation} \begin{array}{rl} P_{i}&=V_{k-i}^TP_{i+1}+s_{k-i}a_i\\ &=P_{i+1}+s_{k-i}(a_i-\rho_{k-i}y_{k-i}^TP_{i+1}) \end{array} \end{equation} 其中\(zhòng)(P_{m+1}=H_{k-m}q_m\)。 根據(jù)上述分析,我們可以得到L-BFGS的求解搜索方向的算法9。根據(jù)算法9的整個(gè)流程,可知通過(guò)兩個(gè)循環(huán)\(m\)次的迭代運(yùn)算即可出計(jì)算當(dāng)前的搜索方向,需要存儲(chǔ)歷史數(shù)據(jù)\(\{s_{k-i},y_{k-i}|i=1,\cdots,m\}\)和臨時(shí)數(shù)據(jù)\(\{a_{k-i}|i=1,\cdots,m\}\),所以算法的時(shí)間和空間復(fù)雜度均為\(O(mn)\)。如果目前處于迭代的初期,已有的歷史數(shù)據(jù)少于\(m\),那么就用這些已有的數(shù)據(jù),在后續(xù)迭代過(guò)程中不斷新增歷史數(shù)據(jù)即可;若干當(dāng)前的迭代次數(shù)不小于\(m\),那么在每次計(jì)算出搜索方向后,即可用\(s_k\)和\(y_k\)替換\(s_{k-m}\)和\(y_{k-m}\)組成新的\(m\)對(duì)歷史更新數(shù)據(jù)。
在算法9中,需要給出矩陣\(H_{k-m}\)。在第一次迭代時(shí),\(H_{k-m}\)被初始化為單位陣,在隨后的迭代過(guò)程中\(zhòng)(H_{k-m}=\gamma_kI\),其中 \begin{equation} \gamma_k=\frac{y_{k-1}^Ts_{k-1}}{y_{k-1}^Ty_{k-1}} \end{equation} 另外,在內(nèi)存受限的系統(tǒng)中存儲(chǔ)\(n\times n\)不是很現(xiàn)實(shí)的想法。用上述的方法,我們僅需存儲(chǔ)一個(gè)標(biāo)量\(\gamma_k\)即可,這是一個(gè)簡(jiǎn)單卻又高效的做法[13]。 最后,附上L-BFGS的Python版本代碼:
1 def LBFGS(fun,dfun,theta,args,mode=0,eps=1e-12,max_iter=1e4 ): 2 """ 3 #Functionality:find the minimum of objective function f(x) with LBFGS 4 #@Parameters 5 #fun:objective function f(x) 6 #dfun:compute the gradient of f(x) 7 #args:parameters needed by fun and dfun 8 #theta:start vector of parameters of the model 9 #H:initial inverse Hessian approximation 10 #mode:index of line search algorithm 11 """ 12 x_pre=x_cur= theta 13 s_arr= [] 14 y_arr= [] 15 Hscale=1 16 k= 0 17 while k< max_iter: 18 g= dfun(x_cur,args) 19 d=LBFGSSearchDirection(y_arr,s_arr,Hscale,- g) 20 step=LineSearch(fun,dfun,x_pre,args,d,1 ,mode) 21 s=step* d 22 x_cur=x_pre+ s 23 y=dfun(x_cur,args)- dfun(x_pre,args) 24 ys=np.sum(y* s.T) 25 if np.sum(np.abs(s))< eps: 26 return x_cur 27 x_pre= x_cur 28 k+=1 29 y_arr,s_arr,Hscale= LBFGSUpdate(y,s,y_arr,s_arr) 30 return x_cur 31 32 33 def LBFGSSearchDirection(y_arr,s_arr,Hscale,g): 34 """ 35 #Functionality:estimate search direction using with LBFGS 36 #@Parameters 37 #y_arr:m*dim matrix,where y_arr[i,:]=f'(x_{i+1})-f'(x_i) 38 #s_arr:m*dim matrix,where s_arr[i,:]=x_{k+1}-x_k 39 #Hscale:a scale to initilize the inverse of Hessian matrix 40 #g:a row vector representing -f'(x_{k}) 41 """ 42 histNum=len(s_arr) # number of update data stored 43 if not histNum: 44 return g 45 dim= s_arr[0].size 46 a_arr=[0 for i in range(histNum)] 47 rho=[0 for i in range(histNum)] 48 q= g 49 for i in range(1,histNum+1 ): 50 s=s_arr[histNum- i] 51 y=y_arr[histNum- i] 52 rho[histNum-i]=1/np.sum(s* y.T) 53 a_arr[i-1]=rho[histNum-i]*np.sum(s* q.T) 54 q-=(a_arr[i-1]* y) 55 P=Hscale* q 56 for i in range(histNum,0,-1 ): 57 y=y_arr[histNum- i] 58 s=s_arr[histNum- i] 59 beta=rho[histNum-i]*np.sum(y* P.T) 60 P+=s*(a_arr[i-1]- beta) 61 return P 62 63 64 def LBFGSUpdate(y,s,oldy,olds,m=1e2 ): 65 """ 66 #Functionality:refresh the historical update data 67 #@Parameters 68 #y:f'(x_{k+1})-f'(x_k) 69 #s:x_{k+1}-x_k 70 #oldy:[y0,y1,...],which is a list 71 #olds:[s0,s1,...],which is a list 72 #m:number of historical data to store(default:100) 73 """ 74 eps=1e-12 75 Hscale=np.sum(y*s.T/y*y.T) # a scale to initialize H_{k-m} 76 if Hscale<eps: # skip update 77 return oldy,olds,Hscale 78 79 cur_m= len(oldy) 80 if cur_m>= m: 81 oldy.pop(0) 82 olds.pop(0) 83 oldy.append(copy.deepcopy(y)) 84 olds.append(copy.deepcopy(s)) 85 return oldy,olds,Hscale?
References
[1] Backtracking line search. http://en.wikipedia.org/wiki/Backtracking_line_search .
[2] Bisection method. http://en.wikipedia.org/wiki/Bisection_method .
[3] Gradient descent. http://en.wikipedia.org/wiki/Gradient_descent .
[4] Limited-memory bfgs. http://en.wikipedia.org/wiki/Limited-memory_BFGS .
[5] Line search methods. http://pages.cs.wisc.edu/~ferris/cs730/chap3.pdf .
[6] Line search methods:step length selection. http://terminus.sdsu.edu/SDSU/Math693a_f2013/Lectures/06/lecture.pdf .
[7] Math 408a line search methods. https://www.math.washington.edu/~burke/crs/408/lectures/L7-line-search.pdf .
[8] Newton’s method. http://en.wikipedia.org/wiki/Newton%27s_method .
[9] Nonlinear programming algorithms. http://www.math.bme.hu/~bog/GlobOpt/Chapter5.pdf .
[10] Oerview of quasi-newton optimization methods. https://homes.cs.washington.edu/~galen/files/quasi-newton-notes.pdf .
[11] Quasi-newton method. http://en.wikipedia.org/wiki/Quasi-Newton_method .
[12] Unconstrained minimization. http://www.ing.unitn.it/~bertolaz/2-teaching/2011-2012/AA-2011-2012-OPTIM/lezioni/slides-mND.pdf .
[13] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528,1989.
更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號(hào)聯(lián)系: 360901061
您的支持是博主寫作最大的動(dòng)力,如果您喜歡我的文章,感覺(jué)我的文章對(duì)您有幫助,請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點(diǎn)擊下面給點(diǎn)支持吧,站長(zhǎng)非常感激您!手機(jī)微信長(zhǎng)按不能支付解決辦法:請(qǐng)將微信支付二維碼保存到相冊(cè),切換到微信,然后點(diǎn)擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對(duì)您有幫助就好】元
