邢臺(tái)網(wǎng)站推廣費(fèi)用seo權(quán)威入門(mén)教程
前言:
? ? 蒙特卡羅的學(xué)習(xí)基本流程:
? ? ?Policy Evaluation :? ? ? ? ? 生成動(dòng)作-狀態(tài)軌跡,完成價(jià)值函數(shù)的估計(jì)。
? ? ?Policy Improvement:? ? ? ?通過(guò)價(jià)值函數(shù)估計(jì)來(lái)優(yōu)化policy。
? ? ? ?同策略(one-policy):產(chǎn)生 采樣軌跡的策略 ?和要改善的策略
?相同。
? ? ? ?Policy Evaluation :? ? 通過(guò)-貪心策略(
),產(chǎn)生(狀態(tài)-動(dòng)作-獎(jiǎng)賞)軌跡。
? ? ? ?Policy Improvement:? 原始策略也是?-貪心策略(
), 通過(guò)價(jià)值函數(shù)優(yōu)化,?
-貪心策略(
)
? ? ? 異策略(off-policy):產(chǎn)生采樣軌跡的? 策略 ?和要改善的策略
?不同。
? ? ? Policy Evaluation :? ?通過(guò)-貪心策略(
),產(chǎn)生采樣軌跡(狀態(tài)-動(dòng)作-獎(jiǎng)賞)。
? ? ? Policy Improvement:? 改進(jìn)原始策略
? ? 兩個(gè)優(yōu)勢(shì):
? ? 1: 原始策略不容易采樣
? ? 2: 降低方差
易策略常用的方案為?IR(importance sample) 重要性采樣
Importance sampling?is a?Monte Carlo method?for evaluating properties of a particular?distribution, while only having samples generated from a different distribution than the distribution of interest. Its introduction in statistics is generally attributed to a paper by?Teun Kloek?and?Herman K. van Dijk?in 1978,[1]?but its precursors can be found in?statistical physics?as early as 1949.[2][3]?Importance sampling is also related to?umbrella sampling?in?computational physics. Depending on the application, the term may refer to the process of sampling from this alternative distribution, the process of inference, or both.
一? importance-samling
? ? 1.1 原理
? ? ?原始問(wèn)題:
? ? ??
? ? ?如果采樣N次,得到
? ? ? ?
? ? 問(wèn)題:??很難采樣(采樣空間很大,很多時(shí)候只能采樣到一部分)
? ?引入 q(x) 重要性分布(這也是一個(gè)分布,容易被采樣)
??: 稱為importance weight
? ? ? ? ? ??
? ? ? ? ? ? ?(大數(shù)定理)
?下面例子,我們需要對(duì),做歸一化處理,更清楚的看出來(lái)占比
? ?下面代碼進(jìn)行了歸一化處理,方案如下:
? ? ?
? ? ?
? ? ?
? ? ??
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 8 16:38:34 2023@author: chengxf2
"""import numpy as np
import matplotlib.pyplot as plt
from scipy.special import logsumexpclass pdf:def __call__(self,x):passdef sample(self,n):pass#正太分布的概率密度
class Norm(pdf):#返回一組符合高斯分布的概率密度隨機(jī)數(shù)。def __init__(self, mu=0, sigma=1):self.mu = muself.sigma = sigmadef __call__(self, x):#log p 功能,去掉前面常數(shù)項(xiàng)logp = (x-self.mu)**2/(2*self.sigma**2)return -logpdef sample(self, N):#產(chǎn)生N 個(gè)點(diǎn),這些點(diǎn)符合正太分布x = np.random.normal(self.mu, self.sigma,N)return xclass Uniform(pdf):#均勻分布的概率密度def __init__(self, low, high):self.low = lowself.high = highdef __call__(self, x):#logq 功能N = len(x)a = np.repeat(-np.log(self.high-self.low), N)return -adef sample(self, N):#產(chǎn)生N 點(diǎn),這些點(diǎn)符合均勻分布x = np.random.uniform(self.low, self.high,N)return xclass ImportanceSampler:def __init__(self, p_dist, q_dist):self.p_dist = p_distself.q_dist = q_distdef sample(self, N):#采樣samples = self.q_dist.sample(N)weights = self.calc_weights(samples)normal_weights = weights - logsumexp(weights)return samples, normal_weightsdef calc_weights(self, samples):#log (p/q) =log(p)-log(q)return self.p_dist(samples)-self.q_dist(samples)if __name__ == "__main__":N = 10000p = Norm()q = Uniform(-10, 10) sampler = ImportanceSampler(p, q)#samples 從q(x)采樣出來(lái)的點(diǎn),weight_samplesamples,weight_sample= sampler.sample(N)#以weight_sample的概率,從samples中抽樣 N 個(gè)點(diǎn)samples = np.random.choice(samples,N, p = np.exp(weight_sample))plt.hist(samples, bins=100)
二 易策略 off-policy 原理
? ? ?target policy : 原始策略?
? ? ? ? :? ? ?這里面代表基于原始策略,得到的軌跡
? ? ? ? ? ? ? ? ??
? ? ? ?? ?該軌跡的概率
? ? ? ?:? ? 該軌跡的累積獎(jiǎng)賞
? ? ? 期望的累積獎(jiǎng)賞:
? ? ? ? ? ? ? ? ? ??
? ? behavior policy?: 行為策略
? ? ?q(x): 代表各種軌跡的采樣概率
? ? 則累積獎(jiǎng)賞函數(shù)f在概率p 也可以等價(jià)的寫(xiě)為:
? ? ?
? ? ?
? ?
? ? ??和?
?分別表示兩個(gè)策略產(chǎn)生i 條軌跡的概率,對(duì)于給定的一條軌跡
? ??:
? ? 原始策略?產(chǎn)生該軌跡的概率:
? ? ?
? ??
? ?則
? ??
? 若?為確定性策略,但是
?是
的
貪心策略:
原始策略? ?
行為策略:?
? 現(xiàn)在通過(guò)行為策略產(chǎn)生的軌跡度量權(quán)重w
?理論上應(yīng)該是連乘的,但是,
?考慮到只是概率的比值,上面可以做個(gè)替換
?
其中:?(更靈活的利用importance sample)
其核心是要計(jì)算兩個(gè)概率比值,上面的例子是去log,再歸一化
三? 方差影響
四? 代碼
代碼里面R的計(jì)算方式跟上面是不同的,
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 8 11:56:26 2023@author: chengxf2
"""import numpy as ap
# -*- coding: utf-8 -*-
"""
Created on Fri Nov 3 09:37:32 2023@author: chengxf2
"""# -*- coding: utf-8 -*-
"""
Created on Thu Nov 2 19:38:39 2023@author: cxf
"""
import numpy as np
import random
from enum import Enumclass State(Enum):#狀態(tài)空間#shortWater =1 #缺水health = 2 #健康overflow = 3 #溢水a(chǎn)poptosis = 4 #凋亡class Action(Enum):#動(dòng)作空間A#water = 1 #澆水noWater = 2 #不澆水class Env():def reward(self, state):#針對(duì)轉(zhuǎn)移到新的環(huán)境獎(jiǎng)賞 r = -100if state is State.shortWater:r =-1elif state is State.health:r = 1elif state is State.overflow:r= -1else: # State.apoptosisr = -100return rdef action(self, state, action):if state is State.shortWater:if action is Action.water :newState =[State.shortWater, State.health]p =[0.4, 0.6]else:newState =[State.shortWater, State.apoptosis]p =[0.4, 0.6]elif state is State.health:#健康if action is Action.water :newState =[State.health, State.overflow]p =[0.6, 0.4]else:newState =[State.shortWater, State.health]p =[0.6, 0.4]elif state is State.overflow:#溢水if action is Action.water :newState =[State.overflow, State.apoptosis]p =[0.6, 0.4]else:newState =[State.health, State.overflow]p =[0.6, 0.4]else: #凋亡newState=[State.apoptosis]p =[1.0]#print("\n S",S, "\t prob ",proba)nextState = random.choices(newState, p)[0]r = self.reward(nextState)return nextState,rdef __init__(self):self.name = "環(huán)境空間"class Agent():def initPolicy(self):#初始化累積獎(jiǎng)賞self.Q ={} #(state,action) 的累積獎(jiǎng)賞self.count ={} #(state,action) 執(zhí)行的次數(shù)for state in self.S:for action in self.A:self. Q[state, action] = 0.0self.count[state,action]= 0action = self.randomAction()self.policy[state]= Action.noWater #初始化都不澆水def randomAction(self):#隨機(jī)策略action = random.choices(self.A, [0.5,0.5])[0]return actiondef behaviorPolicy(self):#使用e-貪心策略state = State.shortWater #從缺水開(kāi)始env = Env()trajectory ={}#[s0,a0,r0]--[s1,a1,r1]--[sT-1,aT-1,rT-1]for t in range(self.T):#選擇策略rnd = np.random.rand() #生成隨機(jī)數(shù)if rnd <self.epsilon:action =self.randomAction()else:#通過(guò)原始策略選擇actionaction = self.policy[state] newState,reward = env.action(state, action) trajectory[t]=[state,action,reward]state = newStatereturn trajectorydef calcW(self,trajectory):#計(jì)算權(quán)重q1 = 1.0-self.epsilon+self.epsilon/2.0 # a== 原始策略q2 = self.epsilon/2.0 # a!=原始策略w ={}for t, value in trajectory.items():#[state, action,reward]action =value[1]state = value[0]if action == self.policy[state]:p = 1q = q1else:p = 0q = q2w[t] = round(np.exp(p-q),3)#print("\n w ",w)return wdef getReward(self,t,wDict,trajectory):p = 1.0r= 0#=[state,action,reward]for i in range(t,self.T):r+=trajectory[t][-1]w =wDict[t]p =p*wR = p*rm = self.T-treturn R/mdef improve(self):a = Action.noWaterfor state in self.S:maxR = self.Q[state, a]for action in self.A:R = self.Q[state,action]if R>=maxR:maxR = Rself.policy[state]= actiondef learn(self):self.initPolicy()for s in range(1,self.maxIter): #采樣第S 條軌跡#通過(guò)行為策略(e-貪心策略)產(chǎn)生軌跡trajectory =self.behaviorPolicy()w = self.calcW(trajectory)print("\n 迭代次數(shù) %d"%s ,"\t 缺水:",self.policy[State.shortWater].name,"\t 健康:",self.policy[State.health].name,"\t 溢水:",self.policy[State.overflow].name,"\t 凋亡:",self.policy[State.apoptosis].name)#策略評(píng)估for t in range(self.T):R = self.getReward(t, w,trajectory)state = trajectory[t][0]action = trajectory[t][1]Q = self.Q[state,action]count = self.count[state, action]self.Q[state,action] = (Q*count+R)/(count+1)self.count[state, action]=count+1#獲取權(quán)重系數(shù)self.improve() def __init__(self):self.S = [State.shortWater, State.health, State.overflow, State.apoptosis]self.A = [Action.water, Action.noWater]self.Q ={} #累積獎(jiǎng)賞self.count ={}self.policy ={} #target Policyself.maxIter =500self.epsilon = 0.2self.T = 10if __name__ == "__main__":agent = Agent()agent.learn()
https://img2020.cnblogs.com/blog/1027447/202110/1027447-20211013112906490-1926128536.png