中國信用網(wǎng)官網(wǎng)查詢?nèi)肟谥貞cseo全網(wǎng)營銷
??Transformer和BERT可謂是LLM的基礎(chǔ)模型,徹底搞懂極其必要。Transformer最初設(shè)想是作為文本翻譯模型使用的,而BERT模型構(gòu)建使用了Transformer的部分組件,如果理解了Transformer,則能很輕松地理解BERT。
一.Transformer模型架構(gòu)
1.編碼器
(1)Multi-Head Attention(多頭注意力機(jī)制)
??首先將輸入x進(jìn)行embedding編碼,然后通過WQ、WK和WV矩陣轉(zhuǎn)換為Q、K和V,然后輸入Scaled Dot-Product Attention中,最后經(jīng)過Feed Forward輸出,作為解碼器第2層的輸入Q。
(2)Feed Forward(前饋神經(jīng)網(wǎng)絡(luò))
2.解碼器
(1)Masked Multi-Head Attention(掩碼多頭注意力機(jī)制)
??Masked包括上三角矩陣Mask(不包含對角線)和PAD MASK的疊加,目的是在計算自注意力過程中不會注意當(dāng)前詞的下一個詞,只會注意當(dāng)前詞與當(dāng)前詞之前的詞。在模型訓(xùn)練的時候?yàn)榱朔乐拐`差積累和并行訓(xùn)練,使用Teacher Forcing機(jī)制。
(2)Encoder-Decoder Multi-Head Attention(編解碼多頭注意力機(jī)制)
??把Encoder的輸出作為解碼器第2層的Q,把Decoder第1層的輸出作為K和V。
(3)Feed Forward(前饋神經(jīng)網(wǎng)絡(luò))
二.簡單翻譯任務(wù)
1.定義數(shù)據(jù)集
??這塊簡要介紹,主要是通過數(shù)據(jù)生成器模擬了一些數(shù)據(jù),將原文翻譯為譯文,實(shí)現(xiàn)代碼如下所示:
# 定義字典
vocab_x = '<SOS>,<EOS>,<PAD>,0,1,2,3,4,5,6,7,8,9,q,w,e,r,t,y,u,i,o,p,a,s,d,f,g,h,j,k,l,z,x,c,v,b,n,m'
vocab_x = {word: i for i, word in enumerate(vocab_x.split(','))}
vocab_xr = [k for k, v in vocab_x.items()]
vocab_y = {k.upper(): v for k, v in vocab_x.items()}
vocab_yr = [k for k, v in vocab_y.items()]
print('vocab_x=', vocab_x)
print('vocab_y=', vocab_y)# 定義生成數(shù)據(jù)的函數(shù)
def get_data():# 定義詞集合words =['0','1','2','3','4','5','6','7','8','9','q','w','e','r','t','y','u','i','o','p','a','s','d','f','g','h','j','k','l','z','x','c','v','b','n','m']# 定義每個詞被選中的概率p = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26])p = p / p.sum()# 隨機(jī)選n個詞n = random.randint(30, 48) # 生成30-48個詞x = np.random.choice(words, size=n, replace=True, p=p) # words中選n個詞,每個詞被選中的概率為p,replace=True表示可以重復(fù)選擇# 采樣的結(jié)果就是xx = x.tolist()# y是由對x的變換得到的# 字母大寫,數(shù)字取9以內(nèi)的互補(bǔ)數(shù)def f(i):i = i.upper()if not i.isdigit():return ii = 9 - int(i)return str(i)y = [f(i) for i in x]# 逆序y = y[::-1]# y中的首字母雙寫y = [y[0]] + y# 加上首尾符號x = ['<SOS>'] + x + ['<EOS>']y = ['<SOS>'] + y + ['<EOS>']# 補(bǔ)PAD,直到固定長度x = x + ['<PAD>'] * 50y = y + ['<PAD>'] * 51x = x[:50]y = y[:51]# 編碼成數(shù)據(jù)x = [vocab_x[i] for i in x]y = [vocab_y[i] for i in y]# 轉(zhuǎn)Tensorx = torch.LongTensor(x)y = torch.LongTensor(y)return x, y# 定義數(shù)據(jù)集
class Dataset(torch.utils.data.Dataset):def __init__(self): # 初始化super(Dataset, self).__init__()def __len__(self): # 返回數(shù)據(jù)集的長度return 1000def __getitem__(self, i): # 根據(jù)索引返回數(shù)據(jù)return get_data()
??然后通過loader = torch.utils.data.DataLoader(dataset=Dataset(), batch_size=8, drop_last=True, shuffle=True, collate_fn=None)
定義了數(shù)據(jù)加載器,數(shù)據(jù)樣例如下所示:
2.定義PAD MASK函數(shù)
??PAD MASK主要目的是減少計算量,如下所示:
def mask_pad(data):# b句話,每句話50個詞,這里是還沒embed的# data = [b, 50]# 判斷每個詞是不是<PAD>mask = data == vocab_x['<PAD>']# [b, 50] -> [b, 1, 1, 50]mask = mask.reshape(-1, 1, 1, 50)# 在計算注意力時,計算50個詞和50個詞相互之間的注意力,所以是個50*50的矩陣# PAD的列為True,意味著任何詞對PAD的注意力都是0,但是PAD本身對其它詞的注意力并不是0,所以是PAD的行不為True# 復(fù)制n次# [b, 1, 1, 50] -> [b, 1, 50, 50]mask = mask.expand(-1, 1, 50, 50) # 根據(jù)指定的維度擴(kuò)展return mask
if __name__ == '__main__':# 測試mask_pad函數(shù)print(mask_pad(x[:1]))
輸出結(jié)果shape為(1,1,50,50)如下所示:
tensor([[[[False, False, False, ..., False, False, True],[False, False, False, ..., False, False, True],[False, False, False, ..., False, False, True],...,[False, False, False, ..., False, False, True],[False, False, False, ..., False, False, True],[False, False, False, ..., False, False, True]]]])
3.定義上三角MASK函數(shù)
??將上三角和PAD MASK相加,最終輸出的shape和PAD MASK函數(shù)相同,均為(b, 1, 50, 50):
# 定義mask_tril函數(shù)
def mask_tril(data):# b句話,每句話50個詞,這里是還沒embed的# data = [b, 50]# 50*50的矩陣表示每個詞對其它詞是否可見# 上三角矩陣,不包括對角線,意味著對每個詞而言它只能看到它自己和它之前的詞,而看不到之后的詞# [1, 50, 50]"""[[0, 1, 1, 1, 1],[0, 0, 1, 1, 1],[0, 0, 0, 1, 1],[0, 0, 0, 0, 1],[0, 0, 0, 0, 0]]"""tril = 1 - torch.tril(torch.ones(1, 50, 50, dtype=torch.long)) # torch.tril返回下三角矩陣,則1-tril返回上三角矩陣# 判斷y當(dāng)中每個詞是不是PAD, 如果是PAD, 則不可見# [b, 50]mask = data == vocab_y['<PAD>'] # mask的shape為[b, 50]# 變形+轉(zhuǎn)型,為了之后的計算# [b, 1, 50]mask = mask.unsqueeze(1).long() # 在指定位置插入維度,mask的shape為[b, 1, 50]# mask和tril求并集# [b, 1, 50] + [1, 50, 50] -> [b, 50, 50]mask = mask + tril# 轉(zhuǎn)布爾型mask = mask > 0 # mask的shape為[b, 50, 50]# 轉(zhuǎn)布爾型,增加一個維度,便于后續(xù)的計算mask = (mask == 1).unsqueeze(dim=1) # mask的shape為[b, 1, 50, 50]return mask
if __name__ == '__main__':# 測試mask_tril函數(shù)print(mask_tril(x[:1]))
??輸出結(jié)果shape為(b,1,50,50)如下所示:
tensor([[[[False, True, True, ..., True, True, True],[False, False, True, ..., True, True, True],[False, False, False, ..., True, True, True],...,[False, False, False, ..., True, True, True],[False, False, False, ..., True, True, True],[False, False, False, ..., True, True, True]]]])
4.定義注意力計算層
??這里的注意力計算層是Scaled Dot-Product Attention,計算方程為 A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V {\rm{Attention}}(Q,K,V) = {\rm{softmax}}(\frac{{Q{K^T}}}{{\sqrt {{d_k}} }})V Attention(Q,K,V)=softmax(dk??QKT?)V,其中 d k d_k dk?等于Embedding的維度除以注意力機(jī)制的頭數(shù),比如64 = 512 / 8,如下所示:
# 定義注意力計算函數(shù)
def attention(Q, K, V, mask):"""Q:torch.randn(8, 4, 50, 8)K:torch.randn(8, 4, 50, 8)V:torch.randn(8, 4, 50, 8)mask:torch.zeros(8, 1, 50, 50)"""# b句話,每句話50個詞,每個詞編碼成32維向量,4個頭,每個頭分到8維向量# Q、K、V = [b, 4, 50, 8]# [b, 4, 50, 8] * [b, 4, 8, 50] -> [b, 4, 50, 50]# Q、K矩陣相乘,求每個詞相對其它所有詞的注意力score = torch.matmul(Q, K.permute(0, 1, 3, 2)) # K.permute(0, 1, 3, 2)表示將K的第3維和第4維交換# 除以每個頭維數(shù)的平方根,做數(shù)值縮放score /= 8**0.5# mask遮蓋,mask是True的地方都被替換成-inf,這樣在計算softmax時-inf會被壓縮到0# mask = [b, 1, 50, 50]score = score.masked_fill_(mask, -float('inf')) # masked_fill_()函數(shù)的作用是將mask中為1的位置用value填充score = torch.softmax(score, dim=-1) # 在最后一個維度上做softmax# 以注意力分?jǐn)?shù)乘以V得到最終的注意力結(jié)果# [b, 4, 50, 50] * [b, 4, 50, 8] -> [b, 4, 50, 8]score = torch.matmul(score, V)# 每個頭計算的結(jié)果合一# [b, 4, 50, 8] -> [b, 50, 32]score = score.permute(0, 2, 1, 3).reshape(-1, 50, 32)return score
if __name__ == '__main__':# 測試attention函數(shù)print(attention(torch.randn(8, 4, 50, 8), torch.randn(8, 4, 50, 8), torch.randn(8, 4, 50, 8), torch.zeros(8, 1, 50, 50)).shape) #(8, 50, 32)
5.BatchNorm和LayerNorm對比
??在PyTorch中主要提供了兩種批量標(biāo)準(zhǔn)化的網(wǎng)絡(luò)層,分別是BatchNorm和LayerNorm,其中BatchNorm按照處理的數(shù)據(jù)維度分為BatchNorm1d、BatchNorm2d、BatchNorm3d。BatchNorm1d和LayerNorm之間的區(qū)別,在于BatchNorm1d是取不同樣本做標(biāo)準(zhǔn)化,而LayerNorm是取不同通道做標(biāo)準(zhǔn)化。
# BatchNorm1d和LayerNorm的對比
# 標(biāo)準(zhǔn)化之后,均值是0, 標(biāo)準(zhǔn)差是1
# BN是取不同樣本做標(biāo)準(zhǔn)化
# LN是取不同通道做標(biāo)準(zhǔn)化
# affine=True,elementwise_affine=True:指定標(biāo)準(zhǔn)化后再計算一個線性映射
norm = torch.nn.BatchNorm1d(num_features=4, affine=True)
print(norm(torch.arange(32, dtype=torch.float32).reshape(2, 4, 4)))
norm = torch.nn.LayerNorm(normalized_shape=4, elementwise_affine=True)
print(norm(torch.arange(32, dtype=torch.float32).reshape(2, 4, 4)))
??輸出結(jié)果如下所示:
tensor([[[-1.1761, -1.0523, -0.9285, -0.8047],[-1.1761, -1.0523, -0.9285, -0.8047],[-1.1761, -1.0523, -0.9285, -0.8047],[-1.1761, -1.0523, -0.9285, -0.8047]],[[ 0.8047, 0.9285, 1.0523, 1.1761],[ 0.8047, 0.9285, 1.0523, 1.1761],[ 0.8047, 0.9285, 1.0523, 1.1761],[ 0.8047, 0.9285, 1.0523, 1.1761]]],grad_fn=<NativeBatchNormBackward0>)
tensor([[[-1.3416, -0.4472, 0.4472, 1.3416],[-1.3416, -0.4472, 0.4472, 1.3416],[-1.3416, -0.4472, 0.4472, 1.3416],[-1.3416, -0.4472, 0.4472, 1.3416]],[[-1.3416, -0.4472, 0.4472, 1.3416],[-1.3416, -0.4472, 0.4472, 1.3416],[-1.3416, -0.4472, 0.4472, 1.3416],[-1.3416, -0.4472, 0.4472, 1.3416]]],grad_fn=<NativeLayerNormBackward0>)
6.定義多頭注意力計算層
??本文中的多頭注意力計算層包括轉(zhuǎn)換矩陣(WK、WV和WQ),以及多頭注意力機(jī)制的計算過程,還有層歸一化、殘差鏈接和Dropout。如下所示:
# 多頭注意力計算層
class MultiHead(torch.nn.Module):def __init__(self):super().__init__()self.fc_Q = torch.nn.Linear(32, 32) # 線性運(yùn)算,維度不變self.fc_K = torch.nn.Linear(32, 32) # 線性運(yùn)算,維度不變self.fc_V = torch.nn.Linear(32, 32) # 線性運(yùn)算,維度不變self.out_fc = torch.nn.Linear(32, 32) # 線性運(yùn)算,維度不變self.norm = torch.nn.LayerNorm(normalized_shape=32, elementwise_affine=True) # 標(biāo)準(zhǔn)化self.DropOut = torch.nn.Dropout(p=0.1) # Dropout,丟棄概率為0.1def forward(self, Q, K, V, mask):# b句話,每句話50個詞,每個詞編碼成32維向量# Q、K、V=[b,50,32]b = Q.shape[0] # 取出batch_size# 保留下原始的Q,后面要做短接(殘差思想)用clone_Q = Q.clone()# 標(biāo)準(zhǔn)化Q = self.norm(Q)K = self.norm(K)V = self.norm(V)# 線性運(yùn)算,維度不變# [b,50,32] -> [b,50,32]K = self.fc_K(K) # 權(quán)重就是WKV = self.fc_V(V) # 權(quán)重就是WVQ = self.fc_Q(Q) # 權(quán)重就是WQ# 拆分成多個頭# b句話,每句話50個詞,每個詞編碼成32維向量,4個頭,每個頭分到8維向量# [b,50,32] -> [b,4,50,8]Q = Q.reshape(b, 50, 4, 8).permute(0, 2, 1, 3)K = K.reshape(b, 50, 4, 8).permute(0, 2, 1, 3)V = V.reshape(b, 50, 4, 8).permute(0, 2, 1, 3)# 計算注意力# [b,4,50,8]-> [b,50,32]score = attention(Q, K, V, mask)# 計算輸出,維度不變# [b,50,32]->[b,50,32]score = self.DropOut(self.out_fc(score)) # Dropout,丟棄概率為0.1# 短接(殘差思想)score = clone_Q + scorereturn score
7.定義位置編碼層
??位置編碼計算方程如下所示,其中 d m o d e l {d_{model}} dmodel?表示Embedding的維度,比如512:
P E ( p o s , 2 i ) = s i n ( p o s / 10000 2 i / d m o d e l ) P E ( p o s , 2 i + 1 ) = c o s ( p o s / 10000 2 i / d m o d e l ) \begin{array}{l} PE\left( {pos,2i} \right) = sin\left( {pos/{{10000}^{2i/{d_{model}}}}} \right) \\ PE\left( {pos,2i + 1} \right) = cos\left( {pos/{{10000}^{2i/{d_{model}}}}} \right) \\ \end{array} PE(pos,2i)=sin(pos/100002i/dmodel?)PE(pos,2i+1)=cos(pos/100002i/dmodel?)?
# 定義位置編碼層
class PositionEmbedding(torch.nn.Module) :def __init__(self):super().__init__()# pos是第幾個詞,i是第幾個詞向量維度,d_model是編碼維度總數(shù)def get_pe(pos, i, d_model):d = 1e4**(i / d_model)pe = pos / dif i % 2 == 0:return math.sin(pe) # 偶數(shù)維度用sinreturn math.cos(pe) # 奇數(shù)維度用cos# 初始化位置編碼矩陣pe = torch.empty(50, 32)for i in range(50):for j in range(32):pe[i, j] = get_pe(i, j, 32)pe = pe. unsqueeze(0) # 增加一個維度,shape變?yōu)閇1,50,32]# 定義為不更新的常量self.register_buffer('pe', pe)# 詞編碼層self.embed = torch.nn.Embedding(39, 32) # 39個詞,每個詞編碼成32維向量# 用正太分布初始化參數(shù)self.embed.weight.data.normal_(0, 0.1)def forward(self, x):# [8,50]->[8,50,32]embed = self.embed(x)# 詞編碼和位置編碼相加# [8,50,32]+[1,50,32]->[8,50,32]embed = embed + self.pereturn embed
8.定義全連接輸出層
??與標(biāo)準(zhǔn)Transformer相比,這里定義的全連接輸出層對層歸一化norm進(jìn)行了提前,如下所示:
# 定義全連接輸出層
class FullyConnectedOutput(torch.nn.Module):def __init__(self):super().__init__()self.fc = torch.nn.Sequential( # 線性全連接運(yùn)算torch.nn.Linear(in_features=32, out_features=64),torch.nn.ReLU(),torch.nn.Linear(in_features=64, out_features=32),torch.nn.Dropout(p=0.1),)self.norm = torch.nn.LayerNorm(normalized_shape=32, elementwise_affine=True)def forward(self, x):# 保留下原始的x,后面要做短接(殘差思想)用clone_x = x.clone()# 標(biāo)準(zhǔn)化x = self.norm(x)# 線性全連接運(yùn)算# [b,50,32]->[b,50,32]out = self.fc(x)# 做短接(殘差思想)out = clone_x + outreturn out
9.定義編碼器
??編碼器包含多個編碼層(下面代碼為5個),1個編碼層包含1個多頭注意力計算層和1個全連接輸出層,如下所示:
# 定義編碼器
# 編碼器層
class EncoderLayer(torch.nn.Module):def __init__(self):super().__init__()self.mh = MultiHead() # 多頭注意力計算層self.fc = FullyConnectedOutput() # 全連接輸出層def forward(self, x, mask):# 計算自注意力,維度不變# [b,50,32]->[b,50,32]score = self.mh(x, x, x, mask) # Q=K=V# 全連接輸出,維度不變# [b,50,32]->[b,50,32]out = self.fc(score)return out
# 編碼器
class Encoder(torch.nn.Module):def __init__(self):super().__init__()self.layer_l = EncoderLayer() # 編碼器層self.layer_2 = EncoderLayer() # 編碼器層self.layer_3 = EncoderLayer() # 編碼器層def forward(self, x, mask):x = self.layer_l(x, mask)x = self.layer_2(x, mask)x = self.layer_3(x, mask)return x
10.定義解碼器
??解碼器包含多個解碼層(下面代碼為3個),1個解碼層包含2個多頭注意力計算層(1個掩碼多頭注意力計算層和1個編解碼多頭注意力計算層)和1個全連接輸出層,如下所示:
class DecoderLayer(torch.nn.Module):def __init__(self):super().__init__()self.mhl = MultiHead() # 多頭注意力計算層self.mh2 = MultiHead() # 多頭注意力計算層self.fc = FullyConnectedOutput() # 全連接輸出層def forward(self, x, y, mask_pad_x, mask_tril_y):# 先計算y的自注意力,維度不變# [b,50,32] -> [b,50,32]y = self.mhl(y, y, y, mask_tril_y)# 結(jié)合x和y的注意力計算,維度不變# [b,50,32],[b,50,32]->[b,50,32]y = self.mh2(y, x, x, mask_pad_x)# 全連接輸出,維度不變# [b,50,32]->[b,50,32]y = self.fc(y)return y
# 解碼器
class Decoder(torch.nn.Module) :def __init__(self):super().__init__()self.layer_1 = DecoderLayer() # 解碼器層self.layer_2 = DecoderLayer() # 解碼器層self.layer_3 = DecoderLayer() # 解碼器層def forward(self, x, y, mask_pad_x, mask_tril_y):y = self.layer_1(x, y, mask_pad_x, mask_tril_y)y = self.layer_2(x, y, mask_pad_x, mask_tril_y)y = self.layer_3(x, y, mask_pad_x, mask_tril_y)return y
11.定義Transformer主模型
??Transformer主模型計算流程包括:獲取一批x和y之后,對x計算PAD MASK,對y計算上三角MASK;對x和y分別編碼;把x輸入編碼器計算輸出;把編碼器的輸出和y同時輸入解碼器計算輸出;將解碼器的輸出輸入全連接輸出層計算輸出。具體實(shí)現(xiàn)代碼如下所示:
# 定義主模型
class Transformer(torch.nn.Module):def __init__(self):super().__init__()self.embed_x = PositionEmbedding() # 位置編碼層self.embed_y = PositionEmbedding() # 位置編碼層self.encoder = Encoder() # 編碼器self.decoder = Decoder() # 解碼器self.fc_out = torch.nn.Linear(32, 39) # 全連接輸出層def forward(self, x, y):# [b,1,50,50]mask_pad_x = mask_pad(x) # PAD遮蓋mask_tril_y = mask_tril(y) # 上三角遮蓋# 編碼,添加位置信息# x=[b,50]->[b,50,32]# y=[b,50]->[b,50,32]x, y =self.embed_x(x), self.embed_y(y)# 編碼層計算# [b,50,32]->[b,50,32]x = self.encoder(x, mask_pad_x)# 解碼層計算# [b,50,32],[b,50,32]->[b,50,32]y = self.decoder(x, y, mask_pad_x, mask_tril_y)# 全連接輸出,維度不變# [b,50,32]->[b,50,39]y = self.fc_out(y)return y
12.定義預(yù)測函數(shù)
??預(yù)測函數(shù)本質(zhì)就是根據(jù)x得到y(tǒng)的過程,在預(yù)測過程中解碼器是串行工作的,從<SOS>
開始生成直到結(jié)束:
# 定義預(yù)測函數(shù)
def predict(x):# x=[1,50]model.eval()# [1,1,50,50]mask_pad_x = mask_pad(x)# 初始化輸出,這個是固定值# [1,50]# [[0,2,2,2...]]target = [vocab_y['<SOS>']] + [vocab_y['<PAD>']] * 49 # 初始化輸出,這個是固定值target = torch.LongTensor(target).unsqueeze(0) # 增加一個維度,shape變?yōu)閇1,50]# x編碼,添加位置信息# [1,50] -> [1,50,32]x = model.embed_x(x)# 編碼層計算,維度不變# [1,50,32] -> [1,50,32]x = model.encoder(x, mask_pad_x)# 遍歷生成第1個詞到第49個詞for i in range(49):# [1,50]y = target# [1, 1, 50, 50]mask_tril_y = mask_tril(y) # 上三角遮蓋# y編碼,添加位置信息# [1, 50] -> [1, 50, 32]y = model.embed_y(y)# 解碼層計算,維度不變# [1, 50, 32],[1, 50, 32] -> [1, 50, 32]y = model.decoder(x, y, mask_pad_x, mask_tril_y)# 全連接輸出,39分類#[1,50,32]-> [1,50,39]out = model.fc_out(y)# 取出當(dāng)前詞的輸出# [1,50,39]->[1,39]out = out[:,i,:]# 取出分類結(jié)果# [1,39]->[1]out = out.argmax(dim=1).detach()# 以當(dāng)前詞預(yù)測下一個詞,填到結(jié)果中target[:,i + 1] = outreturn target
13.定義訓(xùn)練函數(shù)
??訓(xùn)練函數(shù)的過程通常比較套路了,主要是損失函數(shù)和優(yōu)化器,然后就是逐個epoch和batch遍歷,計算和輸出當(dāng)前epoch、當(dāng)前batch、當(dāng)前學(xué)習(xí)率、當(dāng)前損失、當(dāng)前正確率。如下所示:
# 定義訓(xùn)練函數(shù)
def train():loss_func = torch.nn.CrossEntropyLoss() # 定義交叉熵?fù)p失函數(shù)optim = torch.optim.Adam(model.parameters(), lr=2e-3) # 定義優(yōu)化器sched = torch.optim.lr_scheduler.StepLR(optim, step_size=3, gamma=0.5) # 定義學(xué)習(xí)率衰減策略for epoch in range(1):for i, (x, y) in enumerate(loader):# x=[8,50]# y=[8,51]# 在訓(xùn)練時用y的每個字符作為輸入,預(yù)測下一個字符,所以不需要最后一個字# [8,50,39]pred = model(x, y[:, :-1]) # 前向計算# [8,50,39] -> [400,39]pred = pred.reshape(-1, 39) # 轉(zhuǎn)形狀# [8,51]->[400]y = y[:, 1:].reshape(-1) # 轉(zhuǎn)形狀# 忽略PADselect = y != vocab_y['<PAD>']pred = pred[select]y = y[select]loss = loss_func(pred, y) # 計算損失optim.zero_grad() # 梯度清零loss.backward() # 反向傳播optim.step() # 更新參數(shù)if i % 20 == 0:# [select,39] -> [select]pred = pred.argmax(1) # 取出分類結(jié)果correct = (pred == y).sum().item() # 計算正確個數(shù)accuracy = correct / len(pred) # 計算正確率lr = optim.param_groups[0]['lr'] # 取出當(dāng)前學(xué)習(xí)率print(epoch, i, lr, loss.item(), accuracy) # 打印結(jié)果,分別為:當(dāng)前epoch、當(dāng)前batch、當(dāng)前學(xué)習(xí)率、當(dāng)前損失、當(dāng)前正確率sched.step() # 更新學(xué)習(xí)率
??其中,y
和預(yù)測結(jié)果間的對應(yīng)關(guān)系,如下所示:
參考文獻(xiàn):
[1]HuggingFace自然語言處理詳解:基于BERT中文模型的任務(wù)實(shí)戰(zhàn)
[2]第13章:手動實(shí)現(xiàn)Transformer-簡單翻譯任務(wù)
[3]第13章:手動實(shí)現(xiàn)Transformer-兩數(shù)相加任務(wù)