做網(wǎng)站要多少錢呀產(chǎn)品推廣步驟
文章目錄
- 下載IMDb數(shù)據(jù)
- 讀取IMDb數(shù)據(jù)
- 建立分詞器
- 將評論數(shù)據(jù)轉(zhuǎn)化為數(shù)字列表
- 讓轉(zhuǎn)換后的數(shù)字長度相同
- 加入嵌入層
- 建立多層感知機模型
- 加入平坦層
- 加入隱藏層
- 加入輸出層
- 查看模型摘要
- 訓練模型
- 評估模型準確率
- 進行預測
- 查看測試數(shù)據(jù)預測結(jié)果
- 完整函數(shù)
- 用RNN模型進行IMDb情感分析
- 用LSTM模型進行IMDb情感分析
GITHUB地址https://github.com/fz861062923/Keras
下載IMDb數(shù)據(jù)
#下載網(wǎng)站http://ai.stanford.edu/~amaas/data/sentiment/
讀取IMDb數(shù)據(jù)
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
C:\Users\admin\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.from ._conv import register_converters as _register_converters
Using TensorFlow backend.
#因為數(shù)據(jù)也是從網(wǎng)絡(luò)上爬取的,所以還需要用正則表達式去除HTML標簽
import re
def remove_html(text):r=re.compile(r'<[^>]+>')return r.sub('',text)
#觀察IMDB文件目錄結(jié)構(gòu),用函數(shù)進行讀取
import os
def read_file(filetype):path='./aclImdb/'file_list=[]positive=path+filetype+'/pos/'for f in os.listdir(positive):file_list+=[positive+f]negative=path+filetype+'/neg/'for f in os.listdir(negative):file_list+=[negative+f]print('filetype:',filetype,'file_length:',len(file_list))label=([1]*12500+[0]*12500)#train數(shù)據(jù)和test數(shù)據(jù)中positive都是12500,negative都是12500text=[]for f_ in file_list:with open(f_,encoding='utf8') as f:text+=[remove_html(''.join(f.readlines()))]return label,text
#用x表示label,y表示text里面的內(nèi)容
x_train,y_train=read_file('train')
filetype: train file_length: 25000
x_test,y_test=read_file('test')
filetype: test file_length: 25000
y_train[0]
'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'
建立分詞器
具體用法可以參看官網(wǎng)https://keras.io/preprocessing/text/
token=Tokenizer(num_words=2000)#建立一個有2000單詞的字典
token.fit_on_texts(y_train)#讀取所有的訓練數(shù)據(jù)評論,按照單詞在評論中出現(xiàn)的次數(shù)進行排序,前2000名會列入字典
#查看token讀取多少文章
token.document_count
25000
將評論數(shù)據(jù)轉(zhuǎn)化為數(shù)字列表
train_seq=token.texts_to_sequences(y_train)
test_seq=token.texts_to_sequences(y_test)
print(y_train[0])
Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
print(train_seq[0])
[308, 6, 3, 1068, 208, 8, 29, 1, 168, 54, 13, 45, 81, 40, 391, 109, 137, 13, 57, 149, 7, 1, 481, 68, 5, 260, 11, 6, 72, 5, 631, 70, 6, 1, 5, 1, 1530, 33, 66, 63, 204, 139, 64, 1229, 1, 4, 1, 222, 899, 28, 68, 4, 1, 9, 693, 2, 64, 1530, 50, 9, 215, 1, 386, 7, 59, 3, 1470, 798, 5, 176, 1, 391, 9, 1235, 29, 308, 3, 352, 343, 142, 129, 5, 27, 4, 125, 1470, 5, 308, 9, 532, 11, 107, 1466, 4, 57, 554, 100, 11, 308, 6, 226, 47, 3, 11, 8, 214]
讓轉(zhuǎn)換后的數(shù)字長度相同
#截長補短,讓每一個數(shù)字列表長度都為100
_train=sequence.pad_sequences(train_seq,maxlen=100)
_test=sequence.pad_sequences(test_seq,maxlen=100)
print(train_seq[0])
[308, 6, 3, 1068, 208, 8, 29, 1, 168, 54, 13, 45, 81, 40, 391, 109, 137, 13, 57, 149, 7, 1, 481, 68, 5, 260, 11, 6, 72, 5, 631, 70, 6, 1, 5, 1, 1530, 33, 66, 63, 204, 139, 64, 1229, 1, 4, 1, 222, 899, 28, 68, 4, 1, 9, 693, 2, 64, 1530, 50, 9, 215, 1, 386, 7, 59, 3, 1470, 798, 5, 176, 1, 391, 9, 1235, 29, 308, 3, 352, 343, 142, 129, 5, 27, 4, 125, 1470, 5, 308, 9, 532, 11, 107, 1466, 4, 57, 554, 100, 11, 308, 6, 226, 47, 3, 11, 8, 214]
print(_train[0])
[ 29 1 168 54 13 45 81 40 391 109 137 13 57 1497 1 481 68 5 260 11 6 72 5 631 70 6 15 1 1530 33 66 63 204 139 64 1229 1 4 1 222899 28 68 4 1 9 693 2 64 1530 50 9 215 1386 7 59 3 1470 798 5 176 1 391 9 1235 29 3083 352 343 142 129 5 27 4 125 1470 5 308 9 53211 107 1466 4 57 554 100 11 308 6 226 47 3 118 214]
_train.shape
(25000, 100)
加入嵌入層
將數(shù)字列表轉(zhuǎn)化為向量列表(為什么轉(zhuǎn)化,建議大家都思考一哈)
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation,Flatten
from keras.layers.embeddings import Embedding
model=Sequential()
model.add(Embedding(output_dim=32,#將數(shù)字列表轉(zhuǎn)換為32維的向量input_dim=2000,#輸入數(shù)據(jù)的維度是2000,因為之前建立的字典有2000個單詞input_length=100))#數(shù)字列表的長度為100
model.add(Dropout(0.25))
建立多層感知機模型
加入平坦層
model.add(Flatten())
加入隱藏層
model.add(Dense(units=256,activation='relu'))
model.add(Dropout(0.35))
加入輸出層
model.add(Dense(units=1,#輸出層只有一個神經(jīng)元,輸出1表示正面評價,輸出0表示負面評價activation='sigmoid'))
查看模型摘要
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 100, 32) 64000
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 3200) 0
_________________________________________________________________
dense_1 (Dense) (None, 256) 819456
_________________________________________________________________
dropout_2 (Dropout) (None, 256) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 257
=================================================================
Total params: 883,713
Trainable params: 883,713
Non-trainable params: 0
_________________________________________________________________
訓練模型
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
train_history=model.fit(_train,x_train,batch_size=100,epochs=10,verbose=2,validation_split=0.2)
Train on 20000 samples, validate on 5000 samples
Epoch 1/10- 21s - loss: 0.4851 - acc: 0.7521 - val_loss: 0.4491 - val_acc: 0.7894
Epoch 2/10- 20s - loss: 0.2817 - acc: 0.8829 - val_loss: 0.6735 - val_acc: 0.6892
Epoch 3/10- 13s - loss: 0.1901 - acc: 0.9285 - val_loss: 0.5907 - val_acc: 0.7632
Epoch 4/10- 12s - loss: 0.1066 - acc: 0.9622 - val_loss: 0.7522 - val_acc: 0.7528
Epoch 5/10- 13s - loss: 0.0681 - acc: 0.9765 - val_loss: 0.9863 - val_acc: 0.7404
Epoch 6/10- 13s - loss: 0.0486 - acc: 0.9827 - val_loss: 1.0818 - val_acc: 0.7506
Epoch 7/10- 14s - loss: 0.0380 - acc: 0.9859 - val_loss: 0.9823 - val_acc: 0.7780
Epoch 8/10- 17s - loss: 0.0360 - acc: 0.9860 - val_loss: 1.1297 - val_acc: 0.7634
Epoch 9/10- 13s - loss: 0.0321 - acc: 0.9891 - val_loss: 1.2459 - val_acc: 0.7480
Epoch 10/10- 14s - loss: 0.0281 - acc: 0.9899 - val_loss: 1.4111 - val_acc: 0.7304
評估模型準確率
scores=model.evaluate(_test,x_test)#第一個參數(shù)為feature,第二個參數(shù)為label
25000/25000 [==============================] - 4s 148us/step
scores[1]
0.80972
進行預測
predict=model.predict_classes(_test)
predict[:10]
array([[1],[0],[1],[1],[1],[1],[1],[1],[1],[1]])
#轉(zhuǎn)換成一維數(shù)組
predict=predict.reshape(-1)
predict[:10]
array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1])
查看測試數(shù)據(jù)預測結(jié)果
_dict={1:'正面的評論',0:'負面的評論'}
def display(i):print(y_test[i])print('label真實值為:',_dict[x_test[i]],'預測結(jié)果為:',_dict[predict[i]])
display(0)
I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.
label真實值為: 正面的評論 預測結(jié)果為: 正面的評論
完整函數(shù)
def review(input_text):input_seq=token.texts_to_sequences([input_text])pad_input_seq=sequence.pad_sequences(input_seq,maxlen=100)predict_result=model.predict_classes(pad_input_seq)print(_dict[predict_result[0][0]])
#IMDB上面找的一段評論,進行預測
review('''
Going into this movie, I had low expectations. I'd seen poor reviews, and I also kind of hate the idea of remaking animated films for no reason other than to make them live action, as if that's supposed to make them better some how. This movie pleasantly surprised me!Beauty and the Beast is a fun, charming movie, that is a blast in many ways. The film very easy on the eyes! Every shot is colourful and beautifully crafted. The acting is also excellent. Dan Stevens is excellent. You can see him if you look closely at The Beast, but not so clearly that it pulls you out of the film. His performance is suitably over the top in anger, but also very charming. Emma Watson was fine, but to be honest, she was basically just playing Hermione, and I didn't get much of a character from her. She likes books, and she's feisty. That's basically all I got. For me, the one saving grace for her character, is you can see how much fun Emma Watson is having. I've heard interviews in which she's expressed how much she's always loved Belle as a character, and it shows.The stand out for me was Lumieré, voiced by Ewan McGregor. He was hilarious, and over the top, and always fun! He lit up the screen (no pun intended) every time he showed up!The only real gripes I have with the film are some questionable CGI with the Wolves and with a couple of The Beast's scenes, and some pacing issues. The film flows really well, to such an extent that in some scenes, the camera will dolly away from the character it's focusing on, and will pan across the countryside, and track to another, far away, with out cutting. This works really well, but a couple times, the film will just fade to black, and it's quite jarring. It happens like 3 or 4 times, but it's really noticeable, and took me out of the experience. Also, they added some stuff to the story that I don't want to spoil, but I don't think it worked on any level, story wise, or logically.Overall, it's a fun movie! I would recommend it to any fan of the original, but those who didn't like the animated classic, or who hate musicals might be better off staying aw''')
正面的評論
review('''
This is a horrible Disney piece of crap full of completely lame singsongs, a script so wrong it is an insult to other scripts to even call it a script. The only way I could enjoy this is after eating two complete space cakes, and even then I would prefer analysing our wallpaper!''')
負面的評論
- 到這里用在keras中用多層感知機進行情感預測就結(jié)束了,反思實驗可以改進的地方,除了神經(jīng)元的個數(shù),還可以將字典的單詞個數(shù)設(shè)置大一些(原來是2000),數(shù)字列表的長度maxlen也可以設(shè)置長一些
用RNN模型進行IMDb情感分析
- 使用RNN的好處也可以思考一哈
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN
model_rnn=Sequential()
model_rnn.add(Embedding(output_dim=32,input_dim=2000,input_length=100))
model_rnn.add(Dropout(0.25))
model_rnn.add(SimpleRNN(units=16))#RNN層有16個神經(jīng)元
model_rnn.add(Dense(units=256,activation='relu'))
model_rnn.add(Dropout(0.25))
model_rnn.add(Dense(units=1,activation='sigmoid'))
model_rnn.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 100, 32) 64000
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
simple_rnn_1 (SimpleRNN) (None, 16) 784
_________________________________________________________________
dense_1 (Dense) (None, 256) 4352
_________________________________________________________________
dropout_2 (Dropout) (None, 256) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 257
=================================================================
Total params: 69,393
Trainable params: 69,393
Non-trainable params: 0
_________________________________________________________________
model_rnn.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
train_history=model_rnn.fit(_train,x_train,batch_size=100,epochs=10,verbose=2,validation_split=0.2)
Train on 20000 samples, validate on 5000 samples
Epoch 1/10- 13s - loss: 0.5200 - acc: 0.7319 - val_loss: 0.6095 - val_acc: 0.6960
Epoch 2/10- 12s - loss: 0.3485 - acc: 0.8506 - val_loss: 0.4994 - val_acc: 0.7766
Epoch 3/10- 12s - loss: 0.3109 - acc: 0.8710 - val_loss: 0.5842 - val_acc: 0.7598
Epoch 4/10- 13s - loss: 0.2874 - acc: 0.8833 - val_loss: 0.4420 - val_acc: 0.8136
Epoch 5/10- 12s - loss: 0.2649 - acc: 0.8929 - val_loss: 0.6818 - val_acc: 0.7270
Epoch 6/10- 14s - loss: 0.2402 - acc: 0.9035 - val_loss: 0.5634 - val_acc: 0.7984
Epoch 7/10- 16s - loss: 0.2084 - acc: 0.9190 - val_loss: 0.6392 - val_acc: 0.7694
Epoch 8/10- 16s - loss: 0.1855 - acc: 0.9289 - val_loss: 0.6388 - val_acc: 0.7650
Epoch 9/10- 14s - loss: 0.1641 - acc: 0.9367 - val_loss: 0.8356 - val_acc: 0.7592
Epoch 10/10- 19s - loss: 0.1430 - acc: 0.9451 - val_loss: 0.7365 - val_acc: 0.7766
scores=model_rnn.evaluate(_test,x_test)
25000/25000 [==============================] - 14s 567us/step
scores[1]#提高了大概兩個百分點
0.82084
用LSTM模型進行IMDb情感分析
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
model_lstm=Sequential()
model_lstm.add(Embedding(output_dim=32,input_dim=2000,input_length=100))
model_lstm.add(Dropout(0.25))
model_lstm.add(LSTM(32))
model_lstm.add(Dense(units=256,activation='relu'))
model_lstm.add(Dropout(0.25))
model_lstm.add(Dense(units=1,activation='sigmoid'))
model_lstm.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 100, 32) 64000
_________________________________________________________________
dropout_3 (Dropout) (None, 100, 32) 0
_________________________________________________________________
lstm_1 (LSTM) (None, 32) 8320
_________________________________________________________________
dense_3 (Dense) (None, 256) 8448
_________________________________________________________________
dropout_4 (Dropout) (None, 256) 0
_________________________________________________________________
dense_4 (Dense) (None, 1) 257
=================================================================
Total params: 81,025
Trainable params: 81,025
Non-trainable params: 0
_________________________________________________________________
scores=model_rnn.evaluate(_test,x_test)
25000/25000 [==============================] - 13s 522us/step
scores[1]
0.82084
可以看出和RMNN差不多,這可能因為事評論數(shù)據(jù)的時間間隔不大,不能充分體現(xiàn)LSTM的優(yōu)越性