机器人之TF-IDF,WORD2VEC,机器学习算法，深度学习算法在新浪新闻分类表现-职坐标

机器人之TF-IDF,WORD2VEC,机器学习算法，深度学习算法在新浪新闻分类表现

小标 2018-10-25 来源：阅读 2523 评论 0

摘要：本文主要向大家介绍了机器人之TF-IDF,WORD2VEC,机器学习算法，深度学习算法在新浪新闻分类表现，通过具体的内容向大家展现，希望对大家学习机器人有所帮助。

本文主要向大家介绍了机器人之TF-IDF,WORD2VEC,机器学习算法，深度学习算法在新浪新闻分类表现，通过具体的内容向大家展现，希望对大家学习机器人有所帮助。

系统开发工具和平台

结合本文设计，在系统开发和实现上，选择python作为主要开发语言，其兼容数据分析，机器学习，及web开发的特性极大缩短系统开发时间，可以有更多的开发时间去整理逻辑，及算法功能优化。

整体开发工具及环境如下：

·开发语言： python 数据库：mongodb

·开发编辑器：pycharm，sublimetext3

·数据分析框架：Pandas，Numpy，Scipy，Matplotlib

·分词工具：jieba

·机器学习算法模型：Scikit-learn，TensorFlow，keras，gensim

·服务器框架：Django

·前端架构：CSS，Javascript，ajax，vue，jQuery

运行电脑配置：

Cpu：(英特尔)Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz(3601 MHz)

内存：8.00 GB ( 2400 MHz)

显卡：Intel(R) HD Graphics 630 (1024 MB) （本项目使用的是cpu加速）

爬虫模块功能实现

如图所示，实现爬虫模块主要以下5个步骤

1.使用python自带的urllib库，对新浪新闻发送http请求，得到API的内的数据，实则是一串json格式封装的新闻数据，包含新闻标题，新闻发布时间，新闻链接，新闻评论人数，新闻来源等信息。

2.使用python自带的json解析库，解析json数据，得到需要的新闻标题，发布时间，评论人数，新闻链接等信息。

3.异步加载通过访问之前解析出来的新闻链接，爬取相应的新闻内容。

4.将所有数据存入mongodb数据库中。为之后的机器学习建模提供数据集。

5.使用multiprocessing开启进程池，使用threading开启多线程。循环翻页抓取全部所需信息。

由于API爬虫的高效性，在配合多线程与多进程技术，爬虫模块半小时可以实现上百万的新闻数据爬取。爬取信息如下图所示：

（我是图）

预处理模块功能实现

通过pandas将数据库的新闻信息读入内存，数据格式为dataframe，做数据预处理工作。爬来的新闻数据中，部分新闻内容信息是缺失的，我们将其从102万条新闻数据中去除。最后可以使用的有48万条新闻数据。

对这48万条数据进行分组，总共分为15个类的新闻数据。对不同新闻类别的统计如表所示：

可以发现lable 14 和lable 15的只具有几百条，lable8和lable 11一个也没有，数据分布也十分不均匀。综合考虑下，最后选择剩下的11个lable，每个lable随机抽取两千条新闻信息。

分词后需要去除停用词，如图展示的是对未去除停用词的词频统计（TOP10）：

可以发现在没有去停用词操作前，文档中出现大量‘的’，‘在’，‘是’这类词，但是其对分类贡献率极低。通过去除停用词我们可以得到下面结果：

4.5 分类器模块功能实现

1 CNN

深度学习的CNN算法项目中使用到了python的第三方库keras，keras为使用TensorFlow提供的一定的便捷化的接口。它一定程度上降低了我们学习的难度，项目中使用keras可以便捷的构架神经网络，而不用耗费大量的时间去学习TensorFlow的编码解码及构建特征的编程方法。

在配置参数上我们选择
100作为每条新闻的最大长度，
单词的向量空间维度为200,
20%的数据作为测试集
以及16%的数据作为验证集。

a. 不使用word2vec算法训练cnn模型步骤：

·使用Tokenizer对所有文本数据做特征提取。将新闻的文本数据转化成由单词的索引对应的序列。

·按配置参数的比例分割训练集，验证集，测试集。

·通过embedding技术对新闻特征序列降维，生成100*200的二维向量矩阵。

· 设置1 层卷积层与池化层减少向量的长度，，通过一层 Flatten 层将 2 维向量压缩到 1 维，最后由两层Dense将向量长度收缩到 11 上，对应新闻数据集的 11 个类类别。

实验结果如下：

训练集准确率：0.8652

准确率为：0.81450513

耗时：96s

对11个新闻类别的分类通过简单的搭建一个神经网络达到81%，但是从测试集准确度88%来看，存在一定的过拟合现象。总体来说在训练的时间偏长，效率较低。

b. 使用word2vec算法的CNN模型步骤：

· 使用 word2vec 模型替代embedding层的1312 万个参数。替换后embedding矩阵为65604 x 200。65604表示65604个单词。

· 其余步骤如上所示。

实验结果：

训练集准确度：0.8629

测试集准确度：0.8262257

耗时：77s

模型的shape与之前一样，过拟合现象减轻，准确率由原来的81%提升到了85.3%，这说明具备语义推理能力的word2vec可以一定程度上提高模型的准确率和运算性能。

2 LSTM

LSTM是深度学习算法中相对比较适合文本分类的一个模型，这里以同样的方法通过keras搭建LSTM网络。

a. 不使用word2vec算法训练LSTM模型

使用LSTM构架神经网络的步骤和参数与CNN的相同，这里不再做详细说明。

在训练集的准确率：0.7899

在测试集的准确率：0.7539733

耗时：161.9S

由于训练的数据量偏小，LSTM并没有发挥出其在自然语言处理上的优势，另外使用LSTM训练模型的时间为CNN的2倍，略为低效。

b. 使用word2vec算法的LSTM模型:

在训练集的准确率：0.8746

在测试集的准确率：0.821892816

耗时：162S

间接说明新闻文档的数量对LSTM的影响，使用word2vec产生的大量参数提升了语料库的容量。使得准确率有所提升。

3.朴素贝叶斯

贝叶斯算法我们使用python的机器学习库scikit-learn来完成，传统的机器学习模型，我们不用将词向量模型构建成二维的矩阵，来以分析图像的思维来训练文本，处理后的原数据矩阵为65604*1。

a. 使用TF-IDF算法建立贝叶斯模型：

使用TF-IDF来做特征提取，这里我们使用CountVectorizer来建立特征语料库，语料库的数量为文档出现所有词的集合。

将测试集与训练集分布与之拟合：

因为提前分割了数据集，所以在使用TF-IDF做特征提取的时候需要主要词对应语料库的位置。应当使用总得语料库来拟合训练集与测试。如果分开拟合将导致训练集与测试集相同索引对应不同单词，从而造成较大的误差。最后使用朴素贝叶斯算法建模，用测试集来验证贝叶斯模型的准确率。

简单的贝叶斯模型，准确率却略高于LSTM，而且在运行时间上，贝叶斯模型的运行时间不到2S，是LSTM的90分之一。

b. 使用word2vec训练贝叶斯模型：

配置参数为，100的新闻文章最大长度，N为4的字流窗大小，使用多核cpu加速，使用skip-gram做特征提取，迭代次数为10次。

Word2vel处理后的文本矩阵：

每个字对应了大量的权重，且由于word2vec赋予产生的矩阵一种连续性，使用朴素贝叶斯将不能处理这些连续得向量矩阵，这里我们使用高斯贝叶斯，假设矩阵满足正态分布。

训练结果如下：

训练正确率出其的低，且耗时为朴素贝叶斯模型的22倍。从直观的想法中，朴素贝叶斯正确率偏高，而使用word2vec将贝叶斯的单词独立性的假设得到了补充，按理应该正确率得到提升。下一章实验结果展示将会讨论这个问题。

4. Svm

Svm处理思路与朴素贝叶斯相同，这里指明一下，项目中SVM使用线性核。这是对比过高斯与多项式得出的。

a. 使用TF-IDF算法建立SVM模型：

由于特征构建原理相同，这里不做解释，详细可以参考上面贝叶斯的文档。

正确率：84.4% 虽不及贝叶斯，但是整体效果还是不错的。

b. 使用word2vec算法建立SVM模型：

正确率只有77.08%，但是相较word2vec在高斯贝叶斯模型模型的表现已经很好了。相同参数下，运行时间上达到了134.9s，效率很低。

4.4系统界面的实现

系统界面使用python的Django服务器框架开发，使用python原生的sqlite3作为数据库支持。前端使用JavaScript，CSS建立了一个较为简约的UI界面。主页建立了一个发布按钮。

点击发布按钮（1），产生如图一个弹框：

弹框中，添加链接这栏可以添加一个URL（2），因为本系统是针对新浪新闻而开发的，使用目前仅支持对新浪新闻相应链接的提取。添加链接后，点击获取标题（3），可以自动爬取对应的新闻标题和对应的摘要（4），点击发布按钮（5）。系统后台接收新闻标题与对应的摘要，经过分词后，通过与TF-IDF算法特征提取保存的单词及对应TF-IDF值矩阵匹配，转化成向量矩阵。这个向量矩阵，再与我们离线训练出来的贝叶斯模型，通过贝叶斯算法计算其所属类别的概率，达到预测的目的。

观察上图可以发现，贝叶斯算法具有较高的准确率，可以做到良好的分类效果。

第五章实验结果分析

5.1 系统评估指标(ROC，AUC，训练所需时间)

a. ROC，AUC

在介绍系统评估指标前，我们先了解4个概念：

·True Positive（TP）：意思是对于某一个类别的新闻信息，算法对新闻信息做出预测，且预测类别与此类别相同，TP的值表示预测该类别相同的个数。

·False Positive（FP）：数值表示预测某一类别预测类别与真实类别不同的个数；

·True Negative（TN）：数值表示预测某一类别预测为此类别，但是真实值非此类别的个数；

·False Negative（FN）：数值表示预测某一类别预测为非此类别，且真实值也非此类别的个数；

基于此，我们就可以计算出准确率（precision rate）、召回率（recall rate）。

以上表为例，TP的值为200，FN的值为30，FP的值为300，TN的值为2000。

那么, 准确率=170/(170+30) = 85% ,召回率=170/(170+300)= 36.17%。

ROC曲线就是准确率随召回率的变化情况。ROC曲线越接近左上角,分类效果越好。AUC曲线表示ROC曲线下的面积,AUC面积越大,分类效果越好

5.2 算法拟合数据集说明

为了让算法能够均匀计算到每个类别，项目数据集上选择从每个类别中随机抽取2000条新闻数据进行算法拟合。其中文化类别只爬取到1924条，所以使用对文化类别选取全部的1924条作为样本。

在深度学习算法CNN,LSTM中选择20%的数据作为测试集以及16%的数据作为验证集。

在传统机器学习算法中，我们选择80%的数据作为训练集，20%的数据作为验证集。

5.3 分类系统算法评估

通过对比训练所需时间，朴素贝叶斯算法配合TF-IDF只需要2S就能完成对1万4千多条信息的数学建模。且对比不同算法模型的ROC图，贝叶斯算法综合下来ROC曲线最接近左上角,分类效果最好。ROC曲线下的面积对比中,朴素贝叶斯AUC面积最大大,分类效果最好。

由此判断，朴素贝叶斯配合TF-IDF是最为适合作为本项目新闻分类器的算法模型。其算法成功率为0.85430157，也支持对大量新闻数据的分类预测。

源码：

深度学习使用到的语料库为维基百科训练出来的语料库。

CNN：

#coding:utf-8
import sys
import keras
import matplotlib.pyplot as plt

VECTOR_DIR = 'vectors.bin'

MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 200
VALIDATION_SPLIT = 0.16
TEST_SPLIT = 0.2

print ('(1) load texts...')
train_texts = open('train_contents.txt',encoding='utf-8').read().split('\n')
train_labels = open('train_labels.txt',encoding='utf-8').read().split('\n')
test_texts = open('test_contents.txt',encoding='utf-8').read().split('\n')
test_labels = open('test_labels.txt',encoding='utf-8').read().split('\n')
all_texts = train_texts + test_texts
all_labels = train_labels + test_labels

print ('(2) doc to var...')
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np

tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_texts)
sequences = tokenizer.texts_to_sequences(all_texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(all_labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

print ('(3) split data set...')
# split the data into training set, validation set, and test set
p1 = int(len(data)*(1-VALIDATION_SPLIT-TEST_SPLIT))
p2 = int(len(data)*(1-TEST_SPLIT))
x_train = data[:p1]
y_train = labels[:p1]
x_val = data[p1:p2]
y_val = labels[p1:p2]
x_test = data[p2:]
y_test = labels[p2:]
print ('train docs: '+str(len(x_train)))
print ('val docs: '+str(len(x_val)))
print ('test docs: '+str(len(x_test)))

print ('(5) training model...')
from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import Conv1D, MaxPooling1D, Embedding, GlobalMaxPooling1D
from keras.models import Sequential

model = Sequential()
model.add(Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(Dropout(0.2))
model.add(Conv1D(250, 3, padding='valid', activation='relu', strides=1))
model.add(MaxPooling1D(3))
model.add(Flatten())
model.add(Dense(EMBEDDING_DIM, activation='relu'))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()
#plot_model(model, to_file='model.png',show_shapes=True)

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
print (model.metrics_names)
model.fit(x_train, y_train, callbacks=[history],validation_data=(x_val, y_val), epochs=2, batch_size=128)
#model.save('cnn.h5')

print ('(6) testing model...')
print (model.evaluate(x_test, y_test))

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve,auc
import numpy as np
from scipy import interp

y_score  = model.predict(x_test)
lw = 2
n_classes = 11
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

# Compute macro-average ROC curve and ROC area

# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))

# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])

# Finally average it and compute AUC
mean_tpr /= n_classes

fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

# Plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         color='deeppink', linestyle=':', linewidth=4)

plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)

colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Some extension of Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()

CNN+word2vec

#coding:utf-8
import sys
import keras

VECTOR_DIR = 'vectors.bin'

MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 128
VALIDATION_SPLIT = 0.16
TEST_SPLIT = 0.2

print ('(1) load texts...')
train_texts = open('train_contents.txt',encoding='utf-8').read().split('\n')
train_labels = open('train_labels.txt',encoding='utf-8').read().split('\n')
test_texts = open('test_contents.txt',encoding='utf-8').read().split('\n')
test_labels = open('test_labels.txt',encoding='utf-8').read().split('\n')
all_texts = train_texts + test_texts
all_labels = train_labels + test_labels

print ('(2) doc to var...')
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np

tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_texts)
sequences = tokenizer.texts_to_sequences(all_texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(all_labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

print ('(3) split data set...')
# split the data into training set, validation set, and test set
p1 = int(len(data)*(1-VALIDATION_SPLIT-TEST_SPLIT))
p2 = int(len(data)*(1-TEST_SPLIT))
x_train = data[:p1]
y_train = labels[:p1]
x_val = data[p1:p2]
y_val = labels[p1:p2]
x_test = data[p2:]
y_test = labels[p2:]
print ('train docs: '+str(len(x_train)))
print ('val docs: '+str(len(x_val)))
print ('test docs: '+str(len(x_test)))

print ('(4) load word2vec as embedding...')

import gensim
from keras.utils import plot_model
w2v_model = gensim.models.KeyedVectors.load_word2vec_format(VECTOR_DIR, binary=True)
embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
not_in_model = 0
in_model = 0
for word, i in word_index.items():
    if word in w2v_model:
        in_model += 1
        embedding_matrix[i] = np.asarray(w2v_model[word], dtype='float32')
    else:
        not_in_model += 1
print (str(not_in_model)+' words not in w2v model')
from keras.layers import Embedding
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

print ('(5) training model...')
from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import Conv1D, MaxPooling1D, Embedding, GlobalMaxPooling1D
from keras.models import Sequential

model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Conv1D(250, 3, padding='valid', activation='relu', strides=1))
model.add(MaxPooling1D(3))
model.add(Flatten())
model.add(Dense(EMBEDDING_DIM, activation='relu'))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()
#plot_model(model, to_file='model.png',show_shapes=True)

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
print( model.metrics_names)
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=2, batch_size=128)
model.save('word_vector_cnn.h5')

print ('(6) testing model...')
print (model.evaluate(x_test, y_test))



import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve,auc
import numpy as np
from scipy import interp

y_score  = model.predict(x_test)
lw = 2
n_classes = 11
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

# Compute macro-average ROC curve and ROC area

# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))

# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])

# Finally average it and compute AUC
mean_tpr /= n_classes

fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

# Plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
         label='micro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["micro"]),
         color='deeppink', linestyle=':', linewidth=4)

plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)

colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Some extension of Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()

LSTM

#coding:utf-8

import keras
import matplotlib.pyplot as plt
VECTOR_DIR = 'vectors.bin'

MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 200
VALIDATION_SPLIT = 0.16
TEST_SPLIT = 0.2

print ('(1) load texts...')
train_texts = open('train_contents.txt',encoding='utf-8').read().split('\n')
train_labels = open('train_labels.txt',encoding='utf-8').read().split('\n')
test_texts = open('test_contents.txt',encoding='utf-8').read().split('\n')
test_labels = open('test_labels.txt',encoding='utf-8').read().split('\n')
all_texts = train_texts + test_texts
all_labels = train_labels + test_labels

print ('(2) doc to var...')
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np

tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_texts)
sequences = tokenizer.texts_to_sequences(all_texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(all_labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

print ('(3) split data set...')
p1 = int(len(data)*(1-VALIDATION_SPLIT-TEST_SPLIT))
p2 = int(len(data)*(1-TEST_SPLIT))
x_train = data[:p1]
y_train = labels[:p1]
x_val = data[p1:p2]
y_val = labels[p1:p2]
x_test = data[p2:]
y_test = labels[p2:]
print ('train docs: '+str(len(x_train)))
print ('val docs: '+str(len(x_val)))
print ('test docs: '+str(len(x_test)))

from keras.layers import Dense, Input, Flatten, Dropout
from keras.layers import LSTM, Embedding
from keras.models import Sequential

model = Sequential()
model.add(Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(LSTM(200, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

history = model.fit(x_train, y_train,validation_data=(x_val, y_val), epochs=2, batch_size=128)
#model.save('lstm.h5')

print (model.evaluate(x_test, y_test))

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve,auc
import numpy as np
from scipy import interp

y_score  = model.predict(x_test)
lw = 2
n_classes = 11
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

# Compute macro-average ROC curve and ROC area

# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))

# Then interpolate all ROC curves at this poin