Implementing Deep Learning for Sentiment Analysis

Johann Mitloehner, 2018

  • Sentiment Detection
  • Data Preparation
  • Word Embeddings
  • Two-Layer Feed Forward
  • Keras
In [91]:
import gzip
from nltk import word_tokenize
import numpy as np
import sys

Glove Word Embeddings

  • read lines from glove file
  • split into word and vector
  • store in dict
In [92]:
def readglove():
  print 'reading glove..'
  glove = {}
  for line in gzip.open('glove.txt.gz'):
    wds = line.strip().split()
    glove[wds[0]] = [ float(x) for x in wds[1:] ]
  print 'done.'
  return glove

Task 4 of SemEval 2017

  • approx 50000 tweets
  • user, sentiment, text
  • manually annotated sentiment via CrowdFlower
    • positive
    • negative
    • neutral

Embed a single word

In [93]:
# retrieve word embedding, or return 0s if not found
def emb(glove, w):
  if w in glove: return glove[w]
  else: return [ 0.0 for x in glove['the'] ]
In [94]:
def embtoks(glove, toks):
  e = [ glove[t] for t in toks if t in glove ]
  if len(e) == 0: 
    return np.asarray([ 0.0 for x in glove['the']])
  else:
    return np.asarray(e) / len(e)

Embed tweets

  • get glove dict
  • read tweets from file
  • store the sentiment
  • tokenize the tweet text
  • embed each token
  • add up and divide by length to get tweet embedding
  • append to training data: X for input, y for correct response
In [95]:
# read tweets from file and create numeric X, y training data
def embed(filename):
  glove = readglove()
  X = []
  y = []
  sents = { 'negative': 0, 'neutral': 1, 'positive': 2 }
  print "embedding.."
  for line in open(filename):
    wds = line.strip().split("\t")
    if len(wds) == 3:
      sent = wds[1]
      tweet = wds[2].strip('"')
      e = np.asarray([ 0.0 for x in glove['the'] ])
      toks = word_tokenize(tweet)
      n = 1
      for w in [ x.lower() for x in toks ]:
        if w in glove:
          e += glove[w]
          n += 1
      X += [ e/n ] # embedded tweet
      y += [ sents[sent] ] # encoded sentiment
      if len(X) <= 10: # check a few lines
        print 'tweet:', line[:60]
        print 'embedding:', X[-1][:5], '...  sentiment:', y[-1]
  print "done."
  return np.asarray(X), np.asarray(y)

Embed the tweets

In [96]:
x, y = embed('tweets.txt')
reading glove..
done.
embedding..
tweet: 100000794790727680	positive	One Night like In Vegas I make d
embedding: [-0.05292273  0.19408755  0.06798509  0.10263991 -0.20382291] ...  sentiment: 2
tweet: 100000831528632320	positive	Walking through Chelsea at this 
embedding: [ 0.03866978 -0.0181638   0.15413442 -0.00767709 -0.07895079] ...  sentiment: 2
tweet: 100000950005145600	neutral	"And on the very first play of th
embedding: [ 0.10328479  0.13142791  0.08679648  0.04456704  0.00694168] ...  sentiment: 1
tweet: 100000974885748736	neutral	"Drove the bike today, about 40 m
embedding: [ 0.10516658  0.07653628  0.00464372 -0.07494705 -0.00748709] ...  sentiment: 1
tweet: 100001038454624257	negative	looking at the temp outside....h
embedding: [ 0.05622543  0.06396979 -0.04814796 -0.05173475  0.02989893] ...  sentiment: 0
tweet: 100001071937748992	neutral	"RT @RedArmy49: Therefore i still
embedding: [ 0.13206838  0.14300169  0.01165134  0.02783608 -0.164611  ] ...  sentiment: 1
tweet: 100001160882176000	positive	"@QuietusCyn @ShatteredYuuki Yea
embedding: [ 0.0489815   0.07140772  0.0040643   0.01825142 -0.07752615] ...  sentiment: 2
tweet: 100001174920511489	negative	"Criticism of Rick Perry is ridi
embedding: [ 0.12544556  0.02385883 -0.12746172  0.10105752 -0.00330126] ...  sentiment: 0
tweet: 100001181002252288	positive	"My brothers GF is on me to use 
embedding: [ 0.01637573  0.13751979 -0.03177031 -0.02556538 -0.04039707] ...  sentiment: 2
tweet: 100001199650123776	negative	I'm stuck in London again... :( 
embedding: [ 0.03564071  0.05724262  0.2163214   0.01232763 -0.18059729] ...  sentiment: 0
done.

Softmax

Start with two-layer feed-foward net

First layer: dense with ReLU activation

Second layer: dense with softmax

Predict class based on probability vector

In [97]:
# return class with highest prob
def softpred(x):
  return [ np.argmax(pr) for pr in x ]

Predict training batch X with given weights W and W2

  • hidden layer activation is ReLU on X and W dot product
  • output layer is exponential on dot product of hidden layer with W2
  • normalise by sum of scores to interpret as probabilities
In [98]:
def netpred(X, W, W2):
  hid = np.maximum(0, np.dot(X, W))
  escor = np.exp(np.dot(hid, W2))
  return escor / np.sum(escor, axis=1, keepdims=True), hid

Percentage of correct classifications

little helper function for nice printing of accuracy

In [99]:
# % correct classifications
def accur(scor, y):
  return '%.1f' % ((100.0 * sum(softpred(scor) == y )) / len(y))

Softmax classifier

  • get number of inputs and output classes from training data
  • initialise weights
  • perform stochastic gradient descent for given number of steps
    • choose random minibatch
    • compute the hidden layer activations and the class probabilities
    • compute the gradient
    • back propagation of errors i.e. wrong classes
  • print accuracy on whole input data
  • return weights
In [100]:
# introduce hidden layer, SGD
def netsoftmax(dataset, h, stepsize=0.5, steps=50):
  X_, y_ = dataset
  reg = 0.001
  D = len(X_[0])
  K = len(set(y_))
  print 'netsoftmax: number of classes =', K
  W = 0.1 * np.random.randn(D,h)
  W2 = 0.1 * np.random.randn(h, K)
  for step in range(steps):
    # minibatch
    ix = np.random.choice(len(y_), min(200, len(y_)), replace=False)
    y = y_[ix]
    X = X_[ix]
    probs, hid = netpred(X, W, W2)
    if (step % 1000) == 0: print 'accuracy on training set:', accur(probs, y)
    # gradient
    dscor = probs
    dscor[range(len(y)), y] -= 1
    dscor /= len(y)
    # backprop 
    dW2 = np.dot(hid.T, dscor)
    dhid = np.dot(dscor, W2.T)
    dhid[hid <= 0] = 0
    dW = np.dot(X.T, dhid)
    dW += reg * W
    dW2 += reg * W2
    # parameter update in the negative gradient direction to decrease loss
    W += -stepsize * dW
    W2 += -stepsize * dW2
  probs, hid = netpred(X_, W, W2)
  print 'accuracy on training set: ', accur(probs, y_)
  return W, W2

Main function for Softmax

  • get tweet filename
  • embed the tweets
  • split into training and validation sets
  • train the softmax classifier
  • print accuracy on validation set
In [110]:
# read embeddings and tweets, train sentiment
def mainsoftmax():
  n = int(len(x) * 0.8)
  x_train, y_train, x_val, y_val = x[:n], y[:n], x[n:], y[n:]
  print "size of training set:", len(x_train)
  print 'training netsoftmax with ReLU units in layer 1:'
  W, W2 = netsoftmax((x_train, y_train), 200, steps=10000, stepsize=0.1)
  probs, hid = netpred(x_val, W, W2)
  print 'accuracy on validation set: ', accur(probs, y_val)

Call mainsoftmax

In [115]:
np.random.seed(1337)
mainsoftmax()
size of training set: 39762
training netsoftmax with ReLU units in layer 1:
netsoftmax: number of classes = 3
accuracy on training set: 17.0
accuracy on training set: 61.5
accuracy on training set: 55.5
accuracy on training set: 71.0
accuracy on training set: 64.0
accuracy on training set: 64.0
accuracy on training set: 65.5
accuracy on training set: 68.5
accuracy on training set: 60.0
accuracy on training set: 66.0
accuracy on training set:  64.1
accuracy on validation set:  58.3

Keras

In [116]:
import keras
#from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop, SGD
import sys
import numpy as np

np.random.seed(1337)
batch_size = 128
epochs = 20

num_classes = len(set(y))
n = int(len(x) * 0.8)
x_train, y_train, x_val, y_val = x[:n], y[:n], x[n:], y[n:]

print 'training data sample:'
for i in range(5):
  print x_train[i][:3], '...  ', y_train[i]
print 'size of training set:  ', x_train.shape[0]
print 'size of validation set:', x_val.shape[0]

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_val = keras.utils.to_categorical(y_val, num_classes)

model = Sequential()
model.add(Dense(200, activation='relu', input_shape=(200,)))
model.add(Dropout(0.2))
model.add(Dense(200, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])
print 'train model..'
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_val, y_val))
score = model.evaluate(x_val, y_val, verbose=0)
print 'accuracy on validation set:', score[1]
training data sample:
[-0.05292273  0.19408755  0.06798509] ...   2
[ 0.03866978 -0.0181638   0.15413442] ...   2
[ 0.10328479  0.13142791  0.08679648] ...   1
[ 0.10516658  0.07653628  0.00464372] ...   1
[ 0.05622543  0.06396979 -0.04814796] ...   0
size of training set:   39762
size of validation set: 9941
train model..
Train on 39762 samples, validate on 9941 samples
Epoch 1/20
39762/39762 [==============================] - 2s 47us/step - loss: 0.8517 - acc: 0.5953 - val_loss: 0.8147 - val_acc: 0.6010
Epoch 2/20
39762/39762 [==============================] - 4s 95us/step - loss: 0.7798 - acc: 0.6382 - val_loss: 0.8350 - val_acc: 0.5868
Epoch 3/20
39762/39762 [==============================] - 4s 105us/step - loss: 0.7616 - acc: 0.6484 - val_loss: 0.8096 - val_acc: 0.6040
Epoch 4/20
39762/39762 [==============================] - 4s 104us/step - loss: 0.7478 - acc: 0.6582 - val_loss: 0.7930 - val_acc: 0.6117
Epoch 5/20
39762/39762 [==============================] - 4s 111us/step - loss: 0.7360 - acc: 0.6640 - val_loss: 0.8137 - val_acc: 0.6088
Epoch 6/20
39762/39762 [==============================] - 5s 127us/step - loss: 0.7261 - acc: 0.6701 - val_loss: 0.7980 - val_acc: 0.6111
Epoch 7/20
39762/39762 [==============================] - 5s 120us/step - loss: 0.7146 - acc: 0.6749 - val_loss: 0.8099 - val_acc: 0.6085
Epoch 8/20
39762/39762 [==============================] - 5s 118us/step - loss: 0.7069 - acc: 0.6780 - val_loss: 0.8476 - val_acc: 0.6073
Epoch 9/20
39762/39762 [==============================] - 5s 125us/step - loss: 0.6964 - acc: 0.6844 - val_loss: 0.8165 - val_acc: 0.6070
Epoch 10/20
39762/39762 [==============================] - 4s 98us/step - loss: 0.6883 - acc: 0.6872 - val_loss: 0.8021 - val_acc: 0.6146
Epoch 11/20
39762/39762 [==============================] - 3s 85us/step - loss: 0.6798 - acc: 0.6917 - val_loss: 0.8260 - val_acc: 0.6087
Epoch 12/20
39762/39762 [==============================] - 3s 86us/step - loss: 0.6713 - acc: 0.6965 - val_loss: 0.8233 - val_acc: 0.6129
Epoch 13/20
39762/39762 [==============================] - 4s 99us/step - loss: 0.6620 - acc: 0.6983 - val_loss: 0.8370 - val_acc: 0.6165
Epoch 14/20
39762/39762 [==============================] - 4s 107us/step - loss: 0.6541 - acc: 0.7043 - val_loss: 0.8499 - val_acc: 0.6087
Epoch 15/20
39762/39762 [==============================] - 4s 110us/step - loss: 0.6438 - acc: 0.7114 - val_loss: 0.8628 - val_acc: 0.6053
Epoch 16/20
39762/39762 [==============================] - 5s 124us/step - loss: 0.6341 - acc: 0.7143 - val_loss: 0.8422 - val_acc: 0.6036
Epoch 17/20
39762/39762 [==============================] - 5s 130us/step - loss: 0.6260 - acc: 0.7211 - val_loss: 0.8568 - val_acc: 0.5985
Epoch 18/20
39762/39762 [==============================] - 4s 95us/step - loss: 0.6185 - acc: 0.7249 - val_loss: 0.8634 - val_acc: 0.6082
Epoch 19/20
39762/39762 [==============================] - 4s 106us/step - loss: 0.6110 - acc: 0.7250 - val_loss: 0.8833 - val_acc: 0.6078
Epoch 20/20
39762/39762 [==============================] - 4s 108us/step - loss: 0.6030 - acc: 0.7315 - val_loss: 0.8864 - val_acc: 0.6076
accuracy on validation set: 0.607584750061
In [ ]: