Implementing Deep Learning for Sentiment Analysis¶

Johann Mitloehner, 2018

Sentiment Detection
Data Preparation
Word Embeddings
Two-Layer Feed Forward
Keras

import gzip
from nltk import word_tokenize
import numpy as np
import sys

Glove Word Embeddings¶

read lines from glove file
split into word and vector
store in dict

def readglove():
  print 'reading glove..'
  glove = {}
  for line in gzip.open('glove.txt.gz'):
    wds = line.strip().split()
    glove[wds[0]] = [ float(x) for x in wds[1:] ]
  print 'done.'
  return glove

Task 4 of SemEval 2017¶

approx 50000 tweets
user, sentiment, text
manually annotated sentiment via CrowdFlower
- positive
- negative
- neutral

Embed a single word¶

# retrieve word embedding, or return 0s if not found
def emb(glove, w):
  if w in glove: return glove[w]
  else: return [ 0.0 for x in glove['the'] ]

def embtoks(glove, toks):
  e = [ glove[t] for t in toks if t in glove ]
  if len(e) == 0: 
    return np.asarray([ 0.0 for x in glove['the']])
  else:
    return np.asarray(e) / len(e)

Embed tweets¶

get glove dict
read tweets from file
store the sentiment
tokenize the tweet text
embed each token
add up and divide by length to get tweet embedding
append to training data: X for input, y for correct response

# read tweets from file and create numeric X, y training data
def embed(filename):
  glove = readglove()
  X = []
  y = []
  sents = { 'negative': 0, 'neutral': 1, 'positive': 2 }
  print "embedding.."
  for line in open(filename):
    wds = line.strip().split("\t")
    if len(wds) == 3:
      sent = wds[1]
      tweet = wds[2].strip('"')
      e = np.asarray([ 0.0 for x in glove['the'] ])
      toks = word_tokenize(tweet)
      n = 1
      for w in [ x.lower() for x in toks ]:
        if w in glove:
          e += glove[w]
          n += 1
      X += [ e/n ] # embedded tweet
      y += [ sents[sent] ] # encoded sentiment
      if len(X) <= 10: # check a few lines
        print 'tweet:', line[:60]
        print 'embedding:', X[-1][:5], '...  sentiment:', y[-1]
  print "done."
  return np.asarray(X), np.asarray(y)

Embed the tweets¶

x, y = embed('tweets.txt')

reading glove..
done.
embedding..
tweet: 100000794790727680	positive	One Night like In Vegas I make d
embedding: [-0.05292273  0.19408755  0.06798509  0.10263991 -0.20382291] ...  sentiment: 2
tweet: 100000831528632320	positive	Walking through Chelsea at this 
embedding: [ 0.03866978 -0.0181638   0.15413442 -0.00767709 -0.07895079] ...  sentiment: 2
tweet: 100000950005145600	neutral	"And on the very first play of th
embedding: [ 0.10328479  0.13142791  0.08679648  0.04456704  0.00694168] ...  sentiment: 1
tweet: 100000974885748736	neutral	"Drove the bike today, about 40 m
embedding: [ 0.10516658  0.07653628  0.00464372 -0.07494705 -0.00748709] ...  sentiment: 1
tweet: 100001038454624257	negative	looking at the temp outside....h
embedding: [ 0.05622543  0.06396979 -0.04814796 -0.05173475  0.02989893] ...  sentiment: 0
tweet: 100001071937748992	neutral	"RT @RedArmy49: Therefore i still
embedding: [ 0.13206838  0.14300169  0.01165134  0.02783608 -0.164611  ] ...  sentiment: 1
tweet: 100001160882176000	positive	"@QuietusCyn @ShatteredYuuki Yea
embedding: [ 0.0489815   0.07140772  0.0040643   0.01825142 -0.07752615] ...  sentiment: 2
tweet: 100001174920511489	negative	"Criticism of Rick Perry is ridi
embedding: [ 0.12544556  0.02385883 -0.12746172  0.10105752 -0.00330126] ...  sentiment: 0
tweet: 100001181002252288	positive	"My brothers GF is on me to use 
embedding: [ 0.01637573  0.13751979 -0.03177031 -0.02556538 -0.04039707] ...  sentiment: 2
tweet: 100001199650123776	negative	I'm stuck in London again... :( 
embedding: [ 0.03564071  0.05724262  0.2163214   0.01232763 -0.18059729] ...  sentiment: 0
done.

Softmax¶

Start with two-layer feed-foward net

First layer: dense with ReLU activation

Second layer: dense with softmax

Predict class based on probability vector¶

# return class with highest prob
def softpred(x):
  return [ np.argmax(pr) for pr in x ]

Predict training batch X with given weights W and W2¶

hidden layer activation is ReLU on X and W dot product
output layer is exponential on dot product of hidden layer with W2
normalise by sum of scores to interpret as probabilities

def netpred(X, W, W2):
  hid = np.maximum(0, np.dot(X, W))
  escor = np.exp(np.dot(hid, W2))
  return escor / np.sum(escor, axis=1, keepdims=True), hid

Percentage of correct classifications¶

little helper function for nice printing of accuracy

# % correct classifications
def accur(scor, y):
  return '%.1f' % ((100.0 * sum(softpred(scor) == y )) / len(y))

Softmax classifier¶

get number of inputs and output classes from training data
initialise weights
perform stochastic gradient descent for given number of steps
- choose random minibatch
- compute the hidden layer activations and the class probabilities
- compute the gradient
- back propagation of errors i.e. wrong classes
print accuracy on whole input data
return weights

# introduce hidden layer, SGD
def netsoftmax(dataset, h, stepsize=0.5, steps=50):
  X_, y_ = dataset
  reg = 0.001
  D = len(X_[0])
  K = len(set(y_))
  print 'netsoftmax: number of classes =', K
  W = 0.1 * np.random.randn(D,h)
  W2 = 0.1 * np.random.randn(h, K)
  for step in range(steps):
    # minibatch
    ix = np.random.choice(len(y_), min(200, len(y_)), replace=False)
    y = y_[ix]
    X = X_[ix]
    probs, hid = netpred(X, W, W2)
    if (step % 1000) == 0: print 'accuracy on training set:', accur(probs, y)
    # gradient
    dscor = probs
    dscor[range(len(y)), y] -= 1
    dscor /= len(y)
    # backprop 
    dW2 = np.dot(hid.T, dscor)
    dhid = np.dot(dscor, W2.T)
    dhid[hid <= 0] = 0
    dW = np.dot(X.T, dhid)
    dW += reg * W
    dW2 += reg * W2
    # parameter update in the negative gradient direction to decrease loss
    W += -stepsize * dW
    W2 += -stepsize * dW2
  probs, hid = netpred(X_, W, W2)
  print 'accuracy on training set: ', accur(probs, y_)
  return W, W2

Main function for Softmax¶

get tweet filename
embed the tweets
split into training and validation sets
train the softmax classifier
print accuracy on validation set

# read embeddings and tweets, train sentiment
def mainsoftmax():
  n = int(len(x) * 0.8)
  x_train, y_train, x_val, y_val = x[:n], y[:n], x[n:], y[n:]
  print "size of training set:", len(x_train)
  print 'training netsoftmax with ReLU units in layer 1:'
  W, W2 = netsoftmax((x_train, y_train), 200, steps=10000, stepsize=0.1)
  probs, hid = netpred(x_val, W, W2)
  print 'accuracy on validation set: ', accur(probs, y_val)

Call mainsoftmax¶

np.random.seed(1337)
mainsoftmax()

size of training set: 39762
training netsoftmax with ReLU units in layer 1:
netsoftmax: number of classes = 3
accuracy on training set: 17.0
accuracy on training set: 61.5
accuracy on training set: 55.5
accuracy on training set: 71.0
accuracy on training set: 64.0
accuracy on training set: 64.0
accuracy on training set: 65.5
accuracy on training set: 68.5
accuracy on training set: 60.0
accuracy on training set: 66.0
accuracy on training set:  64.1
accuracy on validation set:  58.3

Keras¶

import keras
#from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop, SGD
import sys
import numpy as np

np.random.seed(1337)
batch_size = 128
epochs = 20

num_classes = len(set(y))
n = int(len(x) * 0.8)
x_train, y_train, x_val, y_val = x[:n], y[:n], x[n:], y[n:]

print 'training data sample:'
for i in range(5):
  print x_train[i][:3], '...  ', y_train[i]
print 'size of training set:  ', x_train.shape[0]
print 'size of validation set:', x_val.shape[0]

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_val = keras.utils.to_categorical(y_val, num_classes)

model = Sequential()
model.add(Dense(200, activation='relu', input_shape=(200,)))
model.add(Dropout(0.2))
model.add(Dense(200, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])
print 'train model..'
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_val, y_val))
score = model.evaluate(x_val, y_val, verbose=0)
print 'accuracy on validation set:', score[1]

training data sample:
[-0.05292273  0.19408755  0.06798509] ...   2
[ 0.03866978 -0.0181638   0.15413442] ...   2
[ 0.10328479  0.13142791  0.08679648] ...   1
[ 0.10516658  0.07653628  0.00464372] ...   1
[ 0.05622543  0.06396979 -0.04814796] ...   0
size of training set:   39762
size of validation set: 9941
train model..
Train on 39762 samples, validate on 9941 samples
Epoch 1/20
39762/39762 [==============================] - 2s 47us/step - loss: 0.8517 - acc: 0.5953 - val_loss: 0.8147 - val_acc: 0.6010
Epoch 2/20
39762/39762 [==============================] - 4s 95us/step - loss: 0.7798 - acc: 0.6382 - val_loss: 0.8350 - val_acc: 0.5868
Epoch 3/20
39762/39762 [==============================] - 4s 105us/step - loss: 0.7616 - acc: 0.6484 - val_loss: 0.8096 - val_acc: 0.6040
Epoch 4/20
39762/39762 [==============================] - 4s 104us/step - loss: 0.7478 - acc: 0.6582 - val_loss: 0.7930 - val_acc: 0.6117
Epoch 5/20
39762/39762 [==============================] - 4s 111us/step - loss: 0.7360 - acc: 0.6640 - val_loss: 0.8137 - val_acc: 0.6088
Epoch 6/20
39762/39762 [==============================] - 5s 127us/step - loss: 0.7261 - acc: 0.6701 - val_loss: 0.7980 - val_acc: 0.6111
Epoch 7/20
39762/39762 [==============================] - 5s 120us/step - loss: 0.7146 - acc: 0.6749 - val_loss: 0.8099 - val_acc: 0.6085
Epoch 8/20
39762/39762 [==============================] - 5s 118us/step - loss: 0.7069 - acc: 0.6780 - val_loss: 0.8476 - val_acc: 0.6073
Epoch 9/20
39762/39762 [==============================] - 5s 125us/step - loss: 0.6964 - acc: 0.6844 - val_loss: 0.8165 - val_acc: 0.6070
Epoch 10/20
39762/39762 [==============================] - 4s 98us/step - loss: 0.6883 - acc: 0.6872 - val_loss: 0.8021 - val_acc: 0.6146
Epoch 11/20
39762/39762 [==============================] - 3s 85us/step - loss: 0.6798 - acc: 0.6917 - val_loss: 0.8260 - val_acc: 0.6087
Epoch 12/20
39762/39762 [==============================] - 3s 86us/step - loss: 0.6713 - acc: 0.6965 - val_loss: 0.8233 - val_acc: 0.6129
Epoch 13/20
39762/39762 [==============================] - 4s 99us/step - loss: 0.6620 - acc: 0.6983 - val_loss: 0.8370 - val_acc: 0.6165
Epoch 14/20
39762/39762 [==============================] - 4s 107us/step - loss: 0.6541 - acc: 0.7043 - val_loss: 0.8499 - val_acc: 0.6087
Epoch 15/20
39762/39762 [==============================] - 4s 110us/step - loss: 0.6438 - acc: 0.7114 - val_loss: 0.8628 - val_acc: 0.6053
Epoch 16/20
39762/39762 [==============================] - 5s 124us/step - loss: 0.6341 - acc: 0.7143 - val_loss: 0.8422 - val_acc: 0.6036
Epoch 17/20
39762/39762 [==============================] - 5s 130us/step - loss: 0.6260 - acc: 0.7211 - val_loss: 0.8568 - val_acc: 0.5985
Epoch 18/20
39762/39762 [==============================] - 4s 95us/step - loss: 0.6185 - acc: 0.7249 - val_loss: 0.8634 - val_acc: 0.6082
Epoch 19/20
39762/39762 [==============================] - 4s 106us/step - loss: 0.6110 - acc: 0.7250 - val_loss: 0.8833 - val_acc: 0.6078
Epoch 20/20
39762/39762 [==============================] - 4s 108us/step - loss: 0.6030 - acc: 0.7315 - val_loss: 0.8864 - val_acc: 0.6076
accuracy on validation set: 0.607584750061