Taking a first look at the data

Started taking a look at the data. I have a few observations.

1.Lots of typos.

2.The essays seems to range over a variety of subjects, but there is nothing too controversial. I can pick out a few obvious ones like electoral law.

Also, one of the thing that is interesting is that the hook is pretty on topic. In French schools, we were taught instead to have a lead that was off-topic as to serve as a hook (E.g. “Seagull go to die when they feel like they are no longer needed by their flock. Just like seagulls,teenagers commit suicide whenever they feel they got dumped by their cliques.”)

I was thinking on using Word2Vec or some other approach to detect the subject to differentiate the Lead from the other parts as a feature, but I guess this approach will have to be abandoned.

I don’t know anything about NLP so I’ll start by the most obvious place: the kaggle discussion board pertaining to this challenge:


I found the several ngrams with R, this gentleman right here that answered most of my questions:https://www.kaggle.com/erikbruin/nlp-on-student-writing-eda/notebook

We have a couple of important things:

1.About 40% of essays don’t have a lead statement

2.Some don’t even have a concluding statement.

3.The discourse type seems to be correlated to average length of text.

4.Counterclaim and rebuttals aren’t as present as others. E.g. This is biased set and I’m probably going to have to augment them in order to train whatever model(s) is needed.

  1. There are 32162 gaps of text that are unclassified. Most of these tend to be quite small (the median is probably around 30 or 40 characters). Consideration should be given as to how to include them into our model (if at all).
  2. Predictionstring and not discourse_start/end dictate the position of the annotation
  3. Annotators make mistakes or make questionables annotations. E.g. This is an example: “The first of many reasons why we should limit car usage is because of (greenhouse gas emissions)”. The part between parathesis so the model has to account for human error.

There are some obvious bread crumps to follow.

Based on the n-grams,American educators seem to have a beef with the electoral college system.

But why do they have a beef with Venus?And what do they have against Facial Action Coding?

In the early 19th century, that would have been a mystery for Hercule Poirot. Unfortunately, he is a fictional character and thus cannot help us.

I could start digging further into the data, but I think it is a good idea to try to get an overview as to where we’re going and what architecture I’m going to use. I know very little about NLP and so I think it’s best to take a second look after I’ve had a fresh new perspective.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: