Start of literature review

I reviewed some of the academic literature that might be of use.

Typically, when I start with the Wikipedia pages, and a quick google search. The wiki is useful as always.In this case, I am completely ignorant of text mining, so I don’t know for which terms exactly to search for.

Luckily, the community have already compiled a list of resources including a number of survey papers accessible to relative beginners:

https://www.kaggle.com/c/feedback-prize-2021/discussion/295208

Now there are about 50 papers in the solutioning document. Now reading each in depth it’s probably overkill so I decided to do a first pass, cherry pick a few, do a quick flyover and summarize them.

Then do a second pass on those chosen few, and read those in depth (or save them in the drawer for a later read-through).

A mathematics teacher once said: When you don’t know what to do, grab whatever buoy you have.

Without further ado, starting with a quick historical survey of transfer learning in the context of NLP makes sense.

https://arxiv.org/abs/1910.07370

In chronological order, we have:

Previous methods – > Recurrent Neural Network (RNN) -> Long Short Term Memory (LSTM) then Gated Recurrent Units (GRU) -> AWD-LSTM -> NT-ASGD ->Seq2Seq ->Attention mechanism -> UMLFit -> ElMO -> OpenAI transformer (GPT) -> GPT-2/BERT -> Universal sentence encoder.

RNN is the same as a multi-layer perceptron network except there is a memory element. NB: such networks use “Backpropagation through time”.

LSTM has the capacity to “forget” with GRUs being a computational upgrade. This is done through logic gates.

The contribution of AWD-LSTM was to remedy the problem of overfitting. Typically, neural nets would use dropout (forgetting certain activations at random) to avoid it, but it had the effect of negating memory in RNNs. The drop connect algo would instead drop weights.

NT-ASGD is a custom optimizer. I predict there will be plenty of our experimentation with optimizers and so I will spare the details.

Seq2seq adopts an encoder decoder architecture. Essentially, a RNN would encode the data into a single vector, and another RNN would decode it into the final output. Said vector is often not capable of taking context into account.

Attention Mechanism takes about the same concept as Seq2Seq except that each token input has an additional contextual vector associated with it.

Next up are transformers which is an extension of attention mechanisms. There’s a whole page on the topic in general, but it has a fair amount of jargon so I’ll save it for when I get to actual choosing and selecting the actual algorithms I use.

Next up:

UMLFIT innovated by introducing transfer learning. Transfer learning is when you’re taking a model that’s already been trained on a general task, then retrain only the final few layers for a specific tasks.

ElmO differs from its predecessor in that whereas a vector represented each token (word) independent of context, ElmO instead uses a bidirectional LTSM to create a vector for each token that takes its context in consideration.

What GPT does is use unsupervised learning to train a generic model up to a certain point, then use supervised learning to fine tune for a specific task.

BERT introduces birdirectionality. It’s the concept of just isntead of just taking the input into account, (e.g. A brat goes to White Castle), we also go the other way around “e.g. Castle White to goes Brat A”.

In other words, it takes into account the whole document. In particular, it masks some words to augment the data.

Now GPT-2 is BERT trained on the Wiki corpus — OK, I will likely not use it, I think I read that GPT-2 has some pretty high hardware requirements. So I’ll need to get clever about it. Also it does note it in the paper but .

Now Universal Sentence Encoders use whole sentences as tokens instead of just words.It differs alittle bit from Transformer architecture under the hood.

Tranformer-XL is a variant of transformers which have variable-length input.

XLNet was made to address shortcomings of BERT due to the nature of hiding tokens during training.

One thing that keeps coming back is the word “attention” and “multi-headed” so I’ll have to look them up later on.

Anyways, this is only to quickly summarize what happened in the field of NLP in the last few years.I am completely new to this, so please excuse the excessive vulgarization and perhaps mistakes. I’ll only go into further details later on for a chosen few.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: