We’ll go back to continue analyzing what’s already out there. I found this very interesting post that found there are 15 distinct subjects from Mr.Chris Deotte on the kaggle boards:

https://www.kaggle.com/c/feedback-prize-2021/discussion/301481

From Mr. Deotte:

“First we convert each text into Tfidf embedding of length 25,000. Next we use UMAP to reduce this to 2 dimensions. Next use KMeans for color and clustering”

The unenlightened beginner such as yours truly is well to ask: what is a tfidf embedding? What is UMAP? And what is Kmeans? This post answers the first question.

An embedding is a way to represent whatever input we’re giving our model as real numbers.

The simplest ,one-hot-encoding is a matrix representation where each column represents the of one word.

For example:

“King” = (1,0,0)

“Peasant” = (0,1,0)

“Queen”=(0,0,1)

As you can imagine, there are counts ways to “embed” , each with their advantage and disadvantage. If you have 5 tokens, the representation is trivial, but if you have the whole of the English language, it becomes more problematic.

**TF-IDF** is defined as a measure of how often a term appears(*term frequency*) divided by the impact it has on the corpus as a whole(*inverse document frequency*). We can think of TF-IDF as a “function” (I will use that term extremely loosely) taking a document and a particular word as input.

Now **term frequency** can be defined in a number of ways, but I believe the default to be the raw count of terms in said document *d *divided by the sum of terms in documents:

We define **Inverse document frequency as** the number of documents in the corpus *N* divided by the number of documents where the terms appear inputted into a logarithmic function.

Why logarithmic? Log has interesting properties. One, it grows very very slowly. Since we’re interested mostly in ratios here it prevents the TF-IDF from blowing up.

This fella conveniently produces some nice code to illustrate it:

medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a

If we run the first code cell he uses, its easier to visualize what is meant by an example.

Suppose we have two “documents”:

Document 1:”Mullah Omar has a nice beard”

Document 2:”Saddam Hussein has a nice mustache”

Count Vectorizer (one-hot encoding) beard hussein mullah mustache nice omar saddam Doc1 1 0 1 0 1 1 0 Doc2 0 1 0 1 1 0 1 TD-IDF Vectorizer (tf-idf encoding) beard hussein mullah mustache nice omar saddam Doc1 0.534046 0.000000 0.534046 0.000000 0.379978 0.534046 0.000000 Doc2 0.000000 0.534046 0.000000 0.534046 0.379978 0.000000 0.534046

Next up: What is UMAP (a light theoretical perspective)?

