What are TF-IDF embeddings?

OK long time no see everyone.Been busy IRL.

We’ll go back to continue analyzing what’s already out there. I found this very interesting post that found there are 15 distinct subjects from Mr.Chris Deotte on the kaggle boards:

https://www.kaggle.com/c/feedback-prize-2021/discussion/301481

From Mr. Deotte:

“First we convert each text into Tfidf embedding of length 25,000. Next we use UMAP to reduce this to 2 dimensions. Next use KMeans for color and clustering”

The unenlightened beginner such as yours truly is well to ask: what is a tfidf embedding? What is UMAP? And what is Kmeans? This post answers the first question.

An embedding is a way to represent whatever input we’re giving our model as real numbers.

The simplest ,one-hot-encoding is a matrix representation where each column represents the of one word.

For example:

“King” = (1,0,0)

“Peasant” = (0,1,0)

“Queen”=(0,0,1)

As you can imagine, there are counts ways to “embed” , each with their advantage and disadvantage. If you have 5 tokens, the representation is trivial, but if you have the whole of the English language, it becomes more problematic.

TF-IDF is defined as a measure of how often a term appears(term frequency) divided by the impact it has on the corpus as a whole(inverse document frequency). We can think of TF-IDF as a “function” (I will use that term extremely loosely) taking a document and a particular word as input.

Now term frequency can be defined in a number of ways, but I believe the default to be the raw count of terms in said document d divided by the sum of terms in documents:

We define Inverse document frequency as the number of documents in the corpus N divided by the number of documents where the terms appear inputted into a logarithmic function.

Why logarithmic? Log has interesting properties. One, it grows very very slowly. Since we’re interested mostly in ratios here it prevents the TF-IDF from blowing up.

This fella conveniently produces some nice code to illustrate it:

medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a

If we run the first code cell he uses, its easier to visualize what is meant by an example.

Suppose we have two “documents”:

Document 1:”Mullah Omar has a nice beard”

Document 2:”Saddam Hussein has a nice mustache”

Count Vectorizer (one-hot encoding)

      beard  hussein  mullah  mustache  nice  omar  saddam
Doc1      1        0       1         0     1     1       0
Doc2      0        1       0         1     1     0       1

TD-IDF Vectorizer (tf-idf encoding)

         beard   hussein    mullah  mustache      nice      omar    saddam
Doc1  0.534046  0.000000  0.534046  0.000000  0.379978  0.534046  0.000000
Doc2  0.000000  0.534046  0.000000  0.534046  0.379978  0.000000  0.534046

Next up: What is UMAP (a light theoretical perspective)?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: