OK long time no see everyone.Been busy IRL.
We’ll go back to continue analyzing what’s already out there. I found this very interesting post that found there are 15 distinct subjects from Mr.Chris Deotte on the kaggle boards:
From Mr. Deotte:
“First we convert each text into Tfidf embedding of length 25,000. Next we use UMAP to reduce this to 2 dimensions. Next use KMeans for color and clustering”
The unenlightened beginner such as yours truly is well to ask: what is a tfidf embedding? What is UMAP? And what is Kmeans? This post answers the first question.
An embedding is a way to represent whatever input we’re giving our model as real numbers.
The simplest ,one-hot-encoding is a matrix representation where each column represents the of one word.
“King” = (1,0,0)
“Peasant” = (0,1,0)
As you can imagine, there are counts ways to “embed” , each with their advantage and disadvantage. If you have 5 tokens, the representation is trivial, but if you have the whole of the English language, it becomes more problematic.
TF-IDF is defined as a measure of how often a term appears(term frequency) divided by the impact it has on the corpus as a whole(inverse document frequency). We can think of TF-IDF as a “function” (I will use that term extremely loosely) taking a document and a particular word as input.
Now term frequency can be defined in a number of ways, but I believe the default to be the raw count of terms in said document d divided by the sum of terms in documents:
We define Inverse document frequency as the number of documents in the corpus N divided by the number of documents where the terms appear inputted into a logarithmic function.
Why logarithmic? Log has interesting properties. One, it grows very very slowly. Since we’re interested mostly in ratios here it prevents the TF-IDF from blowing up.
This fella conveniently produces some nice code to illustrate it:
If we run the first code cell he uses, its easier to visualize what is meant by an example.
Suppose we have two “documents”:
Document 1:”Mullah Omar has a nice beard”
Document 2:”Saddam Hussein has a nice mustache”
Count Vectorizer (one-hot encoding) beard hussein mullah mustache nice omar saddam Doc1 1 0 1 0 1 1 0 Doc2 0 1 0 1 1 0 1 TD-IDF Vectorizer (tf-idf encoding) beard hussein mullah mustache nice omar saddam Doc1 0.534046 0.000000 0.534046 0.000000 0.379978 0.534046 0.000000 Doc2 0.000000 0.534046 0.000000 0.534046 0.379978 0.000000 0.534046
Next up: What is UMAP (a light theoretical perspective)?