Playing with stats

I have continued playing with some more graphs:

test2 = test2 %>% mutate(max_loc = test2$location %>% map(strtoi) %>% map_int(max)) %>%

mutate(min_loc = test2$location %>% map(strtoi) %>% map_int(min)) %>%

mutate(perc_of_pn = (max_loc-min_loc)/unlist(pn_len)*100) %>%

mutate(relative_start = (min_loc)/unlist(pn_len)*100)

ggplot(test2, aes(x=perc_of_pn)) +

geom_histogram(color=”black”, fill=”white”) +

ggtitle(“How long each annotation is”) +


ggplot(test2, aes(x=relative_start)) +

geom_histogram(color=”black”, fill=”white”) +

ggtitle(“Lenght of patient note per case number”) +


The second ggplot turned out to be nonsensical: R could not find a way to logically display what I wanted to express.So this is the point where I have to go for more numerical estiamtes.

(NB: Yes I know my code is atrocious and not fit for public release, I will go back and edit it)

So first, I’ll try some out of the box tools. The first that comes to mind is correlation.

E.g. whethere there is a linear relationship between two variables.E.g. If you plot the two variables on a graph, and it roughly looks like a line, they’re correlated. However, there are weaknesses to such approach. for example, the relation can be predicted by a parabola, you’re out of luck.

I’m not a statistical expert, and if there were easy linear correlations, this competition wouldn’t exist so I don’t want to spend an inordinate amount of time on statistical correlation.

After some googling, I found this paper:

There are two main classes of correlation: Pearsons’ for normally distributed variables and Spearman’s for skewed variables.

Unfortunately, these deals purely with continuous variables. All of the variables are discrete. I imagine there are descendants that deal with discrete variants.

After some more googling, I found this paper:

I summarize the relevant parts quickly. First a few definitions:

The levelofmeasurement (or measurement
scale) of a variable is its designation as continuous
or discrete.

Continuous variable seems fairly self-explanatory, but is a tad more complicated than this in a real world context. Even though the values may be integers, but as long as it , it is considered continuous. For example , if you wanted to tell the age of the baby, you’d say “this baby is 60 days old” but a more precise measure would likely be something like :60.52789 days old)

A discrete variable can be subdivided into two further categories:

An ordinal variable have order encoded in its values. E.g. Length of any kind is ordered and thus ordinal.

A nominal variable doesn’t have an order encoded of any kind. A watermelon could be encoded as 1, an orange could be encoded as 0 but one isn’t superior to the other.

So as you can see, there is some discretion as to how these terms are defined.

In this paper, there are six combinations that are practically encountered, however for our needs, we only need to have one: 5.Ordinal-nominal since features are nominal but lengths and counts are ordinal.

We need the rank-biserial correlation coefficient .

Uuuuuunfortunately, the original paper is behind a paywall.

After some more googling, it turns out to be used mostly for used for dichotomous variables. Ours isn’t.

I’m sure that one could dig into the details and either torture the data enough to fit the tools or macguyver something witht he formula, but it defeats the purpose of “picking the low hanging fruit first”.

So I decided upon something else.

After some more googling, I got wind of an alternative that expressed the same idea as correlation:

which seems to be more promising and intuitive.

In a nutshell, it start with explaining why it’s just generally a bad idea to try to force correlation when nominal values are invovled. I run the danger of trying to force a square through a round hole. Or seeing everything as a nail or [insert your own popular analogy here]. It makes no intuitive sense.

Before going further and digging into the code,I need to know a little bit more about R, in particular the tilde operator:

There is far more to know about how to use formulas in R:

However, for our purposes, the very basics will do. In a nutshell, you can think of tilde as meaning “depends on” with + as “and”.

E.g. Z ~ X + Y —-> Z depends on X and Y.

He uses the LM function to build a linear model. Aside: “Linear” model is a bit of a misnomer as there are can be non-linearities in the model itself (the “linear” part actually comes from derivations of the model). Anyways, I digress.

If we use this:

test3 = test2

test3$case_num = test3$case_num %>% as.character

test3$feature_num = test3$feature_num %>% as.character

boxplot(annon_len ~ case_num, data = test3, ylab = “Annotation Length”)

lm(annon_len ~ case_num, data = test3) %>% summary

While trying to figure out how to loop through a list containing all the column, I came across this interesting ycombinator article:

Also this:

But anyways I digress. Learning Racket will be an exercise for another time.

The OP notes the use of the R^2. This deserves a few explanations. I’ll shamelessly steal from wikipedia.

Suppose there exists a dataset, with values y_1,y_2…y_n.

As well as a function f that approximates it (e.g. y_i would have associated )

Define the residuals as a vector representing the errors between the function f and dataset y.

The SS_res or sum of square of residuals is what it says on the tin: we compute it by squaring the residuals and adding them together.

The SS_tot or sum of squares is computed by taking the difference between each y_i of the dataset and their mean, squaring it and adding them together. Essentially it measures how widespread the dataset is.

We define R squared as 1- SS_res/SS_tot. There are several other equivalent definitions, but at this point, it will suffice as we just want some quick intuitive idea of what it does.

The reasons for squaring are mostly technical — later on down the line, it makes things more convenient for statisticians: you can do derivations without your derivative being zero or 1, you don’t get negative numbers, etc. But the resulting statistic still express the same intuitive idea.

If we run the code above: we get 0.04389 for annon length.

For features, we get an R^2 of 0.6143. Now how do we interpret this?

The lower the errors and the higher the variance, the higher the R^2 and so the better. 0.61 doesn’t sound that bad, but then is it?

One of my pet peeves about applied mathematical education is that we don’t talk enough about the weaknesses of different techniques.

Googling R^2 weaknesses reveal this:

The lecture notes are these in particular where the author shows several way to game the R^2 definition:

So have I completely wasted my time, revisiting ?

No, IIRC, I have used correlation in other pet projects and seem that to an extent the criticism also extends to Pearson correlation since R^2 can be derived from it when applied to observed and predicted values:

Intutiively, We’re also not computing the equivalent of the correlation between the feature number and annotation length. Rather, our intent is to test the performance of a linear model. That means we can simply use another measure of performance.

OK I really like this guy.

Important clue from Shalizi’s note:

“Mean squared error is a much better measure of how good predictions
are; better yet are estimates of out-of-sample error which we’ll cover
later in the course.”

Hey, who am I to argue? I did google what out-of-sample errors were:

It is EXTREMELY interesting. It links back to functional analysis and provides a (possible) suggestion for this particular problem, but also a generic framework from which a whole bunch of other things can be evaluated.

I have heard the point of view that machine learning practitioners were essentially doing what statisticians have been doing for decades . I am inclined to agree after writing this blog post.

However: there are only two or so weeks remaining to the competition, and I haven’t even made even one submission yet due to my lack of practical knowledge. Maybe I am just too tired to attack it right now, but .

One of the . There are tons of rabbit holes that I can get lost into.

We’ll go for the simplest for now and start with MSE:

We can do this:

model_summ = lm(annon_len ~ feature_num, data = test3) %>% summary


model_summ$residuals^2 %>% mean

It gives us a MSE of 47.73412. What does that mean exactly? Right now,not much. The weakness here in this context I think is that this figure is relative: we need something to compare it with, but we’ll keep that number in mind and come back to it when we have another model.

There’s also a value called eta^2 that is equivalent in this case to R… I don’t think it’ll be useful, in this particular case so we’ll move on.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: