The plan going forward

OK, so now time for some data analysis. For this, I’ve chosen R.


The first reason is that in my experience Rstudio is easier than Pycharm or Jupyter for data exploration.

The second is that R is at the core a functional language (see my previous post about the advantages of functional programming).

The third is mainly to buff up the resume. The more the merrier. Having two or three language is better. I already have a few projects in Python. While i did dabble in R here and there, I am still very new to the language so I’ll dig right in.


Now there are two ways I could proceed.

I could do what most anyone new to a language would do: ,where I mostly tackle trivial or solved problems tailor made to maximize my learning.


I just jump in not knowing what I don’t know, and hope for the best.

The first option seems reasonable, but the second is far more fun, so we’ll go for #2.

Eyeballing the data

So now we have our data. First thing I notice is that in the train.csv file, the annotations are all really

clean. They’re short, concise, to the point and relatively uniform in length.

If you take a look at the location column, however. you’ll soon realize that the annotations simplify the patient notes by excluding extraneous words.

We want the latter back as they’re part of the datasets. EDIT: Actually no we don’t. Or at least I’m guessing not right now. Upon closer inpsection, it seems that extraneous words are very short.

E.g. “Father (52) have heart palpitations”. Becomes “Father have heart palpitaions”

This is arguably maybe not true for all cases: this is my conclusion via a few glances with my MkI eyeball.

The issue is that I’m very limited in time, so I’d like to exhaust the “bang-for-your-bucks” approaches.

Since the competition values accuracy above all else, I can start with just predicting “Father have heart palpitations” and then go for a more granular approach later on.

Divide and Conquer

I don’t have a lot of compute power (I have an RTX 3060…) I intend to use a divide and conquer approach: subdivide the problems in as many tasks as is practical and than train a model for each tasks before then aggregating the results somehow. Furthermore, I will be putting a premium on feature engineering.

The questions I am asking myself are:

1. Does the relative position of an annotation matter?

E.g. It makes sense that a doctor would write “17 y/o male” at the very beginning rather than at the end.

2.What are the lengths of the annotations? And their counts? Is length correlated to the features?

3.Do some patients require more verbose notes across the board?

There are probably more that will pop as time goes by, I will update

After this, I will try to explore simple models I can used and use this as a benchmark before really delving into the future.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: