Looking at stuff case by case

A review of the competition reveals that both training and test set only contains 10 patient cases.

So I thought ti would be a useful thing to recreat the previous exercise by grouping each feature by case.

So we start:

The code to generate the plot by case would be something like this:

test2 =test2 %>% mutate(annon_len = annotation %>% map_int(nchar))

ggplot(test2, aes(x=annon_len)) +

geom_histogram(color=”black”, fill=”white”) +

ggtitle(“Lenght of annotation per case number”) +


We can also do a boxplot to get a better idea.


This code for example:

ggplot(test2, aes(x=case_num, y=annon_len, group=case_num)) +


will return a boxplot.Visualization matters. If you play around with it a little bit , you’ll realize that having both histogram and boxplot tells a more complete story than either alone. The boxplots imply the cases are much more similar than not while it is clear from the histogram that it is so.

We can also cut stuff by intervals (see my upcoming post on R basics) if we really want to dig in:

for (x in c(0:9)){



test2[test2$case_num == x,]$annon_len %>% cut(breaks=5) %>% table %>% print

There is one weakness to such an ad-hoc approach, namely that we have to check things one by one.In this case, even if we don’t want to go down to the level of individual features, we can see that for 10 cases * 3 or 4 features/columns = 30 to 40 things we need to check, we have to find a way to visualize and digest them easily.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: