Tidying up the data part deux

aka Basic graphs in R and more

Next order of business: I want to check how long each patient note is.

The reason being is that IIRC, the SOTA for NLP is a descendent of one of the Sesame Street (BERT/ELMO/etc), which doesn’t take many characters (IIRC BERT only took 500 originally?).

I just want to start thinking about what to do with it.

So first we do this:

pn = pn %>% mutate(pn_len = pn_history %>% map_int(nchar))

It looks like the smallest note is 30 and the biggest one is 950 characters.

Not sure if at this point a detailed breakdown would be useful.We want just to get an eyeball view of what it looks like.

In addition to our trusty friend, Google. I’m gonna use this


There is a template to follow:

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Here’s also another reference with everything you wanted to know about , and far more than is likely useful:


In practice it looks something like this:

ggplot(pn, aes(x=pn_len)) + 
  geom_histogram(color="black", fill="white")

I just spent half an hour thinking up what to do so it’s time to draw up a battle plan.

First this is what I know: I have very little compute power so I cannot in any way,shape or form compete on brute force.

I assume that if somebody is willing to pay bokou money (hey, for a broke student, this is bokou money OK?), I assume said somebody also has bokou money to just throw the data into an off-the-shelf model as is. I also assume since the contest exists, that this approach is unsatisfactory.

So instead, I’ll skip that part and I’ll adapt a divide and conquer approach. My thoughts right now is to develop “one-versus-all” models for each features individually, starting from the easiest and going up to the hardest then create or choose one to judge which one is right.

There are also other ways we can chop up the problems: we can, for example, separate the tasks of identifying information-rich segments and figuring out which feature it belongs to but that comes further down the road.

So let’s describe the features some more, starting with a laundry list before tackling each of them.

The questions I am asking myself:

Are there commonalities I can eyeball with basic tools? In particular:

1) Is the length of the annotation correlated somehow to the features?

Take feature 913 for example which denotes the gender “female”:

It would be natural to believe the length is in this case, but are there others?

We can ge tthe data with this command:

test %>% mutate(annon_len = annotation %>% map_int(nchar))

2) Is the number of words correlated somehow to the features?

E.g. “Female” and “17 y/o” can be resumed in 1 or two words but “patient has intermittent migraine in the morning” might now.

We can get the data with this command:

test2 = test2 %>% mutate(word_num_of_annotation = annotation %>% strsplit(” “) %>% lengths)
ggplot(test2, aes(x=word_num_of_annotation)) +
geom_histogram(color=”black”, fill=”white”)

3) Do annotations with “gaps” correlate to the features?

For example, an annotator . “Mom (52) have hearts problems“

(52) would be a gap in this case. Could it denote a more complex idea or just the bias of a particular annotator? It begs the question: can we detect if there’s a similar style of annotation (probably not, just brainstorming here)?

To count the gap, we can do this:

test2 = test %>% filter( (location %>% str_count(“,”) ) == (annotation %>% str_count(“,”)) ) %>%

#seoarate rows if there exists the same , in location and annotation

separate_rows(location, annotation,sep=”,”) %>%

mutate(gap_count = location %>% str_count(“;”))

If you’re thinking “it’s not with simplistic thinking like this that you’re going to get anywhere”.

You’re totally right. Truth is, I don’t know a lot about the subject. I *could* start playing with gigantic models like BERT and its descendants right off the bat. I am trying very hard however, to avoid using techniques whose motivations I don’t understand very well.

Obviously, NLP is a gargantuan topic, so right now the goal is to just pick the lowest hanging fruit and then see where this technique fails in the future. This way, I feel that I can put SOTA techniques into historical perspective,instead of just treating them like a black box.

Anyways, I digress. Even going with such gut feelings, going forward is not as simple.

If we run

test$feature_num %>% unique %>% length

we find out there are about 143 features. Not all of them are distincts (for example there are several features to denote “female” but for different patients) so for slight dimension reduction, it’s an obvious venue of attack. There are a few things that pop to mind, but we just want to get the easiest, juiciest, low-hanging fruit here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: