Tidying up the data part 1

Tidying up the data

In this dataset, you’ll notice that in some patient notes, there are several features that have several annotations, and others that have none. For now we’ll filter out only for the former:

#Get rid of all the junk characters

clean = function(column){

return(str_replace_all(column, c(“\\[” = “”, “\\]” = “”,”\\’” = “”)))

}

train = train %>% filter(location !=”[]”)

test$location = test$location %>% clean %>% strsplit(“,”)

test$annotation = test$annotation %>% clean %>% strsplit(“,”)

Another question popped in my mind: how likely is that a single feature have several annotation instead of just one?

Suddenly while playing around with this, I came across this error:

Error in `fn()`:

! In row 218, can’t recycle input of size 3 to size 5.

I check out the test dataframe,and sure enough, I see:

c(“diarrhea”, ” loose”, ” watery stool”) in the annotation column and

c(“275 283″, ” 342 361″) in the location one.

Clearly .

First order of business is to check just how many there are. If there are 5 or 6, all the best: I’ll clean it manually.

Conceptually, the easiest way to do it would be to check the length of each cell of both annotation and location column. I really wish there were an obvious way to do this in one line. However, either my google-fu is weak or I just don’t have the vocab to express myself.

I tried this:

test %>% filter( (location %>% map_int(length)) != (annotation %>% map_int(length)) )

There’s 144 erroneous entries… OK. Short enough so that the pidgeon can’t claim complete victory, but just long enough that it’d be annoying to go through them one by one.

There seems to be several possible types of errors. Wrong use of notation (e.g. use “;” instead of “,”).

Just plain forgetting to annotate, or even wrong ordering of the annotations.

Now here’s the thing: if the training set has these mistakes, so do the test set. It wouldn’t be wise to fix them so I decided to leave them alone for now.

Also, those mistakes constitute way less than 1% of our training dataset… since I’m on a tight schedule, I’ll just dump them for now while being aware that there are likely other errors in the dataset.

BONUS OBJECTIVE: Can I train a model to see if there’s an error in the annotations? Just how many errors are there in the dataset itself?

In a nutshell I now want to see if the annotation concord with their annotation.

My code thus far is as such:

#about 1/3rd of rows are missing annotations

train = train %>% filter(location !=”[]”)

test$location = test$location %>% clean #%>% strsplit(“,”)

test$annotation = test$annotation %>% clean #%>% strsplit(“,”)

test$feature_num = test$feature_num %>% map_int(chr_to_int)

test$pn_num = test$pn_num %>% map_int(chr_to_int)

#get only those rows with (likely) no mistakes

test2 = test %>% filter( (location %>% map_int(str_count(“,”)) ) != (annotation %>% map_int(str_count(“,”)) ))

…aaaand… It doesn’t work . Everything is kosher up to the point where I try using filter where Rstudio will throw me :

Error: Can’t coerce element 1 from a character to a integer

After playing ayound with it a bit and trying to be a bit more specific, I get:

> test$location %>% map_int(str_count(.,pattern=”,”))

Error in `stop_bad_type()`:

! Result 1 must be a single integer, not NULL of length 0

Great.

So after working out my google-fu, I come across this:

https://stackoverflow.com/questions/55397509/purrrmap-int-cant-coerce-element-1-from-a-double-to-a-integer

OK, to make a long story short. Playing around wiht str_coutn seems to return integers so what’s the big deal R?

I don’t know.

I tried simply: test2 = test %>% filter( (location %>% str_count(“,”) ) != (annotation %>% str_count(“,”) ))

Turns out str_count will take a vector, no problem.

Morale of the story: keep it simple stupid. Don’t let the hammer make everthing look like a nail.

I also found out that my second attempt produced a false positive: someone had a typo and wrote something like “loose stools,” in the annotation. There are probably more than one such mistake, but since time is short, and they are so few, they’re hardly representatitve.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: