P-Diff Learning Classifier with noisy labels based on probability difference distributions
Label noise in digital Pathology In the field of digital pathology and other health related deep learning applications, label noise is an important challenge to consider during training.
It’s inherent to the medical fields as the problems are extremely challenging even for trained experts, so there is high intra- as well as inter-observer variability.
This blog post dives into the idea of the paper P-DIFF: Learning Classifier with Noisy Labels based on Probability Difference Distributions which is authored by researchers of Microsoft in China.
Meta-learning from noisy labels
Label noise introduction Training machine learning models requires a lot of data. Often, it is quite costly to obtain sufficient data for your problem. Sometimes, you might even need domain experts which don’t have much time and are expensive.
One option that you can look into is getting cheaper, lower quality data, i.e. have less experienced people annotate data. This usually has the side effect of your labels becoming more noisy.
Refactoring machine learning code - namedtuple
Instead of using sometimes confusing indexing in your code, use a namedtuple instead. It’s backwards compatible, so you can still use the index, but you can make your code much more readable.
This is especially helpful when you transform between PIL and numpy based code, where PIL uses a column, row notation while numpy uses a row, column notation.
Let’s consider this piece of code where we want to get the pixel locations of several points which are in the numpy format:
Refactoring machine learning code - einops
Einops is a really great library to improve your machine learning code. It supports Numpy, PyTorch, Tensorflow and many more machine learning libraries. It helps to give more semantic meaning to your code and can also save you a lot of headaches when transforming data.
As a primer let’s look at a typical use-case in machine learning where you have a bunch of data and you want to reshape it, so some dimensions are merged together like this:
Refactoring machine learning code - comments as code
I find that in the field of data science and machine learning some coding principles that are standard in traditional software engineering sometimes are lacking. One such principle is to strive to rather specify everything that is possible in code rather than as comments.
Why does it make sense to do that?
Comments often don’t age well. You write them in the context of the current code, but then over time as the code gets changed and readapted to other use cases, the context changes.