Predicting Breast Cancer Proliferation Scores with Apache Spark and Apache SystemML

Monday, November 7, 2016 - 12:00 pm

Mike Dusenberry and Madison Myers work on the Machine Learning team at IBM’s Spark Technology Center. The project they will be speaking to us about was developed for the Tumor Proliferation Assessment Challenge 2016, where they were given the assignment of predicting scores of tumor proliferation speed. This kind of tumor growth is an important indicator of breast cancer patients’ prognosis and therefore the automation of the prognosis process would greatly help pathologists. Upon applying to the challenge, Mike and Madison were given 500 images of breast cancer tissue that were of varying sizes starting at 7GB each, with 20-40x zoom and at least 50,000 by 50,000 pixels. Because the data was so large and ended up amounting to 7 terabytes of data, Mike and Madison had to overcome big data issues and decided to approach the problem by applying machine learning and deep convolutional neural networks using Apache Spark and Apache SystemML.

Engineer
IBM Spark Technology Center

Mike Dusenberry is an engineer at the IBM Spark Technology Center, creating a deep learning library for SystemML and solving for performant deep learning at scale. He was on his way to an M.D. and a career as a physician in his home state of North Carolina when he teamed up with professors on a medical machine learning research project. Two years later in San Francisco, Mike is contributing to Apache SystemML as a committer and researching medical applications for deep learning.

Data Scientist
IBM Spark Technology Center

After receiving her BA from NYU and her MA from King's College London, Madison J Myers started her career in political science where she focused on food policy and social justice. Shortly after, she became interested in data science and now works at IBM’s Spark Technology Center as a data science intern while also studying as a MIDS graduate student at UC Berkeley. Madison currently focuses on Apache SystemML and Apache Spark, developing use cases in the medical and social science domains.