PhyloSoC: Apply Machine Learning Algorithm(s) to Ecology Data

From Phyloinformatics
Revision as of 06:41, 18 June 2012 by (talk) (Update)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Author and Relevant links

Student: Abu Zaher Md. Faridee / email:

Mentors: Primary Mentor: Kathryn Iverson, Secondary Mentor: Sarah Westcott

Project Homepage: PhyloSoC: Apply Machine Learning Algorithm(s) to Ecology Data

Project blog: Wiki on github and Issue Tracker on github

Source Code: Hosted on github



Interim Period before Acceptance: April 7 - April 22

  • Become Familiar with Microbial Ecology Terms
  • Got mothur up and running with Xcode and Gfortran in my Mac OS X setup
  • Follow mothur tutorial from the wiki and get to know the workflow
  • Create a private git branch for mothur in Github
  • Take a closer look at the R implementation
  • Create the initial schematics for the command line programs that would be added to mothur ('train' and 'inquire')
  • Go through text books to make sure my knowledge of the classifier algorithms are clear once again

Week 1: April 23 - April 29 (Community Bonding Period Starts)

  • Register for Machine Learning Course at Coursera
  • Set up Wiki, Issue Tracker and other stuffs in GitHub page

  • Determine Performance Evaluation Criteria - Issue #1
  • Prepare Datasets for The Random Forest Algorithm - Issue #2
  • Classification/Regression or Both! - Issue #3
  • Make a List of Reusable Classes/Functions from Mothur - Issue #4
  • Continuing the discussion on all the open issues

Week 2: April 30 - May 6

  • Find Mothur's Common Practices - Issue #5
  • Find a Way to Merge 'train' and 'inquire' Commands Into One Single Command - Issue #6
  • Investigate Mothur's Multithreading/Multiprocessing API - Issue #7
  • Investigate The Possible Places of Parallelization in the Random Forest Classifier - Issue #8
  • Create Code Schematics/Pseudocode for the Random Forest Implementation - Issue #9
  • Investigate the Best Practices for Parameter Tuning/Estimation for Random Forest - Issue #10
  • Continuing the discussion on all the open issues

Important Note: There has been some major revelations during this week. Following the discussion of Issue #3, we came to realize that the initial problem statement falls under neither Classification nor Regression but what we are seeking is actually a Feature Selection problem. Therefore, there will be a lot of changes in the project plan in the upcoming week as we might need to learn a lot of things and repurpose our timeline accordingly.

Week 3: May 7 - May 13

  • Create a shared folder in file sharing service like DropBox and share it with mentor Kathryn and Sarah, put all the papers/resources together that we come across.

  • Review the Literature of 'Parameter Selection' and Create a wiki Page to Document the Findings - Issue #11

  • Track Upstream Mothur Branch and Update Our Master Branch from That - Issue #12
  • Review the article titled A Review of Feature Selection Techniques in Bioinformatics (Yvan Saeys, Iñaki Inza, Pedro Larrañaga - 2007) as per Issue #11

  • Continuing the discussion on all the open issues

Week 4: May 14 - May 20

  • Review the article titled An Introduction to Variable and Feature Selection (Isabelle Guyon, André Elisseeff - 2003) as per Issue #11

  • Review the article titled A Review of Feature Selection Methods on Synthetic Data (Verónica Bolón-Canedo, Noelia Sánchez-Maroño, Amparo Alonso-Betanzos - 2012) as per Issue #11
  • Take a Deeper Review on the Paper "Feature Selection via Regularized Trees" - Issue #13
  • Continuing the discussion on all the open issues

Important Note: As in Week 2 we came to realize that what we are looking for is neither Classification nor Regression, but rather Feature Selection, we had to mostly spent Week 3 and Week 4 in reviewing the literature of Feature Selection methods. Since we did not have that much deeper knowledge of Feature Selection as much as we do on Classification or Regression, we started with some survey papers (both recent and old), gaining knowledge on which of the algorithms would be best for us. We discussed on selecting between Variable Ranking vs Variable Subset Selection in respect to our problem set, selecting between a Filter, Wrapper or an Embedded method. Finally we decided to go for Regularized Random Forest. We chose to go with this in the end, keeping a few thing in mind. Firstly, this would allow us to recycle most of the project plans that we came up with in the previous days, with minor amount of variations, as Regularized Random Forest is a variant of the basic Random Forest algorithm. Secondly, this embedded method can be cleverly converted into a wrapper method with the help of an optimizer such genetic algorithm or ant colony optimization. The paper describing RRF is a recent one (2012), so this one can be considered state-of-the-art. Most of the decisions/discussions can be followed in the page of Issue #11.

Milestone 1 Reached: Preliminary investigation complete

Week 5: May 21 - May 27 (Start of Official Coding)

  • Prepare Datasets for The Random Forest Algorithm - Issue #2
  • Week 5: Implement the Basic Building Blocks of Random Forest Algorithm - Issue #14

Week 6: May 28 - June 3

  • Week 6: Implement the Core Part of Random Forest Algorithm - Issue #15

Week 7: June 4 - June 10

  • Week 7: Finish the Implementation of Random Forest Algorithm - Issue #16

Week 8: June 11 - June 17

  • Week 8: Start of Parameter Tuning & Preliminary Evaluation of the Implementation - Issue #17
  • Choosing a Good Random Number Generator (RNG) a.k.a "Treading Into the Mucky Waters of RNG" - Issue #18

Milestone 2 Reached: We have a working Random Forest implementation that can work with microbial ecology data

Week 9: June 18 - June 24

  • Week 9: Implement Regularized Radom Forest Framework on top of Current Random Forest Implementation and Tune Parameters - Issue #19

Week 10: June 25 - July 1

Week 11: July 2 - July 8

Mid-term evaluation (from July 9 to July 13)


Week 12: July 9 - July 15

Week 13: July 16 - July 22

Week 14: July 23 - July 29

Milestone 3 Reached

Week 15: July 30 - August 5

Week 16: August 6 - August 12

August 13: Suggested ‘Pencils Down’


Week 17: August 13 - August 19

August 20: Firm ‘Pencils Down’


Final Evaluation


Further work after Google Summer of Code