Machine Learning for Muscular Dystrophy Diagnosis

This project utilized the NCBI Gene Expression Omnibus (GEO) to evaluate the diagnostic power of classical machine learning methods for muscular dystrophy classification.

IndependentBioinformaticsData Pre-ProcessingMachine LearningNumPyGene Expression Omnibus (GEO)
Machine Learning for Muscular Dystrophy Diagnosis cover

OVERVIEW

This project applies classical machine learning, preprocessing, and feature selection to gene-expression data from the NCBI Gene Expression Omnibus (GEO) to evaluate which models are best suited for diagnosing Duchenne’s Muscular Dystrophy. A core theme of the work is treating differentially expressed genes as molecular biomarkers, using statistical filtering and feature selection to isolate genes whose expression patterns most strongly align with pathological status, then training supervised models on those biomarker signals.

WHAT I DID

  • Identified and curated two compatible gene-expression microarray datasets (GSE3307, GSE6011), ensuring overlap in gene features for valid dataset merging.
  • Trained and evaluated supervised models (Logistic Regression, Naive Bayes, KNN, Random Forest) using an 80/20 train-test split.
  • Implemented a reusable evaluation function to compute accuracy, precision, recall, and F1 score, enabling systematic comparison.
  • Authored the full research manuscript, documenting model performance and methodological limitations.

RESULTS / IMPACT

  • Achieved 90%+ accuracy using Logistic Regression and Random Forest, with Random Forest providing the strongest balance of precision, recall, and F1 score.
  • Identified a subset of genes with statistically significant associations to muscular dystrophy status, reinforcing known pathways and surfacing candidates for follow-up study.
  • Produced a reproducible ML pipeline adaptable to other genetically driven neuromuscular disorders using GEO.
  • Published the work as a public preprint, contributing an accessible end-to-end example of applied ML in genomic diagnostics.

LESSONS + NEXT STEPS

  • Learned how preprocessing and feature selection shape biological ML pipelines, including extracting genomic data from GEO.
  • Extend the pipeline to RNA-seq data and evaluate modern deep learning architectures for diagnostic performance.
  • Explore other genetic disorders with publicly accessible genomic datasets.

Scroll through the full paper directly below.

If the preview doesn't load, open the PDF or view the full paper.