Blog

  • This project will be executed in three distinct phases from September 2025 to April 2026. The plan is designed to systematically explore, develop, and evaluate a deep learning-based approach for genetic variant detection in polyploid plants.

    Phase 1: Literature Review, Data Preparation, and Baseline Evaluation (September 2025 – October 2025)

    1.1 Systematic Literature Review :

    Conduct an in-depth review of the complexities of polyploid genomes, including the challenges posed by homologous genes, genomic instability, and their impact on variant detection.

    Survey state-of-the-art computational methods for variant calling, with a focus on both deep learning approaches (e.g., DeepVariant, Clair3) and traditional statistical methods (e.g., GATK).

    1.2 Data Collection and Preprocessing:

    Identify, collect, and organize public Whole-Genome Sequencing (WGS) datasets for representative polyploid plants (e.g., wheat, cotton, or potato).

    Establish a standardized data preprocessing pipeline. This will involve aligning raw sequencing reads to the reference genome using tools like BWA or Minimap2, followed by sorting, indexing, and duplicate removal using SAMtools.

    1.3 Baseline Performance Evaluation:

    Apply a state-of-the-art, haploid-based variant caller (e.g., Clair3) to the preprocessed polyploid datasets.

    Systematically record its performance metrics, including precision, recall, and F1 score, to establish a clear performance baseline for subsequent model comparisons.

    Phase 2: Deep Learning Model Design, Development, and Preliminary Training (November 2025 – January 2026)

    2.1 Model Architecture Exploration and Design:

    Design and implement several deep learning models. Initially, the effectiveness of established architectures will be assessed:

    CNN-based models: To leverage their strength in image recognition by converting genome alignment data into pileup images, enabling the capture of visual features of SNPs and Indels.

    RNN/LSTM-based models: To process the sequential nature of sequencing reads and capture long-range dependencies.

    Building on this, explore novel or hybrid model architectures specifically tailored to the unique characteristics of polyploid data, such as integrating attention mechanisms or Transformers to better distinguish subtle differences between homologous genes.

    2.2 Model Training and Tuning:

    Train the designed models on the prepared datasets.

    Conduct systematic hyperparameter tuning (e.g., learning rate, batch size, network depth) and utilize cross-validation to ensure model generalization.

    2.3 Mid-term Progress Summary:

    Compile and submit a mid-term report. This report will document the literature review, the data processing workflow, the results of the baseline evaluation, and the preliminary performance of the newly developed deep learning models.

    Phase 3: Model Optimization, Comprehensive Validation, and Thesis Writing (February 2026 – April 2026)

    3.1 In-depth Optimization and Validation:

    Based on the mid-term results, select the best-performing model architecture for further in-depth optimization. This may involve incorporating more complex network modules, refining the loss function, or employing advanced training strategies.

    Conduct extensive performance validation on more diverse datasets or under varying conditions (e.g., different sequencing depths) to test the model’s robustness and application boundaries.

    3.2 Performance Comparison and Analysis:

    Perform a comprehensive comparison of the final, optimized model against the baseline established in Phase 1 (Clair3) and other established variant calling methods.

    Employ statistical methods to analyze the performance differences between models on various types of variants (e.g., homozygous/heterozygous SNPs, complex Indels) and provide an interpretation of these differences from a model architecture perspective.

    3.3 Final Report and Thesis Writing:

    Consolidate all research findings into the final project report/thesis. The thesis will be structured to include the research background, related works, methodology, experimental design, results, discussion, and conclusion.

    Organize all code, data, and experimental workflows, preparing detailed documentation or a public repository (e.g., on GitHub) to ensure the reproducibility of the research.

    Prepare a presentation for the final project defense and draft a manuscript for potential publication in a scientific journal.

    Final Presentation

  • Welcome to HKU CS WordPress Multi Site 2025 Sites. This is your first post. Edit or delete it, then start writing!

    Final Presentation