XClose

In2Research Journeys

Home
Menu

Week 2 – Baseline Model Development

20 Jun 2025 - Hirra Asif - NLP, machine learning, social media

This week I focused on preparing the data splits, training baseline models, assessing their performance with multiple metrics and identifying challenges to address before moving on to deep learning models.

Data Pipeline & Splits:

For each topic, the three datasets were individually shuffled and split 80/20 into training and testing sets using a stratified split (random_state=42). They were also combined, shuffled and split 80/20 to create training and testing sets containing examples from all sources.


Baseline models trained:

I built three pipelines using feature embedding and extraction with TF-IDF vectorisation and Chi-Square feature selection. Four models were tested:

⭐ Multinomial Naïve Bayes – a probabilistic model based on Bayes’ theorem, suited for discrete features such as term frequencies, assuming feature independence.

⭐ XGBoost – a gradient boosting framework that builds an ensemble of decision trees in sequence, each one trained to minimise the errors made by the previous trees.

⭐ Logistic Regression – a linear model that estimates class probabilities using the logistic function.

⭐ Random Forest – an ensemble of decision trees trained on bootstrapped samples with random feature selection at each split.

Hyperparameters were tuned using 5-fold GridSearchCV, which tests multiple parameter combinations with cross-validation to balance bias and variance. For each model, the optimal hyperparameters were recorded and the corresponding best model was saved.


Performance:

Model performance was evaluated using ROC AUC, precision, recall, balanced accuracy, F1-score, and specificity. For the multiclass COVID-19 task, all metrics were calculated using the macro averaging method.


Challenges 🤔❓

  • Balancing two label schemes (binary ADR vs. multiclass sentiment) required custom evaluation scripts.
  • Hyperparameter searches via GridSearchCV introduce long training times and require more compute.
  • Class imbalance in ADR detection: most forum posts report reactions, so very few non-ADR examples, considering SMOTE to oversample the minority class could help address this.

➡️ Up next

Deep Learning 🧠 phase of the project implementing Transformer-based models (BERT and RoBERTa), training them on the cleaned dataset, and tuning hyperparameters for optimal performance.