Week 2 – Baseline Model Development
20 Jun 2025 - Hirra Asif
Week 2 – Baseline Model Development
This week I added COVID-19 sentiment 🦠📱 alongside adverse drug reaction (ADR) detection 💊🫨 to my dataset pipelines. I chose these two topics because the pandemic offers a large, accessible source for sentiment analysis datasets with reliable labels, especially around topics like long-COVID trends, while ADR posts deliver crucial early signals for pharmacovigilance, as mentioned last week.
Data Pipeline & Splits
I concatenated multiple cleaned datasets into a single DataFrame, then applied a random, stratified 80/20 train/test split (using random_state=42
) so that both the training and testing sets include examples from every source.
I also, tested the best models on entirely separate datasets to see overfitting and generalisability .
Task | Training & Test (80/20) | External Validation |
---|---|---|
COVID sentiment | Instagram , Tweets, Reddit API |
A different datatset of tweets |
ADR detection | PsyTAR (annotated patient forum posts), Reddit API |
CADEC-v2 (annotated patient forum posts) |
⭐ These pipeline splits ensure that models only see part of each source during training and are then challenged on completely new data during external evaluation.
- For ADR detection, I used a comprehensive drug term list and minimal tokenisation to preserve medication names.
- For COVID sentiment, I applied more cleaning (lowercasing, stopword removal, stemming) and handled emojis as sentiment cues to reduce noise.
These steps ensured each task received the right level of text normalisation.
Baseline models trained
I built three pipelines using TF-IDF
vectorisation and tuned hyperparameters with 5-fold GridSearchCV
:
⭐
GridSearchCV
tests different hyperparameter values with cross-validation to find settings that best balance bias and variance.
-
Naïve Bayes
(α = 0.01, 0.1, 1.0, 5.0) -
Logistic Regression
(C = 0.01, 0.1, 1, 10; with class weighting) -
Random Forest
(n_estimators = 100, 200; max_depth = None, 10; min_samples_split = 2)
I recorded each model’s optimal hyperparameters and stored the corresponding best model in a dictionary, ensuring that every evaluation reused the exact same models for reproducibility.
Performance: Macro-AUC
Before summarising, I plotted ROC curves to inspect the true-positive vs. false-positive trade-offs. Below are the macro averaged ROC-AUC scores for train/test split vs. external dataset tests:
COVID sentiment (one-vs-rest)
Model | Train/Test Split | Test only: Twitter |
---|---|---|
Naïve Bayes | 0.84 | 0.75 |
Logistic Regression | 0.92 | 0.82 |
Random Forest | 0.92 | 0.78 |
ADR detection (binary)
Model | Train/Test Split | Test only: CADEC_v2 |
---|---|---|
Naïve Bayes | 0.86 | 0.84 |
Logistic Regression | 0.89 | 0.89 |
Random Forest | 0.89 | 0.90 |
*Bold indicates the top performer in each column. ADR’s AUC stays steady on CADEC_v2
, while sentiment dips on Twitter could be overfitting.
#### Below are the macro averaged ROC curves for Train/Test Split:
Key takeaways 🥡:
Logistic Regression
offered the best mix of accuracy and stability, with minimal drop from train/test to external sets.Random Forest
matched LR on ADR and even led on the CADEC test, but didn’t perform as well for sentiment on new tweets.Naïve Bayes
was quick to train but didn’t perform as well as the other models.- Class imbalance in ADR detection: most forum posts report reactions, so very few non-ADR examples, considering
SMOTE
to oversample the minority class could help address this. - The more pronounced AUC decline on COVID sentiment highlights how rapidly changing slang and topic shifts can undermine model performance.
Challenges 🤔❓
- Balancing two label schemes (binary ADR vs. sentiment) required custom evaluation scripts.
- Hyperparameter searches via
GridSearchCV
introduce long training times and require more compute.
Summary
Completed this week:
✅ Added COVID-19 sentiment to the pipeline alongside ADR detection and confirmed that shared TF-IDF features work well for both social media and clinical terminology.
✅ Constructed and tuned three baseline pipelines (Naïve Bayes, Logistic Regression, Random Forest) with hyperparameter search, saving best models.
✅ Validated performance with macro averaged ROC-AUC on the train/test split and truly external datasets (new tweets and CADEC-v2), showing minimal overfitting.
Up next:
➡️ 🧠 Deep Learning
phase of the project implementing Transformer-based models (BERT and RoBERTa), training them on the cleaned dataset, and tuning hyperparameters for optimal performance.