Week 1 – Data Collection and Pre-processing

I’m building NLP tools that detect adverse drug reactions (ADRs) and the surrounding sentiment from social-media posts.

Task	Checklist
Selected four datasets: two patient‐forums (PsyTAR, CADEC‐v2), Reddit (self-labelled using Reddit API), and Twitter (SMM4H-20).	✔
Created preprocessing scripts: • `classic_preprocess` – heavy cleaning, stemming, tokenisation (for traditional ML). • `preprocess_transformer` – minimal cleaning, preserving negations, numbers, and drug terms (for deep-learning models).	✔

Reddit data was straightforward to collect; Twitter required careful handling due to stricter access terms.