Introduction to R and RStudio


  • Use RStudio to write and run R programs.
  • R has the usual arithmetic operators and mathematical functions.
  • Use <- to assign values to variables.
  • Use ls() to list the variables in a program.
  • Use rm() to delete objects in a program.
  • Use install.packages() to install packages (libraries).

Project Management With RStudio


  • Use RStudio to create and manage projects with consistent layout.
  • Treat raw data as read-only.
  • Treat generated output as disposable.
  • Separate function definition and application.

Seeking Help


  • Use help() to get online help in R.

Data Structures


  • Use read.csv to read tabular data in R.
  • The basic data types in R are double, integer, complex, logical, and character.
  • Data structures such as data frames or matrices are built on top of lists and vectors, with some added attributes.

Exploring Data Frames


  • Use cbind() to add a new column to a data frame.
  • Use rbind() to add a new row to a data frame.
  • Remove rows from a data frame.
  • Use str(), summary(), nrow(), ncol(), dim(), colnames(), head(), and typeof() to understand the structure of a data frame.
  • Read in a csv file using read.csv().
  • Understand what length() of a data frame represents.

Subsetting Data


  • Indexing in R starts at 1, not 0.
  • Access individual values by location using [].
  • Access slices of data using [low:high].
  • Access arbitrary sets of data using [c(...)].
  • Use logical operations and logical vectors to access subsets of data.

Creating Publication-Quality Graphics with ggplot2


  • Use ggplot2 to create plots.
  • Think about graphics in layers: aesthetics, geometry, statistics, scale transformation, and grouping.

Writing Data


  • Save plots from RStudio using the ‘Export’ button.
  • Use write.table to save tabular data.

Data Frame Manipulation with dplyr


  • Use the dplyr package to manipulate data frames.
  • Use select() to choose variables from a data frame.
  • Use filter() to choose data based on values.
  • Use group_by() and summarize() to work with subsets of data.
  • Use mutate() to create new variables.

Data Frame Manipulation with tidyr


  • Use the tidyr package to change the layout of data frames.
  • Use pivot_longer() to go from wide to longer layout.
  • Use pivot_wider() to go from long to wider layout.

Basic Statistics: describing, modelling and reportingDescribing dataInferential statisticsRegression Modelling


  • R has a range of in-built functions to enable initial data exploration.
  • Linear models (lm) can be used with continuous and categorical variables.

Logistic Regression


  • Logistic regression models the log-odds of an event as a linear combination of one or more independent variables.

  • Binary logistic regression, where a single binary dependent variable, coded by an indicator variable, where the two values are labeled “0” and “1”, can be used to model the probability of a certain class or event taking place. In these examples, antimicrobial resistance to a particular antibiotic.

Broom


  • Broom can be used to create reusable outputs from various analyses, in the form of tibbles.

Producing Reports With Quarto


  • Mix reporting written in R Markdown with software written in R.
  • Specify chunk options to control formatting.
  • Use knitr to convert these documents into PDF and other formats.

Best Practices for Writing R Code


  • Start each program with a description of what it does.
  • Then load all required packages.
  • Consider what working directory you are in when sourcing a script.
  • Use comments to mark off sections of code.
  • Put function definitions at the top of your file, or in a separate file if there are many.
  • Name and style code consistently.
  • Break code into small, discrete pieces.
  • Factor out common operations rather than repeating them.
  • Keep all of the source files for a project in one directory and use relative paths to access them.
  • Keep track of the memory used by your program.
  • Always start with a clean environment instead of saving the workspace.
  • Keep track of session information in your project folder.
  • Have someone else review your code.
  • Use version control.

Introduction to Reproducibility


Automated Version Control


  • Version control is like an unlimited ‘undo’.
  • Version control also allows many people to work in parallel.

Setting Up Git


  • Use git config with the --global option to configure a user name, email address, editor, and other preferences once per machine.

Creating a Repository


  • git init initializes a repository.
  • Git stores all of its repository data in the .git directory.

Tracking Changes


  • Diff shows the status of a repository.
  • Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded).
  • Stage puts files in the staging area.
  • Commit saves the staged content as a new commit in the local repository.
  • Write a commit message that accurately describes your changes.

Exploring History


  • Diff displays differences between commits.
  • git restore recovers old versions of files.

Ignoring Things


  • The .gitignore file tells Git what files to ignore.

Remotes in GitHub


  • A local Git repository can be connected to one or more remote repositories.
  • Use the SSH protocol to connect to remote repositories.
  • git push copies changes from a local repository to a remote repository.
  • git pull copies changes from a remote repository to a local repository.

Collaborating


  • git clone copies a remote repository to create a local repository with a remote called origin automatically set up.

Conflicts


  • Conflicts occur when two or more people change the same lines of the same file.
  • The version control system does not allow people to overwrite each other’s changes blindly, but highlights conflicts so that they can be resolved.

Branches


  • Branches provide a safe way to experiment with new ideas or explore solutions to bugs and other issues within your files.
  • Pull Requests provide the mechanism for bringing changes from other branches make into main with varying levels of oversight.

Issues


  • Issues can be used to plan, discuss, and track work.

Open Science


  • Open scientific work is more useful and more highly cited than closed.

Licensing


  • The LICENSE, LICENSE.md, or LICENSE.txt file is often used in a repository to indicate how the contents of the repo may be used by others.
  • People who incorporate General Public License (GPL’d) software into their own software must make the derived software also open under the GPL license if they decide to share it; most other open licenses do not require this.
  • The Creative Commons family of licenses allow people to mix and match requirements and restrictions on attribution, creation of derivative works, further sharing, and commercialization.
  • People who are not lawyers should not try to write licenses from scratch.

Citation


  • Add a CITATION file to a repository to explain how you want your work cited.

Hosting


  • Projects can be hosted on university servers, on personal domains, or on a public hosting service.
  • Rules regarding intellectual property and storage of sensitive information apply no matter where code and data are hosted.

SQL and R


  • SQL is a powerful language used to interrogate and manipulate relational databases.
  • You can interact with relational databases from within R.

Writing Good Software


  • Keep your project folder structured, organized and tidy.
  • Document what and why, not how.
  • Break programs into short single-purpose functions.
  • Write re-runnable tests.
  • Don’t repeat yourself.
  • Be consistent in naming, indentation, and other aspects of style.

Supplemental - Assumption Diagnostics and Regression Trouble Shooting