Introduction to R and RStudio
- Use RStudio to write and run R programs.
- R has the usual arithmetic operators and mathematical functions.
- Use
<-
to assign values to variables. - Use
ls()
to list the variables in a program. - Use
rm()
to delete objects in a program. - Use
install.packages()
to install packages (libraries).
Project Management With RStudio
- Use RStudio to create and manage projects with consistent layout.
- Treat raw data as read-only.
- Treat generated output as disposable.
- Separate function definition and application.
Seeking Help
- Use
help()
to get online help in R.
Data Structures
- Use
read.csv
to read tabular data in R. - The basic data types in R are double, integer, complex, logical, and character.
- Data structures such as data frames or matrices are built on top of lists and vectors, with some added attributes.
Exploring Data Frames
- Use
cbind()
to add a new column to a data frame. - Use
rbind()
to add a new row to a data frame. - Remove rows from a data frame.
- Use
str()
,summary()
,nrow()
,ncol()
,dim()
,colnames()
,head()
, andtypeof()
to understand the structure of a data frame. - Read in a csv file using
read.csv()
. - Understand what
length()
of a data frame represents.
Subsetting Data
- Indexing in R starts at 1, not 0.
- Access individual values by location using
[]
. - Access slices of data using
[low:high]
. - Access arbitrary sets of data using
[c(...)]
. - Use logical operations and logical vectors to access subsets of data.
Creating Publication-Quality Graphics with ggplot2
- Use
ggplot2
to create plots. - Think about graphics in layers: aesthetics, geometry, statistics, scale transformation, and grouping.
Writing Data
- Save plots from RStudio using the ‘Export’ button.
- Use
write.table
to save tabular data.
Data Frame Manipulation with dplyr
- Use the
dplyr
package to manipulate data frames. - Use
select()
to choose variables from a data frame. - Use
filter()
to choose data based on values. - Use
group_by()
andsummarize()
to work with subsets of data. - Use
mutate()
to create new variables.
Data Frame Manipulation with tidyr
- Use the
tidyr
package to change the layout of data frames. - Use
pivot_longer()
to go from wide to longer layout. - Use
pivot_wider()
to go from long to wider layout.
Basic Statistics: describing, modelling and reportingDescribing dataInferential statisticsRegression Modelling
- R has a range of in-built functions to enable initial data exploration.
- Linear models (lm) can be used with continuous and categorical variables.
Logistic Regression
Logistic regression models the log-odds of an event as a linear combination of one or more independent variables.
Binary logistic regression, where a single binary dependent variable, coded by an indicator variable, where the two values are labeled “0” and “1”, can be used to model the probability of a certain class or event taking place. In these examples, antimicrobial resistance to a particular antibiotic.
Broom
- Broom can be used to create reusable outputs from various analyses, in the form of tibbles.
Producing Reports With Quarto
- Mix reporting written in R Markdown with software written in R.
- Specify chunk options to control formatting.
- Use
knitr
to convert these documents into PDF and other formats.
Best Practices for Writing R Code
- Start each program with a description of what it does.
- Then load all required packages.
- Consider what working directory you are in when sourcing a script.
- Use comments to mark off sections of code.
- Put function definitions at the top of your file, or in a separate file if there are many.
- Name and style code consistently.
- Break code into small, discrete pieces.
- Factor out common operations rather than repeating them.
- Keep all of the source files for a project in one directory and use relative paths to access them.
- Keep track of the memory used by your program.
- Always start with a clean environment instead of saving the workspace.
- Keep track of session information in your project folder.
- Have someone else review your code.
- Use version control.
Introduction to Reproducibility
Automated Version Control
- Version control is like an unlimited ‘undo’.
- Version control also allows many people to work in parallel.
Setting Up Git
- Use
git config
with the--global
option to configure a user name, email address, editor, and other preferences once per machine.
Creating a Repository
-
git init
initializes a repository. - Git stores all of its repository data in the
.git
directory.
Tracking Changes
-
Diff
shows the status of a repository. - Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded).
-
Stage
puts files in the staging area. -
Commit
saves the staged content as a new commit in the local repository. - Write a commit message that accurately describes your changes.
Exploring History
-
Diff
displays differences between commits. -
git restore
recovers old versions of files.
Ignoring Things
- The
.gitignore
file tells Git what files to ignore.
Remotes in GitHub
- A local Git repository can be connected to one or more remote repositories.
- Use the SSH protocol to connect to remote repositories.
-
git push
copies changes from a local repository to a remote repository. -
git pull
copies changes from a remote repository to a local repository.
Collaborating
-
git clone
copies a remote repository to create a local repository with a remote calledorigin
automatically set up.
Conflicts
- Conflicts occur when two or more people change the same lines of the same file.
- The version control system does not allow people to overwrite each other’s changes blindly, but highlights conflicts so that they can be resolved.
Branches
- Branches provide a safe way to experiment with new ideas or explore solutions to bugs and other issues within your files.
- Pull Requests provide the mechanism for bringing changes from other branches make into main with varying levels of oversight.
Issues
- Issues can be used to plan, discuss, and track work.
Open Science
- Open scientific work is more useful and more highly cited than closed.
Licensing
- The
LICENSE
,LICENSE.md
, orLICENSE.txt
file is often used in a repository to indicate how the contents of the repo may be used by others. - People who incorporate General Public License (GPL’d) software into their own software must make the derived software also open under the GPL license if they decide to share it; most other open licenses do not require this.
- The Creative Commons family of licenses allow people to mix and match requirements and restrictions on attribution, creation of derivative works, further sharing, and commercialization.
- People who are not lawyers should not try to write licenses from scratch.
Citation
- Add a CITATION file to a repository to explain how you want your work cited.
Hosting
- Projects can be hosted on university servers, on personal domains, or on a public hosting service.
- Rules regarding intellectual property and storage of sensitive information apply no matter where code and data are hosted.
SQL and R
- SQL is a powerful language used to interrogate and manipulate relational databases.
- You can interact with relational databases from within R.
Writing Good Software
- Keep your project folder structured, organized and tidy.
- Document what and why, not how.
- Break programs into short single-purpose functions.
- Write re-runnable tests.
- Don’t repeat yourself.
- Be consistent in naming, indentation, and other aspects of style.