Welcome to R
R is a powerful, specialized programming language and environment for statistical computing and graphics. Developed in the early 1990s, it has become the tool of choice for statisticians, data analysts, and researchers worldwide. R is highly extensible through user-contributed packages, offering capabilities for data manipulation, complex statistical modeling, machine learning, and the creation of publication-quality visualizations. Its strong focus on data analysis, combined with an active community, makes R an indispensable skill for anyone working deeply with data.
Introduction to R
How to install R and RStudio and write your first "Hello, World!" program.
R is one of the most popular languages for statistical analysis and data visualization. Its ecosystem, especially with RStudio, provides an excellent environment for both beginners and experts.
Step 1: Download and Install R
1. Visit the Official Website: Go to The R Project website and download R for your operating system (Windows, macOS, or Linux).
2. Run the Installer: Follow the setup instructions for your OS.
Step 2: Install RStudio (Recommended IDE)
RStudio is an Integrated Development Environment (IDE) that makes working with R much easier.
1. Download RStudio Desktop from the RStudio website.
2. Install it on your system.
Step 3: Write and Run Your First R Script
1. Open RStudio.
2. Go to File → New File → R Script.
3. Type the following code in the script editor:
print("Hello, World!")
4. To run the code, you can:
- Select the line and press Ctrl+Enter (Cmd+Return on Mac).
- Click the 'Run' button in the script editor.
5. You should see the output in the Console pane below:
> print("Hello, World!")
[1] "Hello, World!"
Congratulations! You have successfully installed R and run your first script. 🎉
For more details, visit the official R documentation at The R Manuals. Happy coding! 🚀
Syntax of R
1. R Uses Curly Braces for Code Blocks
R uses curly braces {} to define blocks of code, such as in functions, loops, and conditionals. While indentation is not enforced, it is highly recommended for readability.
Correct Example:
if (5 > 3) {
print("Five is greater than three") # Code block inside braces
}
Incorrect Example (Will cause an error):
if (5 > 3)
print("This will cause an error!") # Missing braces for multi-line block
Always use braces for multi-line blocks to avoid errors and improve code clarity.
2. One-Based Indexing
R follows one-based indexing, meaning the first element of a vector, list, or other data structure is at index 1, not 0.
Example:
fruits < - c("apple", "banana", "cherry")
print(fruits[1]) # Outputs: "apple" (first element)
print(fruits[3]) # Outputs: "cherry" (third element)
Incorrect Example (Returns empty vector):
print(fruits[0]) # Returns character(0), an empty vector
Since R starts counting from 1, accessing index 0 does not give an error but returns an empty vector of the same type.
3. Comments in R
Comments are used to explain code and are ignored by R. R only supports single-line comments, which start with a #.
Single-Line Comment:
# This is a single-line comment
print("Hello, World!") # This is an inline comment
For multi-line explanations, you must start each line with a #. There is no native multi-line comment syntax in R.
4. Case Sensitivity
R is a case-sensitive language. Variable names, function names, and keywords must be used with consistent capitalization.
Example:
myVar <- 5
MyVar <- 10
print(myVar) # Outputs: 5
print(MyVar) # Outputs: 10 (different variable)
Incorrect Example:
Print("Hello") # Error: could not find function "Print"
R's built-in functions like print() must be written in lowercase.
5. Variables and Dynamic Typing
R is dynamically typed. You do not need to declare a variable's type; it is determined by the value you assign.
Example:
x <- 10 # Integer
class(x) # "numeric" (all numbers are numeric by default)
y <- 3.14 # Float/Double
class(y) # "numeric"
z <- "Hello" # String/Character
class(z) # "character"
Reassigning Different Data Types:
a <- 5 # Initially numeric
a <- "R" # Now reassigned as a character
print(a) # Outputs: "R"
Because R uses dynamic typing, a variable's type can change during execution.
Conclusion
Understanding R's basic syntax is essential for writing correct and efficient code. Key takeaways include:
- R uses
{}for code blocks and encourages good indentation. - One-based indexing means the first element is at index
1. - Comments are single-line and start with
#. - R is case-sensitive.
- R supports dynamic typing, allowing variables to change types.
print and cat
The print() and cat() functions in R are used to display output. print() is the standard function for displaying any R object, while cat() is better for concatenating and outputting multiple items in a more controlled format.
1. Basic print() Usage
The print() function displays a single R object, such as a string, number, or variable.
print("Hello, World!") # Outputs: [1] "Hello, World!"
print(42) # Outputs: [1] 42
2. Using cat() for Concatenated Output
The cat() function is useful for outputting multiple items. It converts its arguments to character strings, concatenates them, and prints them. Unlike print(), it does not add a newline by default or include vector indices like [1].
cat("Hello", "R", "World", "!") # Outputs: Hello R World !
You can control the separator using the sep parameter.
cat("Hello", "R", sep = "-") # Outputs: Hello-R
3. Using the sep Parameter
The sep parameter in cat() defines the separator between items.
cat("apple", "banana", "cherry", sep = ", ")
# Outputs: apple, banana, cherry
4. Printing Variables
You can print variables directly with print() or concatenate them with cat().
name <- "Alice"
age <- 25
print(name) # Outputs: [1] "Alice"
cat("Name:", name, "Age:", age, "\n")
# Outputs: Name: Alice Age: 25
For more complex formatting, use sprintf() or paste().
cat(sprintf("Name: %s, Age: %d\n", name, age))
# Outputs: Name: Alice, Age: 25
5. Printing Special Characters
Use escape sequences for formatting:
cat("Hello\nWorld!") # \n adds a new line
cat("This is a tab:\tR") # \t adds a tab space
Conclusion
The print() and cat() functions are fundamental for displaying output in R. Key takeaways:
- Use
print(object)for standard output of any R object. - Use
cat(...)for concatenating and controlling the output format withsep. cat()does not add a newline by default; use\nwhen needed.- Use
sprintf()for precise, C-style string formatting.
Arithmetic Operators in R
Arithmetic operators perform mathematical calculations. R supports addition, subtraction, multiplication, division, exponentiation, modulus, and integer division.
Examples:
a &tl;- 15
b <- 4
cat("a + b =", a + b, "\n") # 19 (addition)
cat("a - b =", a - b, "\n") # 11 (subtraction)
cat("a * b =", a * b, "\n") # 60 (multiplication)
cat("a / b =", a / b, "\n") # 3.75 (true division)
cat("a %/% b =", a %/% b, "\n") # 3 (integer division)
cat("a %% b =", a %% b, "\n") # 3 (modulus)
cat("a ^ b =", a ^ b, "\n") # 50625 (exponentiation, also a ** b)
Arithmetic operators are vectorized, meaning they work element-wise on vectors.
x <- c(1, 2, 3)
y <- c(4, 5, 6)
print(x + y) # Outputs: 5 7 9
Comparison Operators in R
Comparison operators compare two values and return a logical (boolean) result, TRUE or FALSE. They include equality, inequality, and greater/less than comparisons.
Examples:
x <- 7
y <- 10
print(x == y) # FALSE (equal to)
print(x != y) # TRUE (not equal to)
print(x > y) # FALSE
print(x &lf; y) # TRUE
print(x >= 7) # TRUE
print(y <= 7) # FALSE
Comparison results are often used in if statements or for subsetting data.
vec <- 1:5
print(vec > 3) # Outputs: FALSE FALSE FALSE TRUE TRUE
Logical Operators in R
Logical operators combine logical values and return TRUE or FALSE. The main operators are & (and), | (or), and ! (not).
Examples:
is_sunny <- TRUE
is_warm <- FALSE
print(is_sunny & is_warm) # FALSE (both must be TRUE)
print(is_sunny | is_warm) # TRUE (at least one is TRUE)
print(!is_warm) # TRUE (negates the value)
For control flow (e.g., in if conditions), use the short-circuiting operators && and ||, which only evaluate the first element of a vector.
if (x > 0 && y > 0) {
print("Both are positive.")
}
Bitwise Operators in R
Bitwise operators work on the binary representations of integers. They are less common in R for data analysis but are available for low-level programming. R uses functions like bitwAnd() for these operations.
Examples:
a <- as.raw(6) # binary: 110
b lt;- as.raw(3) # binary: 011
print(as.integer(a & b)) # 2 (binary 010, AND)
print(as.integer(a | b)) # 7 (binary 111, OR)
print(as.integer(xor(a, b))) # 5 (binary 101, XOR)
For integer inputs, use the bitw family of functions.
print(bitwAnd(6, 3)) # 2
print(bitwOr(6, 3)) # 7
print(bitwShiftL(1, 2)) # 4 (left shift)
print(bitwShiftR(8, 1)) # 4 (right shift)
Assignment Operators in R
The main assignment operator in R is <-. The equals sign = can also be used, but < is the conventional choice. Compound assignment is not native but can be achieved with packages like zeallot or by reassignment.
Examples:
x <- 10 # standard assignment
x = 10 # alternative assignment (less common in scripts)
print(x)
# Reassignment with arithmetic
x <- x + 5 # x is now 15
x <- x - 3 # x is now 12
x <- x * 2 # x is now 24
For multiple assignment, you can assign multiple variables at once.
c(a, b) <- c(4, 5) # a is 4, b is 5 (requires zeallot package for this syntax)
# Or, more simply:
a <- 4
b <- 5
Integers in R
In R, whole numbers are usually stored as the numeric type (double-precision floating point). To explicitly create an integer, you must use the L suffix.
1. Creating Integers and Numerics
a <- 10 # This is a 'numeric' by default
class(a) # "numeric"
b <- 10L # This is an 'integer'
class(b) # "integer"
c <- as.integer(10) # Another way to create an integer
2. Integer Operations
Integers and numerics work together in arithmetic operations. R will often return a numeric result.
print(5L + 3L) # 8 (integer)
print(5 / 2) # 2.5 (numeric, even with integers)
print(5L %/% 2L) # 2 (integer division)
3. Type Conversion
num <- as.integer("123")
print(num) # 123 (integer)
num <- as.integer(3.1415)
print(num) # 3 (truncates towards zero)
Numeric (Doubles/Floats) in R
The numeric data type in R is used for real numbers (floats/doubles). It is the default type for numbers.
1. Creating Numerics
a <- 3.1415
b <- -0.5
c <- 100.0 # Even this is numeric
print(c(a, b, c))
class(a) # "numeric"
2. Numeric Operations
print(2.5 + 1.5) # 4
print(5.0 / 2) # 2.5
print(3.1415 ^ 2) # Exponentiation
print(7.3 %% 3) # Modulus (remainder)
3. Type Conversion and Special Values
num <- as.numeric("123.45")
print(num) # 123.45
num <- as.numeric(TRUE) # TRUE coerces to 1
print(num) # 1
# Special numeric values
print(Inf - Inf) # NaN (Not a Number)
print(1 / 0) # Inf
Character Strings in R
Strings in R are stored in the character vector type. They can be defined using either single (') or double (") quotes.
1. Creating Character Strings
a <- 'Hello, World!'
b <- "R is great"
# Both are equivalent
print(a)
print(b)
R does not have a native multi-line string syntax like triple quotes. You can create a vector of strings or use the paste() function with a newline character.
multiline <- paste("This is", "a multi-line", "string", sep = "\n")
cat(multiline)
2. String Operations
String concatenation is done with paste().
print(paste("Hello", "World", sep = " ")) # Concatenation
print(paste0("Hello", "World")) # Concatenation with no separator
# Get string length with nchar()
print(nchar("Hello")) # 5
3. String Functions
text <- "hello world"
print(toupper(text)) # "HELLO WORLD"
print(tolower("HELLO")) # "hello"
print(sub("world", "earth", text)) # Substitute first occurrence: "hello earth"
String Functions: Trimming, Splitting, and Replacing
R has many useful functions for string manipulation, often found in base R or the stringr package.
Trimming with trimws()
text <- " hello world "
print(trimws(text)) # "hello world"
print(trimws(text, "left")) # "hello world "
print(trimws(text, "right")) # " hello world"
Splitting Strings with strsplit()
line <- "apple,banana,cherry"
print(strsplit(line, ",")) # Returns a list: [[1]] "apple" "banana" "cherry"
words <- "one two three"
print(strsplit(words, " ")[[1]]) # Extract the vector: "one" "two" "three"
Replacing Text with sub() and gsub()
sentence <- "I like cats and cats are nice"
print(sub("cats", "dogs", sentence)) # Replaces only first: "I like dogs and cats are nice"
print(gsub("cats", "dogs", sentence)) # Replaces all: "I like dogs and dogs are nice"
Joining Strings with paste()
parts <- c("2025", "09", "26")
date <- paste(parts, collapse = "-")
print(date) # "2025-09-26"
Common Pitfalls
trimws()only removes whitespace by default.strsplit()returns alist; use[[1]]to get the first element as a vector.sub()replaces only the first occurrence;gsub()replaces all.paste()withcollapseturns a vector into a single string.
String Formatting & Case Methods
R provides several ways to format strings and combine them with variables.
Formatting with sprintf()
# %s for strings, %d for integers, %f for floats
template <- "The sum of %d and %d is %d"
msg <- sprintf(template, 2, 4, 2+4)
print(msg) # The sum of 2 and 4 is 6
# Control decimal places with %.2f
print(sprintf("Pi is approximately %.2f", pi)) # "Pi is approximately 3.14"
Case Conversion
text <- "Hello World"
print(toupper(text)) # "HELLO WORLD"
print(tolower(text)) # "hello world"
The stringr Package
The stringr package provides a more consistent and user-friendly set of string functions.
library(stringr)
str_to_upper("hello") # "HELLO"
str_replace_all("a-a-a", "a", "b") # "b-b-b" (like gsub)
Common Pitfalls
- Using wrong format specifiers in
sprintf()can cause errors or unexpected output. toupper()andtolower()are base R functions, not methods on the string object.
Vectors
Vectors are the fundamental data structure in R. They are one-dimensional arrays that can hold numeric, character, or logical data, but all elements must be of the same type. They are created using the c() function (combine).
1. Creating a Vector
num_vec <- c(1, 2, 3, 4, 5) # Numeric vector
char_vec <- c("a", "b", "c") # Character vector
log_vec <- c(TRUE, FALSE, TRUE) # Logical vector
print(num_vec)
print(char_vec)
2. Vectorized Operations
A key feature of R is that most operations are vectorized, meaning they are applied to each element of the vector without the need for explicit loops.
v1 <- c(1, 2, 3)
v2 <- c(4, 5, 6)
print(v1 + v2) # 5, 7, 9
print(v1 * 2) # 2, 4, 6
3. Vector Recycling
When performing operations on two vectors of different lengths, R will recycle the shorter vector to match the length of the longer one.
short <- c(1, 2)
long <- c(10, 20, 30, 40)
print(short + long) # 11, 22, 31, 42 (short is recycled to c(1,2,1,2))
Warning: Recycling can produce unexpected results if the longer vector's length is not a multiple of the shorter one's length.
Indexing Vectors
You access elements of a vector using square brackets []. R uses 1-based indexing, so the first element is at position 1.
Basic Indexing
fruits <- c("apple", "banana", "cherry")
print(fruits[1]) # Outputs: "apple" (first element)
print(fruits[3]) # Outputs: "cherry" (third element)
Negative Indexing
Negative indices are used to exclude elements.
print(fruits[-1]) # All elements except the first: "banana" "cherry"
print(fruits[-c(1,3)]) # Excludes first and third: "banana"
Logical Indexing
You can use a logical vector to select elements where the condition is TRUE.
numbers <- c(10, 20, 30, 40)
print(numbers[numbers > 25]) # Outputs: 30, 40
Common Pitfalls
- Accessing an index of 0 returns an empty vector.
- Accessing an index beyond the vector's length returns
NA. - Negative and positive indices cannot be mixed in a single subset operation.
Slicing Vectors
Slicing allows you to extract a contiguous portion of a vector using the syntax vector[start:end]. Both start and end are inclusive.
Basic Slicing
numbers <- c(0, 1, 2, 3, 4, 5)
print(numbers[2:4]) # Outputs: 1, 2, 3 (elements 2 through 4)
print(numbers[1:3]) # Outputs: 0, 1, 2 (first three elements)
print(numbers[4:6]) # Outputs: 3, 4, 5 (from 4th to last)
Using Sequences
print(numbers[seq(1, 5, by=2)]) # Outputs: 0, 2, 4 (every 2nd element)
print(numbers[c(1,3,5)]) # Outputs: 0, 2, 4 (specific indices)
Reversing a Vector
print(rev(numbers)) # Outputs: 5, 4, 3, 2, 1, 0
print(numbers[6:1]) # Also reverses
Common Pitfalls
- The end index is inclusive, unlike Python where it's exclusive.
- Slicing returns a new vector; it does not modify the original.
Vector-Specific Functions & Methods
R provides many built-in functions for creating, manipulating, and summarizing vectors.
Useful Vector Creation Functions
seq1 <- 1:5 # Creates a sequence: 1,2,3,4,5
seq2 <- seq(1, 10, by=2) # 1,3,5,7,9
rep1 <- rep(1, times=5) # 1,1,1,1,1 (repeat)
rep2 <- rep(1:2, each=2) # 1,1,2,2
Summary Functions
nums <- c(1, 2, 3, 4, NA, 6) # Note the NA (missing value)
print(length(nums)) # Number of elements (6)
print(sum(nums, na.rm = TRUE)) # Sum, removing NA (16)
print(mean(nums, na.rm = TRUE)) # Mean, removing NA (3.2)
print(max(nums, na.rm = TRUE)) # Maximum value (6)
print(min(nums, na.rm = TRUE)) # Minimum value (1)
Logical Vector Functions
log_vec <- c(TRUE, FALSE, TRUE)
print(all(log_vec)) # FALSE (are all values TRUE?)
print(any(log_vec)) # TRUE (is any value TRUE?)
Common Pitfalls
- Many functions return
NAif the vector contains missing values; usena.rm = TRUEto ignore them. - The
length()function returns the total number of elements, not the count ofTRUEvalues.
Lists
Lists are versatile R objects that can contain elements of different types (e.g., numbers, strings, vectors, even other lists). They are created with the list() function.
1. Creating a List
my_list <- list(1, "a", TRUE, c(2, 5, 7))
print(my_list)
2. Accessing List Elements
Elements can be accessed by position using single brackets [] (which returns a list) or double brackets [[]] (which returns the element itself).
print(my_list[2]) # Returns a list containing "a"
print(my_list[[2]]) # Returns the element "a" itself
# Accessing a vector inside the list
print(my_list[[4]][2]) # Accesses the 2nd element of the vector in the 4th list item: 5
3. Named Lists
List elements can be named, which allows for access with the $ operator.
named_list <- list(name = "Alice", age = 30, scores = c(85, 92, 78))
print(named_list$name) # "Alice"
print(named_list[["age"]]) # 30
print(named_list$scores[2]) # 92
Data Frames
Data frames are the most important data structure for data analysis in R. They are used to store tabular data, where each column can be a different type (e.g., numeric, character), but all elements within a column must be the same type. This is analogous to a spreadsheet or a Python pandas DataFrame.
1. Creating a Data Frame
df <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
score = c(85, 92, 78)
)
print(df)
2. Accessing Data Frame Elements
You can access columns, rows, and individual elements using various methods.
print(df$name) # Access the 'name' column as a vector
print(df[["age"]]) # Another way to access a column
print(df[2, ]) # Get the second row
print(df[2, 3]) # Get the element in the 2nd row, 3rd column
print(df[, "score"]) # Get the 'score' column
3. Adding and Modifying Columns
df$new_col <- c("A", "B", "C") # Add a new column
df$age <- df$age + 1 # Modify an existing column
print(df)
Data Frame Functions
R provides many functions to inspect and manipulate data frames.
Inspecting Data Frames
print(dim(df)) # Dimensions (rows, columns)
print(nrow(df)) # Number of rows
print(ncol(df)) # Number of columns
print(names(df)) # Column names
print(str(df)) # Structure of the data frame
print(summary(df)) # Summary statistics for each column
Subsetting Data Frames
# Select specific columns
print(df[, c("name", "score")])
# Filter rows based on a condition
print(df[df$score > 80, ])
# Use the subset() function
print(subset(df, age >= 30 & score < 90))
Common Pitfalls
- Using
$for a column name that doesn't exist returnsNULL. - When selecting a single column with
[, it remains a data frame by default; use[[or$to get a vector.
Variables
Variables in R are used to store data. They are dynamically typed, meaning the type is determined by the assigned value. Variable names can contain letters, numbers, dots, and underscores, but cannot start with a number.
1. Creating Variables
a <- 10
b <- "Hello"
c <- 3.14
print(a)
print(b)
print(c)
2. Assigning Multiple Variables
# R does not have built-in multiple assignment like Python.
# You can assign separately or use a list:
a <- 4
b <- 5
# Or, for multiple return values from a function:
values <- list(4, 5)
a <- values[[1]]
b <- values[[2]]
3. Checking and Converting Variable Types
x <- "123"
print(class(x)) # "character"
# Type conversion
num <- as.numeric(x)
print(class(num)) # "numeric"
print(num) # 123
4. The Environment
You can see all defined variables in the environment using ls().
ls() # Lists all variables in the current environment
Conditionals: if, else and ifelse
Conditional statements in R allow you to execute different code blocks based on logical conditions. The basic structure uses if and else. For vectorized conditional checks, use ifelse().
1. Basic if Statement
x <- 10
if (x > 5) {
print("x is greater than 5")
}
2. if-else Statement
x <- 3
if (x > 5) {
print("x is greater than 5")
} else {
print("x is not greater than 5")
}
3. if-else-if Chain
x <- 5
if (x > 5) {
print("x is greater than 5")
} else if (x == 5) {
print("x is exactly 5")
} else {
print("x is less than 5")
}
4. Vectorized ifelse()
The ifelse() function is useful for applying a conditional check to each element of a vector.
vec <- 1:5
result <- ifelse(vec > 3, "High", "Low")
print(result) # "Low" "Low" "Low" "High" "High"
for-loop (Basics)
A for loop in R repeats a block of code for each element in a sequence (like a vector or list).
Looping Over a Vector
fruits <- c("apple", "banana", "cherry")
for (fruit in fruits) {
print(paste("I like", fruit))
}
Looping with an Index
for (i in 1:length(fruits)) {
print(paste("Fruit", i, "is", fruits[i]))
}
Using break and next
Use break to exit a loop early and next to skip to the next iteration (similar to continue in Python).
for (i in 1:5) {
if (i == 3) {
next # skip number 3
}
if (i == 5) {
break # stop the loop at 5
}
print(i)
}
# Prints 1, 2, 4
Common Pitfalls
- Looping is often not the most efficient way to work with data in R; vectorized operations are preferred.
- The loop variable (e.g.,
fruit) is not limited to the loop's scope; it remains in the environment after the loop finishes.
while-loop (Basics)
A while loop repeats a block of code as long as its condition is TRUE. The number of repetitions is not fixed in advance.
Simple Counting Example
x <- 1
while (x <= 5) {
print(paste("Count is", x))
x <- x + 1 # Crucial: update the counter
}
Stopping Early with break
x <- 1
while (TRUE) { # runs until broken
if (x %% 5 == 0) {
print(paste(x, "is a multiple of 5"))
break
}
x <- x + 1
}
Common Pitfalls
- Infinite loops: If the condition never becomes
FALSE, the loop will run forever. Always ensure the condition can change. - Forgetting to update the variable in the condition (e.g.,
x <- x + 1) is a common cause of infinite loops.
Loop Control Statements: break and next
R provides break and next to control the flow of loops.
break — Exiting a Loop
for (i in 1:10) {
if (i > 5) {
break
}
print(i)
}
# Prints 1,2,3,4,5
next — Skipping an Iteration
for (i in 1:5) {
if (i %% 2 == 0) {
next # Skip even numbers
}
print(paste("Odd:", i))
}
# Prints Odd: 1, Odd: 3, Odd: 5
There is no pass
R does not have a pass statement. If a block needs to be empty, you can simply use an empty block {} or a comment.
for (i in 1:3) {
# Placeholder for future logic - do nothing for now
# An empty block is valid
}
Nested Loops
Nested loops are loops inside loops. They are useful for working with multi-dimensional data, like matrices, or for generating combinations.
Example: Multiplication Table
for (i in 1:3) {
for (j in 1:3) {
cat(i, "x", j, "=", i * j, " ")
}
cat("\n") # New line after each inner loop
}
Nested while Loop
i <- 1
while (i <= 2) {
j <- 1
while (j <= 2) {
print(paste(i, j))
j <- j + 1
}
i <- i + 1
}
Pitfalls
- Complexity: Nested loops can quickly become slow if the inner loops run many times.
- Consider using vectorized operations or the
outer()function instead of nested loops for mathematical operations.
Functions in R
Functions in R are defined using the function keyword and are assigned to a variable. They are first-class objects, meaning you can pass them as arguments to other functions.
1. Defining Functions
greet <- function(name) {
return(paste("Hello,", name, "!"))
}
print(greet("Alice"))
2. Default Arguments
power <- function(x, exponent = 2) {
return(x ^ exponent)
}
print(power(3)) # Uses default: 9
print(power(3, 3)) # Overrides default: 27
3. Anonymous Functions
You can create functions without a name (anonymous functions), often used with functions like lapply().
# An anonymous function that adds two numbers
(function(a, b) { a + b })(3, 4) # Returns 7
# More commonly used with apply functions
lapply(1:3, function(x) x^2) # Returns list(1, 4, 9)
Importing Packages in R
R's functionality is extended through packages (libraries). Packages can be installed from CRAN (Comprehensive R Archive Network) or other repositories and then loaded into your session.
1. Installing Packages
# Install a package from CRAN
install.packages("dplyr")
# Install multiple packages at once
install.packages(c("ggplot2", "tidyr"))
2. Loading Packages
# Load a package into the current session
library(dplyr)
# Alternatively, use require(), but library() is more common
require(ggplot2)
3. Using Package Functions
Once a package is loaded, you can use its functions directly.
library(dplyr)
# Now you can use dplyr functions like filter(), select(), etc.
4. Accessing Functions Without Loading
You can use a specific function from a package without loading the entire package using ::.
dplyr::filter(mtcars, mpg > 20)
5. Common Data Science Packages
dplyr,tidyr: Data manipulation and cleaning.ggplot2: Data visualization.readr,readxl: Reading data from files.shiny: Building interactive web apps.
Reading and Writing Files
R provides several functions to read data from and write data to files. Common formats include CSV, text, and Excel files.
1. Reading a CSV File
# Read a CSV file into a data frame
df <- read.csv("path/to/your/file.csv")
# Specify options like strings as factors
# In recent R versions, strings are not converted to factors by default
df <- read.csv("file.csv", stringsAsFactors = FALSE)
2. Writing to a CSV File
write.csv(df, "path/to/output/file.csv", row.names = FALSE)
3. Reading Text Files
# Read entire file as a character vector
lines <- readLines("textfile.txt")
# Read with a connection, useful for large files
con <- file("textfile.txt", "r")
first_line <- readLines(con, n = 1)
close(con)
4. Basic File Functions
file.exists("myfile.csv") # Check if file exists
file.remove("old_file.csv") # Delete a file
dir.create("new_folder") # Create a directory
5. The here Package (Recommended)
The here package helps manage file paths in a project, making your code more reproducible.
library(here)
csv_path <- here("data", "my_data.csv") # Constructs a reliable path
df <- read.csv(csv_path)
dplyr — Install & Import
dplyr is a core package for data manipulation in R. It provides a grammar of data manipulation with easy-to-understand verb-based functions.
Install and Load
install.packages("dplyr") # Install the package
library(dplyr) # Load it into your session
Why use dplyr?
- Intuitive function names (verbs) like
filter(),select(),mutate(). - Efficient computation, including on large datasets.
- Consistent syntax and excellent integration with the pipe operator
%>%.
dplyr — Key Verbs
dplyr's functionality is built around a set of core 'verbs' for data manipulation.
filter() — Select Rows
filter(mtcars, mpg > 20, cyl == 4) # Cars with mpg>20 and 4 cylinders
select() — Select Columns
select(mtcars, mpg, cyl, hp) # Select only these columns
select(mtcars, -mpg) # Select all columns except mpg
mutate() — Create or Modify Columns
mutate(mtcars, kpl = mpg * 0.425144) # Add a new column for km per liter
arrange() — Sort Rows
arrange(mtcars, desc(mpg)) # Sort by mpg, highest first
summarize() — Aggregate Data
summarize(mtcars, avg_mpg = mean(mpg, na.rm = TRUE)) # Average mpg
These verbs are most powerful when used with group_by().
mtcars %>%
group_by(cyl) %>%
summarize(avg_mpg = mean(mpg)) # Average mpg for each cylinder group
dplyr — The Pipe Operator %>%
The pipe operator %>% (from the magrittr package, included with dplyr) allows you to chain multiple operations together in a readable, left-to-right fashion. It takes the output of the expression on its left and passes it as the first argument to the function on its right.
Example Without Pipes
cyl_4 <- filter(mtcars, cyl == 4)
cyl_4_mpg <- select(cyl_4, mpg, cyl)
result <- arrange(cyl_4_mpg, mpg)
Example With Pipes
result <- mtcars %>%
filter(cyl == 4) %>%
select(mpg, cyl) %>%
arrange(mpg)
The pipe makes the sequence of operations clear and avoids creating intermediate variables.
The Native Pipe |>
R 4.1.0 introduced a native pipe operator |>. Its behavior is very similar to %>%, but with some technical differences.
result <- mtcars |>
filter(cyl == 4) |>
select(mpg, cyl) |>
arrange(mpg)
ggplot2 — Install & Import
ggplot2 is a powerful and popular package for creating static, publication-quality graphics in R based on the Grammar of Graphics.
Install and Load
install.packages("ggplot2")
library(ggplot2)
Why use ggplot2?
- Consistent and logical syntax based on layers.
- High flexibility and customization for complex plots.
- Produces elegant graphics with relatively little code.
ggplot2 — Basic Usage
The fundamental syntax for a ggplot2 graph involves:
- The
ggplot()function, which defines the data and aesthetic mappings (aes()). - Adding layers with
geom_functions (e.g.,geom_point(),geom_line()). - Using
+to add components together.
Simple Scatter Plot
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point()
Adding a Smooth Line
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
geom_smooth(method = "lm")
Using Color for Groups
ggplot(mtcars, aes(wt, mpg, color = factor(cyl))) +
geom_point()
ggplot2 — Modifiers & Styling
You can customize almost every aspect of a ggplot2 graph by adding more layers and theme elements.
Labels and Title
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
labs(
title = "Car Weight vs. MPG",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon"
)
Axis Limits and Scales
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
xlim(0, 6) +
ylim(10, 35)
Themes
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
theme_bw() # Use a black-and-white theme
# Customize the theme in detail
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Advanced R Concepts
Beyond the basics, R offers many advanced features for writing efficient, powerful, and reusable code.
1. The Apply Family
The apply family of functions (apply(), lapply(), sapply(), vapply()) are used to apply a function to margins of an array or to elements of a list/vector, often as an alternative to loops.
# Apply a function to each column of a data frame (margin=2)
apply(mtcars, 2, mean)
# Apply a function to each element of a list
my_list <- list(a = 1:3, b = 4:6)
lapply(my_list, mean) # Returns a list
sapply(my_list, mean) # Tries to simplify the result to a vector
2. Functional Programming with purrr
The purrr package enhances R's functional programming capabilities, providing a more consistent and powerful set of tools than the base apply functions.
library(purrr)
map(my_list, mean) # Similar to lapply
map_dbl(my_list, mean) # Returns a numeric vector
3. Writing Efficient R Code
R can be slow with loops on large data. Key strategies for efficiency include:
- Vectorization: Use built-in vectorized functions whenever possible.
- Avoid growing objects in loops: Pre-allocate memory for results.
- Use efficient data structures: Data frames for tabular data, matrices for homogeneous numeric data.
- Profile your code: Use
system.time()or theprofvispackage to find bottlenecks.