R: From Programming to Statistics
Organizers: Ewan Dunbar, Alex Drummond
Email: umdlingstats@gmail.com
Description
This course is divided in two. The first week is about programming and R, and the second week is about statistics and data analysis. However, the material in this course is CUMULATIVE. If you miss the first week, you will have a hard time with the second, unless you already fairly proficient with both programming and R.
This mini-course is intended to familiarize you, first, with the basics of computer programming, and, second, with the fundamentals of statistics for data analysis. Both parts of the course will use the R Software Environment. Both parts of the course are intended to be accessible to absolute beginners, and at the same time to be useful for those who simply feel uncertain about their knowledge or abilities in programming or in statistics, or who would like to have a more in-depth knowledge of the most important elements of either. The course will be interactive and will require you to do take-home exercises.
Location
MMH 1304
Time
Technical help and homefun review session: 9:30 AM - 10:00 AM
Main session: 10:00 AM - 12:00 PM
Please see "Before coming" below to find out what you should do before coming
You might find these links helpful
Analyzing Linguistic Data (book)
Before coming
1. Download R from the R website and install it on your computer
(Instructional video for Windows) (Instructional video for Mac)
2. Install the R package languageR, making sure to install its dependencies (packages that it depends on) when given the option on the Mac version, and in the terminal version.
(Instructional video for Windows) (Instructions for Mac) (Instructions for terminal version)
Submitting homefun
You will be asked to write functions. Put them in an R source file to edit them (you can use R's built-in text editor by using File-> New Document in the GUI) and save it. Then you can source your file (using the Source File...) menu option. Then you can call your functions from inside R to test them out. You might find the "debug" function useful. If I have a function called "foo", then "debug(foo)" will turn on debugging mode for whenever you call "foo". To exit the debugger, type "Q". To turn debugging off, type "undebug(foo)".
If you are asked to do anything which is *not* writing a function (creating a variable, etc), put all your functions at the top of your file, and everything else at the bottom.
Submit your homefun to umdlingstats@gmail.com in a file called hw1.R, or hw2.R, etc, depending on whether we are on day 1, or 2, etc. Send whatever you can, but come to the help session at 9:30 if you have problems.
Plan
|
Monday |
Tuesday |
Wednesday |
Thursday |
Friday |
9:30 - 10 |
Tech support |
Homefun help |
Homefun help |
Homefun help |
Homefun help |
10 - 12 |
Programming |
Programming |
Programming |
Programming |
Transition |
Topics |
Basics, functions, variables |
Conditionals, loops |
Practice session |
Practice session |
Working with data in R |
Notes |
|||||
In-class exercises |
New Programming exercises; Answers to more in-class exercises |
New Programming exercises; Tagged corpus; Alex's part-of-speech functions |
See notes |
||
Homefun |
"Variables" exercises from in-class exercises; Homefun 1 (part 2 optional) |
Homefun 2; you will find the examples linked above (answers to in-class exercises) useful |
|||
Answers |
See hints below |
||||
|
|
|
|
|
|
|
Monday |
Tuesday |
Wednesday |
Thursday |
Friday |
9:30 - 10 |
|
Homefun help |
Homefun help |
Homefun help |
Homefun help |
10 - 12 |
|
Statistics |
Statistics |
Statistics |
Statistics |
Topics |
|
Fundamental concepts |
Linear models: t-test and ANOVA |
Linear models: factorial ANOVA and regression |
Linear models: Random/mixed effects, hierarchical models |
Notes |
|
||||
In-class exercises |
|
See notes |
See notes |
See notes; 2007 U of T stats comp question |
|
Homefun |
|
First do exercise 2 from section 2.4; then Day 6 Homefun |
|
||
Answers |
|
See suggestions below |
Extra R examples
Day 1: Indexing
c(2,4,6,8)[1] c(2,4,6,8)[2] c(2,4,6,8)[3] c(2,4,6,8)[4] c(2,4,6,8)[c(1,3)] c(2,4,6,8)[c(TRUE,FALSE,TRUE,FALSE)] c(2,4,6,8)[c(TRUE,FALSE,TRUE,FALSE)] (1:100)[(1:100) < 17] (1:100)[(1:100)*2 == 16]
Day 3: More loops
For loops with index vectors (a very common practice)
Instead of this while loop
v <- c(5, 4, 3, 11, 6, 1)
i <- 1
while (i <= length(v)) {
print(v[i])
i <- i+1
}you can use a for loop like this:
v <- c(5, 4, 3, 11, 6, 1)
for (i in 1:length(v)) {
print(v[i])
}In this case it's not necessary, since you could just as easily have written this:
v <- c(5, 4, 3, 11, 6, 1)
for (x in v) {
print(x)
}But you will often find this useful, and easier than writing a while loop.
Nested loop example
Often, in R, you can do things without using loops. When you can, it is better to do this. It is faster. Here's a loop, but inside it the seq function really does a bunch of things at once.
v <- c(5, 4, 3, 11, 6, 1)
for (i in 1:length(v)) {
w <- seq(v[i], v[i]+5)
print(w)
}But to see what kind of thing R's doing inside, you can "seq" for yourself.
v <- c(5, 4, 3, 11, 6, 1)
for (i in 1:length(v)) {
w <- c()
for (j in 1:6) {
w[j] <- v[i] + j - 1
}
print(w)
}Sometimes, you might find you have to do this. One common situation is where you have vectors inside a list, and you need to traverse the vectors in the outer loop, and the list in the inner loop -- that is, you can't process the vectors independently.
Day 4: Warmup
R has a lot of functions for drawing random (actually "pseudo-random") numbers. The simplest is called "runif". This draws numbers between 0 and 1 by default. All you need to do is tell R how many numbers to sample. So "runif(1)" would give you a vector of length one, with a random number between 0 and 1. If you wanted more, you could, of course, just change 1 to 1000 or whatever you like. Try writing your own function, though, that takes one argument, n, and returns a vector with n different random numbers drawn using runif(1).
parse_tagged_words(c("The:Det", "man:N"))
[[1]]
[[1]]$word
[1] "The"
[[1]]$tag
[1] "Det"
[[2]]
[[2]]$word
[1] "man"
[[2]]$tag
[1] "N"POS Exercise
Write a function (call it whatever you want) using Alex's POS tag functions that returns a list of the counts of all the tags in the corpus. E.g.
$N [1] 17 $V [1] 30 $Det [1] 40
(But those aren't the actual numbers.)
Homefun 4 Hints
plot(ChickWeight$weight ~ ChickWeight$Time, xlim=c(max(ChickWeight$Time),min(ChickWeight$Time)), ylim=c(max(ChickWeight$weight), min(ChickWeight$Time)))
plot(ChickWeight$weight ~ ChickWeight$Time, xlim=c(max(ChickWeight$Time),min(ChickWeight$Time)), ylim=c(max(ChickWeight$weight), min(ChickWeight$Time)), pch=as.character(ChickWeight$Time))
plot(ChickWeight$weight ~ ChickWeight$Time, xlim=c(max(ChickWeight$Time),min(ChickWeight$Time)), ylim=c(max(ChickWeight$weight), min(ChickWeight$Time)), pch=as.character(ChickWeight$Time), family="Times")
Solutions to short exercises (day 5)
2.2.6.1.1
nrow(english[english$RTlexdec>=4,])
2.2.6.1.2
nrow(english[(english$FrequencyInitialDiphoneWord+english$Ncount)>=15.5 & (english$FrequencyInitialDiphoneWord+english$Ncount)<=17.5,])
2.2.6.2
exercise1a <- english[english$RTlexdec>=4,]
2.2.6.3
nrow(english[english$AgeSubject == "young" & english$LengthInLetters <= 4,])
2.2.6.4
{{{mean(c(nrow(english[english$RTlexdec>=4,]),
nrow(english[(english$FrequencyInitialDiphoneWord+english$Ncount)>=15.5 &
(english$FrequencyInitialDiphoneWord+english$Ncount <=17.5,]),
nrow(english[english$AgeSubject == "young" &
english$LengthInLetters <= 4,])))}}}
