Locked History Attachments

R

R: From Programming to Statistics

Organizers: Ewan Dunbar, Alex Drummond

Email: umdlingstats@gmail.com

Description

This course is divided in two. The first week is about programming and R, and the second week is about statistics and data analysis. However, the material in this course is CUMULATIVE. If you miss the first week, you will have a hard time with the second, unless you already fairly proficient with both programming and R.

This mini-course is intended to familiarize you, first, with the basics of computer programming, and, second, with the fundamentals of statistics for data analysis. Both parts of the course will use the R Software Environment. Both parts of the course are intended to be accessible to absolute beginners, and at the same time to be useful for those who simply feel uncertain about their knowledge or abilities in programming or in statistics, or who would like to have a more in-depth knowledge of the most important elements of either. The course will be interactive and will require you to do take-home exercises.

Location

MMH 1304

Time

Technical help and homefun review session: 9:30 AM - 10:00 AM

Main session: 10:00 AM - 12:00 PM

Please see "Before coming" below to find out what you should do before coming

Analyzing Linguistic Data (book)

R for programmers

Before coming

1. Download R from the R website and install it on your computer

(Instructional video for Windows) (Instructional video for Mac)

2. Install the R package languageR, making sure to install its dependencies (packages that it depends on) when given the option on the Mac version, and in the terminal version.

(Instructional video for Windows) (Instructions for Mac) (Instructions for terminal version)

Submitting homefun

You will be asked to write functions. Put them in an R source file to edit them (you can use R's built-in text editor by using File-> New Document in the GUI) and save it. Then you can source your file (using the Source File...) menu option. Then you can call your functions from inside R to test them out. You might find the "debug" function useful. If I have a function called "foo", then "debug(foo)" will turn on debugging mode for whenever you call "foo". To exit the debugger, type "Q". To turn debugging off, type "undebug(foo)".

If you are asked to do anything which is *not* writing a function (creating a variable, etc), put all your functions at the top of your file, and everything else at the bottom.

Submit your homefun to umdlingstats@gmail.com in a file called hw1.R, or hw2.R, etc, depending on whether we are on day 1, or 2, etc. Send whatever you can, but come to the help session at 9:30 if you have problems.

Plan

Monday

Tuesday

Wednesday

Thursday

Friday

9:30 - 10

Tech support

Homefun help

Homefun help

Homefun help

Homefun help

10 - 12

Programming

Programming

Programming

Programming

Transition

Topics

Basics, functions, variables

Conditionals, loops

Practice session

Practice session

Working with data in R

Notes

Programming Notes

Programming Notes

Programming Notes

Programming Notes

Day 5 Notes (PDF)

In-class exercises

Programming exercises

Programming exercises; Answers to some in-class exercises

New Programming exercises; Answers to more in-class exercises

New Programming exercises; Tagged corpus; Alex's part-of-speech functions

See notes

Homefun

"Variables" exercises from in-class exercises; Homefun 1 (part 2 optional)

Homefun 2; you will find the examples linked above (answers to in-class exercises) useful

Homefun 3

Homefun 4

Day 5 Homefun

Answers

Homefun 1 Solutions (Ewan); Homefun 1 Solutions (Alex)

Homefun 2 Solutions (from class)

Homefun 3 Solutions

See hints below

Homefun 5 Solutions; Homefun 5 Solutions (Q1 and Q3; Alex)

Monday

Tuesday

Wednesday

Thursday

Friday

9:30 - 10

Homefun help

Homefun help

Homefun help

Homefun help

10 - 12

Statistics

Statistics

Statistics

Statistics

Topics

Fundamental concepts

Linear models: t-test and ANOVA

Linear models: factorial ANOVA and regression

Linear models: Random/mixed effects, hierarchical models

Notes

Day 6 Notes (PDF)

Day 7 Notes (PDF)

Day 8 Notes (PDF)

Day 9 Notes (brief) (PDF)

In-class exercises

See notes

See notes

See notes; 2007 U of T stats comp question

Homefun

First do exercise 2 from section 2.4; then Day 6 Homefun

Day 7 Homefun

2008 U of T stats comp question (try)

Answers

Possible solution

See suggestions below

2007 answers 2008 answers

Extra R examples

Day 1: Indexing

c(2,4,6,8)[1]

c(2,4,6,8)[2]

c(2,4,6,8)[3]

c(2,4,6,8)[4]

c(2,4,6,8)[c(1,3)]

c(2,4,6,8)[c(TRUE,FALSE,TRUE,FALSE)]

c(2,4,6,8)[c(TRUE,FALSE,TRUE,FALSE)]

(1:100)[(1:100) < 17]

(1:100)[(1:100)*2 == 16]

Day 3: More loops

For loops with index vectors (a very common practice)

Instead of this while loop

v <- c(5, 4, 3, 11, 6, 1)
i <- 1
while (i <= length(v)) {
  print(v[i])
  i <- i+1
}

you can use a for loop like this:

v <- c(5, 4, 3, 11, 6, 1)
for (i in 1:length(v)) {
  print(v[i])
}

In this case it's not necessary, since you could just as easily have written this:

v <- c(5, 4, 3, 11, 6, 1)
for (x in v) {
  print(x)
}

But you will often find this useful, and easier than writing a while loop.

Nested loop example

Often, in R, you can do things without using loops. When you can, it is better to do this. It is faster. Here's a loop, but inside it the seq function really does a bunch of things at once.

v <- c(5, 4, 3, 11, 6, 1)
for (i in 1:length(v)) {
  w <- seq(v[i], v[i]+5)
  print(w)
}

But to see what kind of thing R's doing inside, you can "seq" for yourself.

v <- c(5, 4, 3, 11, 6, 1)
for (i in 1:length(v)) {
  w <- c()
  for (j in 1:6) {
    w[j] <- v[i] + j - 1
  }
  print(w)
}

Sometimes, you might find you have to do this. One common situation is where you have vectors inside a list, and you need to traverse the vectors in the outer loop, and the list in the inner loop -- that is, you can't process the vectors independently.

Day 4: Warmup

R has a lot of functions for drawing random (actually "pseudo-random") numbers. The simplest is called "runif". This draws numbers between 0 and 1 by default. All you need to do is tell R how many numbers to sample. So "runif(1)" would give you a vector of length one, with a random number between 0 and 1. If you wanted more, you could, of course, just change 1 to 1000 or whatever you like. Try writing your own function, though, that takes one argument, n, and returns a vector with n different random numbers drawn using runif(1).

parse_tagged_words(c("The:Det", "man:N"))
[[1]]
[[1]]$word
[1] "The"

[[1]]$tag
[1] "Det"


[[2]]
[[2]]$word
[1] "man"

[[2]]$tag
[1] "N"

POS Exercise

Write a function (call it whatever you want) using Alex's POS tag functions that returns a list of the counts of all the tags in the corpus. E.g.

$N
[1] 17

$V
[1] 30

$Det
[1] 40

(But those aren't the actual numbers.)

Homefun 4 Hints

plot(ChickWeight$weight ~ ChickWeight$Time, xlim=c(max(ChickWeight$Time),min(ChickWeight$Time)), ylim=c(max(ChickWeight$weight), min(ChickWeight$Time)))

plot(ChickWeight$weight ~ ChickWeight$Time, xlim=c(max(ChickWeight$Time),min(ChickWeight$Time)), ylim=c(max(ChickWeight$weight), min(ChickWeight$Time)), pch=as.character(ChickWeight$Time))

plot(ChickWeight$weight ~ ChickWeight$Time, xlim=c(max(ChickWeight$Time),min(ChickWeight$Time)), ylim=c(max(ChickWeight$weight), min(ChickWeight$Time)), pch=as.character(ChickWeight$Time), family="Times")

Solutions to short exercises (day 5)

2.2.6.1.1

nrow(english[english$RTlexdec>=4,])

2.2.6.1.2

nrow(english[(english$FrequencyInitialDiphoneWord+english$Ncount)>=15.5 & (english$FrequencyInitialDiphoneWord+english$Ncount)<=17.5,])

2.2.6.2

exercise1a <- english[english$RTlexdec>=4,]

2.2.6.3

nrow(english[english$AgeSubject == "young" & english$LengthInLetters <= 4,])

2.2.6.4

{{{mean(c(nrow(english[english$RTlexdec>=4,]),