程序代写代做代考 flex data mining Hive hadoop data science Introduction to information system
Introduction to information system
Introduction to R
Bowei Chen
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M
Data Science 2016 – 2017 Workshop
What is R?
• R is a free software environment for
statistical computing and graphics.
• R compiles and runs on a wide
variety of UNIX platforms, Windows
and MacOS.
• R can be downloaded at:
https://cran.r-project.org/
Old logo New logo
https://cran.r-project.org/
Comprehensive R Archive Network (CRAN)
• CRAN includes packages which provide additional functionalities.
• Over 7,801 additional packages (as of January 2016) available at CRAN,
Bioconductor, Omegahat, GitHub, and other repositories.
• R packages are written mainly by academics and company staff.
• The R Foundation is seated in Vienna, Austria and currently hosted by
the Vienna University of Economics and Business. It is a registered
association under Austrian law and active worldwide.
Short History of R (1/2)
• S is a statistical programming language developed primarily by John
Chambers, Rick Becker and Allan Wilks at Bell Laboratories since 1976.
• The two modern implementations of S are:
– R: part of the GNU free software project
– S-PLUS (or S+): A commercial product sold by TIBCO Software
Short History of R (2/2)
• S-PLUS is a commercial implementation of the S programming language sold
by TIBCO Software Inc.
• R was created by Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand, and is currently developed by the R Development
Core Team, of which John Chambers is a member. R is named partly after
the first names of the first two R authors and partly as a play on the name of S.
What Can You Do Using R? (1/2)
• Data entry and manipulation
– Input data
• from keyboard
• from spreadsheet
• from another statistics package
– Manipulate data
• Statistical analysis
– Descriptive statistics
– Statistical inference
What Can You Do Using R? (2/2)
• Graphical display
– Predefined plots for some models
– Flexible, powerful options
– Save to image files in various formats
• Write new functions
– Make a change to an existing function
– Create new functions tailored to your exact needs
– Contribute a new package
• Create documents (with Sweave, knitr)
– PDF (article and slides)
– HTML
Why Use R for Data Science Computing? (1/2)
• Open source (R is a GNU S+)
• Good visualisations (ggplot2, lattice, standard plot library)
• Easier for writing custom packages and functions
• Closer to the statistics and machine learning community
• Better LaTeX support (Sweave, knitr)
• Works with Big data (Rhadoop, Rspark, RCpp)
By Gregory Piatetsky, KDnuggets
http://www.kdnuggets.com/author/gregory-piatetsky
Limitations of R
• The quality of some packages is less than perfect. They are not error-free!
• Many R commands give little thought to memory management, and so R
can very quickly consume all available memory. This can be a restriction
when doing data mining. There are various solutions, including using 64 bit
operating systems that can access much more memory than 32 bit ones.
• Documentation is sometimes patchy and terse, and impenetrable to
the non-statistician. However, some very high-standard books are
increasingly plugging the documentation gaps.
RGui
When R is waiting
for us to tell it what
to do, it begins the
line with >
Type
• ‘demo()’ for some demos
• ‘help()’ for on-line help
• ‘help.start()’ for an
HTML browser interface
• ‘q()’ to quit R
Editors and IDEs
• Rstudio
• Juyper Notebook
• Vim
• Emacs (ESS)
• Eclipse (StatET)
• Tinn-R
• Notepad++
• LaTeX/LyX (knitr, Sweave)
• …
https://www.rstudio.com/
https://www.rstudio.com/
R source editor (Ctrl+1)
R console (Ctrl+2)
Environment (Ctrl+8)
history (Ctrl+4)
Help (Ctrl+4)
Files (Ctrl+5)
Plots (Ctrl+6)
Packages (Ctrl+7)
Objects
• Everything in R is an object, having a class.
• Data, intermediate results are stored in R objects
• The Class of the object both describes what the object contains and what
many standard functions
• Objects are usually accessed by name.
R Commands
• R commands are either assignments or expressions
• Commands are separated either by a semicolon ; or newline
x <- 1+2 `<-`(x, 1+2) #same thing x = 1+2 #same thing Assignment Operations An assignment command evaluates an expression and passes the value to a variable but the result is not printed. Expression Operations An expression command is evaluated and (normally) printed. If the statement results in a value, R will print that value automatically. > 1+2
[1] 3
> 1+2*3
[1] 7
> (1+2)*3
[1] 9
In R, any number that you print
out in the console is interpreted as
a vector. A vector is an ordered
collection of numbers. The “[1]”
means that the index of the first
item displayed in the row is 1.
Workspace
• R stores objects in workspace that is kept in memory.
• When quitting R ask you if you want to save that workspace
• The workspace containing all objects you work on can then be restored next
time you work with R along with a history of the used commands.
Variables (1/3)
A variable is a symbol that holds a
value, which can be any R object.
The types of variables are:
• Integer
• Double
• Character
• Logical
• Factor or categorical
Variables (2/3)
Integer, double (numerical values)
> a = 49
> sqrt(a)
[1] 7
> a <- pi > print(a)
[1] 3.141593
Character, string, logical
> a = “The dog ate my homework”
> sub(“dog”,”cat”,a)
[1] “The cat ate my homework“
> a = (1+1==3)
> a
[1] FALSE
Variables (3/3)
Factor
> a <- factor(c("H", "e", "l", "l", "o")) > print(a)
[1] H e l l o
Levels: e H l o
> class(a)
[1] “factor”
Types of Numerical Variables (1/2)
When we use numerical objects, in
mathematical terms, variables can be
classified as:
• Scalars
• Vectors
• Matrices
A scalar is a single number
> x <- 5 > Y <- 100 Types of Numerical Variables (2/2) A vector is a sequence of numbers > x <- c(3, 5, 2) > x
[1] 3 5 2
A matrix is a two-way table of numbers
> x <- matrix(c(2, 3, 4, 5, 6, 7), nrow=3, ncol=2) > x
[,1] [,2]
[1,] 2 5
[2,] 3 6
[3,] 4 7
Variable Names
• You can use simple variable names like x, y, A, and a (note that A and a are
different variable names). You can also use longer names like counter,
index1, or subject_id.
• A variable name can contain digits, but it cannot begin with a digit.
• Be careful about the built-in operators or symbols with your own variable
names! For example, you could create a variable named log, but then you
would no longer be able to use the logarithm function
Comments
A comment is anything you write in
your program code that is ignored by
the computer.
Comments help others understand
your code. Anything following a “#”
character is a comment in R.
> x <- c(3, 5, 2) ## These are the doses of the new drug formulation. Arithmetic Operators Addition + Subtraction - Multiplication * Division / Exponentiation ^ or ** Modulus (x mod y) 5%%2 is 1 x %% y Integer division 5%/%2 is 2 x %/% y Comparison Operators Equal == Not equal != Greater than >
Greater than or equal >=
Less than < Less than or equal <= Logical Operators x and y x & y x or y x | y Not x !x Test if x is TRUE isTRUE(x) Numeric Functions Absolute value abs(x) Square root sqrt(x) Ceiling(3.475) is 4 ceiling(x) Foor(3.475) is 3 floor(x) Round(3.475, digits=2) is 3.48 round(x, digits=n) Signif(3.475, digits=2) is 3.5 signif(x, digits=n) Cosine, sine, tan, … cos(x), sin(x), tan(x) Natural logarithm log(x) Common logarithm log10(x) Exponential of x exp(x) Control Structures: if Syntax: if(cond1==true) { cmd1 } > if (TRUE) {
+ “this will be printed if it is TRUE”
+ }
[1] “this will be printed if it is TRUE”
Control Structures: if-else
Syntax:
if(cond1==true) { cmd1 } else { cmd2 }
> if(1==0) {
+ print(1)
+ } else {
+ print(2)
+ }
[1] 2
Control Structures: ifelse
Syntax:
ifelse(cond, yes, no)
> ifelse(1 == 0,
+ “this will be printed if 1==0”,
+ “this will not be printed if 1!=0”)
[1] “this will not be printed if 1!=0”
Control Structures: for
Syntax:
for (var in seq) { expr }
> x <- c("a", "a", "a", "a", "a") > for (i in x){
+ print(i)
+ }
[1] “a”
[1] “a”
[1] “a”
[1] “a”
[1] “a”
Control Structures: repeat
Syntax:
repeat { (cond) expr }
> i <- 10
> repeat {
+ if (i > 25)
+ break
+ else {
+ print(i); i <- i + 5;
+ }
+ }
[1] 10
[1] 15
[1] 20
[1] 25
Control Structures: while
Syntax:
while (cond) { expr }
> i <- 10
> while (i <= 25) {
+ print(i); i <- i + 5
+ }
[1] 10
[1] 15
[1] 20
[1] 25
Control Structures: switch
Syntax:
switch(expr, ...)
> AA = ‘foo’
> switch(AA,
+ foo = {
+ print(‘foo’) # case ‘foo’
+ },
+ bar = {
+ print(‘bar’) # case ‘bar’
+ },
+ {
+ print(‘default’)
+ })
[1] “foo”
Installing R and RStudio on Your Machine
• Download R from https://cran.r-project.org/
• Download RStudio at https://www.rstudio.com/
https://cran.r-project.org/
https://www.rstudio.com/
Exercise 1/10
demo(graphics)
demo(plotmath)
demo(Japanese)
demo(lm.glm)
demo(hclColors)
Exercise 2/10
x<-c(4,2,6) y<-c(1,0,-1) length(x) sum(x) sum(x^2) x+y x*y x-2 x^2 Exercise 3/10 7:11 seq(2,9) seq(4,10,by=2) seq(3,30,length=10) seq(6,-4,by=-2) Exercise 4/10 rep(2,4) rep(c(1,2),4) rep(c(1,2),c(4,4)) rep(1:4,4) rep(1:4,rep(3,4)) Exercise 5/10 c(T,T,F,F) & c(T,F,F,T) x <- as.logical(0); !x x <- seq(-3,3,length=200) > 0
1:3 + c(T,F,T)
intersect(1:10,5:15)
drinks <- factor(c("beer","beer","wine","water")) Exercise 6/10 x<-c(5,7,9); y<-c(6,3,4); z<-cbind(x,y); print(z) c(1, 2, 3, . . . , 19, 20) x <- c(3,6,8); y <- c(2,5,1); x[y>1.5]
x <- c(3,6,8); y <- c(2,5,1); y[x==6] Exercise 7/10 x <- 1:15 if (sample(x, 1) <= 10) { print("x is less than 10") } else { print("x is greater than 10") } Clean all the variables (the workspace) rm(list=ls()) Clean one variable rm(x) Exercise 8/10 x <- c("apples", "oranges", "bananas", "strawberries") for (i in x) { print(i) } for (i in 1:4) { print(x[i]) } for (i in seq(x)) { print(x[i]) } for (i in 1:4) print(x[i]) Exercise 9/10 i <- 1 while (i < 10) { print(i) i <- i + 1 } Exercise 10/10 z <- c("Alec", "Dan", "Rob", "Karthik"); typeof(z) x <- c(0.5, 0.7) x <- c(TRUE, FALSE) x <- c("a", "b", "c", "d", "e") x <- 9:100 x <- c(1 + (0+0i), 2 + (0+4i)) Additional Exercises 1) Create a number series that repeats 1 to 10 for 10 times 2) Create a number series that repeats each number for 10 times from 1 to 10 3) Find out the same (i.e. same integer and same index) numbers from 1) series and 2) series 4) Create a series from 1 to 30 5) Create a 30 numbers geometric progression for 1.2 (start from 1) 6) Compared series of 4) and 5), get a series of True/False values to state if the number in 4) series is larger than the number in 5) 7) Find out the numbers in 4) series that are larger than number with the same index in 5) series References • W. Venables, D. Smith, and the R Core Team (2015) An Introduction to R. • P. Teetor (2011) R Cookbook. O’Reilly. • J. Adler (2012) R in a Nutshell, 2nd Edition, O’Reilly Thank You! bchen@lincoln.ac.uk mailto:bchen@lincoln.ac.uk