mosaic
packageNSF-funded project to develop a new way to introduce mathematics, statistics, computation and modeling to students in colleges and universities.
more information at mosaic-web.org
mosaic
package is available via
This document was originally created as an R presentation to be used as slides accompanying various presentations. It has been converted into a more traditional document for use as a vignette in the mosaic
package.
The examples below use the mosaic
and mosaicData
packages.
require(mosaic)
require(mosaicData)
Many of the guiding principles of the mosaic
package reflect the “Less Volume, More Creativity” mantra of Mike McCarthy who had a large poster with those words placed in the “war room” (where assistant coaches decide on the game plan for the upcoming opponent) as a constant reminder not to add too much complexity to the game plan.
|
A lot of times you end up putting in a lot more volume, because you are teaching fundamentals and you are teaching concepts that you need to put in, but you may not necessarily use because they are building blocks for other concepts and variations that will come off of that … In the offseason you have a chance to take a step back and tailor it more specifically towards your team and towards your players." Mike McCarthy, Head Coach, Green Bay Packers |
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. — Antoine de Saint-Exupery (writer, poet, pioneering aviator) |
|
One key to successfully introducing R is finding a set of commands that is
It is not enough to use R, it must be used elegantly.
Two examples of this principle: * the mosaic package * the dplyr package (Hadley Wickham)
Goal: a minimal set of R commands for Intro Stats
Result: Minimal R Vignette (vignette("MinimalR")
)
Much of the work on the mosaic
package has been motivated by
If you (or your students) are just getting started with R, it is good to keep the following in mind:
The following template is important because we can do so much with it.
It is useful to name the components of the template:
We’re hiding a bit of complexity in the template, and there will be times that we will want to gussy things up a bit. We’ll indicate that by adding ...
to the end of the template. Just don’t let ...
become a distractor early on.
Here are some variations on the template.
# simpler version
goal( ~ x, data = mydata )
# fancier version
goal( y ~ x | z , data = mydata )
# unified version
goal( formula , data = mydata )
Using the template generally requires answering two questions. (These questions are useful in the context of nearly all computer tools, just substitute “the computer” in for R in the questions.)
xyplot()
)births ~ date
)data=Births78
)
?Births78
for documentationxyplot( births ~ date, data=Births78)
Some things you will need to know:
Command: bwplot()
HELPrct
age
, substance
?HELPrct
for info about databwplot( age ~ substance, data=HELPrct)
Some things you will need to know:
Command: bwplot()
HELPrct
age
, substance
?HELPrct
for info about databwplot( substance ~ age, data=HELPrct )
Note that we have reversed which variable is mapped to the x-axis and which to the y-axis by reversing their order in the formula.
histogram( ~ age, data=HELPrct)
Note: When there is one variable it is on the right side of the formula.
histogram( ~age, data=HELPrct )
densityplot( ~age, data=HELPrct )
bwplot( ~age, data=HELPrct )
qqmath( ~age, data=HELPrct )
freqpolygon( ~age, data=HELPrct )
bargraph( ~sex, data=HELPrct )
xyplot( i1 ~ age, data=HELPrct )
bwplot( age ~ substance, data=HELPrct )
bwplot( substance ~ age, data=HELPrct )
Note: i1
is the average number of drinks (standard units) consumed per day in the past 30 days (measured at baseline)
histogram()
, qqmath()
, densityplot()
, freqpolygon()
, bargraph()
xyplot()
, bwplot()
Create a plot of your own choosing with one of these data sets
names(KidsFeet) # 4th graders' feet
?KidsFeet
names(Utilities) # utility bill data
?Utilities
require(NHANES) # load package
names(NHANES) # body shape, etc.
?NHANES
Add groups =
group to overlay.
Use y ~ x | z
to create multipanel plots.
Here is an example.
densityplot( ~ age | sex, data=HELPrct,
groups=substance,
auto.key=TRUE)
Beginners can create plots with 3 or 4 variables easily and quickly using this template.
The lattice
graphics system includes lots of bells and whistles including
I used to introduce these too early. My current approach:
require(lubridate)
xyplot( births ~ date, data=Births78,
groups=wday(date, label=TRUE, abbr=TRUE),
type='l',
auto.key=list(columns=4, lines=TRUE, points=FALSE),
par.settings=list(
superpose.line=list( lty=1 ) ))
Notes
wday()
is in the lubridate
packageThe mosaic
package provides functions that make it simple to create numerical summaries using the same template used for graphing (and later for describing linear models).
Big idea:
histogram( ~ age, data=HELPrct ) # width=5 (or 10) might be good here
mean( ~ age, data=HELPrct )
## [1] 35.65342
The mosaic package includes formula aware versions of mean()
, sd()
, var()
, min()
, max()
, sum()
, IQR()
, …
Also provides favstats()
to compute our favorites.
favstats( ~ age, data=HELPrct )
## min Q1 median Q3 max mean sd n missing
## 19 30 35 40 60 35.65342 7.710266 453 0
favstats()
quickly becomes a go-to function in our courses.
tally( ~ sex, data=HELPrct)
## sex
## female male
## 107 346
tally( ~ substance, data=HELPrct)
## substance
## alcohol cocaine heroin
## 177 152 124
There are three ways to think about this. All do the same thing.
sd( age ~ substance, data=HELPrct )
sd( ~ age | substance, data=HELPrct )
sd( ~ age, groups=substance, data=HELPrct )
## alcohol cocaine heroin
## 7.652272 6.692881 7.986068
This makes it possible to easily convert three different types of plots into the (same) corresponding numerical summary.
2-way tables are just tallies of 2 variables.
tally( sex ~ substance, data=HELPrct )
## substance
## sex alcohol cocaine heroin
## female 36 41 30
## male 141 111 94
tally( ~ sex + substance, data=HELPrct )
## substance
## sex alcohol cocaine heroin
## female 36 41 30
## male 141 111 94
Other output formats are available
tally( sex ~ substance, data=HELPrct, format="proportion" )
## substance
## sex alcohol cocaine heroin
## female 0.2033898 0.2697368 0.2419355
## male 0.7966102 0.7302632 0.7580645
tally( substance ~ sex, data=HELPrct, format="proportion", margins=TRUE )
## sex
## substance female male
## alcohol 0.3364486 0.4075145
## cocaine 0.3831776 0.3208092
## heroin 0.2803738 0.2716763
## Total 1.0000000 1.0000000
tally( ~ sex + substance, data=HELPrct, format="proportion", margins=TRUE )
## substance
## sex alcohol cocaine heroin Total
## female 0.07947020 0.09050773 0.06622517 0.23620309
## male 0.31125828 0.24503311 0.20750552 0.76379691
## Total 0.39072848 0.33554084 0.27373068 1.00000000
tally( sex ~ substance, data=HELPrct, format="percent" )
## substance
## sex alcohol cocaine heroin
## female 20.33898 26.97368 24.19355
## male 79.66102 73.02632 75.80645
mean( age ~ substance | sex, data=HELPrct )
## A.F C.F H.F A.M C.M H.M F M
## 39.16667 34.85366 34.66667 37.95035 34.36036 33.05319 36.25234 35.46821
mean( age ~ substance | sex, data=HELPrct, .format="table" )
## substance sex mean
## 1 A F 39.16667
## 2 A M 37.95035
## 3 C F 34.85366
## 4 C M 34.36036
## 5 H F 34.66667
## 6 H M 33.05319
mutate()
(in the dplyr
package) or transform()
.median()
, min()
, max()
, sd()
, var()
, favstats()
, etc.This master template can be used to do a large portion of what needs doing in an Intro Stats course.
mean( age ~ sex, data=HELPrct )
bwplot( age ~ sex, data=HELPrct )
lm( age ~ sex, data=HELPrct )
## female male
## 36.25234 35.46821
## (Intercept) sexmale
## 36.2523364 -0.7841284
It can be learned early and practiced often so that students become secure in their ability to use these functions.
The mosaic
package includes some other things, too
mosaicData
and NHANES
packages)xchisq.test()
, xpnorm()
, xqqmath()
x
mplot()
mplot(HELPrct)
interactive plot creationplot()
in some situationshistogram()
controls (e.g., width
)add = TRUE
works in many situations)xpnorm( 700, mean=500, sd=100)
##
## If X ~ N(500, 100), then
## P(X <= 700) = P(Z <= 2) = 0.9772
## P(X > 700) = P(Z > 2) = 0.02275
##
## [1] 0.9772499
xpnorm( c(300, 700), mean=500, sd=100)
##
## If X ~ N(500, 100), then
## P(X <= 300) = P(Z <= -2) = 0.02275 P(X <= 700) = P(Z <= 2) = 0.97725
## P(X > 300) = P(Z > -2) = 0.97725 P(X > 700) = P(Z > 2) = 0.02275
##
## [1] 0.02275013 0.97724987
xchisq.test(phs)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: x
## X-squared = 24.429, df = 1, p-value = 7.71e-07
##
## 104.00 10933.00
## ( 146.52) (10890.48)
## [12.05] [ 0.16]
## <-3.51> < 0.41>
##
## 189.00 10845.00
## ( 146.48) (10887.52)
## [12.05] [ 0.16]
## < 3.51> <-0.41>
##
## key:
## observed
## (expected)
## [contribution to X-squared]
## <Pearson residual>
Modeling is really the starting point for the mosaic
design.
lm()
and glm()
) defined the templatelattice
graphics use the template (so we chose lattice
)model <- lm(width ~ length * sex,
data=KidsFeet)
Width <- makeFun(model)
Width( length=25, sex="B")
## 1
## 9.167675
Width( length=25, sex="G")
## 1
## 8.939312
Once models have been converted into functions, we can easily add them to out plots using plotFun()
.
xyplot( width ~ length, data=KidsFeet,
groups=sex, auto.key=TRUE )
plotFun( Width(length, sex="B") ~ length,
col=1, add=TRUE)
## converting numerical color value into a color using lattice settings
plotFun( Width(length, sex="G") ~ length,
col=2, add=TRUE)
## converting numerical color value into a color using lattice settings
## converting numerical color value into a color using lattice settings
If you want to teach using randomization tests and bootstrap intervals, the mosaic
package provides some functions to simplify creating the random distirubtions involved.
Often used on first day of class
Story
woman claims she can tell whether milk has been poured into tea or vice versa.
Question: How do we test this claim?
We use rflip()
to simulate flipping coins
rflip()
##
## Flipping 1 coin [ Prob(Heads) = 0.5 ] ...
##
## H
##
## Number of Heads: 1 [Proportion Heads: 1]
Note: We do this with students who do not (yet) know what a binomial distribution is, so we want to avoid using rbinom()
at this point.
Rather than flip each coin separately, we can flip multiple coins at once.
rflip(10)
##
## Flipping 10 coins [ Prob(Heads) = 0.5 ] ...
##
## H H H T T T H H H T
##
## Number of Heads: 6 [Proportion Heads: 0.6]
heads
= correct; tails
= incorrect than to compare with a given patternrflip(10)
simulates 1 lady tasting 10 cups 1 time.
We can do that many times to see how guessing ladies do:
do(2) * rflip(10)
## n heads tails prop
## 1 10 6 4 0.6
## 2 10 5 5 0.5
do()
is clever about what it remembers (in many common situations)Now let’s simulate 5000 guessing ladies
Ladies <- do(5000) * rflip(10)
head(Ladies, 1)
## n heads tails prop
## 1 10 4 6 0.4
histogram( ~ heads, data=Ladies, width=1 )
Q. How often does guessing score 9 or 10?
Here are 3 ways to find out
tally( ~(heads >= 9), data=Ladies)
##
## TRUE FALSE
## 52 4948
tally( ~(heads >= 9), data=Ladies, format="prop")
##
## TRUE FALSE
## 0.0104 0.9896
prop( ~(heads >= 9), data=Ladies)
## TRUE
## 0.0104
The Lady Tasting Tea illustrates a 3-step process that can be reused in many situations:
mosaic
functions shuffle()
or resample()
diffmean(age ~ sex, data=HELPrct)
## diffmean
## -0.7841284
do(1) *
diffmean(age ~ shuffle(sex), data=HELPrct)
## diffmean
## 1 0.1090973
Null <- do(5000) *
diffmean(age ~ shuffle(sex), data=HELPrct)
prop( ~(abs(diffmean) > 0.7841), data=Null )
## TRUE
## 0.3616
histogram(~ diffmean, data=Null, v=-.7841)
Bootstrap <- do(5000) *
diffmean(age~sex, data= resample(HELPrct))
histogram( ~diffmean, data=Bootstrap,
v=-.7841, glwd=4 )
cdata(~diffmean, data=Bootstrap, p=.95)
## low hi central.p
## -2.4787104 0.8894968 0.9500000
confint(Bootstrap, method="quantile")
## name lower upper level method estimate
## 1 diffmean -2.47871 0.8894968 0.95 percentile -0.7841284
confint(Bootstrap) # default uses bootstrap st. err.
## name lower upper level method estimate
## 1 diffmean -2.47871 0.8894968 0.95 percentile -0.7841284
do(1) * lm(width ~ length, data=KidsFeet)
## Source: local data frame [1 x 9]
##
## Intercept length sigma r.squared F numdf dendf .row .index
## (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (int) (dbl)
## 1 2.862276 0.2479478 0.3963356 0.4110041 25.81878 1 37 1 1
do(3) * lm( width ~ shuffle(length), data=KidsFeet)
## Source: local data frame [3 x 9]
##
## Intercept length sigma r.squared F numdf dendf .row .index
## (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (int) (dbl)
## 1 11.639314 -0.10706623 0.4962420 0.076635651 3.0708562 1 37 1 1
## 2 11.541875 -0.10312500 0.4977279 0.071097404 2.8319481 1 37 1 2
## 3 9.754237 -0.03081856 0.5147824 0.006349662 0.2364388 1 37 1 3
do(1) *
lm(width ~ length + sex, data=KidsFeet)
## Source: local data frame [1 x 10]
##
## Intercept length sexG sigma r.squared F numdf dendf .row .index
## (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (int) (dbl)
## 1 3.641168 0.221025 -0.2325175 0.3848905 0.4595428 15.30513 2 36 1 1
do(3) *
lm( width ~ length + shuffle(sex), data=KidsFeet)
## Source: local data frame [3 x 10]
##
## Intercept length sexG sigma r.squared F numdf dendf .row .index
## (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (int) (dbl)
## 1 2.942851 0.2431545 0.07785347 0.3998086 0.4168355 12.86608 2 36 1 1
## 2 3.450856 0.2282462 -0.20833605 0.3878261 0.4512672 14.80285 2 36 1 2
## 3 2.842685 0.2484316 0.01565872 0.4017205 0.4112447 12.57297 2 36 1 3
Null <- do(5000) *
lm( width ~ length + shuffle(sex),
data=KidsFeet)
histogram( ~ sexG, data=Null,
v=-0.2325, glwd=4)
histogram(~sexG, data=Null,
v=-0.2325, glwd=4)
prop(~ (sexG <= -0.2325), data=Null)
## TRUE
## 0.037