Capítulo 9 Tareas
9.1 Point Chart
9.2 Stratified Sampling
author: Dolores Ojeda, Gener Avilés R date: 2017-03-05
9.2.1 What is Stratified Sampling?
Population is partitioned in non-overlaping groups, called strata and a sample is collected from each stratum following a determined design.9.2.2 Why use Stratified Sampling?
- May produce smaller error when estimating than simple random sample. Specially when measurements within strata have realitve small variation. - Cost by observation reduced. - There may be a need to have a subgroup (stratum) with similar estimates of those of the population.
9.2.3 Example
The Titanic Database:
## pclass survived name sex
## Min. :1.000 Min. :0.000 Length:1310 Length:1310
## 1st Qu.:2.000 1st Qu.:0.000 Class :character Class :character
## Median :3.000 Median :0.000 Mode :character Mode :character
## Mean :2.295 Mean :0.382
## 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :3.000 Max. :1.000
## NA's :1 NA's :1
## age sibsp parch ticket
## Min. : 0.1667 Min. :0.0000 Min. :0.000 Length:1310
## 1st Qu.:21.0000 1st Qu.:0.0000 1st Qu.:0.000 Class :character
## Median :28.0000 Median :0.0000 Median :0.000 Mode :character
## Mean :29.8811 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.0000 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.0000 Max. :8.0000 Max. :9.000
## NA's :264 NA's :1 NA's :1
## fare cabin embarked
## Min. : 0.000 Length:1310 Length:1310
## 1st Qu.: 7.896 Class :character Class :character
## Median : 14.454 Mode :character Mode :character
## Mean : 33.295
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :2
## boat body home.dest
## Length:1310 Min. : 1.0 Length:1310
## Class :character 1st Qu.: 72.0 Class :character
## Mode :character Median :155.0 Mode :character
## Mean :160.8
## 3rd Qu.:256.0
## Max. :328.0
## NA's :1189
9.2.3.1 Variable Codes
- Pclass
: 1 = Upper, 2 = Middle, 3 = Lower. - SibSp
: Number of Siblings/Spouses aboard. - Parch
: Number of Parents/Children Aboard. - Embarked
: C = Cherbourg, Q = Queenstown, S = Southampton.
9.2.3.2 Calculating Probabilities to select people who embarked in Queenstown
\(P(A) = \frac{\text{Numero de elementos de A}}{n}\)
There are 1310 entries, and 123 of them embarked in Queenstown, nevertheless the risk of dying was equally present for them as for the passengers from Southampton or Cherbourg.
If a uniform proability is calculated the numbers are:
- \(P(Q) = \frac{123}{1310} =\) 0.0938931 - \(P(C) = \frac{270}{1310} =\) 0.2061069 - \(P(S)\frac{914}{1310} =\) 0.6977099
This approximation will hinder the process of data mining and, eventually, the generation of a machine learning model for survival prediction.
9.2.3.3 Fixing the Problem
By using stratified sampling we can raise the probability for the group that boarded in Queenstown and survived to be selected, therefore, taken in consideration for the generation of a survival prediction model. For this we will use conditional probability:
\(P(Survived|EmbarkedQ)= \frac{P(Survived\cap EmbarkedQ)}{P(EmbarkedQ)} = \frac{44}{123}=\) 0.3577236
library(dplyr)
Q<-filter(titanic, embarked == "Q" & survived == 1)
count(Q)
## # A tibble: 1 x 1
## n
## <int>
## 1 44
Carlos Pérez-González, Marcos Colebrook-Santamaría. 2014. “Curso Introductorio de R: Introducción a La Interfaz de Rstudio.” http://mcolebrook.github.io/CursoRStudio/RStudio.html#(1).
Chambers, John. 2000. “Stages in the Evolution of S.” March. http://ect.bell-labs.com/sl/S/history.html.
Data, and Story Library. n.d. “Cereals Data Set.” http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html.
Leek, Jeff. 2016. “How to Share Data with a Statistician.” November. https://github.com/jtleek/datasharing.
López, Francisco Javier Barón. 2013. “Apuntes Y Vídeos de Bioestadística.” https://www.bioestadistica.uma.es/baron/apuntes/.