Wine quality exploration by Ger Inberg

Introduction

This document describes the exploration of a wine dataset and tries to find relations between features. More information about the dataset can be found in wineQualityInfo. This analysis consist of a univariate section, a bivariate section, a multivariate section and a Final plots and summary section. First, a summary is given about the used dataset.

## [1] 6497   15
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.: 813   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500  
##  Median :1650   Median : 7.000   Median :0.2900   Median :0.3100  
##  Mean   :2044   Mean   : 7.215   Mean   :0.3397   Mean   :0.3186  
##  3rd Qu.:3274   3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :15.900   Max.   :1.5800   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00     
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00     
##  Median : 3.000   Median :0.04700   Median : 29.00     
##  Mean   : 5.443   Mean   :0.05603   Mean   : 30.53     
##  3rd Qu.: 8.100   3rd Qu.:0.06500   3rd Qu.: 41.00     
##  Max.   :65.800   Max.   :0.61100   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.: 77.0        1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300  
##  Median :118.0        Median :0.9949   Median :3.210   Median :0.5100  
##  Mean   :115.7        Mean   :0.9947   Mean   :3.219   Mean   :0.5313  
##  3rd Qu.:156.0        3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000  
##  Max.   :440.0        Max.   :1.0390   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality        color      quality_level
##  Min.   : 8.00   Min.   :3.000   red  :1599   Low   :2384  
##  1st Qu.: 9.50   1st Qu.:5.000   white:4898   Medium:2836  
##  Median :10.30   Median :6.000                High  :1277  
##  Mean   :10.49   Mean   :5.818                             
##  3rd Qu.:11.30   3rd Qu.:6.000                             
##  Max.   :14.90   Max.   :9.000
## 'data.frame':    6497 obs. of  15 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ color               : Factor w/ 2 levels "red","white": 1 1 1 1 1 1 1 1 1 1 ...
##  $ quality_level       : Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 1 1 1 3 3 1 ...

Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.400   7.000   7.215   7.700  15.900
## [1] "Acids are an important component of wine and it constists of a fixed and volatile part. This feature represents the fixed part and can be tartaric acid for example. The mean and median are both about 7 g/dm^3. There are some outliers at the right side with a maximum of almost 16 g/dm^3."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2300  0.2900  0.3397  0.4000  1.5800
## [1] "Volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. The mean and median are both about 0.30 g/dm^3. There are some outliers at the right side with a maximum of almost 1.58 g/dm^3."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2500  0.3100  0.3186  0.3900  1.6600
## [1] "Citric acid can add 'freshness' and flavor to wines. It has a 'normalish' distribution with a small peak at the left side and again some outliers at the right side, with a maximum value of 1.66 g/dm^3."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   3.000   5.443   8.100  65.800
## [1] "Residual sugar is the amount of sugar remaining after fermentation stops. The distribution looks like the right side of a normal distribution, with a peak at 0 g/dm^3. The mean is quite bigger than the median, which is caused by one or more big outlier(s)."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100
## [1] "Chlorides is the amount of salt in the wine. It has a 'normalish' distribution around +- 0.05 g/dm³ with some outliers at the right side."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   29.00   30.53   41.00  289.00
## [1] "Free sulfur dioxide is the free form of SO2. It has a 'normalish' distribution around +- 30 mg/dm^3 with some very high outliers (max=289 mg/dm^3)"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    77.0   118.0   115.7   156.0   440.0
## [1] "The total sulfur dioxide is the amount of free and bound forms of S02. It has a 'normalish' distribution with a mean of +-116 mg/dm^3. There are some outliers with a high value."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9923  0.9949  0.9947  0.9970  1.0390
## [1] "The density represents the density of the wine, which is dependent on the percent alcohol and sugar content. It has a 'normalish' distribution with a mean of +- 0.995 and a max value of 1.0390 which is clearly an outlier."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.110   3.210   3.219   3.320   4.010
## [1] "The pH features describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic). It has a 'normalish' distribution with a mean of +- 3.2 and a max value of +- 4.0 which is an outlier."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4300  0.5100  0.5313  0.6000  2.0000
## [1] "The sulphates is a wine additive wich acts as an antimicrobial and antioxidant.It has a 'normalish' distribution with a mean of +- 0.53. It is clear from the plot that there are some outliers at the right side (max = 2.0)."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.30   10.49   11.30   14.90
## [1] "The percent alcohol content of the wine. The mean and median are a little more than 10%, with some outliers at the high percentages and a maximum of 14,9%."

##   red white 
##  1599  4898
## [1] "The wine color which can either be red or wine. There are about 3 times as many white wines as red wines in this dataset."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.818   6.000   9.000
## [1] "Quality is a score between 0 and 10. It has a 'normalish' distribution around 6 with a min and max of 3 and 9 respectively."

Univariate Analysis

What is the structure of your dataset?

There are 1599 wines in the dataset with 13 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, color and quality). The (output) variable quality is an integer, the color is a factor variable while the others are numbers.

Some other observations: The density of wine is close to 1, so equals the density of water. The median quality for a redwine is 6 and the max is 8. There are a lot more observations for white wine compared to red wine.

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is quality. I’d like to determine which features are best for predicting the quality of a red wine. I suspect a combination of the other variables can be used to build a predictive model to determine the quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think all of the other features, except density, can have an impact on the quality of the wine. Acidity and Alcohol could be major factors, because too much or less of these can make the wine unbalanced.

Did you create any new variables from existing variables in the dataset?

I have created a quality_level factor variable to be able to use this as a factor in the plots. It has the levels “low”, “medium” and “high”.

Of the features you investigated, were theres any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most features that I plotted have a ‘normalish’ distribution, i.e. they have a pattern that looks like a normal distribution. This means that most of the values are of that feature are close to the mean and median count. The plot of the residual.sugar however is different: it starts at a value of zero with the highest count and after that it only decreases. It looks like the right side of a normal distribution.

Bivariate Plots Section