This document describes the exploration of a wine dataset and tries to find relations between features. More information about the dataset can be found in wineQualityInfo. This analysis consist of a univariate section, a bivariate section, a multivariate section and a Final plots and summary section. First, a summary is given about the used dataset.

`## [1] 6497 15`

```
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.: 813 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500
## Median :1650 Median : 7.000 Median :0.2900 Median :0.3100
## Mean :2044 Mean : 7.215 Mean :0.3397 Mean :0.3186
## 3rd Qu.:3274 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900
## Max. :4898 Max. :15.900 Max. :1.5800 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 1.00
## 1st Qu.: 1.800 1st Qu.:0.03800 1st Qu.: 17.00
## Median : 3.000 Median :0.04700 Median : 29.00
## Mean : 5.443 Mean :0.05603 Mean : 30.53
## 3rd Qu.: 8.100 3rd Qu.:0.06500 3rd Qu.: 41.00
## Max. :65.800 Max. :0.61100 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.: 77.0 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300
## Median :118.0 Median :0.9949 Median :3.210 Median :0.5100
## Mean :115.7 Mean :0.9947 Mean :3.219 Mean :0.5313
## 3rd Qu.:156.0 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000
## Max. :440.0 Max. :1.0390 Max. :4.010 Max. :2.0000
## alcohol quality color quality_level
## Min. : 8.00 Min. :3.000 red :1599 Low :2384
## 1st Qu.: 9.50 1st Qu.:5.000 white:4898 Medium:2836
## Median :10.30 Median :6.000 High :1277
## Mean :10.49 Mean :5.818
## 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :14.90 Max. :9.000
```

```
## 'data.frame': 6497 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ color : Factor w/ 2 levels "red","white": 1 1 1 1 1 1 1 1 1 1 ...
## $ quality_level : Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 1 1 1 3 3 1 ...
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.400 7.000 7.215 7.700 15.900
```

`## [1] "Acids are an important component of wine and it constists of a fixed and volatile part. This feature represents the fixed part and can be tartaric acid for example. The mean and median are both about 7 g/dm^3. There are some outliers at the right side with a maximum of almost 16 g/dm^3."`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2300 0.2900 0.3397 0.4000 1.5800
```

`## [1] "Volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. The mean and median are both about 0.30 g/dm^3. There are some outliers at the right side with a maximum of almost 1.58 g/dm^3."`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2500 0.3100 0.3186 0.3900 1.6600
```

`## [1] "Citric acid can add 'freshness' and flavor to wines. It has a 'normalish' distribution with a small peak at the left side and again some outliers at the right side, with a maximum value of 1.66 g/dm^3."`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 3.000 5.443 8.100 65.800
```

`## [1] "Residual sugar is the amount of sugar remaining after fermentation stops. The distribution looks like the right side of a normal distribution, with a peak at 0 g/dm^3. The mean is quite bigger than the median, which is caused by one or more big outlier(s)."`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100
```

`## [1] "Chlorides is the amount of salt in the wine. It has a 'normalish' distribution around +- 0.05 g/dm³ with some outliers at the right side."`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 17.00 29.00 30.53 41.00 289.00
```

`## [1] "Free sulfur dioxide is the free form of SO2. It has a 'normalish' distribution around +- 30 mg/dm^3 with some very high outliers (max=289 mg/dm^3)"`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.0 77.0 118.0 115.7 156.0 440.0
```

`## [1] "The total sulfur dioxide is the amount of free and bound forms of S02. It has a 'normalish' distribution with a mean of +-116 mg/dm^3. There are some outliers with a high value."`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9923 0.9949 0.9947 0.9970 1.0390
```

`## [1] "The density represents the density of the wine, which is dependent on the percent alcohol and sugar content. It has a 'normalish' distribution with a mean of +- 0.995 and a max value of 1.0390 which is clearly an outlier."`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.110 3.210 3.219 3.320 4.010
```

`## [1] "The pH features describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic). It has a 'normalish' distribution with a mean of +- 3.2 and a max value of +- 4.0 which is an outlier."`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4300 0.5100 0.5313 0.6000 2.0000
```

`## [1] "The sulphates is a wine additive wich acts as an antimicrobial and antioxidant.It has a 'normalish' distribution with a mean of +- 0.53. It is clear from the plot that there are some outliers at the right side (max = 2.0)."`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.30 10.49 11.30 14.90
```

`## [1] "The percent alcohol content of the wine. The mean and median are a little more than 10%, with some outliers at the high percentages and a maximum of 14,9%."`

```
## red white
## 1599 4898
```

`## [1] "The wine color which can either be red or wine. There are about 3 times as many white wines as red wines in this dataset."`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.818 6.000 9.000
```

`## [1] "Quality is a score between 0 and 10. It has a 'normalish' distribution around 6 with a min and max of 3 and 9 respectively."`

There are 1599 wines in the dataset with 13 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, color and quality). The (output) variable quality is an integer, the color is a factor variable while the others are numbers.

Some other observations: The density of wine is close to 1, so equals the density of water. The median quality for a redwine is 6 and the max is 8. There are a lot more observations for white wine compared to red wine.

The main feature in the data set is quality. I’d like to determine which features are best for predicting the quality of a red wine. I suspect a combination of the other variables can be used to build a predictive model to determine the quality.

I think all of the other features, except density, can have an impact on the quality of the wine. Acidity and Alcohol could be major factors, because too much or less of these can make the wine unbalanced.

I have created a quality_level factor variable to be able to use this as a factor in the plots. It has the levels “low”, “medium” and “high”.

Most features that I plotted have a ‘normalish’ distribution, i.e. they have a pattern that looks like a normal distribution. This means that most of the values are of that feature are close to the mean and median count. The plot of the residual.sugar however is different: it starts at a value of zero with the highest count and after that it only decreases. It looks like the right side of a normal distribution.