White Wine – The Good Parts
By Joel Dazo

I have a dataset which consists of observations on the physicochemical properties of Portuguese vinho verde white wines. Each wine is judged by three wine experts and assigned a numeric score based on the assessment of its quality. Through exploratory data analysis, I intend to find and identify those features that may be common among wines that received higher quality scores.


Univariate Plots


The dataset has been loaded using read_csv(). I am curious about the shape of our dataset. How many rows and columns are in the white wine dataset?

dim(wines)
## [1] 4898   12

The dataset has 4,898 observations with 12 variables. Each observation represents a unit of white wine along with the measurements of some of its physicochemical attributes (i.e. acidity, sugar and alcohol level) as well as a numerical assessment of its overall desirability as determined by a group of oenologists.

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

We can see from the results of the str() function above that the dataset is made up of 13 numeric and discrete (int) attributes. Next, I’ll use the summary function to get an idea of the type of values we can expect from the fields in the wines dataset.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

A brief look into the results of the summary() function tells us that we can expect to see most of the variables like residual.sugar as being positively skewed because the mean value is greater than its median while some attributes like pH follow a normal distribution.

I checked the dataset if there are any missing values (NAs) in the variables and I did not find any.

I shall discuss each variable and provide some brief exploration of the structure and distribution of the variable’s data.

Fixed Acidity

The fixed.acidity feature measures the concentration of tartaric acid in grams/dm3. Tartaric acid is an organic acid that helps make the wine taste fresh and also helps in its preservation.

Here’s a summary of the fixed acidity values found in our white wine dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The concentration of fixed acidity is given in units of grams per cubic decimeter (g/dm3). Wines with the smallest concentration in terms of fixed acidity measures at 3.8 g/dm3 while the greatest fixed acidity concentration of 14.2 g/dm3 can be found in the dataset – an outlier value that far exceeds the fixed acidity interquartile range of 1.0 g/dm3.

The wines in our sample seem to follow a gaussian shape in their tartaric acidity levels and the histogram shows that there are a few wines that have such high concentrations (> 7.3 g/dm3) that pull the mean upwards.

The boxplot confirms the normally distributed shape of the fixed acidity data points. The minimum (3.8 g/dm3) and the maximum values (14.2 g/dm3) indicate the presence of outliers in the fixed acidity data. I have added the jittered data points in the boxplot and the presence of some data points well beyond the IQR thresholds are quite typical especially in the positive end.

In unripe grapes, tartaric acid can reach up to 15 grams per cubic decimeter.

Volatile Acidity

volatile.acidity, which pertains to the amount of acetic acid, affects how the wine smells. At high enough amounts, the wine can become vinegary in smell and taste.

A summary of the volatile acidity values is given below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Again, the summary function indicates the presence of outliers (using an IQR of 0.11) on the negative and positive ends of the volatile acidity distribution.

The volatile acidity levels in our white wine dataset is normally distributed yet it appears to be more dispersed than that of the fixed.acidity values. There is a considerable number of acetic acid outliers on the positive end of this normally-skewed distribution.

From the histogram, I can see that there were more wines that have volatile acidity around the 0.22 to 0.28 range g/dm3. In fact, a glimpse of the volatile.acidity vector using the table() function shows that 263 observations possess the modal acetic acid value of 0.28 g/dm3.

I’m not a wine expert, but I would think that the values from the two sets of acidity levels (tartaric and acetic acid) should exhibit similar dispersions. Why would the acetic acid set show a more dispersed distribution of values?

Citric Acid

Citric acid can enhance the freshness and flavor of wines and it is a type of acid that can usually be found or added in very small quantities.

Here is a summary of the citric acid values.

summary(wines$citric.acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

The citric acid amounts in our white wine sample ranges from a minimum of 0.00 g/dm3 to a maximum amount of 1.66 g/dm3. The average amount of citric acid is 0.33 g/dm3 though fifty percent of the wines have citric acid levels at 0.32 g/dm3 which is less than the mean citric acid level of 0.33 g/dm3.

The distribution of citric acid values is normal. The maximum citric acid value in the group of samples is remarkably high (1.66 g/dm3) which indicates the presence of positive outliers.

Although the values are small just like those for acetic acids, the distribution of citric acid value tend to be close to each other.

Half of the citric acid values tend to bunch up between 0.27 g/dm3 and 0.33 g/dm3, an IQR of 0.12 g/dm3. There is a prominent bar in the histogram that seems to stand out against its neighbors at around the third quartile of the distribution. A run of the table() command shows that there are 215 white wines possesses this citric acid amount with 0.49 g/dm3.

Residual Sugar

Residual sugar refers to the amount of sugar left in wine after the fermentation process is allowed to finish. It is mostly responsible for the sweetness in wine.

A summary of residual sugar values in our sample is given below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The mean residual sugar in our white wine sample is 5.2 g/dm3 but there is a substantial amount of variance around this mean as the interquartile range comes in at 8.2 g/dm3. The presence of an extreme observation as the maximum data point shows (65 g/dm3) as well as the considerable variance between the median and the third quartile values would indicate a strong positive skew in the distribution of residual sugar.

Taking a log transformation of the highly positively skewed residual sugar data lets us discover the bimodal shape of its distribution. There is a considerable peak and some local variance around the 1.4 g/dm3 data point and another modal level around the 8.0 g/dm3 data point.

The boxplot in turn shows that the amount of residual sugar (measured in grams per liter) is strongly positively skewed with a mean value of 6.391 g/dm3 substantially being greater than the amount of residual sugar possessed by half of the number of wines in our dataset.

It is interesting to note the outliers that were shown as a result of our summary() function over the residual.sugar feature. One or some wines registered a maximum residual sugar amount of 65.8 grams per liter. This makes these wines very sweet as any wine that has greater than 45 grams per liter are considered sweet.

Chlorides

The kind of grapes used and their origin makes for the concentration of chlorides in wine and its amount is known to manifest its effects in the saltiness of the product.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

I noticed that the chloride data points are quite tight around the mean considering that the calculated IQR is about 0.014 g/dm3. There are considerable outliers in the positive side of the distribution with a maximum chloride concentration of about 0.35 g/dm3.

A log transformed distribution of chlorides in the white wine dataset assumes a normal shape with a mean of about 0.0458 g/dm3.

The boxplot shows the strong, positively skewed shape of the underlying chloride distribution. It also shows the small variance in the middle quartile intervals.

Free Sulfur Dioxide

Sulfur dioxide has been widely used to help protect the quality of wine by retarding the growth of microbes and controlling oxidation. Given below is a summary of the sulfur dioxide data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

The distribution of free sulfur dioxide tend to be concentrated near the mean (35.31 ppm). The IQR is at 23 ppm and the standard deviation of about 17.01 ppm makes up for 48.17% of the mean (coefficient of variation). The formula for determining the coefficient of variation is given below.

\[CV = \frac{\sigma}{\bar{x}} * 100\%\]

I will use this coefficient later when I compare the free sulfur dioxide variance with the total sulfur dioxide variance.

The unbound sulfur dioxide values exhibited by various wines in our white wine dataset seem to be slightly positively skewed, widely dispersed but centrally located around the 35.31 ppm. The skewness of the values is mainly due to the high-valued outliers with the highest posting at 289 ppm.

Total Sulfur Dioxide

The sum of the free (unbound) and bound sulfites in wine determine the total amount of sulfur dioxides. The smell and taste of sulfur becomes apparent in wine if the total suflur dioxide levels are in high enough concentrations (above 50 ppm).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

The mean of the total sulfur dioxide data is 138 ppm and it is close to the median which is 134 ppm. This means that the data is normally distributed. There are also extreme values in the data.

The total sulfur dioxide IQR is 59 ppm and the standard deviation of about 42.50 ppm makes up for 30.72% of the mean (coefficient of variation). This tells us that the total suflur dioxide is quite spread out from the mean, although it is less dispersed than the unbound sulfur dioxide values (CV = 48.17%)

The histogram of total sulfur dioxide amounts in our white wine dataset follows a normal distribution centered around the mean of 138.4 ppm. The most frequently occuring total sulfur dioxide amount is 111 ppm.

Both the histogram and the boxplot shows a slight right-side skew of the values in the distribution owing to the presence of some outliers beyond the IQR-related threshold of about 227 ppm.

I also observe how sparsely spread out the total sulfur dioxide values are around the interquartile area.

Water Density

The density of wine is influenced by the concentration of alcohol, sugar, glycerol, and other dissolved solids. It serves as a measure of the conversion of sugar to alcohol. A low density has more alcohol than sugar and a high density means that there is more sugar than alcohol.

It is said that the density of wine is close to that of water. Dry wine has lower density compared to sweet wine, which has a higher density.

To provide perspective, we can compare our dataset’s white wine density with the density of some familiar substances.

Substance Density in g/cm3
Water 1.000
Ethanol 0.7890
Sugar 1.5870

Given below is a summary of the density the white wine observations in our dataset (in g/cm3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The summary tells us that white wines in general tend to have a density that is close to that of water but leaning towards the alcohol side.

The water density values tend to cluster around the mean (0.994 g/cm3). This fact is also manifested in its low IQR (0.004 g/cm3) and low standard deviation (0.003 g/cm3). The close similarity between the mean and the median (0.9937) tells us that the distribution of water density follows a normal form.

The outliers on the positive end have values that can be considered feasible given that they are still closely comparable to the density of water with which wines share a similar characteristic. I will consider these outliers as an example of good data about some extreme cases.

Acidity (pH)

The pH level of wine provides a way to gauge the ripeness of wine in relation to its acidity. pH values range from 0 (very alkaline or basic) through 14 (very acidic). Wines that have low pH values tend to taste crisp and tart. On the other hand, wines with a high pH value are more vulnerable to microbial growth. The general understanding is that there is an inverse relationship between pH and acidity: the lower the pH, the higher the acidity and vice versa.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

According to winespectator.com, the acidity level in wine gets used along with its other attributes as sweetness, alcohol, extract and tannin to influence how the wine tastes. It also mentions that pH levels of from 3.0 to 3.4 is desirable for white wines.

Since the mean and the median are almost identical, we can expect the pH distribution of white wines in our dataset to have a normal shape. Outliers on the positive end still tend to fall within the range of the usual pH values in most wines (between 2.9 and 3.9). At the low end of the pH summary, I see that we have white wine outliers in our dataset that come close to the acidity of distilled white vinegar (pH = 2.4).

The histogram of white wine pH levels is gaussian in shape centered around the mean of about 3.19. The data points also tend to be sparsely distributed and spread out around the mean as the jittered boxplot above shows. However, the pH standard deviation of 0.15 makes the pH data relatively less varied than the other feature distributions as this only accounts for about 4.74% of the mean (coefficient of variation).

One thing that I noticed is that the acid level scores in between the mean and the maximum available acid level score of 3.820 are relavitely more spread out than their counterpart range on the opposite side (scores below the mean). This describes a slight positive skew owing to some outliers on the positive side. We can see this from the Q-Q plot shown below.

The quantile-quantile plot describes a gaussian distribution around the acidity scores within about 0.15 standard deviations from the mean.

Sulphates

Winemakers control the sulfur dioxide gas levels in their wine products by using sulphates. Sulfur dioxide are known to have antioxidant and antimicrobial properties that help preserve the quality of wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The average sulphate amount is slightly higher than the median sulphate amount which indicates a slightly positive skew for the sulphate distribution.

I can immediately see the presence of outliers in the sulphate data. The sulphate distribution has a normal shape with a slight positive skew.

Sulphates have a standard deviation of 0.11 grams/dm3 and an IQR of 0.14 grams/dm3. The standard deviation takes 23.3% of the mean and this coefficient of variation means that the sulphate data is considerably dispersed as compared to the other variables in our dataset.

Alcohol

The alcohol variable measures the percent alcohol content of the wine. Alcohol is among the characteristics of wine that influences its overall quality and taste. Although the optimal alcohol content is subjective and dependent on the type of wine, some would say that the ideal alcohol content typically has about 13.6 percent while some wines taste exceptionally well with concentrations higher than 14.0 percent.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

The white wines in our dataset have at least an 8% alcohol content and a maximum of 14.2%. The typical wine has about a 10% alcohol level. Since the mean alcohol content of 10.51% is close but slightly greater than the median by about 0.11% points, we can expect a normally distributed set of values.

White wine alcohol levels in our dataset appears slightly skewed to the positive end. The boxplot tells us that about half of the wines have alcohol levels from 9.5% to 11.4%. The data points are quite varied with a standard deviation of 1.23%. I look forward to knowing whether or not any alcohol level holds much of the wines that received high quality classification.

Quality

The quality variable represents a numeric score from 0 (very bad) to 10 (very excellent). This ordered classification is based on the sensory evaluation made by wine experts. The summary statistics are given below.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

The data points representing the quality feature appears to be clustered around 6.0 and a majority of the wines had scores that fall just within a one point interval from 5 to 6 (IQR = 1).

I decided to use a bar plot to visualize the distribution of white wine quality scores because of the discrete nature of its data type.

The bar plot confirms our observation about the distribution of wines according to their quality scores - the quality scores are given in sequenced order from 0 through 10 but there are more wines that are of typical quality (quality scores of 5, 6 and 7) than wines that are considered poor or exceptional. What follows is a table breaking down the number of wines in our dataset according to their quality classifications.

Frequency of Wines By Quality
Quality Frequency
3 20
4 163
5 1457
6 2198
7 880
8 175
9 5


Univariate Analysis


I have given an overview of the 12 variables comprising each of the 4,898 sampled wines in the previous univariate plots section.

I briefly described each variable and some of its characteristics that may influence the quality of the wine. I also accompanied this with a short statistical summary of the feature’s data points (mean, median, minimum, maximum and quartile scores). I have collected those summaries in the table below to serve as a reference to the observations I have made regarding these variables.

Min Q1 Median Q3 Max Range IQR Mean SE
fixed.acidity 3.800 6.300 6.800 7.300 14.200 10.400 1.000 6.855 0.012
volatile.acidity 0.080 0.210 0.260 0.320 1.100 1.020 0.110 0.278 0.001
citric.acid 0.000 0.270 0.320 0.390 1.660 1.660 0.120 0.334 0.002
residual.sugar 0.600 1.700 5.200 9.900 65.800 65.200 8.200 6.391 0.072
chlorides 0.009 0.036 0.043 0.050 0.346 0.337 0.014 0.046 0.000
free.sulfur.dioxide 2.000 23.000 34.000 46.000 289.000 287.000 23.000 35.308 0.243
total.sulfur.dioxide 9.000 108.000 134.000 167.000 440.000 431.000 59.000 138.361 0.607
density 0.987 0.992 0.994 0.996 1.039 0.052 0.004 0.994 0.000
pH 2.720 3.090 3.180 3.280 3.820 1.100 0.190 3.188 0.002
sulphates 0.220 0.410 0.470 0.550 1.080 0.860 0.140 0.490 0.002
alcohol 8.000 9.500 10.400 11.400 14.200 6.200 1.900 10.514 0.018
quality 3.000 5.000 6.000 6.000 9.000 6.000 1.000 5.878 0.013
quality.f* 1.000 3.000 4.000 4.000 7.000 6.000 1.000 3.878 0.013

Listed below are the observations I have made regarding the variables of the white wine dataset.


Bivariate Plots


Let us now look into the analysis that involve any two variables of our white wine dataset. We will be using the pair-wise scatterplots as well as correlation coefficients to examine the associations between the features of our data.

We will begin with a matrix of scatteplots and correlation coefficients using all of the variables in our dataset.

We can see how most of the variables exhibit a positive skew. The pH variable is qute normal and the alcohol distribution is quite unusually shaped.

The pair-wise scatterplots help visualize any relationship that is present between variables. They also give an idea of how strong and in what direction the relationship goes.

The table below shows the Pearson’s correlation coefficient (\(r\)) for each pair of independent variables in the white wine dataset.

fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
fixed.acidity 1.00 -0.02 0.29 0.09 0.02 -0.05 0.09 0.27 -0.43 -0.02 -0.12 -0.11
volatile.acidity -0.02 1.00 -0.15 0.06 0.07 -0.10 0.09 0.03 -0.03 -0.04 0.07 -0.19
citric.acid 0.29 -0.15 1.00 0.09 0.11 0.09 0.12 0.15 -0.16 0.06 -0.08 -0.01
residual.sugar 0.09 0.06 0.09 1.00 0.09 0.30 0.40 0.84 -0.19 -0.03 -0.45 -0.10
chlorides 0.02 0.07 0.11 0.09 1.00 0.10 0.20 0.26 -0.09 0.02 -0.36 -0.21
free.sulfur.dioxide -0.05 -0.10 0.09 0.30 0.10 1.00 0.62 0.29 0.00 0.06 -0.25 0.01
total.sulfur.dioxide 0.09 0.09 0.12 0.40 0.20 0.62 1.00 0.53 0.00 0.13 -0.45 -0.17
density 0.27 0.03 0.15 0.84 0.26 0.29 0.53 1.00 -0.09 0.07 -0.78 -0.31
pH -0.43 -0.03 -0.16 -0.19 -0.09 0.00 0.00 -0.09 1.00 0.16 0.12 0.10
sulphates -0.02 -0.04 0.06 -0.03 0.02 0.06 0.13 0.07 0.16 1.00 -0.02 0.05
alcohol -0.12 0.07 -0.08 -0.45 -0.36 -0.25 -0.45 -0.78 0.12 -0.02 1.00 0.44
quality -0.11 -0.19 -0.01 -0.10 -0.21 0.01 -0.17 -0.31 0.10 0.05 0.44 1.00

Equipped with the pair-wise visualizations as well as the correlation coefficient matrix, we can glean the following bivariate associations. These variable pairs produced some notable \(r\) values; and by notable, I mean at least greater than or equal to ± 0.40.

Scatterplots with at least positively weak relationships:

Variable 1 Variable 2 Correlation Coefficient
Total sulfur dioxide Residual sugar 0.40
Free sulfur dioxide Total sulfur dioxide 0.62
Residual sugar Density 0.84
Total sulfur dioxide Density 0.53
Alcohol Quality 0.44

Scatterplots with at least negatively weak relationships:

Variable 1 Variable 2 Correlation Coefficient
Fixed acidity pH -0.43
Residual sugar Alcohol -0.45
Total sulfur dioxide Alcohol -0.45
Density Alcohol -0.70

I would like to examine each pair of explanatory variables by making some bivariate plots. Please note that since a majority of the variables had positively skewed distributions, I have constructed the plots in such a way as to exclude the higher valued outliers. What appears in the pair-wise scatterplot are the data points that fall within the whiskers of the x-axis variable’s boxplot.

We could think of it as zooming into the more relevant portion of the sets of related data points. I used the following formula to determine the boundaries of these x and y axis values: \(\bar{x} \ \pm \ 1.5 \cdot IQR\)

I have sampled 2000 observations from the white wine dataset in constructing this first plot so as to minimize overplotting.

The Pearson’s r correlation coefficient of 0.40 indicates a weak and positive relationship between total sulfur dioxide and residual sugar.

We see a moderate and positive relationship between the total sulfur dioxide values and the free sulfur dioxide values. The r coefficient between the bivariate data is 0.62.

The density scores do not come in empty as its data points never touches the x-intercept. The correlation coefficient of 0.84 shows a strong positive relationshihp between residual.sugar and density.

A Pearson’s correlation coefficient of 0.53 indicates a proportional relationship between total sulfur dioxide and white wine density.

Above is a jittered scatterplot showing a correlation coefficient of 0.44 that indicates a moderately positive relationship between quality and alcohol.

The next round of plots would deal with the negative bivariate relationships among the features of the white wine dataset.

Overall acidity (pH) is negatively correlated with fixed acidity levels. This correlation is weak (r = -0.43).

Residual sugar and total sulfur dioxide, which share equivalent correlation coefficients at -0.45, are both weakly associated with alcohol.

The inverse relationship between alcohol and density is the most notable plot in this group. With a correlation coefficient at -0.70, it describes a moderate and negative relationship between the two explanatory features.


Bivariate Analysis


It goes without saying that the quality variable might perhaps be the variable of choice when attempting bivariate analysis on the white wine dataset. It is quite common and the reasoning is well-placed to consider how the other features might influence the classified outcome.

However, I am inclined to examine the interactions among the explanatory variables because they might provide some ideas on how to approach the examination of the response variable. Moreover, some practical benefit might be had in paying a closer attention to these relationships as its analysis may provide useful information in the downstream processes of statistical inference or machine learning (e.g. feature selection).

Sulfur Dioxide and Sulphates

The graph below reshows the moderately positive relationship between free.sulfur.dioxide and total.sulfur.dioxide.

The Pearson’s correlation coefficient of 0.62 between the free sulfur dioxide and the total sulfur dioxide in the white wine dataset may be described as an expected result because one can be said to be a part of the other.

If the total suflur dioxide of white wine is comprised of all the different forms that sulfur dioxide may take (like free sulfur dioxide), then it is reasonable to say that whenever the quantity of free sulfur dioxide increases then the quantity of total sulfur dioxide may be assumed to increase as well.

The same reasoning may be applied to the relationship of sulphates towards free.sulfur.dioxide and total.sulfur.dioxide, albeit such a relationship is shown to be very weak and positive (0.06 and 0.13 respectively).

The advantage of having a moderately close relationship between variables that are found to be composites of each other is that we can opt to use just one of them in our statistical calculations with the reason being that one can always be explained by the other.

Alcohol, Residual Sugar and Density

The variables alcohol, residual.sugar and density were listed frequently in our correlation coefficient bivariate pairs list.

The prominent graph among the three visualizations above is the strong positive relationship between residual sugar and density (\(r = 0.84\)). A high amount of residual sugar is associated with a high density value; a low amount of residual sugar is in turn associated with a low density value.

This is followed by the graph showing the moderately negative relationship between alcohol and density \((r = -0.70)\). A low alcohol content value is associated with a high density value while a high alcohol content value is associated with a low density value.

We could glean another type of relationship from the observations we just described which both share a common variable: density. If density is proportionally related to residual sugar and inversely related to alcohol, then we could say that residual sugar would have some sort of relationship with alcohol.

A decreased amount of residual sugar in white wine is associated with a decreased measure of density. On the other hand, a decreased measure of density is associated with an increased alcohol value. Thus, by using the transitivity law, we could say that a decrease in residual sugar may be accompanied by an increase in alcohol levels in white wine.

We can see how the plots above support these conjectures. In fact, the inverse relationship between residual sugar and alcohol is expressed by a correlation coefficient (\(r = -0.45\)) which suggests a moderately negative relationship between residual sugar and alcohol.

What does this all mean for us? I think having this knowledge would be an advantage later if one decides to construct a statistical model using the variables in the white wine dataset. Since both residual.sugar and density have been shown to closely vary together, variance issues brought by multicollinearity that usually come into play in statistical modelling may be mitigated by excluding density as density can be explained by residual.sugar.

Alcohol and Wine Quality

White wine quality has been shown to exhibit a moderately positive association with alcohol level \((r = 0.44)\). This means that poor or low quality wines may be associated with low alcohol levels while high quality wines are associated with higher alcohol contents.

We can examine this relationship more closely by looking into the alcohol level among wines which were similarly classified in terms of quality.

The set of density plots starts off from the bottom where we can see a highly deviated distribution of alcohol values which represents the poor class of white wines (quality = 3).

As we rise up in the quality scale, the median seems to move towards the lower end of the alcohol values becoming most frequent around the 9 - 10% alcohol level in the average quality wine group (quality = 3). At this point, the right-skewed distribution is apparent. We can say that there are generally more white wines that have
lower alcohol content levels among poorer to average class wines.

A gradual shift begins to happen as we climb higher in the quality ladder. Beginning from the quality classification of 7, the general alcohol level steadily increases. Exceptional wines (quality 8 and 9) show a negatively skewed distribution. Wines which received the highest quality ratings show an interesting bimodal distribution.

Reclassifying Quality Scores

Another way of visualizing the relationship between alcohol and quality is by consolidating the quality factor levels into two major groups: white wines that received quality scores of at least 8 and above will be classified as “Exceptional” while everything else will be classified as “Inferior”.

# Create a column to reclassify the quality score
# Excellent : { 8, 9 }
# Inferior  : { 3, 4, 5, 6, 7 }
wines$quality.q2 <- ifelse(wines$quality == 3     |
                               wines$quality == 4 | 
                               wines$quality == 5 |
                               wines$quality == 6 |
                               wines$quality == 7 , 
                           "Inferior", "Excellent")

# Convert the variable to a factored variable
wines$quality.q2 <- factor(wines$quality.q2, levels=c("Inferior", "Excellent"))

I introduced a new variable quality.q2 to hold the factored results of our groupings. The table below tallies the frequencies of the wines that belong to each group based on their quality scores.

Reclassified Wine Quality Labels
Frequency
Inferior 4718
Excellent 180

Since we have just assigned each wine one of the two new binary classification of Excellent or Inferior, we can construct a couple of density line plots that make use of this new information.

The consolidation of several quality scores into a set of binary classification groups (Excellent and Inferior) simply reaffirms the results of our previous density plot.

It is quite convenient to see how inferior wines tend to have lesser alcohol levels in general than wines whose quality scores were higher.

by(data = wines$alcohol, wines$quality.q2, mean)
## wines$quality.q2: Inferior
## [1] 10.47089
## -------------------------------------------------------- 
## wines$quality.q2: Excellent
## [1] 11.65111

The means of the alcohol content for the Inferior and Excellent wine groups are shown above by using the by() function. Excellent wines have a higher alcohol content and the difference is about 1.18%.

Chlorides and Wine Quality

An interesting plot that I encountered is the one shown below about the interaction between chlorides (saltiness) in each quality group.

The binned density plots above seem to indicate that increased amount of chlorides or salt is associated with a rise in the quality of poor wines but up to a point. Wines starting from a quality score of 6 begin to see an improvement in their quality scores that is also attended by a reduction in the chloride levels.


Multivariate Plots


One of the goals of this project is to discover those features that may be common among white wines that received commendable quality scores.

The bivariate relationships between alcohol and density, between residual sugar and density, and between alcohol and residual sugar were relegated to their own separate discussions earlier. By doing so, I was able to discover and infer how density seems to take on an intermediate role in what seems like a more primary association between alcohol and residual sugar.

Let us study this further. I would like to try and visualize the alcohol, density and residual sugar features of wines through a multivariate plot.

I would start by creating a new field called density_quantile to to the existing wines dataset. The new field will hold a decile value calculated from each wine observation’s density data value according to the density distribution. Below is the result of density_quartile classification.

The graph below shows the log-transformed alcohol level and residual.sugar data points for each white wine observation in the dataset. I used the observation’s density_quartile classification to assign a color to each data point.

The inverse relationship between alcohol and residual.sugar is depicted by the negatively sloped fit line.

It also makes apparent the proportional relationship between density and residual.sugar. Wines that belong to lower density quantiles (yellow) also appear in the lower valued portion of the residual sugar scale on the left and wines that have higher density quantiles (blue) accompany the data points that have a higher residual sugar content towards the right.

The inverse relationship between alcohol and density is also evident from the graph since yellow data points indicating a lower density quantile value are prevalent at high alcohol levels and blue data points indicating a higher density quantile value are prevalent at low alcohol levels

For the next visualizations, I will reclassify the quality scores again into three levels: 3 for excellent wines (for scores above 6), 2 for fair (scores of 5 and 6), and 3 for poor (scores of 3 and 4).

Reclassified Wine Quality Labels
Group Frequency
1 183
2 3655
3 1060

I noticed some interesting plots while trying to examine different combinations of relationships that are available to me from the available supporting variables. residual.sugar and density have an apparent relationship with each other. Also, pH and fixed.acidity seem to have an interesting association as well.

Density, Residual Sugar and Wine Quality

It has been discussed earlier that residual.sugar and density are positively and relatively strongly correlated \((r = 0.84)\). How does this correlation come into play when we include the available information about each wine’s quality?

It looks like there is a pattern among excellent and fair wines while I cannot discern any regular behavior among the data points that are described as poor quality wines.

Wines containing higher amounts of residual sugar conditioned on a particular level of density tend to have a higher quality rating. This can be seen on the plot where for a certain level of density measurement, there are relatively more blue points (excellent wines with quality scores of 7 or 8) at the higher (vertical) location. This observation is prevalent among wines with densities ranging from 0 to 0.995 g/cm3. The same can also be said about the fair wines (yellow).

Fixed Acidity, pH, and Wine Quality

Assigning each wine according to the three-score quality specification allows me to examine the multivariate relationships between pH, fixed.acidity and quality. Note that we have eliminated the outliers on the positive end and retained the 99% quartile data points from both participating variables.

The same weak and negative association between pH and fixed.acidity can be seen in each of the quality groups \((r = -0.43)\). The quick and apparent difference between each group is how sparse and spread out the data points are among the excellent and poor wine groups. Fair wines (yellow) tend to cluster around the mean fixed.acidity of 6.85 g/dm3 and and mean pH of 3.19.

Something quite subtle is going on, however, when we look closely at the cluster of the data points as we the quality score increases.

The center of the fixed acidity seems to move to the left or towards the lower valued fixed acidity levels as the wine quality elevates from poor (quality scores of 3 or 4) to excellent (quality scores of 7, 8 or 9).

Quality Group Quality Score Mean Fixed Acidity (g/cu. dm.)
Poor 3, 4 7.18
Fair 5, 6 6.88
Excellent 7, 8, 9 6.73

We can verify how the mean fixed acidity seems to decrease by querying the fixed.acidity of wines conditioned on quality (1 = Poor, 2 = Fair, 3 = Excellent) as well as displaying the boxplot for fixed.acidity binned by the three quality levels (above).


Multivariate Analysis


I did not set out to find relationships between the explanatory variables themselves, nor was I expecting them. My stated goal at the outset is to look for those features that make some wines more amenable to the tastes of experts.

Yet I was surprised to see several interactions among the predictors which the visualizations made more prominent. For instance, high density wines tend to be associated with wines with high sugar content and as sugar content increases, the alcohol content wanes. Since we know from a prior finding that alcohol content is somehow positively correlated with the quality of wine, we can thusly infer that exceptional wines might have the tendency to be less dense.

Another interesting interaction that I saw is how the exceptional and fair wines prominently exhibited a strong relationship between their residual sugar and wine density properties.

I have extracted the fair and exceptional wine data points and show each group’s scatterplot to display the tight proportional relationship between residual sugar and density.

Quality Group Quality Score Correlation Coefficient
Poor 3, 4 0.74
Fair 5, 6 0.86
Excellent 7, 8, 9 0.82

The exceptional and fair wine groups have fairly strong relationships between density and residual sugar. We can see from their correlation coefficients that the exceptional and fair wines have strong and positive correlations between their residual sugar content and density, and this is actually more prominent for the fair wines \((r = 0.86)\).

We then follow this with a similarly curious observation about pH, fixed.acidity and quality where the inverse relationship between pH and fixed.acidity is examined along quality classification groups. Upon closer scrutiny, we find that mean fixed acidity within quality groups decreases as the quality of wine increases.

Optional – Modeling White Wine Data

The presence of the input variables as well as the availability of a label in each of the examples in our white wine dataset certainly makes the idea of generating a model very appealing. So, I decided to pursue it using linear regression to model the interactions between the wines’ physicochemical properties and its assessed quality.

Preparing the Data

Part of the univariate analysis of each feature is the graphical presentation of its distribution using a boxplot. The boxplot lets us discover and identify outliers. Most of the extreme outliers are located on the positive end of the values and it has been shown that the variables’ distributions are positively skewed. Because we are dealing with multrivariate data, we would consider only those data points among predictors that lie within the limits bounded by the boxplots. This means that we shall deem those predictor values that fall above \(Q3 + 1.5 * IQR\)   as outliers.

The dataset will then be randomly split into training data and test data. The training data will consist of 80% of the main dataset while the remaining 20% will be set aside as test data.

After the data preparation, we end up with the following datasets.

White Wine Data Records Name Record Count
Original Dataset wines.orig 4898
Outliers Excluded wines.outdf 4074
Training Set (80% of wines.outdf) trainSet 3259
Test Set (20% of wines.outdf) testSet 815

So, we now instantiate a linear regression model from the features of the white wine dataset using the lm() function.

## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + density + 
##     pH + sulphates + alcohol, data = trainSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3992 -0.5197 -0.0497  0.4527  2.7696 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.985e+02  2.995e+01   6.627 4.00e-11 ***
## fixed.acidity        1.340e-01  3.068e-02   4.367 1.30e-05 ***
## volatile.acidity    -1.956e+00  1.790e-01 -10.927  < 2e-16 ***
## citric.acid          7.494e-02  1.549e-01   0.484  0.62845    
## residual.sugar       9.574e-02  1.131e-02   8.467  < 2e-16 ***
## chlorides           -3.632e+00  1.613e+00  -2.252  0.02439 *  
## free.sulfur.dioxide  5.866e-03  9.540e-04   6.148 8.79e-10 ***
## density             -1.995e+02  3.035e+01  -6.573 5.71e-11 ***
## pH                   9.431e-01  1.463e-01   6.447 1.31e-10 ***
## sulphates            8.017e-01  1.408e-01   5.693 1.36e-08 ***
## alcohol              1.207e-01  3.846e-02   3.139  0.00171 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7489 on 3248 degrees of freedom
## Multiple R-squared:  0.2545, Adjusted R-squared:  0.2522 
## F-statistic: 110.9 on 10 and 3248 DF,  p-value: < 2.2e-16

The lm() function generated a best fit model over the ten input variables comprising our training set of 3,259 white wine examples. All variables except citric.acid had significant estimates of the coefficient with most of them being signficant at the 0.05 level. The coefficient of determination is rather low at 0.2545.

Sum of Squared Errors (SSE)

SSE <- sum(model1$residuals^2)
SSE
## [1] 1821.442

The sum of the squares of the residuals from a model that used all available predictors is about 1,821. The histogram above shows the distribution of the residuals and it appears to be normally disributed.

Sample of Predicted Quality Scores

The multivariable linear regression model is applied on our test set and we show the first few predicted quality values in the table below.

# Make test set predictions
predictOnTest <- predict(model1, newdata = testSet)
Index Quality
2 5
4 6
14 7
16 6
19 6
22 6

Compute R2

SSE = sum((testSet$quality - predictOnTest)^2)
SST = sum((testSet$quality - mean(wines.orig$quality))^2)
1 - SSE/SST
## [1] 0.2477108

The coefficient of multiple determination is calculated to be 0.2477. This can be interpreted to mean that our model explains only about 25% of the variability of the wine quality around its mean.


Final Plots and Summary


I began this exploratory data analysis of the white wine dataset with a cursory description of its dimensions, structure and data types. I also examined and plotted each feature’s distribution and found that most of the variables were positively skewed. Most of the variables have oultiers in the high end of the distribution and if we remove these outliers, most of the variables except density and chlorides will exhibit quite a normal distribution.

Variable Mean Unit
Fixed acidity 6.86 g per cu. dm.
Volatile acidity 0.28 g per cu. dm.
Citric acid 0.33 g per cu. dm.
Residual sugar 6.39 g per cu. dm.
Chlorides 0.05 g per cu. dm.
Free sulfur dioxide 35.31 mg per cu. dm.
Total sulfur dioxide 138.40 mg per cu. dm.
Density 0.99 g per cu. cm.
pH 3.19
Sulphates 0.49 g per cu. dm.
Alcohol 10.51 % by volume
Quality 6 out of 10

From a summary of the attributes of white wines, we can glean that a typical white wine:

These descriptive measures give us useful yet superficial understanding of our data. And so I had to dig deeper to find clues as to which features influence our wine experts’ assessment of each wine’s overall desirability. Running a correlation matrix quickly shows that alcohol surpasses the others as a predictor of quality with a correlation coefficient of 0.44.

Since we know about the weak and positive relationship between alcohol and wine quality, we can then explore the interactions between alcohol and the other variables because whatever relationships we can uncover might reasonably redound to their indirect effect on wine quality.

To remind us of these interesting relationships, I will recapitulate the notable findings from the multivariate analysis that we previously conducted.

The visualization above shows how we can augment our understanding of the role of alcohol as a somewhat loose determiner of wine quality. Good wines tend to be associated with high alcohol content. It is apparent from this plot that high alcohol levels tend to be associated with low residual sugar content and low density levels.

The results of this data exploration about the relationship between alcohol and residual sugar (and density) is very interesting because it provides a data-centric affirmation to what actually transpires in wine-making: alcohol is a by-product of the fermentation process done by yeasts upon the natural sugars of grapes. So we may naturally expect the alcohol level to increase with a decrease in residual sugar content.

Sugar chemistry helps to explain its role in wine. One of these is fermentation, where yeasts metabolize sugars for energy, yielding alcohol as a major byproduct. - calwineries.com

Again, we can go further and verify the indirect influence of residual sugar and density to wine quality. And the scatterplot above manages to depict these subtle, indirect associations. Granted that the plot above shows an obvious direct and proportional relationship between wine density and sugar content, an interesting finding is that this relationship is a lot stronger among fair quality wines than among excellent or poor wines.

The positive relationship between density and residual sugar as shown in the scatter plot also supports the practice of merely using wine density as an alternative measure to gauge the alcohol level that is present in wines as well as an indicator of the conversion of sugar to alcohol.


Reflection


Difficulties and Challenges I Encountered During the Analysis

One of the reasons I chose to tackle this study is because I like wine. It afforded me the opportunity to deepen my knowledge about this type of drink beyond the superficial appreciation of its taste, texture, smell, and how it enhances the enjoyment of the food it accompanies. Yet when I started digging into the physicochemical data, that is when I realized how deprived I was of what is actually involved in the making of this fine beverage.

The first challenge for me was determining where to even start with the data. There was a total of 12 variables shared by 4,898 observations. The way I managed this hurdle is to just let the data speak for itself. It is convenient that R provides functions such as str() and summary() that allowed me to describe the dataset as a whole and get an idea about the structure of each feature.

Another challenge that I encountered was about figuring out which variables are important enough about which I should pay more attention. For this, I had to rely on studying the individual distributions of the variables through their boxplots and histograms. I also had to create multiple preliminary bivariate plots to distinguish which pair or sets of variables are worthy of consideration.

Recall how the \(R^2\) calculation returned a weak coefficient of determination value of 0.2477. Although I was expecting a higher result, at this point, I must admit that I was not at all no longer surprised. It somehow confirms my early struggles of trying to look for any relationship that would connect the features to the response variable. It turns out that almost none of them (save perhaps alcohol), even taken collectively, can offer a close explanation of the quality score that each wine received.

The closest approximation that I can come up with is by approaching the problem indirectly . Since I knew that the alcohol level had a high enough correlation coefficient value when paired with wine quality, I used this knowledge to inform my efforts at discovering the supporting variables that influence the alcohol level.

Along the way, I also discovered some other interesting interactions among the explanatory variables. The data showed how residual sugar and total sulfur dioxide are proportionally related but this relationship is somehow weak. I did some research and discovered that this relationship between sulfur dioxide and sugar in white wine happens because of the need to protect the wine in order to compensate for its low sugar levels. According to morethanorganic.com:

White wines and rosés do not contain natural anti-oxidants because they are not left in contact with their skins after crushing. For this reason they are more prone to oxidation and tend to be given larger doses of sulphur dioxide

On the other hand, by doing bivariate analysis, I found that free sulfur dioxide is negatively related with volatile acidity with a correlation coefficient of -0.10. Once again, it is very interesting to know how this finding mirrors what happens in wine-making. According to winemakermag.com, free sulfur dioxide is used to manage volatile acidity levels in wine.

The moderately positive correlation between total suflur dioxide and wine density can in part be understood in hindsight. This situation is not farfetched if we consider the positive correlation between total sulfur dioxide and residual sugar as well as the strong and positive relationship between residual sugar and wine density.

In the middle of the project, I felt that I was running out of options about other ways to present the white wine data. This is when the idea about grouping the wines based on quality struck me. Reclassifying and refactoring wines based on different quality sets (Exceptional, Fair, Poor) introduced a variety of different ways of looking at the data. I was able to closely analyze the relationship between residual sugar and wine density conditioned on quality because of this.

Some Ideas For Future Work or Inquiry

The close exploratory analysis conducted on the white wine dataset yielded these interesting findings. It also brought forth some curious topics that may need further inquiry.

The quality of white wine observations rose with increasing chloride levels but only up to a point. Once the chloride content reached a little below 5.0 g/dm3, the relationship became reversed. Higher quality wines began to be associated with lower saltiness or chloride levels. It would be interesting to determine if there are extraneous factors that might explain this behavior between chloride levels and wine quality. Finding these extraneous factors can be the subject of a future inquiry .

The low \(R^2\) value which coincided with low p values in our regression model is another topic which may need to be further examined. It is usual to see a high \(R^2\) value paired with low p values which specifies that the variance in the predictor variables are associated with the variance in the response variable and that the model primarily explains most of the variability in the response variable.

But in our case, our model does not explain much of the variability in the quality of wine despite the presence of significant input variables. We saw this in our scatterplot when we visualized the predicted and actual white wine quality data values from our test set. The data points were substantially spread out around the reference line which indicates the large variance in the residuals. This also means that it is possible to have a viable set of input variable coefficients that coincide with a substantially imprecise set of predicted values.

From the standpoint of the available data on white wines, a single variable (alcohol) does not wholly determine the desirability of white wine. And a model utilizing a combination of various multivariate characteristics could not substantially explain the change in quality score despite the significance of each individual input variable. Thus, increasing the explanatory viability of a model might require additional predictors or perhaps the white wine data may just contain a substantial amount of intrinsic, unexplainable variability. I think that a further and closer study would have to be made at finding additional extraneous predictors or to analyze and manage the complex nature of wines to arrive at a better inferential model.


Bibliography