I have a dataset which consists of observations on the physicochemical properties of Portuguese vinho verde white wines. Each wine is judged by three wine experts and assigned a numeric score based on the assessment of its quality. Through exploratory data analysis, I intend to find and identify those features that may be common among wines that received higher quality scores.
The dataset has been loaded using read_csv()
. I am curious about the shape of our dataset. How many rows and columns are in the white wine dataset?
dim(wines)
## [1] 4898 12
The dataset has 4,898 observations with 12 variables. Each observation represents a unit of white wine along with the measurements of some of its physicochemical attributes (i.e. acidity, sugar and alcohol level) as well as a numerical assessment of its overall desirability as determined by a group of oenologists.
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
We can see from the results of the str()
function above that the dataset is made up of 13 numeric and discrete (int) attributes. Next, I’ll use the summary
function to get an idea of the type of values we can expect from the fields in the wines
dataset.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
A brief look into the results of the summary()
function tells us that we can expect to see most of the variables like residual.sugar
as being positively skewed because the mean value is greater than its median while some attributes like pH
follow a normal distribution.
I checked the dataset if there are any missing values (NAs) in the variables and I did not find any.
I shall discuss each variable and provide some brief exploration of the structure and distribution of the variable’s data.
The fixed.acidity
feature measures the concentration of tartaric acid in grams/dm3. Tartaric acid is an organic acid that helps make the wine taste fresh and also helps in its preservation.
Here’s a summary of the fixed acidity values found in our white wine dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The concentration of fixed acidity is given in units of grams per cubic decimeter (g/dm3). Wines with the smallest concentration in terms of fixed acidity measures at 3.8 g/dm3 while the greatest fixed acidity concentration of 14.2 g/dm3 can be found in the dataset – an outlier value that far exceeds the fixed acidity interquartile range of 1.0 g/dm3.
The wines in our sample seem to follow a gaussian shape in their tartaric acidity levels and the histogram shows that there are a few wines that have such high concentrations (> 7.3 g/dm3) that pull the mean upwards.
The boxplot confirms the normally distributed shape of the fixed acidity data points. The minimum (3.8 g/dm3) and the maximum values (14.2 g/dm3) indicate the presence of outliers in the fixed acidity data. I have added the jittered data points in the boxplot and the presence of some data points well beyond the IQR thresholds are quite typical especially in the positive end.
In unripe grapes, tartaric acid can reach up to 15 grams per cubic decimeter.
volatile.acidity
, which pertains to the amount of acetic acid, affects how the wine smells. At high enough amounts, the wine can become vinegary in smell and taste.
A summary of the volatile acidity values is given below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Again, the summary
function indicates the presence of outliers (using an IQR of 0.11
) on the negative and positive ends of the volatile acidity distribution.
The volatile acidity levels in our white wine dataset is normally distributed yet it appears to be more dispersed than that of the fixed.acidity
values. There is a considerable number of acetic acid outliers on the positive end of this normally-skewed distribution.
From the histogram, I can see that there were more wines that have volatile acidity around the 0.22 to 0.28 range g/dm3. In fact, a glimpse of the volatile.acidity
vector using the table()
function shows that 263 observations possess the modal acetic acid value of 0.28 g/dm3.
I’m not a wine expert, but I would think that the values from the two sets of acidity levels (tartaric and acetic acid) should exhibit similar dispersions. Why would the acetic acid set show a more dispersed distribution of values?
Citric acid can enhance the freshness and flavor of wines and it is a type of acid that can usually be found or added in very small quantities.
Here is a summary of the citric acid values.
summary(wines$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
The citric acid amounts in our white wine sample ranges from a minimum of 0.00 g/dm3 to a maximum amount of 1.66 g/dm3. The average amount of citric acid is 0.33 g/dm3 though fifty percent of the wines have citric acid levels at 0.32 g/dm3 which is less than the mean citric acid level of 0.33 g/dm3.
The distribution of citric acid values is normal. The maximum citric acid value in the group of samples is remarkably high (1.66 g/dm3) which indicates the presence of positive outliers.
Although the values are small just like those for acetic acids, the distribution of citric acid value tend to be close to each other.
Half of the citric acid values tend to bunch up between 0.27 g/dm3 and 0.33 g/dm3, an IQR of 0.12 g/dm3. There is a prominent bar in the histogram that seems to stand out against its neighbors at around the third quartile of the distribution. A run of the table()
command shows that there are 215 white wines possesses this citric acid amount with 0.49 g/dm3.
Residual sugar refers to the amount of sugar left in wine after the fermentation process is allowed to finish. It is mostly responsible for the sweetness in wine.
A summary of residual sugar values in our sample is given below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
The mean residual sugar in our white wine sample is 5.2 g/dm3 but there is a substantial amount of variance around this mean as the interquartile range comes in at 8.2 g/dm3. The presence of an extreme observation as the maximum data point shows (65 g/dm3) as well as the considerable variance between the median and the third quartile values would indicate a strong positive skew in the distribution of residual sugar.
Taking a log transformation of the highly positively skewed residual sugar data lets us discover the bimodal shape of its distribution. There is a considerable peak and some local variance around the 1.4 g/dm3 data point and another modal level around the 8.0 g/dm3 data point.
The boxplot in turn shows that the amount of residual sugar (measured in grams per liter) is strongly positively skewed with a mean value of 6.391 g/dm3 substantially being greater than the amount of residual sugar possessed by half of the number of wines in our dataset.
It is interesting to note the outliers that were shown as a result of our summary()
function over the residual.sugar
feature. One or some wines registered a maximum residual sugar amount of 65.8 grams per liter. This makes these wines very sweet as any wine that has greater than 45 grams per liter are considered sweet.
The kind of grapes used and their origin makes for the concentration of chlorides in wine and its amount is known to manifest its effects in the saltiness of the product.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
I noticed that the chloride data points are quite tight around the mean considering that the calculated IQR is about 0.014 g/dm3. There are considerable outliers in the positive side of the distribution with a maximum chloride concentration of about 0.35 g/dm3.
A log transformed distribution of chlorides in the white wine dataset assumes a normal shape with a mean of about 0.0458 g/dm3.
The boxplot shows the strong, positively skewed shape of the underlying chloride distribution. It also shows the small variance in the middle quartile intervals.
Sulfur dioxide has been widely used to help protect the quality of wine by retarding the growth of microbes and controlling oxidation. Given below is a summary of the sulfur dioxide data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
The distribution of free sulfur dioxide tend to be concentrated near the mean (35.31 ppm). The IQR is at 23 ppm and the standard deviation of about 17.01 ppm makes up for 48.17% of the mean (coefficient of variation). The formula for determining the coefficient of variation is given below.
\[CV = \frac{\sigma}{\bar{x}} * 100\%\]
I will use this coefficient later when I compare the free sulfur dioxide variance with the total sulfur dioxide variance.
The unbound sulfur dioxide values exhibited by various wines in our white wine dataset seem to be slightly positively skewed, widely dispersed but centrally located around the 35.31 ppm. The skewness of the values is mainly due to the high-valued outliers with the highest posting at 289 ppm.
The sum of the free (unbound) and bound sulfites in wine determine the total amount of sulfur dioxides. The smell and taste of sulfur becomes apparent in wine if the total suflur dioxide levels are in high enough concentrations (above 50 ppm).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
The mean of the total sulfur dioxide data is 138 ppm and it is close to the median which is 134 ppm. This means that the data is normally distributed. There are also extreme values in the data.
The total sulfur dioxide IQR is 59 ppm and the standard deviation of about 42.50 ppm makes up for 30.72% of the mean (coefficient of variation). This tells us that the total suflur dioxide is quite spread out from the mean, although it is less dispersed than the unbound sulfur dioxide values (CV = 48.17%)
The histogram of total sulfur dioxide amounts in our white wine dataset follows a normal distribution centered around the mean of 138.4 ppm. The most frequently occuring total sulfur dioxide amount is 111 ppm.
Both the histogram and the boxplot shows a slight right-side skew of the values in the distribution owing to the presence of some outliers beyond the IQR-related threshold of about 227 ppm.
I also observe how sparsely spread out the total sulfur dioxide values are around the interquartile area.
The density of wine is influenced by the concentration of alcohol, sugar, glycerol, and other dissolved solids. It serves as a measure of the conversion of sugar to alcohol. A low density has more alcohol than sugar and a high density means that there is more sugar than alcohol.
It is said that the density of wine is close to that of water. Dry wine has lower density compared to sweet wine, which has a higher density.
To provide perspective, we can compare our dataset’s white wine density with the density of some familiar substances.
Substance | Density in g/cm3 |
---|---|
Water | 1.000 |
Ethanol | 0.7890 |
Sugar | 1.5870 |
Given below is a summary of the density the white wine observations in our dataset (in g/cm3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The summary tells us that white wines in general tend to have a density that is close to that of water but leaning towards the alcohol side.
The water density values tend to cluster around the mean (0.994 g/cm3). This fact is also manifested in its low IQR (0.004 g/cm3) and low standard deviation (0.003 g/cm3). The close similarity between the mean and the median (0.9937) tells us that the distribution of water density follows a normal form.
The outliers on the positive end have values that can be considered feasible given that they are still closely comparable to the density of water with which wines share a similar characteristic. I will consider these outliers as an example of good data about some extreme cases.
The pH level of wine provides a way to gauge the ripeness of wine in relation to its acidity. pH values range from 0 (very alkaline or basic) through 14 (very acidic). Wines that have low pH values tend to taste crisp and tart. On the other hand, wines with a high pH value are more vulnerable to microbial growth. The general understanding is that there is an inverse relationship between pH and acidity: the lower the pH, the higher the acidity and vice versa.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
According to winespectator.com, the acidity level in wine gets used along with its other attributes as sweetness, alcohol, extract and tannin to influence how the wine tastes. It also mentions that pH levels of from 3.0 to 3.4 is desirable for white wines.
Since the mean and the median are almost identical, we can expect the pH distribution of white wines in our dataset to have a normal shape. Outliers on the positive end still tend to fall within the range of the usual pH values in most wines (between 2.9 and 3.9). At the low end of the pH summary, I see that we have white wine outliers in our dataset that come close to the acidity of distilled white vinegar (pH = 2.4).
The histogram of white wine pH levels is gaussian in shape centered around the mean of about 3.19. The data points also tend to be sparsely distributed and spread out around the mean as the jittered boxplot above shows. However, the pH standard deviation of 0.15 makes the pH data relatively less varied than the other feature distributions as this only accounts for about 4.74% of the mean (coefficient of variation).
One thing that I noticed is that the acid level scores in between the mean and the maximum available acid level score of 3.820 are relavitely more spread out than their counterpart range on the opposite side (scores below the mean). This describes a slight positive skew owing to some outliers on the positive side. We can see this from the Q-Q plot shown below.
The quantile-quantile plot describes a gaussian distribution around the acidity scores within about 0.15 standard deviations from the mean.
Winemakers control the sulfur dioxide gas levels in their wine products by using sulphates. Sulfur dioxide are known to have antioxidant and antimicrobial properties that help preserve the quality of wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
The average sulphate amount is slightly higher than the median sulphate amount which indicates a slightly positive skew for the sulphate distribution.
I can immediately see the presence of outliers in the sulphate data. The sulphate distribution has a normal shape with a slight positive skew.
Sulphates have a standard deviation of 0.11 grams/dm3 and an IQR of 0.14 grams/dm3. The standard deviation takes 23.3% of the mean and this coefficient of variation means that the sulphate data is considerably dispersed as compared to the other variables in our dataset.
The alcohol
variable measures the percent alcohol content of the wine. Alcohol is among the characteristics of wine that influences its overall quality and taste. Although the optimal alcohol content is subjective and dependent on the type of wine, some would say that the ideal alcohol content typically has about 13.6 percent while some wines taste exceptionally well with concentrations higher than 14.0 percent.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The white wines in our dataset have at least an 8% alcohol content and a maximum of 14.2%. The typical wine has about a 10% alcohol level. Since the mean alcohol content of 10.51% is close but slightly greater than the median by about 0.11% points, we can expect a normally distributed set of values.
White wine alcohol levels in our dataset appears slightly skewed to the positive end. The boxplot tells us that about half of the wines have alcohol levels from 9.5% to 11.4%. The data points are quite varied with a standard deviation of 1.23%. I look forward to knowing whether or not any alcohol level holds much of the wines that received high quality classification.
The quality
variable represents a numeric score from 0 (very bad) to 10 (very excellent). This ordered classification is based on the sensory evaluation made by wine experts. The summary statistics are given below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
The data points representing the quality
feature appears to be clustered around 6.0 and a majority of the wines had scores that fall just within a one point interval from 5 to 6 (IQR = 1).
I decided to use a bar plot to visualize the distribution of white wine quality scores because of the discrete nature of its data type.
The bar plot confirms our observation about the distribution of wines according to their quality scores - the quality scores are given in sequenced order from 0 through 10 but there are more wines that are of typical quality (quality scores of 5, 6 and 7) than wines that are considered poor or exceptional. What follows is a table breaking down the number of wines in our dataset according to their quality classifications.
Quality | Frequency |
---|---|
3 | 20 |
4 | 163 |
5 | 1457 |
6 | 2198 |
7 | 880 |
8 | 175 |
9 | 5 |
I have given an overview of the 12 variables comprising each of the 4,898 sampled wines in the previous univariate plots section.
I briefly described each variable and some of its characteristics that may influence the quality of the wine. I also accompanied this with a short statistical summary of the feature’s data points (mean, median, minimum, maximum and quartile scores). I have collected those summaries in the table below to serve as a reference to the observations I have made regarding these variables.
Min | Q1 | Median | Q3 | Max | Range | IQR | Mean | SE | |
---|---|---|---|---|---|---|---|---|---|
fixed.acidity | 3.800 | 6.300 | 6.800 | 7.300 | 14.200 | 10.400 | 1.000 | 6.855 | 0.012 |
volatile.acidity | 0.080 | 0.210 | 0.260 | 0.320 | 1.100 | 1.020 | 0.110 | 0.278 | 0.001 |
citric.acid | 0.000 | 0.270 | 0.320 | 0.390 | 1.660 | 1.660 | 0.120 | 0.334 | 0.002 |
residual.sugar | 0.600 | 1.700 | 5.200 | 9.900 | 65.800 | 65.200 | 8.200 | 6.391 | 0.072 |
chlorides | 0.009 | 0.036 | 0.043 | 0.050 | 0.346 | 0.337 | 0.014 | 0.046 | 0.000 |
free.sulfur.dioxide | 2.000 | 23.000 | 34.000 | 46.000 | 289.000 | 287.000 | 23.000 | 35.308 | 0.243 |
total.sulfur.dioxide | 9.000 | 108.000 | 134.000 | 167.000 | 440.000 | 431.000 | 59.000 | 138.361 | 0.607 |
density | 0.987 | 0.992 | 0.994 | 0.996 | 1.039 | 0.052 | 0.004 | 0.994 | 0.000 |
pH | 2.720 | 3.090 | 3.180 | 3.280 | 3.820 | 1.100 | 0.190 | 3.188 | 0.002 |
sulphates | 0.220 | 0.410 | 0.470 | 0.550 | 1.080 | 0.860 | 0.140 | 0.490 | 0.002 |
alcohol | 8.000 | 9.500 | 10.400 | 11.400 | 14.200 | 6.200 | 1.900 | 10.514 | 0.018 |
quality | 3.000 | 5.000 | 6.000 | 6.000 | 9.000 | 6.000 | 1.000 | 5.878 | 0.013 |
quality.f* | 1.000 | 3.000 | 4.000 | 4.000 | 7.000 | 6.000 | 1.000 | 3.878 | 0.013 |
Listed below are the observations I have made regarding the variables of the white wine dataset.
The presence of outliers can be determined from how the mean is consistently greater than the median in all the variables except quality
. Also, the range values are greater than the IQR across the board. The outliers from these variables occur at the right or positive end of the distribution.
The alcohol
data has a dispersed, positively skewed distribution but the boxplot did not reveal any outliers.
Most variables like fixed acidity, volatile acidity, residual sugar, citric acid and chlorides have outliers. The distribution of these variables may be considered gaussian if their outliers were removed.
density
and free.sulfur.dioxide
are variables whose outliers are very far removed from the other data points.
The boxplot visualization of the distribution of the residual.sugar
seems to be strongly positively skewed that removal of the outliers in the positive end will have no significant effect on the shape of the distribution.
A majority of the wines fell under the quality
classifications of 5, 6 and 7. However, no wine received a quality score of 1, 2 nor 10. The frequency of wines that received scores near the tail ends of the distribution (3, 8 and 9) tapered down considerably.
Let us now look into the analysis that involve any two variables of our white wine dataset. We will be using the pair-wise scatterplots as well as correlation coefficients to examine the associations between the features of our data.
We will begin with a matrix of scatteplots and correlation coefficients using all of the variables in our dataset.
We can see how most of the variables exhibit a positive skew. The pH
variable is qute normal and the alcohol
distribution is quite unusually shaped.
The pair-wise scatterplots help visualize any relationship that is present between variables. They also give an idea of how strong and in what direction the relationship goes.
The table below shows the Pearson’s correlation coefficient (\(r\)) for each pair of independent variables in the white wine dataset.
fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
fixed.acidity | 1.00 | -0.02 | 0.29 | 0.09 | 0.02 | -0.05 | 0.09 | 0.27 | -0.43 | -0.02 | -0.12 | -0.11 |
volatile.acidity | -0.02 | 1.00 | -0.15 | 0.06 | 0.07 | -0.10 | 0.09 | 0.03 | -0.03 | -0.04 | 0.07 | -0.19 |
citric.acid | 0.29 | -0.15 | 1.00 | 0.09 | 0.11 | 0.09 | 0.12 | 0.15 | -0.16 | 0.06 | -0.08 | -0.01 |
residual.sugar | 0.09 | 0.06 | 0.09 | 1.00 | 0.09 | 0.30 | 0.40 | 0.84 | -0.19 | -0.03 | -0.45 | -0.10 |
chlorides | 0.02 | 0.07 | 0.11 | 0.09 | 1.00 | 0.10 | 0.20 | 0.26 | -0.09 | 0.02 | -0.36 | -0.21 |
free.sulfur.dioxide | -0.05 | -0.10 | 0.09 | 0.30 | 0.10 | 1.00 | 0.62 | 0.29 | 0.00 | 0.06 | -0.25 | 0.01 |
total.sulfur.dioxide | 0.09 | 0.09 | 0.12 | 0.40 | 0.20 | 0.62 | 1.00 | 0.53 | 0.00 | 0.13 | -0.45 | -0.17 |
density | 0.27 | 0.03 | 0.15 | 0.84 | 0.26 | 0.29 | 0.53 | 1.00 | -0.09 | 0.07 | -0.78 | -0.31 |
pH | -0.43 | -0.03 | -0.16 | -0.19 | -0.09 | 0.00 | 0.00 | -0.09 | 1.00 | 0.16 | 0.12 | 0.10 |
sulphates | -0.02 | -0.04 | 0.06 | -0.03 | 0.02 | 0.06 | 0.13 | 0.07 | 0.16 | 1.00 | -0.02 | 0.05 |
alcohol | -0.12 | 0.07 | -0.08 | -0.45 | -0.36 | -0.25 | -0.45 | -0.78 | 0.12 | -0.02 | 1.00 | 0.44 |
quality | -0.11 | -0.19 | -0.01 | -0.10 | -0.21 | 0.01 | -0.17 | -0.31 | 0.10 | 0.05 | 0.44 | 1.00 |
Equipped with the pair-wise visualizations as well as the correlation coefficient matrix, we can glean the following bivariate associations. These variable pairs produced some notable \(r\) values; and by notable, I mean at least greater than or equal to ± 0.40.
Scatterplots with at least positively weak relationships:
Variable 1 | Variable 2 | Correlation Coefficient |
---|---|---|
Total sulfur dioxide | Residual sugar | 0.40 |
Free sulfur dioxide | Total sulfur dioxide | 0.62 |
Residual sugar | Density | 0.84 |
Total sulfur dioxide | Density | 0.53 |
Alcohol | Quality | 0.44 |
Scatterplots with at least negatively weak relationships:
Variable 1 | Variable 2 | Correlation Coefficient |
---|---|---|
Fixed acidity | pH | -0.43 |
Residual sugar | Alcohol | -0.45 |
Total sulfur dioxide | Alcohol | -0.45 |
Density | Alcohol | -0.70 |
I would like to examine each pair of explanatory variables by making some bivariate plots. Please note that since a majority of the variables had positively skewed distributions, I have constructed the plots in such a way as to exclude the higher valued outliers. What appears in the pair-wise scatterplot are the data points that fall within the whiskers of the x-axis variable’s boxplot.
We could think of it as zooming into the more relevant portion of the sets of related data points. I used the following formula to determine the boundaries of these x and y axis values: \(\bar{x} \ \pm \ 1.5 \cdot IQR\)
I have sampled 2000 observations from the white wine dataset in constructing this first plot so as to minimize overplotting.
The Pearson’s r correlation coefficient of 0.40 indicates a weak and positive relationship between total sulfur dioxide and residual sugar.
We see a moderate and positive relationship between the total sulfur dioxide values and the free sulfur dioxide values. The r coefficient between the bivariate data is 0.62.
The density
scores do not come in empty as its data points never touches the x-intercept. The correlation coefficient of 0.84 shows a strong positive relationshihp between residual.sugar
and density
.
A Pearson’s correlation coefficient of 0.53 indicates a proportional relationship between total sulfur dioxide and white wine density.
Above is a jittered scatterplot showing a correlation coefficient of 0.44 that indicates a moderately positive relationship between quality
and alcohol
.
The next round of plots would deal with the negative bivariate relationships among the features of the white wine dataset.
Overall acidity (pH) is negatively correlated with fixed acidity levels. This correlation is weak (r = -0.43).
Residual sugar and total sulfur dioxide, which share equivalent correlation coefficients at -0.45, are both weakly associated with alcohol.
The inverse relationship between alcohol
and density
is the most notable plot in this group. With a correlation coefficient at -0.70, it describes a moderate and negative relationship between the two explanatory features.
It goes without saying that the quality
variable might perhaps be the variable of choice when attempting bivariate analysis on the white wine dataset. It is quite common and the reasoning is well-placed to consider how the other features might influence the classified outcome.
However, I am inclined to examine the interactions among the explanatory variables because they might provide some ideas on how to approach the examination of the response variable. Moreover, some practical benefit might be had in paying a closer attention to these relationships as its analysis may provide useful information in the downstream processes of statistical inference or machine learning (e.g. feature selection).
The graph below reshows the moderately positive relationship between free.sulfur.dioxide
and total.sulfur.dioxide
.
The Pearson’s correlation coefficient of 0.62 between the free sulfur dioxide and the total sulfur dioxide in the white wine dataset may be described as an expected result because one can be said to be a part of the other.
If the total suflur dioxide of white wine is comprised of all the different forms that sulfur dioxide may take (like free sulfur dioxide), then it is reasonable to say that whenever the quantity of free sulfur dioxide increases then the quantity of total sulfur dioxide may be assumed to increase as well.
The same reasoning may be applied to the relationship of sulphates
towards free.sulfur.dioxide
and total.sulfur.dioxide
, albeit such a relationship is shown to be very weak and positive (0.06 and 0.13 respectively).
The advantage of having a moderately close relationship between variables that are found to be composites of each other is that we can opt to use just one of them in our statistical calculations with the reason being that one can always be explained by the other.
The variables alcohol
, residual.sugar
and density
were listed frequently in our correlation coefficient bivariate pairs list.
The prominent graph among the three visualizations above is the strong positive relationship between residual sugar and density (\(r = 0.84\)). A high amount of residual sugar is associated with a high density value; a low amount of residual sugar is in turn associated with a low density value.
This is followed by the graph showing the moderately negative relationship between alcohol and density \((r = -0.70)\). A low alcohol content value is associated with a high density value while a high alcohol content value is associated with a low density value.
We could glean another type of relationship from the observations we just described which both share a common variable: density
. If density is proportionally related to residual sugar and inversely related to alcohol, then we could say that residual sugar would have some sort of relationship with alcohol.
A decreased amount of residual sugar in white wine is associated with a decreased measure of density. On the other hand, a decreased measure of density is associated with an increased alcohol value. Thus, by using the transitivity law, we could say that a decrease in residual sugar may be accompanied by an increase in alcohol levels in white wine.
We can see how the plots above support these conjectures. In fact, the inverse relationship between residual sugar and alcohol is expressed by a correlation coefficient (\(r = -0.45\)) which suggests a moderately negative relationship between residual sugar and alcohol.
What does this all mean for us? I think having this knowledge would be an advantage later if one decides to construct a statistical model using the variables in the white wine dataset. Since both residual.sugar
and density
have been shown to closely vary together, variance issues brought by multicollinearity that usually come into play in statistical modelling may be mitigated by excluding density
as density
can be explained by residual.sugar
.
White wine quality has been shown to exhibit a moderately positive association with alcohol level \((r = 0.44)\). This means that poor or low quality wines may be associated with low alcohol levels while high quality wines are associated with higher alcohol contents.
We can examine this relationship more closely by looking into the alcohol level among wines which were similarly classified in terms of quality.
The set of density plots starts off from the bottom where we can see a highly deviated distribution of alcohol values which represents the poor class of white wines (quality = 3
).
As we rise up in the quality scale, the median seems to move towards the lower end of the alcohol values becoming most frequent around the 9 - 10% alcohol level in the average quality wine group (quality = 3
). At this point, the right-skewed distribution is apparent. We can say that there are generally more white wines that have
lower alcohol content levels among poorer to average class wines.
A gradual shift begins to happen as we climb higher in the quality ladder. Beginning from the quality classification of 7, the general alcohol level steadily increases. Exceptional wines (quality 8 and 9) show a negatively skewed distribution. Wines which received the highest quality ratings show an interesting bimodal distribution.
Reclassifying Quality Scores
Another way of visualizing the relationship between alcohol
and quality
is by consolidating the quality factor levels into two major groups: white wines that received quality scores of at least 8 and above will be classified as “Exceptional” while everything else will be classified as “Inferior”.
# Create a column to reclassify the quality score
# Excellent : { 8, 9 }
# Inferior : { 3, 4, 5, 6, 7 }
wines$quality.q2 <- ifelse(wines$quality == 3 |
wines$quality == 4 |
wines$quality == 5 |
wines$quality == 6 |
wines$quality == 7 ,
"Inferior", "Excellent")
# Convert the variable to a factored variable
wines$quality.q2 <- factor(wines$quality.q2, levels=c("Inferior", "Excellent"))
I introduced a new variable quality.q2
to hold the factored results of our groupings. The table below tallies the frequencies of the wines that belong to each group based on their quality
scores.
Frequency | |
---|---|
Inferior | 4718 |
Excellent | 180 |
Since we have just assigned each wine one of the two new binary classification of Excellent
or Inferior
, we can construct a couple of density line plots that make use of this new information.
The consolidation of several quality scores into a set of binary classification groups (Excellent
and Inferior
) simply reaffirms the results of our previous density plot.
It is quite convenient to see how inferior wines tend to have lesser alcohol levels in general than wines whose quality scores were higher.
by(data = wines$alcohol, wines$quality.q2, mean)
## wines$quality.q2: Inferior
## [1] 10.47089
## --------------------------------------------------------
## wines$quality.q2: Excellent
## [1] 11.65111
The means of the alcohol content for the Inferior and Excellent wine groups are shown above by using the by()
function. Excellent wines have a higher alcohol content and the difference is about 1.18%.
An interesting plot that I encountered is the one shown below about the interaction between chlorides (saltiness) in each quality group.
The binned density plots above seem to indicate that increased amount of chlorides or salt is associated with a rise in the quality of poor wines but up to a point. Wines starting from a quality score of 6 begin to see an improvement in their quality scores that is also attended by a reduction in the chloride levels.
One of the goals of this project is to discover those features that may be common among white wines that received commendable quality scores.
The bivariate relationships between alcohol and density, between residual sugar and density, and between alcohol and residual sugar were relegated to their own separate discussions earlier. By doing so, I was able to discover and infer how density seems to take on an intermediate role in what seems like a more primary association between alcohol and residual sugar.
Let us study this further. I would like to try and visualize the alcohol, density and residual sugar features of wines through a multivariate plot.
I would start by creating a new field called density_quantile
to to the existing wines
dataset. The new field will hold a decile value calculated from each wine observation’s density
data value according to the density
distribution. Below is the result of density_quartile
classification.
The graph below shows the log-transformed alcohol
level and residual.sugar
data points for each white wine observation in the dataset. I used the observation’s density_quartile
classification to assign a color to each data point.
The inverse relationship between alcohol
and residual.sugar
is depicted by the negatively sloped fit line.
It also makes apparent the proportional relationship between density
and residual.sugar
. Wines that belong to lower density quantiles (yellow) also appear in the lower valued portion of the residual sugar scale on the left and wines that have higher density quantiles (blue) accompany the data points that have a higher residual sugar content towards the right.
The inverse relationship between alcohol
and density
is also evident from the graph since yellow data points indicating a lower density quantile value are prevalent at high alcohol levels and blue data points indicating a higher density quantile value are prevalent at low alcohol levels
For the next visualizations, I will reclassify the quality
scores again into three levels: 3 for excellent wines (for scores above 6), 2 for fair (scores of 5 and 6), and 3 for poor (scores of 3 and 4).
Group | Frequency |
---|---|
1 | 183 |
2 | 3655 |
3 | 1060 |
I noticed some interesting plots while trying to examine different combinations of relationships that are available to me from the available supporting variables. residual.sugar
and density
have an apparent relationship with each other. Also, pH
and fixed.acidity
seem to have an interesting association as well.
It has been discussed earlier that residual.sugar
and density
are positively and relatively strongly correlated \((r = 0.84)\). How does this correlation come into play when we include the available information about each wine’s quality
?
It looks like there is a pattern among excellent and fair wines while I cannot discern any regular behavior among the data points that are described as poor quality wines.
Wines containing higher amounts of residual sugar conditioned on a particular level of density tend to have a higher quality rating. This can be seen on the plot where for a certain level of density
measurement, there are relatively more blue points (excellent wines with quality scores of 7 or 8) at the higher (vertical) location. This observation is prevalent among wines with densities ranging from 0 to 0.995 g/cm3. The same can also be said about the fair wines (yellow).
Assigning each wine according to the three-score quality specification allows me to examine the multivariate relationships between pH
, fixed.acidity
and quality
. Note that we have eliminated the outliers on the positive end and retained the 99% quartile data points from both participating variables.
The same weak and negative association between pH
and fixed.acidity
can be seen in each of the quality
groups \((r = -0.43)\). The quick and apparent difference between each group is how sparse and spread out the data points are among the excellent and poor wine groups. Fair wines (yellow) tend to cluster around the mean fixed.acidity
of 6.85 g/dm3 and and mean pH
of 3.19.
Something quite subtle is going on, however, when we look closely at the cluster of the data points as we the quality score increases.
The center of the fixed acidity seems to move to the left or towards the lower valued fixed acidity levels as the wine quality elevates from poor (quality scores of 3 or 4) to excellent (quality scores of 7, 8 or 9).
Quality Group | Quality Score | Mean Fixed Acidity (g/cu. dm.) |
---|---|---|
Poor | 3, 4 | 7.18 |
Fair | 5, 6 | 6.88 |
Excellent | 7, 8, 9 | 6.73 |
We can verify how the mean fixed acidity seems to decrease by querying the fixed.acidity
of wines conditioned on quality (1 = Poor, 2 = Fair, 3 = Excellent) as well as displaying the boxplot for fixed.acidity
binned by the three quality levels (above).
I did not set out to find relationships between the explanatory variables themselves, nor was I expecting them. My stated goal at the outset is to look for those features that make some wines more amenable to the tastes of experts.
Yet I was surprised to see several interactions among the predictors which the visualizations made more prominent. For instance, high density wines tend to be associated with wines with high sugar content and as sugar content increases, the alcohol content wanes. Since we know from a prior finding that alcohol content is somehow positively correlated with the quality of wine, we can thusly infer that exceptional wines might have the tendency to be less dense.
Another interesting interaction that I saw is how the exceptional and fair wines prominently exhibited a strong relationship between their residual sugar and wine density properties.
I have extracted the fair and exceptional wine data points and show each group’s scatterplot to display the tight proportional relationship between residual sugar and density.
Quality Group | Quality Score | Correlation Coefficient |
---|---|---|
Poor | 3, 4 | 0.74 |
Fair | 5, 6 | 0.86 |
Excellent | 7, 8, 9 | 0.82 |
The exceptional and fair wine groups have fairly strong relationships between density and residual sugar. We can see from their correlation coefficients that the exceptional and fair wines have strong and positive correlations between their residual sugar content and density, and this is actually more prominent for the fair wines \((r = 0.86)\).
We then follow this with a similarly curious observation about pH
, fixed.acidity
and quality
where the inverse relationship between pH
and fixed.acidity
is examined along quality classification groups. Upon closer scrutiny, we find that mean fixed acidity within quality groups decreases as the quality of wine increases.
The presence of the input variables as well as the availability of a label in each of the examples in our white wine dataset certainly makes the idea of generating a model very appealing. So, I decided to pursue it using linear regression to model the interactions between the wines’ physicochemical properties and its assessed quality.
Preparing the Data
Part of the univariate analysis of each feature is the graphical presentation of its distribution using a boxplot. The boxplot lets us discover and identify outliers. Most of the extreme outliers are located on the positive end of the values and it has been shown that the variables’ distributions are positively skewed. Because we are dealing with multrivariate data, we would consider only those data points among predictors that lie within the limits bounded by the boxplots. This means that we shall deem those predictor values that fall above \(Q3 + 1.5 * IQR\) as outliers.
The dataset will then be randomly split into training data and test data. The training data will consist of 80% of the main dataset while the remaining 20% will be set aside as test data.
After the data preparation, we end up with the following datasets.
White Wine Data Records | Name | Record Count |
---|---|---|
Original Dataset | wines.orig | 4898 |
Outliers Excluded | wines.outdf | 4074 |
Training Set (80% of wines.outdf) | trainSet | 3259 |
Test Set (20% of wines.outdf) | testSet | 815 |
So, we now instantiate a linear regression model from the features of the white wine dataset using the lm()
function.
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + density +
## pH + sulphates + alcohol, data = trainSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3992 -0.5197 -0.0497 0.4527 2.7696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.985e+02 2.995e+01 6.627 4.00e-11 ***
## fixed.acidity 1.340e-01 3.068e-02 4.367 1.30e-05 ***
## volatile.acidity -1.956e+00 1.790e-01 -10.927 < 2e-16 ***
## citric.acid 7.494e-02 1.549e-01 0.484 0.62845
## residual.sugar 9.574e-02 1.131e-02 8.467 < 2e-16 ***
## chlorides -3.632e+00 1.613e+00 -2.252 0.02439 *
## free.sulfur.dioxide 5.866e-03 9.540e-04 6.148 8.79e-10 ***
## density -1.995e+02 3.035e+01 -6.573 5.71e-11 ***
## pH 9.431e-01 1.463e-01 6.447 1.31e-10 ***
## sulphates 8.017e-01 1.408e-01 5.693 1.36e-08 ***
## alcohol 1.207e-01 3.846e-02 3.139 0.00171 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7489 on 3248 degrees of freedom
## Multiple R-squared: 0.2545, Adjusted R-squared: 0.2522
## F-statistic: 110.9 on 10 and 3248 DF, p-value: < 2.2e-16
The lm()
function generated a best fit model over the ten input variables comprising our training set of 3,259 white wine examples. All variables except citric.acid
had significant estimates of the coefficient with most of them being signficant at the 0.05 level. The coefficient of determination is rather low at 0.2545.
Sum of Squared Errors (SSE)
SSE <- sum(model1$residuals^2)
SSE
## [1] 1821.442
The sum of the squares of the residuals from a model that used all available predictors is about 1,821. The histogram above shows the distribution of the residuals and it appears to be normally disributed.
Sample of Predicted Quality Scores
The multivariable linear regression model is applied on our test set and we show the first few predicted quality values in the table below.
# Make test set predictions
predictOnTest <- predict(model1, newdata = testSet)
Index | Quality |
---|---|
2 | 5 |
4 | 6 |
14 | 7 |
16 | 6 |
19 | 6 |
22 | 6 |
Compute R2
SSE = sum((testSet$quality - predictOnTest)^2)
SST = sum((testSet$quality - mean(wines.orig$quality))^2)
1 - SSE/SST
## [1] 0.2477108
The coefficient of multiple determination is calculated to be 0.2477. This can be interpreted to mean that our model explains only about 25% of the variability of the wine quality around its mean.
I began this exploratory data analysis of the white wine dataset with a cursory description of its dimensions, structure and data types. I also examined and plotted each feature’s distribution and found that most of the variables were positively skewed. Most of the variables have oultiers in the high end of the distribution and if we remove these outliers, most of the variables except density
and chlorides
will exhibit quite a normal distribution.
Variable | Mean | Unit |
---|---|---|
Fixed acidity | 6.86 | g per cu. dm. |
Volatile acidity | 0.28 | g per cu. dm. |
Citric acid | 0.33 | g per cu. dm. |
Residual sugar | 6.39 | g per cu. dm. |
Chlorides | 0.05 | g per cu. dm. |
Free sulfur dioxide | 35.31 | mg per cu. dm. |
Total sulfur dioxide | 138.40 | mg per cu. dm. |
Density | 0.99 | g per cu. cm. |
pH | 3.19 | |
Sulphates | 0.49 | g per cu. dm. |
Alcohol | 10.51 | % by volume |
Quality | 6 | out of 10 |
From a summary of the attributes of white wines, we can glean that a typical white wine:
These descriptive measures give us useful yet superficial understanding of our data. And so I had to dig deeper to find clues as to which features influence our wine experts’ assessment of each wine’s overall desirability. Running a correlation matrix quickly shows that alcohol
surpasses the others as a predictor of quality
with a correlation coefficient of 0.44.
Since we know about the weak and positive relationship between alcohol and wine quality, we can then explore the interactions between alcohol and the other variables because whatever relationships we can uncover might reasonably redound to their indirect effect on wine quality.
To remind us of these interesting relationships, I will recapitulate the notable findings from the multivariate analysis that we previously conducted.
The visualization above shows how we can augment our understanding of the role of alcohol as a somewhat loose determiner of wine quality. Good wines tend to be associated with high alcohol content. It is apparent from this plot that high alcohol levels tend to be associated with low residual sugar content and low density levels.
The results of this data exploration about the relationship between alcohol and residual sugar (and density) is very interesting because it provides a data-centric affirmation to what actually transpires in wine-making: alcohol is a by-product of the fermentation process done by yeasts upon the natural sugars of grapes. So we may naturally expect the alcohol level to increase with a decrease in residual sugar content.
Sugar chemistry helps to explain its role in wine. One of these is fermentation, where yeasts metabolize sugars for energy, yielding alcohol as a major byproduct. - calwineries.com
Again, we can go further and verify the indirect influence of residual sugar and density to wine quality. And the scatterplot above manages to depict these subtle, indirect associations. Granted that the plot above shows an obvious direct and proportional relationship between wine density and sugar content, an interesting finding is that this relationship is a lot stronger among fair quality wines than among excellent or poor wines.
The positive relationship between density and residual sugar as shown in the scatter plot also supports the practice of merely using wine density as an alternative measure to gauge the alcohol level that is present in wines as well as an indicator of the conversion of sugar to alcohol.
Difficulties and Challenges I Encountered During the Analysis
One of the reasons I chose to tackle this study is because I like wine. It afforded me the opportunity to deepen my knowledge about this type of drink beyond the superficial appreciation of its taste, texture, smell, and how it enhances the enjoyment of the food it accompanies. Yet when I started digging into the physicochemical data, that is when I realized how deprived I was of what is actually involved in the making of this fine beverage.
The first challenge for me was determining where to even start with the data. There was a total of 12 variables shared by 4,898 observations. The way I managed this hurdle is to just let the data speak for itself. It is convenient that R provides functions such as str()
and summary()
that allowed me to describe the dataset as a whole and get an idea about the structure of each feature.
Another challenge that I encountered was about figuring out which variables are important enough about which I should pay more attention. For this, I had to rely on studying the individual distributions of the variables through their boxplots and histograms. I also had to create multiple preliminary bivariate plots to distinguish which pair or sets of variables are worthy of consideration.
Recall how the \(R^2\) calculation returned a weak coefficient of determination value of 0.2477. Although I was expecting a higher result, at this point, I must admit that I was not at all no longer surprised. It somehow confirms my early struggles of trying to look for any relationship that would connect the features to the response variable. It turns out that almost none of them (save perhaps alcohol
), even taken collectively, can offer a close explanation of the quality score that each wine received.
The closest approximation that I can come up with is by approaching the problem indirectly . Since I knew that the alcohol level had a high enough correlation coefficient value when paired with wine quality, I used this knowledge to inform my efforts at discovering the supporting variables that influence the alcohol level.
Along the way, I also discovered some other interesting interactions among the explanatory variables. The data showed how residual sugar and total sulfur dioxide are proportionally related but this relationship is somehow weak. I did some research and discovered that this relationship between sulfur dioxide and sugar in white wine happens because of the need to protect the wine in order to compensate for its low sugar levels. According to morethanorganic.com:
White wines and rosés do not contain natural anti-oxidants because they are not left in contact with their skins after crushing. For this reason they are more prone to oxidation and tend to be given larger doses of sulphur dioxide
On the other hand, by doing bivariate analysis, I found that free sulfur dioxide is negatively related with volatile acidity with a correlation coefficient of -0.10. Once again, it is very interesting to know how this finding mirrors what happens in wine-making. According to winemakermag.com, free sulfur dioxide is used to manage volatile acidity levels in wine.
The moderately positive correlation between total suflur dioxide and wine density can in part be understood in hindsight. This situation is not farfetched if we consider the positive correlation between total sulfur dioxide and residual sugar as well as the strong and positive relationship between residual sugar and wine density.
In the middle of the project, I felt that I was running out of options about other ways to present the white wine data. This is when the idea about grouping the wines based on quality struck me. Reclassifying and refactoring wines based on different quality sets (Exceptional, Fair, Poor) introduced a variety of different ways of looking at the data. I was able to closely analyze the relationship between residual sugar and wine density conditioned on quality because of this.
Some Ideas For Future Work or Inquiry
The close exploratory analysis conducted on the white wine dataset yielded these interesting findings. It also brought forth some curious topics that may need further inquiry.
The quality of white wine observations rose with increasing chloride levels but only up to a point. Once the chloride content reached a little below 5.0 g/dm3, the relationship became reversed. Higher quality wines began to be associated with lower saltiness or chloride levels. It would be interesting to determine if there are extraneous factors that might explain this behavior between chloride levels and wine quality. Finding these extraneous factors can be the subject of a future inquiry .
The low \(R^2\) value which coincided with low p values in our regression model is another topic which may need to be further examined. It is usual to see a high \(R^2\) value paired with low p values which specifies that the variance in the predictor variables are associated with the variance in the response variable and that the model primarily explains most of the variability in the response variable.
But in our case, our model does not explain much of the variability in the quality of wine despite the presence of significant input variables. We saw this in our scatterplot when we visualized the predicted and actual white wine quality
data values from our test set. The data points were substantially spread out around the reference line which indicates the large variance in the residuals. This also means that it is possible to have a viable set of input variable coefficients that coincide with a substantially imprecise set of predicted values.
From the standpoint of the available data on white wines, a single variable (alcohol
) does not wholly determine the desirability of white wine. And a model utilizing a combination of various multivariate characteristics could not substantially explain the change in quality score despite the significance of each individual input variable. Thus, increasing the explanatory viability of a model might require additional predictors or perhaps the white wine data may just contain a substantial amount of intrinsic, unexplainable variability. I think that a further and closer study would have to be made at finding additional extraneous predictors or to analyze and manage the complex nature of wines to arrive at a better inferential model.