Title: | Functions and Datasets for the Data Science Course at IBAW |
---|---|
Description: | A collection of useful functions and datasets for the Data Science Course at IBAW. |
Authors: | Stefan Lanz [aut, cre] |
Maintainer: | Stefan Lanz <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.0.9005 |
Built: | 2024-11-24 17:20:10 UTC |
Source: | https://github.com/stibu81/ibawds |
Summary of data on restaurant bills from the dataset reshape2::tips
.
Labels are in German.
bills
bills
A data frame with 8 rows and 4 variables:
sex of the bill payer
time of day
whether there were smokers in the party
mean of all the bills in dollars
Breast cancer database obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. The data were collected in 8 from 1989 to 1991 and are sorted in chronological order.
breast_cancer
breast_cancer
a tibble with 699 rows and 11 variables. All numerical values are integers in the range 1 to 10.
sample code number
clump thickness
uniformity of cell size
uniformity of cell shape
marginal adhesion
single epithelial cell size
bare nuclei
bland chromatin
normal nucleoli
mitoses
"benign" (458) or "malignant" (241)
The data is available on the UC Irvine Machine Learning Repository.
O. L. Mangasarian and W. H. Wolberg, Cancer diagnosis via linear programming, SIAM News, Volume 23(5) (1990) 1 & 18.
Check if the current system is ready for the course by verifying the following:
R and RStudio are up to date
the ibawds package is up to date
all the required packages are installed
The function must be run from RStudio in order to run properly.
check_ibawds_setup()
check_ibawds_setup()
a logical indicating whether the system is up to date (invisibly). Messages inform the user about the status of the system.
ibawds offers the function install_ibawds()
which installs all the packages
that are required for the course. check_lecture_packages()
finds all the
packages that are used in the slides and exercise solution inside a directory.
It then checks whether they are all installed by install_ibawds()
and
returns a tibble of those that are not. This can help to identify, if
additional packages need to be installed by install_ibawds()
.
check_lecture_packages(path = ".")
check_lecture_packages(path = ".")
path |
the path to a folder inside the directory with the slides and exercise solutions. The function automatically tries to identify the top level directory of the course material. |
a tibble with two columns:
the file where the package is used
the name of the package
For a given dataset and given centres, cluster_with_centers()
assigns each data point to its closest centre and then recomputes
the centres as the mean of all points assigned to each class. An initial
set of random cluster centres can be obtained with init_rand_centers()
.
These functions can be used to visualise the mechanism of k-means.
cluster_with_centers(data, centers) init_rand_centers(data, n, seed = sample(1000:9999, 1))
cluster_with_centers(data, centers) init_rand_centers(data, n, seed = sample(1000:9999, 1))
data |
a data.frame containing only the variables to be used for clustering. |
centers |
a data.frame giving the centres of the clusters. It must have
the same number of columns as |
n |
the number of cluster centres to create |
seed |
a random seed for reproducibility |
a list containing two tibbles:
centers
: the new centres of the clusters computed after cluster assignment
with the given centres
cluster
: the cluster assignment for each point in data
using the
centres that were passed to the function
# demonstrate k-means with iris data # keep the relevant columns iris2 <- iris[, c("Sepal.Length", "Petal.Length")] # initialise the cluster centres clust <- init_rand_centers(iris2, n = 3, seed = 2435) # plot the data with the cluster centres library(ggplot2) ggplot(iris2, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(data = clust$centers, aes(colour = factor(1:3)), shape = 18, size = 6) + geom_point() + scale_colour_brewer(palette = "Set1") # assign clusters and compute new centres clust_new <- cluster_with_centers(iris2, clust$centers) # plot the data with clustering clust$cluster <- clust_new$cluster voronoi_diagram(clust, x = "Sepal.Length", y = "Petal.Length", data = iris2) # plot the data with new cluster centres clust$centers <- clust_new$centers voronoi_diagram(clust, x = "Sepal.Length", y = "Petal.Length", data = iris2, colour_data = FALSE) # this procedure may be repeated until the algorithm converges
# demonstrate k-means with iris data # keep the relevant columns iris2 <- iris[, c("Sepal.Length", "Petal.Length")] # initialise the cluster centres clust <- init_rand_centers(iris2, n = 3, seed = 2435) # plot the data with the cluster centres library(ggplot2) ggplot(iris2, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(data = clust$centers, aes(colour = factor(1:3)), shape = 18, size = 6) + geom_point() + scale_colour_brewer(palette = "Set1") # assign clusters and compute new centres clust_new <- cluster_with_centers(iris2, clust$centers) # plot the data with clustering clust$cluster <- clust_new$cluster voronoi_diagram(clust, x = "Sepal.Length", y = "Petal.Length", data = iris2) # plot the data with new cluster centres clust$centers <- clust_new$centers voronoi_diagram(clust, x = "Sepal.Length", y = "Petal.Length", data = iris2, colour_data = FALSE) # this procedure may be repeated until the algorithm converges
Table with the number of packages available on CRAN and the current R version for historic dates back to 21 June 2001.
cran_history
cran_history
A data frame with 69 rows and 4 variables.
date
the number of available R packages on CRAN
the then current version of R
source of the data (see 'Details')
Data on the number of packages on CRAN between 2001-06-21 and 2014-04-13
is obtained from
CRANpackages
from the package Ecdat
.
This data was collected by John Fox and Spencer Graves.
Intervals between data points are irregularly spaced. These data are
marked with "John Fox" or "Spencer Graves" in the column source
.
They are licenced under GPL-2/GPL-3.
Data between 2014-10-01 and 2023-03-06 was collected by the package author
from CRAN snapshots on Microsoft's MRAN, which was retired on 1 July 2023.
Data was collected on the first day of each quarter.
These data are marked with "MRAN" in the column source
.
Newer data has been collected in irregular intervals using the functions
n_available_packages()
and available_r_version()
.
These data are marked with "CRAN" in the column source
.
library(ggplot2) ggplot(cran_history, aes(x = date, y = n_packages)) + geom_point()
library(ggplot2) ggplot(cran_history, aes(x = date, y = n_packages)) + geom_point()
Add the definitions for various useful LaTeX equation symbols for statistics to an RMarkdown document.
define_latex_stats()
define_latex_stats()
Run this function from within a code chunk in a RMarkdown document
with options results = "asis"
and echo = FALSE
(see "Examples").
It only works for pdf output.
It defines the following macros: \E
, \P
, \Var
, \Cov
, \Cor
,
\SD
, \SE
, \Xb
, \Yb
.
The function returns NULL
invisibly. The command definitions
are output as a side effect.
## Not run: # add this code chunk to a RMarkdown document ```{r results = "asis", echo = FALSE} define_latex_stats() ``` ## End(Not run)
## Not run: # add this code chunk to a RMarkdown document ```{r results = "asis", echo = FALSE} define_latex_stats() ``` ## End(Not run)
Dental formulas for various mammals. The dental formula describes the number of incisors, canines, premolars and molars per quadrant. Upper and lower teeth may differ and are therefore shown separately. The total number of teeth is twice the number given.
dentition
dentition
Data frame with 66 rows and 9 variables:
name of the mammal
number of top incisors
number of bottom incisors
number of top canines
number of bottom canines
number of top premolars
number of bottom premolars
number of top molars
number of bottom molars
The data have been downloaded from https://people.sc.fsu.edu/~jburkardt/datasets/hartigan/file19.txt
They come from the following textbook:
Hartigan, J. A. (1975). Clustering Algorithms, John Wiley, New York.
Table 9.1, page 170.
Create plots of the density and distribution functions of a probability distribution. It is possible to mark points and shade the area under the curve.
distribution_plot( fun, range, ..., points = NULL, var = "x", title = "Verteilungsfunktion", is_discrete = NULL ) density_plot( fun, range, ..., from = NULL, to = NULL, points = NULL, var = "x", title = "Dichte", is_discrete = NULL )
distribution_plot( fun, range, ..., points = NULL, var = "x", title = "Verteilungsfunktion", is_discrete = NULL ) density_plot( fun, range, ..., from = NULL, to = NULL, points = NULL, var = "x", title = "Dichte", is_discrete = NULL )
fun |
a density or distribution function that takes quantiles as its first argument. |
range |
numeric vector of length two giving the range of quantiles to be plotted. |
... |
further arguments that are passed to |
points |
numeric vector giving quantiles where the function should be marked with a red dot (continuous) or a red bar (discrete). |
var |
character giving the name of the quantile variable. This is only used to label the axes. |
title |
character giving the title of the plot |
is_discrete |
logical indicating whether this is a discrete distribution. For discrete distributions, a bar plot is created. If omitted, the function tries to automatically determine, whether the distributions is discrete. In case this should fail, set this argument explicitly. |
from , to
|
numeric values giving start and end of a range where the area under the density will be shaded (continuous) or the bars will be drawn in red (discrete). If only one of the two values is given, the shading will start at negative infinity or go until positive infinity, respectively. |
a ggplot
object
# plot density of the normal distribution density_plot(dnorm, c(-5, 7), mean = 1, sd = 2, to = 3) # plot distribution function of the Poisson distribution distribution_plot(ppois, c(0, 12), lambda = 4, points = c(2, 6, 10), var = "y")
# plot density of the normal distribution density_plot(dnorm, c(-5, 7), mean = 1, sd = 2, to = 3) # plot distribution function of the Poisson distribution distribution_plot(ppois, c(0, 12), lambda = 4, points = c(2, 6, 10), var = "y")
Downgrade packages to an older version available on CRAN. This can be useful when debugging problems that might have arisen due to a package update.
downgrade_packages(pkg, dec_version = c("any", "patch", "minor", "major"))
downgrade_packages(pkg, dec_version = c("any", "patch", "minor", "major"))
pkg |
character with the names of the packages to be downgraded. |
dec_version |
character giving the version to decrease. Possible values are "any", "patch", "minor", and "major". See 'Details'. |
Using the argument dec_version
, the user can control which version will
be installed. The possible values are:
"any"
The previous available version will be installed.
"patch"
The newest available version with a smaller patch version
number will be installed. For packages with three version numbers, this
is the same as using "any"
.
"minor"
The newest available version with a smaller minor version number will be installed.
"major"
The newest available version with a smaller major version number will be installed.
Downgrading is only possible for packages that are currently installed. For packages that are not installed, a warning is issued.
The function uses remotes::install_version()
to install a version
of a package that is older than the currently installed version.
A character vector with the names of the downgraded packages, invisibly.
Find the named colour that is most similar to a given colour.
find_similar_colour( colour, distance = c("euclidean", "manhattan"), verbose = interactive() )
find_similar_colour( colour, distance = c("euclidean", "manhattan"), verbose = interactive() )
colour |
a colour specified in one of three forms: a hexadecimal string
of the form |
distance |
character indicating the distance metric to be used. |
verbose |
should additional output be produced? This shows the RGB values for the input colour, the most similar named colour and the difference between the two. |
a character of length one with the name of the most similar named colour.
find_similar_colour("#d339da") find_similar_colour(c(124, 34, 201)) # suppress additional output find_similar_colour("#85d3a1", verbose = FALSE) # use Manhattan distance find_similar_colour(c(124, 34, 201), distance = "manhattan")
find_similar_colour("#d339da") find_similar_colour(c(124, 34, 201)) # suppress additional output find_similar_colour("#85d3a1", verbose = FALSE) # use Manhattan distance find_similar_colour(c(124, 34, 201), distance = "manhattan")
Two tables of father's heights with heights of one of their sons
(galton_sons
) or daughters (galton_daughters
), respectively. All heights
are given in centimetres. It is created from HistData::GaltonFamilies
by
randomly selecting one son or daughter per family. Since some families consist
of only sons or only daughters, not all families are contained in both tables.
galton_sons galton_daughters
galton_sons galton_daughters
Two data frames with 179 (galton_sons
) or 176 (galton_daughters
)$
rows, respectively, and 2 variables:
size of the father in cm.
size of the son or daughter, respectively, in cm.
Copy the files for an exercise for reading files to a directory.
get_reading_exercise_files(path, unzip = TRUE)
get_reading_exercise_files(path, unzip = TRUE)
path |
path where the files should be copied to. |
unzip |
logical indicating whether the files should be unzipped. Set this
to |
There are 8 files in total. Apart from a few errors that were introduced for the purpose of the exercise, they all contain the same data: information about 100 randomly selected Swiss municipalities. The full file can be downloaded from https://www.bfs.admin.ch/bfsstatic/dam/assets/7786544/master.
Logical indicating the success of the copy operation.
These functions create two tables that can be used for the grading of the student's papers.
create_minreq_table(repro, n_tab, n_plot_kinds, n_plots, n_stat) create_grading_table(p_text, p_tab, p_plot, p_code, p_stat)
create_minreq_table(repro, n_tab, n_plot_kinds, n_plots, n_stat) create_grading_table(p_text, p_tab, p_plot, p_code, p_stat)
repro |
logical, is the paper reproducible? |
n_tab |
integer, number of tables |
n_plot_kinds |
integer, number of different kinds of plots |
n_plots |
integer, number of plots |
n_stat |
integer, number of statistical computations |
p_text |
numeric between 0 and 3, points given for the text |
p_tab |
numeric between 0 and 3, points given for the tables |
p_plot |
numeric between 0 and 5, points given for the plots |
p_code |
numeric between 0 and 5, points given for the code |
p_stat |
numeric between 0 and 5, points given for the statistic computations |
The tables are created using knitr::kable()
and kableExtra::kableExtra
is
used for additional styling.
create_minreq_table()
creates a table that checks that the minimal requirements
are satisfied:
the paper must be reproducible
there must be at least one table and two kinds of plots
there must be at least 5 plots and tables
there must be at least two statistical computations
The table lists for each of those requirement whether it is satisfied or not.
create_grading_table()
creates a table that gives grades in percent for
each of five categories:
Text
Tables
Plots
Code
Statistical computations
In each category, up to five points may be awarded. The last row of the table gives the percentage over all categories.
both functions return an object of class kableExtra
.
A number of R-packages are used in the courses and
the video lectures. They are also dependencies of
this package. Use install_ibawds()
to install the
packages that are not yet installed.
install_ibawds()
install_ibawds()
This function checks whether all the packages that ibawds
depends on,
imports or suggests are installed. In interactive sessions, it either
informs the user that all packages are installed or asks to install
missing packages. The function relies on rlang::check_installed()
.
nothing or NULL
invisibly
In the mtcars
dataset, the names of the car models are
stored as row names. However, when working with ggplot2
and other
packages from the tidyverse
, it is convenient to have all data in columns.
mtcars2
is a variant of mtcars
that contains car models in a column
instead of storing them as row names.
mtcars_na
is the same dataset as mtcars2
, but some of the columns
contain missing values.
mtcars2 mtcars2_na
mtcars2 mtcars2_na
A data frame with 32 rows and 12 variables. The format is identical
to mtcars
and details can be found in its documentation. The only
difference is that the car model names are stored in the column model
instead of the row names.
Obtain the number of available packages on CRAN and the current R version.
n_available_packages(cran = getOption("repos")) available_r_version(cran = getOption("repos"))
n_available_packages(cran = getOption("repos")) available_r_version(cran = getOption("repos"))
cran |
character vector giving the base URL of the CRAN server to use. |
The number of packages on CRAN and the R version can be obtained for selected
dates in the past from the dataset cran_history
.
Note: Previously, these functions could obtain the number of packages on CRAN and the then current R version also for past dates by using snapshots from Microsoft's MRAN. However, MRAN shut down on 1 July 2023 such that this functionality is no longer available.
the number of available packages as an integer or the R version number as a character
Training and test data created from a tenth order polynomial with added noise. The polynomial is given by
The noise follows a standard normal distribution. The data can be used to demonstrate overfitting. It is inspired by section II. B. in A high-bias, low-variance introduction to Machine Learning for physicists
noisy_data
noisy_data
a list of two tibbles with two columns each. stands for the
independent,
for the dependent variable. The training data
(
noisy_data$train
) contains 1000 rows, the test data (noisy_data$test
)
20 rows.
P. Mehta et al., A high-bias, low-variance introduction to Machine Learning for physicists Phys. Rep. 810 (2019), 1-124. arXiv:1803.08823 doi:10.1016/j.physrep.2019.03.001
Protein Consumption from various sources in European countries in unspecified units. The exact year of data collection is not known but the oldest known publication of the data is from 1973.
protein
protein
Data frame with 25 rows and 10 variables:
name of the country
red meat
white meat
eggs
milk
fish
cereals
starchy foods
pulses, nuts, oil-seeds
fruits, vegetables
The data have been downloaded from https://raw.githubusercontent.com/jgscott/STA380/master/data/protein.csv
They come from the following book:
Hand, D. J. et al. (1994). A Handbook of Small Data Sets, Chapman and Hall, London.
Chapter 360, p. 297.
In the book, it is stated that the data have first been published in
Weber, A. (1973). Agrarpolitik im Spannungsfeld der internationalen Ernährungspolitik, Institut für Agrarpolitik und Marktlehre, Kiel.
rand_with_cor()
creates a vector of random number that has
correlation rho
with a given vector y
.
Also mean and standard deviation of the random vector
can be fixed by the user. By default, they will be equal to the mean
and standard deviation of y
, respectively.
rand_with_cor(y, rho, mu = mean(y), sigma = sd(y))
rand_with_cor(y, rho, mu = mean(y), sigma = sd(y))
y |
a numeric vector |
rho |
numeric value between -1 and 1 giving the desired correlation. |
mu |
numeric value giving the desired mean |
sigma |
numeric value giving the desired standard deviation |
a vector of the same length as y
that has correlation rho
with y
.
This solution is based on an answer by whuber on Cross Validated.
x <- runif(1000, 5, 8) # create a random vector with positive correlation y1 <- rand_with_cor(x, 0.8) all.equal(cor(x, y1), 0.8) # create a random vector with negative correlation # and fixed mean and standard deviation y2 <- rand_with_cor(x, -0.3, 2, 3) all.equal(cor(x, y2), -0.3) all.equal(mean(y2), 2) all.equal(sd(y2), 3)
x <- runif(1000, 5, 8) # create a random vector with positive correlation y1 <- rand_with_cor(x, 0.8) all.equal(cor(x, y1), 0.8) # create a random vector with negative correlation # and fixed mean and standard deviation y2 <- rand_with_cor(x, -0.3, 2, 3) all.equal(cor(x, y2), -0.3) all.equal(mean(y2), 2) all.equal(sd(y2), 3)
Rescale Mean And/Or Standard Deviation of a Vector
rescale(x, mu = mean(x), sigma = sd(x))
rescale(x, mu = mean(x), sigma = sd(x))
x |
numeric vector |
mu |
numeric value giving the desired mean |
sigma |
numeric value giving the desired standard deviation |
By default, mean and standard deviation are not changed, i.e.,
rescale(x)
is identical to x
. Only if a value is specified
for mu
and/or sigma
the mean and/or the standard deviation are
rescaled.
a numeric vector with the same length as x
with mean mu
and
standard deviation sigma
.
x <- runif(1000, 5, 8) # calling rescale without specifying mu and sigma doesn't change anything all.equal(x, rescale(x)) # change the mean without changing the standard deviation x1 <- rescale(x, mu = 3) all.equal(mean(x1), 3) all.equal(sd(x1), sd(x)) # rescale mean and standard deviation x2 <- rescale(x, mu = 3, sigma = 2) all.equal(mean(x2), 3) all.equal(sd(x2), 2)
x <- runif(1000, 5, 8) # calling rescale without specifying mu and sigma doesn't change anything all.equal(x, rescale(x)) # change the mean without changing the standard deviation x1 <- rescale(x, mu = 3) all.equal(mean(x1), 3) all.equal(sd(x1), sd(x)) # rescale mean and standard deviation x2 <- rescale(x, mu = 3, sigma = 2) all.equal(mean(x2), 3) all.equal(sd(x2), 2)
Extract of the data in the Seatbelts
dataset as a data frame. The
original dataset is a multiple time series (class mts
). Labels are
in German.
seatbelts
seatbelts
A data frame with 576 rows and 3 variables:
data of the first data of the month for which the data was collected.
seat where the persons that were killed or seriously injured were seated. One of "Fahrer" (driver's seat), "Beifahrer" (front seat), "Rücksitz" (rear seat).
number of persons that were killed or seriously injured.
Set options for ggplot plots and tibble outputs for IBAW slides.
set_slide_options( ggplot_text_size = 22, ggplot_margin_pt = rep(10, 4), tibble_print_max = 12, tibble_print_min = 8 )
set_slide_options( ggplot_text_size = 22, ggplot_margin_pt = rep(10, 4), tibble_print_max = 12, tibble_print_min = 8 )
ggplot_text_size |
Text size to be used in ggplot2 plots. This applies to all texts in the plots. |
ggplot_margin_pt |
numeric vector of length 4 giving the sizes of the top, right, bottom, and left margins in points. |
tibble_print_max |
Maximum number of rows printed for a tibble. Set
to |
tibble_print_min |
Number of rows to be printed if a tibble has more
than |
The function uses ggplot2::theme_update()
to modify the default theme
for ggplot and options()
to set base R options that influence the printing
of tibbles.
Note that if you make changes to these options in a R Markdown file, you may have to delete the knitr cache in order for the changes to apply.
a named list (invisibly) with to elements containing the old values of the options for the ggplot theme and the base R options, respectively. These can be used to reset the ggplot theme and the base R options to their previous values.
Evaluation of the student papers, lecture slides and some exercises are all
done in the form of Rmd files. These function find all the relevant
Rmd-files in a directory and check the spelling using the package
spelling
.
spell_check_evaluation(path = ".", students = NULL, use_wordlist = TRUE) spell_check_slides(path = ".", use_wordlist = TRUE)
spell_check_evaluation(path = ".", students = NULL, use_wordlist = TRUE) spell_check_slides(path = ".", use_wordlist = TRUE)
path |
path to the top level directory of the evaluations for
|
students |
an optional character vector with student names. If given, only the evaluation for these students will be checked. |
use_wordlist |
should a list of words be excluded from the spell
check? The package contains separate word lists for evaluations and
slides/exercises with words that have typically appeared in these documents
in the past. When spell checking the paper evaluations, the names of the
students will always be excluded from spell check, even if |
spell_check_evaluation()
finds Rmd-files with evaluations in subfolders
starting from the current working directory or the directory given by
path
. The file names must be of the form "Beurteilung_Student.Rmd", where
"Student" must be replaced by the student's name. By default, words contained
in a wordlist that is part of the package as well as all the students' names
are excluded from the spell check, but this can be turned off by setting
use_wordlist = FALSE
. (Note that the students' names will still be
excluded.)
spell_check_slides()
finds Rmd-files with evaluations in subfolders
starting from the current working directory or the directory given by
path
. In order to exclude a file from the spell check, make sure it's first
line contains the term "nospellcheck", typically in the form of an
html-comment:
<!-- nospellcheck -->
By default, words contained in a wordlist that is part of the package are
excluded from the spell check, but this can be turned off by setting
use_wordlist = FALSE
.
Create a Voronoi diagram for a given clustering object.
voronoi_diagram( cluster, x, y, data = NULL, show_data = !is.null(data), colour_data = TRUE, legend = TRUE, point_size = 2, linewidth = 0.7 )
voronoi_diagram( cluster, x, y, data = NULL, show_data = !is.null(data), colour_data = TRUE, legend = TRUE, point_size = 2, linewidth = 0.7 )
cluster |
an object containing the result of a clustering, e.g.,
created by |
x , y
|
character giving the names of the variables to be plotted on the x- and y-axis. |
data |
The data that has been used to create the clustering. If this
is provided, the extension of the plot is adapted to the data and the
data points are plotted unless this is suppressed by specifying
|
show_data |
should the data points be plotted? This is |
colour_data |
should the data points be coloured according to the assigned cluster? |
legend |
should a colour legend for the clusters be plotted? |
point_size |
numeric indicating the size of the data points and the cluster centres. |
linewidth |
numeric indicating the width of the lines that separate the areas for the clusters. Set to 0 to show no lines at all. |
The function uses the deldir
package to create the polygons for the
Voronoi diagram. The code has been inspired by ggvoronoi
, which can
handle more complex situations.
Garrett et al., ggvoronoi: Voronoi Diagrams and Heatmaps with ggplot2, Journal of Open Source Software 3(32) (2018) 1096, doi:10.21105/joss.01096
cluster <- kmeans(iris[, 1:4], centers = 3) voronoi_diagram(cluster, "Sepal.Length", "Sepal.Width", iris)
cluster <- kmeans(iris[, 1:4], centers = 3) voronoi_diagram(cluster, "Sepal.Length", "Sepal.Width", iris)
Physicochemical data and quality ratings for red and white Portuguese Vinho Verde wines.
wine_quality
wine_quality
a tibble with 6497 rows and 13 variables:
colour of the wine; "red" (1'599) or "white" (4'898)
tartaric acid per volume in
acetic acid per volume in
citric acid per volume in
residual sugar per volume in
sodium chloride per volume in
free sulphur dioxide per volume in
total sulphur dioxide per volume in
density in
pH value
potassium sulphate per volume in
alcohol content per volume in %
quality score between 0 (worst) and 10 (best) determined by sensory analysis.
The data is available on the UC Irvine Machine Learning Repository.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Modeling wine preferences by data mining from physicochemical properties, Decision Support Systems 47(4) (2009), 547-553.