Rating: 8.3/10.
All of Statistics: A Concise Course in Statistical Inference by Larry Wasserman
This textbook is as an introduction to statistics for those who have a solid foundation in mathematics but lack knowledge of statistics. It covers a wide range of topics quite rapidly within 400 pages, resulting in a rather brief treatment of each subject, focusing only on the most important concepts while omitting many details. Overall, it is a useful text for quickly becoming acquainted with a large number of statistical topics.
Chapter 1. A probability distribution is a function from a set A of events to a real number between 0 and 1, interpreted in frequentist statistics as the long-run proportion that A is true, or in Bayesian statistics as the degree of belief that A is true. Definition of independence of events and the properties of conditional probability, including Bayes’ theorem. In advanced probability, we need to deal with defining which subsets of continuous space can be assigned probabilities, said to be measurable, but this is out of scope for this book.
Chapter 2. A random variable can be defined by its cumulative distribution function (CDF), which is non-decreasing, with a limit of 0 at negative infinity and a limit of 1 at infinity. A probability density function (PDF) can be integrated to get a CDF, i.e., its integral over the domain must be 1. Some PDFs of common distributions are given. A PDF may be defined jointly across two variables; then the marginal distribution is obtained by integrating one variable to get a distribution involving only the other variable. The conditional distribution is obtained by fixing one variable to a specific value to get a PDF involving only the other variable. Some advanced measure theory is required to define conditional expectation properly to avoid dividing by zero, but this is out of scope for this book. A transformation applied to a random variable gives a new random variable and can be defined by manipulating the CDF and PDF as well.
Chapter 3. The definition of the expectation of a random variable includes properties like linearity, so when a random variable like a binomial can be broken down as a sum, it’s easier to analyze the expectation of each part individually. Variance is the expectation of (X – μ)^2, and covariance between two random variables is where 0 means they’re uncorrelated. Correlation normalizes covariance to between -1 and 1. The variance of a sum often produces covariance terms when the variables are not independent. Conditional expectation is the expectation of X when Y is fixed. The moment generating function (MGF) is a function that can be differentiated to get expectation and variance, providing another way of representing a distribution since equal distributions have the same MGF, which is useful for some proofs like analyzing the sum of random variables or proving independence.
Chapter 4. Some probability inequalities – Markov’s inequality provides a bound on how frequently large values can appear based solely on the expectation, assuming the random variable is nonnegative. Chebyshev’s inequality applies Markov’s inequality to (X – mu)^2 to give a bound on how much a random variable can deviate from the mean, given its mean and variance. Hoeffding’s inequality offers tighter bounds than Chebyshev’s for the sum or mean of random variables, which must be bounded, and can also provide a confidence interval. Cauchy-Schwarz and Jensen’s inequalities give bounds on expected values, for example, when applied to a function that is convex or concave.
Chapter 5. Several types of convergence of X1, X2, … to X. Quadratic mean or L2 is a strong form of convergence where the L2 difference between Xn and X approaches 0. Convergence in probability means that for every epsilon, the probability that Xn differs from X by more than epsilon must approach 0. However, this is weaker than quadratic mean convergence since there is no bound on how large the difference may be, as long as the probability of this happening tends towards 0. Convergence in distribution is the weakest form of convergence; only the CDF and PDF must match. It is considered weak because the difference between X and Xn may still be large, eg, if each Xn is independent N(0,1) and X is also N(0,1), then you have convergence in distribution but none of the other forms. In the case of converging to a point mass distribution, convergence in probability and distribution is the same. Some properties of convergence are preserved under transformation.
The law of large numbers states that the distribution of the mean of random variables converges to a point around the mean. The weak form is convergence in probability, while the strong form is almost sure convergence, which is a more advanced topic and beyond the scope of this book. The central limit theorem states that the distribution of the mean of random variables converges to a normal distribution determined entirely by the mean and variance of the random variables; it can be proven using moment generating functions. The delta method provides the limiting distribution of a function of random variables that converge to a normal distribution.
Chapter 6. A parametric model is denoted as f(x; theta), where the model is determined by a vector of parameters, theta, which is meant to be inferred from data. Some models that cannot be determined by finitely many parameters are considered nonparametric. A point estimator is a function that estimates the unknown theta from data. It is unbiased if the expectation is equal to its true value, but usually, consistency is more important than unbiasedness. Consistency refers to the property where the estimator converges to the true value as the sample size, gets larger. The MSE is the sum of the squared bias and the variance, and it measures the quality of a point estimate. A confidence interval is interpreted as follows: when constructed from data many times, 95% of these intervals will capture the true value; it is not a probability statement about the true value of theta. Many distributions converge to a normal distribution, in which case it is easy to construct a confidence interval from z-scores.
Chapter 7. The empirical distribution function approximates the CDF from data, and the Glivenko-Cantelli Theorem states that this converges to the actual CDF. The DKW inequality provides bounds on the error of the empirical CDF, which can be used to construct confidence bands for the CDF. A statistical functional is any function that can be expressed in terms of its CDF, such as the mean or variance. The plug-in estimator is when an empirical CDF is used as the estimator for the functional.
Chapter 8. Bootstrap uses the empirical CDF to approximate a function that depends on the CDF, such as variance. In practice, this involves drawing with replacement from the data set and using this to estimate the statistic. The bootstrap can be used to compute the distribution and confidence interval of any statistic derived from the data, such as the median.
Chapter 9. Parametric inference involves estimating the parameters of a model from data. The method of moments approximates the sample moments from data, and then solves a system of equations to express the parameters of interest in terms of the sample moments (given enough of these moments, you can solve a system of equations), and this estimator is consistent.
The maximum likelihood estimator (MLE) defines the likelihood function of the data given the parameters and then finds the parameters that achieve the maximum likelihood. An analytical way of solving this is to differentiate with respect to the parameters – it is also useful to use the log-likelihood function and drop the constants. The MLE is consistent; to prove this, we can show that the KL divergence between the MLE-estimated model and the true model goes to zero. The distribution of the MLE is asymptotically normal with variance equal to the inverse of the Fisher information; this provides a confidence interval for the MLE. The delta method gives the variance and confidence interval for a variable that is a function of multiple parameters in a multi-parameter model. Alternatively we can estimate the parameter variance using bootstrap, providing a computational method rather than a closed-form expression for the variance.
Some intuition about Fisher information: for multi-parameter models we generalize to the Fisher information matrix, defined as the Hessian of the likelihood function around the true value (measuring how much it peaks around the true value). The inverse of the Fisher information is the variance of the MLE estimator, ie, the Fisher information measures how much the data inform you about the parameter.
A statistic is sufficient if the likelihood function can be derived from it, eg: for a normal distribution, just knowing the mean and sample variance is sufficient without knowing the full data. The Rao-Blackwell theorem states that an estimator can be improved (or at least not made worse) by conditioning on a sufficient statistic. If the MLE is difficult to compute analytically, there are two ways of computing it iteratively. The Newton-Raphson method iteratively improves the estimate of the parameters using second derivative information, and the EM algorithm iterates by updating parameters, treating a subset of them as fixed while updating the other parameters, and in other steps updating the fixed parameters. This is often used for fitting a mixture of normal distributions.
Chapter 10. The power function (beta) of a hypothesis test is the chance of rejecting different values of the parameter if the data was generated from the model with this parameter (theta). The size or level (alpha) is the chance of incorrectly rejecting the null hypothesis. Given a fixed alpha, the most powerful test is the one with the best chance of correctly rejecting data in the rejection region.
The Wald test (in this case, same as the z-test) is when the statistic is asymptotically normal – calculate the W statistic and reject it if the absolute value is greater than the z-score derived from the size alpha. This is useful for comparing whether two properties or means are identical. The p-value is the smallest alpha such that we would reject the null hypothesis (but notably it is not the probability of the null hypothesis). If the null hypothesis is true, then the p-value would be uniformly likely between [0,1]; if the null hypothesis is not true, then the p-value is closer to zero. You can interpret it as the probability of observing a more extreme value if the null hypothesis were true.
The chi-squared test is used for testing if the observed count deviates from expected proportions. The permutation test is for testing whether two sets of data come from the same distribution: randomly sample permutations of the test statistic, and compare how many of them are larger than the actual test statistic. This is most useful for small samples as it is an exact test and not based on any large sample approximations. The likelihood ratio test is the ratio of the MLE of the full distribution versus the MLE on the restricted set that the null hypothesis applies; it follows a chi-squared distribution with a degree of freedom equal to the difference between the degrees of freedom of the full and null distributions.
When testing multiple hypotheses, the Bonferroni correction is to lower alpha by dividing it by the number of hypotheses being tested. However, it is very conservative because it makes it unlikely to have any false positive discoveries. An alternative is the Benjamini-Hochberg method, which orders the p-values from smallest to largest and rejects the smallest few such that the expected false discovery rate is alpha. The Neyman-Pearson lemma states that if both the null and alternative hypotheses are simple (only one parameter rather than a region of parameters), then the likelihood ratio test is the most powerful.
Chapter 11. Bayesian inference probability is a degree of belief and not a limiting frequency, so it’s possible to make probability statements about parameters that are fixed but unknown (whereas this is impossible in frequentist statistics). The posterior is calculated from the prior and updated with data using Bayes’ theorem, however we can drop the denominator, which is the normalizing constant, and then the posterior is just the product of the likelihood times the prior.
For Bernoulli trials, the posterior distribution is the beta distribution, parameterized by the number of successes and the number of trials. This depends on the prior, which is usually selected as a uniform or another beta distribution. In the case of a beta distribution prior, the posterior is also a beta distribution, when this happens it is called a conjugate prior. For a normal distribution, the prior being normal means the posterior is also normal, and the 95% Bayesian interval matches the 95% frequentist confidence interval. In general, for large samples, the posterior is approximately normal and matches the frequentist confidence interval.
The choice of priors is important, but we want to avoid subjective judgment, so usually, we choose a non-informative prior. However, a flat prior like the uniform distribution is relative to some parameterization of the model and is not well-defined because if the model is parameterized differently, then it’s no longer flat. Jeffrey’s prior is a rule for creating a non-informative prior based on Fisher information. Bayesian inference is appealing because the interval is a degree of certainty about the distribution of the true value and not a long-run statement about trapping the true value. However, Bayesian methods can fail for certain setups, such as high-dimensional problems where the data cannot update the posterior effectively, resulting in the posterior becoming identical to the prior even after seeing the data.
Chapter 12. The risk of an estimator is the expected value of its loss function (eg: squared error). When comparing two estimators, it may be the case that neither is uniformly lower risk for all possible values of the parameter. In such cases, the maximum risk is the highest value for any possible value of the parameter, whereas the Bayes risk is the average value of the risk, but this requires choosing a prior distribution to average over. The Bayes estimator (or Bayes rule) is the one that minimizes the Bayes risk. When the Bayes estimator has constant risk (meaning it does not depend on the true parameter), it is also a minimax estimator (that minimizes the maximum risk). In large enough samples, the MLE is approximately minimax and Bayes.
An admissible estimator is one that cannot be uniformly improved upon (i.e., no other estimator has strictly lower risk for all possible values of the parameter). Stein’s paradox occurs when estimating the means of more than 3 normally distributed variables; in this case, the sample mean estimator is inadmissible, and the James-Stein estimator (which shrinks estimates towards zero) has uniformly lower risk.
Chapter 13. Simple linear regression fits a linear model between y and x by minimizing the least squares error. When the errors are normal, the least squares error estimator is the MLE. The least squares estimators for regression coefficients are consistent and asymptotically normal, allowing you to test hypotheses for a non-zero slope coefficient. The training error is a downward-biased estimator of the prediction risk due to overfitting, and the Mallows CP statistic addresses this bias by essentially subtracting a complexity penalty. The Akaike Information Criterion (AIC) is equivalent to Mallows CP.
K-fold cross-validation involves dividing the data into k groups, fitting the model while omitting one group each time, and testing on the omitted group. In the case of leave-one-out cross-validation, it is possible to derive the cross-validation risk analytically without fitting the model many times. The Bayesian Information Criterion (BIC) is similar to AIC but has a Bayesian interpretation of prior distribution over the models that is updated with the data and slightly penalizes complexity more than AIC.
To choose which covariates to use, two approaches are forward and backward stepwise regression, which are essentially greedy searches starting with either the empty or the full model, adding and removing variables until improvement ceases. An alternative is the Zheng-Loh method, which first fits the full model and then selects the largest coefficients by the Wald statistic covariance according to a formula. Logistic regression involves modeling a binary response using a logistic function combined with a linear model, which is fit using an iterative algorithm.
Chapter 14. Various topics related to multivariate models. The confidence interval for correlation may best be derived by first applying the Fisher transform to convert it into a normally distributed variable. The multinomial distribution is a generalization of the binomial distribution, but with multiple choices at each step.
Chapter 15. Testing two binary variables for independence can be done using the likelihood ratio test or the Chi-squared test; this can also be generalized for non-binary discrete variables. The odds ratio is defined as one if the variables are independent and is interpreted by how much they occur together (association but not causation) – if the variables are independent, the odds ratio is 1. A similar concept is the relative risk, which, if the probability of a disease is small, is very close to the odds ratio. The confidence interval for the odds ratio is best calculated on the log odds ratio and then exponentiated. Testing the independence of two continuous variables can be done by first fitting a linear regression or testing the correlation coefficient (note this tests if they are uncorrelated, not if they’re independent). Finally, to test the independence of a discrete variable with a continuous variable, you can check if they have the same distribution using the two-sample Kolmogorov-Smirnov test.
Chapter 16: In causal analysis (assuming binary treatment), there are two potential outcomes, C0 and C1, and only one of them is observed for any instance as variable Y. The causal effect is the difference between C1-C0, while the association is the difference in P(Y | X=1) – P(Y | X=0); these are very different. If we randomly assign subjects to treatment versus non-treatment, then the causal effect is equal to the association. In an observational study, the treatment is not assigned randomly; instead, we generally try to control and assume the chance of treatment or non-treatment is random after controlling for all confounding effects so the causal effect may be estimated. Simpson’s paradox is that the association may be positive in subgroups, (eg: positive within men and positive within women), but negative overall but this paradox cannot happen with a causal effect.
Chapter 17. X and Y are conditionally independent given Z, means that once you know Z, then X and Y provide no information about each other. We can represent independence relations as a DAG, where an arrow means dependence, X arrow Y. If X -> Y and Z -> Y, then Y is called a collider. Assume the Markov property so that every variable is independent of its past history and depends only on its parent. The rules of d-separation and d-connectedness in a DAG can determine whether two sets of variables are independent or not, conditional on another set of variables.
Chapter 18: Undirected graphs are also useful for modeling conditional independence but are used in a different way without causation direction. Two variables, X and Y, are conditionally independent given a set Z of variables when there is no path from X to Y without going through Z.
Chapter 19. In a log-linear model, break down the log of the joint probability function of discrete variables into a sum of phi-functions, each representing the interaction between some subset of the variables. The function phi0 has no variables, so it represents a constant baseline, and then each subsequent function adds to this baseline. A model is called graphical if its independent relationships can be represented by an undirected graph. Fitting log-linear models to data is similar to variable selection in linear regression and can be done using AIC or likelihood ratio tests.
Chapter 20. The risk of an estimator is the integrated squared error between the estimator and the function evaluated at all points, and it can be decomposed into a bias term and a variance term; this is called the bias-variance trade-off. For histogram estimation, it is possible to derive a formula for the optimal binwidth, but this is not useful in practice because it depends on an unknown function. The risk of this estimator decreases at a slower rate than many other estimators. A more practical way to select the binwidth is by using cross-validation, and it is possible to get the cross-validation error without fitting a model.
Although the histogram estimator converges to the true density, the kernel density estimator converges faster; need to pick the kernel (eg: Gaussian) and a bandwidth (representing the amount of smoothing); each data point contributes a lump to the density estimator, and the choice of bandwidth is more important than the choice of kernel. You can choose the bandwidth by cross-validation, similar to the histogram estimator. Confidence bands may be determined for both the histogram and kernel density estimators as two functions representing the upper and lower bands.
Density estimators can generalize to multiple dimensions, but the curse of dimensionality means that the data requirement increases exponentially for multiple regressors. Recommend using additive models to avoid the curse of dimensionality; this involves summing functions that each take one variable or possibly a few interaction effects at a time.
Chapter 21. A set of functions can form a vector space where the norm is its integral over some interval, and the inner product of two functions is the integral of their product. Two functions are orthogonal if their inner product is zero. An orthonormal basis can be defined in several ways, such as using cosine functions, Doppler functions, or Legendre polynomials. Density estimation can be achieved by fitting a linear combination of orthogonal basis functions to the data; it is recommended only use up to the square root of n functions to avoid overfitting. A similar method can also be used for regression, fitting a function as a sum of orthogonal basis functions; this is most effective if x is relatively evenly spaced. Haar Wavelets are a different set of orthogonal basis functions that are non-smooth and are useful for fitting functions with local discontinuities.
Chapter 22. In classification (for the binary case), denote a regression function as P(Y=1 | x). The Bayes rule classifies as 1 when the regression function > 1/2, otherwise 0; if the true regression function is known, then this decision rule is the optimal decision boundary. Quadratic Discriminant Analysis (QDA) fits a Gaussian on each class and predicts the class where the Gaussian likelihood is the highest. In the special case if all covariance matrices for all classes are equal, this is Linear Discriminant Analysis (LDA) because the decision boundary is linear. Linear and logistic regression can also produce classifiers, and these are similar to LDA because they can also produce linear decision boundaries. Tree-based methods construct a decision tree where we split the data points to minimize the Gini coefficient, which is a measure of impurity.
To assess the error rate, k-fold cross-validation is a good way to estimate the true error. The empirical risk minimization chooses the class value among the class of classifiers that minimizes the training error: as the sample size grows, the chosen classifier’s error rate converges to the best possible classifier’s error rate. The error rate of the optimal classifier among the class is bounded by Hoeffding’s Inequality, which gives bounds on the rate of convergence. In the case where the class of classifiers is continuous (eg: set of linear classifiers), the VC dimension is a generalization that measures complexity by how large a set can be shattered, ie, it can distinguish all possible subsets.
A support vector machine (SVM) finds a hyperplane in X that maximizes the margin between positive and negative instances, and the points on the margin are called support vectors. Kernelization is useful for many different classifiers, including support vector machines, involves defining a kernel to replace the inner product in all calculations, which is equivalent to transforming to a higher-dimensional space and finding a hyperplane in this space (but without actually performing the transformation), this results in a nonlinear classifier.
Chapter 23: A stochastic process is a collection of random variables through time (time here is called an index set and can be discrete or continuous). In a Markov chain, time and states are discrete, and the transition probabilities depend only on the previous state; the transition can be written as matrix multiplication, and the state distribution after n steps can be calculated from the matrix multiplication of the transition matrix with the initial distribution vector. The state is recurrent if, starting in the state, it is guaranteed to eventually return to the state; otherwise, it is transient. An ergodic Markov chain means it eventually converges to a stationary distribution no matter the starting state. When the chain is ergodic, transition probabilities can be estimated from data, and the estimator is consistent. A Poisson process (or counting process) has continuous time and increasing values, such that the arrivals are determined by a Poisson distribution with the intensity function based on time and the inter-arrival time follows an exponential distribution.
Chapter 24: Monte Carlo integration is instead of evaluating an integral analytically, randomly select points to evaluate the function and take an average. This is useful in many Bayesian inference problems since complex integrals are often involved, and these can be solved by simulation. Importance sampling is when integrating over distribution f but we don’t have a way to sample from f, we pick another distribution g that we do know how to sample from and reweight by a factor of f/g. This can be inaccurate if there is an area where f is large but g is very small, so the ideal choice for g is a shape similar to f but with larger tails.
Markov Chain Monte Carlo (MCMC) constructs a Markov chain with a stationary distribution that is the distribution we want to sample from. In the Metropolis-Hastings algorithm, we use a proposal distribution and then, according to the relative probability, either transition to a new value or stay at the current value. In the long term, it will converge to the intended distribution, but it might get stuck on the same values for many steps if it does not mix well. Therefore, there are some variations of the algorithm that work better in different cases. One such case is Gibbs sampling, which is effective for high-dimensional spaces and works by sampling one dimension at a time, conditioned on other dimensions.