Regression analysis: Regression analysis

• Your continued donations keep Wikipedia running! •
• Learn more about using Wikipedia for research •
Regression analysis
From Wikipedia, the free encyclopedia
Jump to: navigation, search
Info icon This article or section may be confusing or unclear for some readers.
Please improve the article or discuss this issue on the talk page. This article has been tagged since May 2007.

In statistics, regression analysis examines the relation of a dependent variable (response variable) to specified independent variables (explanatory variables). The mathematical model of their relationship is the regression equation. The dependent variable is modeled as a random variable because of uncertainty as to its value, given only the value of each independent variable. A regression equation contains estimates of one or more hypothesized regression parameters ("constants"). These estimates are constructed using data for the variables, such as from a sample. The estimates measure the relationship between the dependent variable and each of the independent variables. They also allow estimating the value of the dependent variable for a given value of each respective independent variable.

Uses of regression include curve fitting, prediction (including forecasting of time-series data), modeling of causal relationships, and testing scientific hypotheses about relationships between variables.
Contents
[hide]

* 1 History of regression
* 2 Simple linear regression
* 3 Generalizing simple linear regression
* 4 Regression diagnostics
* 5 Estimation of model parameters
* 6 Interpolation and extrapolation
* 7 Assumptions underpinning regression
* 8 Examples
o 8.1 Prediction of future observations
* 9 Population regression function
* 10 See also
* 11 Notes
* 12 References
* 13 Software
* 14 External links

[edit] History of regression

The term "regression" was used in the nineteenth century to describe a biological phenomenon, namely that the progeny of exceptional individuals tend on average to be less exceptional than their parents, and more like their more distant ancestors. Francis Galton studied this phenomenon and applied the slightly misleading term "regression towards mediocrity" to it. For Galton, regression had only this biological meaning, but his work[1] was later extended by Udny Yule and Karl Pearson to a more general statistical context.[2]

[edit] Simple linear regression
Illustration of linear regression on a data set (red points).
Illustration of linear regression on a data set (red points).

The general form of a simple linear regression is

y_i=\alpha+\beta x_i +\varepsilon_i

where α is the intercept, β is the slope and \varepsilon is the error term, which picks up the unpredictable part of the response variable yi. The error term is usually taken to be normally distributed. The x's and y's are the data quantities from the sample or population in question, and α and β are the unknown parameters to be estimated from the data. Estimates for the values of α and β can be derived by the method of ordinary least squares. The method is called "least squares," because estimates of α and β minimize the sum of squared error estimates for the given data set. The estimates of α and β are often denoted by \widehat{\alpha} and \widehat{\beta} or their corresponding Roman letters. It can be shown (see Draper and Smith, 1998 for details) that least squares estimates are given by

\hat{\beta}=\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}

and

\hat{\alpha}=\bar{y}-\hat{\beta}\bar{x}

where \bar{x} is the mean of the x values and \bar{y} is the mean of the y values.

[edit] Generalizing simple linear regression

The simple model above can be generalized in different ways.

* The number of predictors can be increased from one to several. See

Main article: linear regression

* The relationship between the knowns (the xs and ys) and the unknowns (α and the βs) can be nonlinear. See

Main article: non-linear regression

* The response variable may be non-continuous. For binary (zero or one) variables, there are the probit and logit model. The multivariate probit model makes it possible to estimate jointly the relationship between several binary dependent variables and some independent variables. For categorical variables with more than two values there is the multinomial logit. For ordinal variables with more than two values, there are the ordered logit and ordered probit models. An alternative to such procedures is linear regression based on polychoric or polyserial correlations between the categorical variables. Such procedures differ in the assumptions made about the distribution of the variables in the population. If the variable is positive with low values and represents the repetition of the occurrence of an event, count models like the Poisson regression or the negative binomial model may be used

* The error term may be other than a normal distribution. See generalized linear model.

* The form of the right hand side can be determined from the data. See Nonparametric regression. These approaches require a large number of observations, as the data are used to build the model structure as well as estimate the model parameters. They are usually computationally intensive.

[edit] Regression diagnostics

Once a regression model has been constructed it is important to confirm the goodness of fit of the model and the statistical significance of the estimated parameters. Commonly used checks of goodness of fit include R-squared, analyses of the pattern of residuals and construction of an ANOVA table. Statistical significance is checked by an F-test of the overall fit, followed by t-tests of individual parameters.

[edit] Estimation of model parameters

The parameters of a regression model can be estimated in many ways. The most common are

* the method of least squares
* the method of maximum likelihood and
* Bayesian methods

For a model with normally distributed errors the method of least squares and the method of maximum likelihood coincide (see Gauss-Markov theorem).

[edit] Interpolation and extrapolation

Regression models predict a value of the y variable given known values of the x variables. If the prediction is to be done within the range of values of the x variables used to construct the model this is known as interpolation. Prediction outside the range of the data used to construct the model is known as extrapolation and it is more risky.

[edit] Assumptions underpinning regression

Regression analysis depends on certain assumptions

1. The predictors must be linearly independent, i.e it must not be possible to express any predictor as a linear combination of the others. See Multicollinear.

2. The error terms must be normally distributed and independent.

3. The variance of the error terms must be constant.

[edit] Examples

To illustrate the various goals of regression, we will give three examples.

[edit] Prediction of future observations

The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).
Height (in) 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
Weight (lbs) 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

We would like to see how the weight of these women depends on their height. We are therefore looking for a function η such that Y=\eta(X)+\varepsilon, where Y is the weight of the women and X their height. Intuitively, we can guess that if the women's proportions are constant and their density too, then the weight of the women must depend on the cube of their height.
A plot of the data set confirms this supposition
A plot of the data set confirms this supposition

\vec{X} will denote the vector containing all the measured heights (\vec{X}=(58,59,60,\dots)) and \vec{Y}=(115,117,120,\dots) is the vector containing all measured weights. We can suppose the heights of the women are independent from each other and have constant variance, which means the Gauss-Markov assumptions hold. We can therefore use the least-squares estimator, i.e. we are looking for coefficients β0,β1 and β2 satisfying as well as possible (in the sense of the least-squares estimator) the equation:

\vec{Y}=\beta_0 + \beta_1 \vec{X} + \beta_2 \vec{X}^3+\vec{\varepsilon}

Geometrically, what we will be doing is an orthogonal projection of Y on the subspace generated by the variables 1,X and X3. The matrix X is constructed simply by putting a first column of 1's (the constant term in the model) a column with the original values (the X in the model) and a third column with these values cubed (X3). The realization of this matrix (i.e. for the data at hand) can be written:
1 x x3
1 58 195112
1 59 205379
1 60 216000
1 61 226981
1 62 238328
1 63 250047
1 64 262144
1 65 274625
1 66 287496
1 67 300763
1 68 314432
1 69 328509
1 70 343000
1 71 357911
1 72 373248

The matrix (\mathbf{X}^t \mathbf{X})^{-1} (sometimes called "information matrix" or "dispersion matrix") is:

\left[\begin{matrix} 1.9\cdot10^3&-45&3.5\cdot 10^{-3}\\ -45&1.0&-8.1\cdot 10^{-5}\\ 3.5\cdot 10^{-3}&-8.1\cdot 10^{-5}&6.4\cdot 10^{-9} \end{matrix}\right]

Vector \widehat{\beta}_{LS} is therefore:

\widehat{\beta}_{LS}=(X^tX)^{-1}X^{t}y= (147, -2.0, 4.3\cdot 10^{-4})

hence \eta(X) = 147 - 2.0 X + 4.3\cdot 10^{-4} X^3
A plot of this function shows that it lies quite closely to the data set
A plot of this function shows that it lies quite closely to the data set

The confidence intervals are computed using:

[\widehat{\beta_j}-\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}};\widehat{\beta_j}+\widehat{\sigma}\sqrt{s_j}t_{n-p;1-\frac{\alpha}{2}}]

with:

\widehat{\sigma}=0.52
s_1=1.9\cdot 10^3, s_2=1.0, s_3=6.4\cdot 10^{-9}\;
\alpha=5\%
t_{n-p;1-\frac{\alpha}{2}}=2.2

Therefore, we can say that the 95% confidence intervals are:

\beta_0\in[112 , 181]

\beta_1\in[-2.8 , -1.2]

\beta_2\in[3.6\cdot 10^{-4} , 4.9\cdot 10^{-4}]

[edit] Population regression function

The population regression function (PRF) is a linear function that is derived from the sample regression function (SRF) which represent the population and sample regression lines, respectively. The SRF can be expressed as: the estimated dependent variable (Y) equals the estimated beta1 parameter value plus the estimated beta2 parameter value multiplied by the explanatory variable (X) plus the (sample) estimated residual (denoted as u-hat sub i). From this function, the PRF can be expressed as: the dependent variable (Y) equals the beta1 paramenter value plus the beta2 paramater value times the explanatory variable (X) plus the stochastic error (denoted as u sub i). These functions serve purposeful during regression analysis, which ultimately determines how the "average value of the dependent variable (or regressand) varies with the given value of the explanatory variable (or regressor)." The stochastic version of the PRF is critical for empirical studies - stochastic meaning that the disturbance term is added to the function in order to completely estimate the PRF.

[edit] See also

* Segmented regression
* Confidence interval
* Extrapolation
* Kriging
* Forecasting
* Prediction interval
* Statistics
* Trend estimation
* Robust regression
* Multivariate normal distribution
* Important publications in regression analysis.

[edit] Notes

1. ^ Francis Galton. "Typical laws of heredity", Nature 15 (1877), 492-495, 512-514, 532-533. (Galton uses the term "reversion" in this paper, which discusses the size of peas.); Francis Galton. Presidential address, Section H, Anthropology. (1885) (Galton uses the term "regression" in this paper, which discusses the height of humans.)
2. ^ G. Udny Yule. "On the Theory of Correlation", J. Royal Statist. Soc., 1897, p. 812-54. Karl Pearson, G. U. Yule, Norman Blanchard, and Alice Lee. "The Law of Ancestral Heredity", Biometrika (1903). In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption was weakened by R.A. Fisher in his works of 1922 and 1925 (R.A. Fisher, "The goodness of fit of regression formulae, and the distribution of regression coefficients", J. Royal Statist. Soc., 85, 597-612 from 1922 and Statistical Methods for Research Workers from 1925). Fisher assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.

[edit] References

* Audi, R., Ed. (1996). "curve fitting problem," The Cambridge Dictionary of Philosophy. Cambridge, Cambridge University Press. pp.172-173.
* William H. Kruskal and Judith M. Tanur, ed. (1978), "Linear Hypotheses," International Encyclopedia of Statistics. Free Press, v. 1,

Evan J. Williams, "I. Regression," pp. 523-41.
Julian C. Stanley, "II. Analysis of Variance," pp. 541-554.

* Lindley, D.V. (1987). "Regression and correlation analysis," New Palgrave: A Dictionary of Economics, v. 4, pp. 120-23.
* Birkes, David and Yadolah Dodge, Alternative Methods of Regression. ISBN 0-471-56881-3
* Chatfield, C. (1993) "Calculating Interval Forecasts," Journal of Business and Economic Statistics, 11. pp. 121-135.
* Draper, N.R. and Smith, H. (1998).Applied Regression Analysis Wiley Series in Probability and Statistics
* Fox, J. (1997). Applied Regression Analysis, Linear Models and Related Methods. Sage
* Hardle, W., Applied Nonparametric Regression (1990), ISBN 0-521-42950-1
* Meade, N. and T. Islam (1995) "Prediction Intervals for Growth Curve Forecasts," Journal of Forecasting, 14, pp. 413-430.
* Gujarati, Basic Econometrics, 4th edition
* S. Kotsiantis, D. Kanellopoulos, P. Pintelas, Local Additive Regression of Decision Stumps, Lecture Notes in Artificial Intelligence, Springer-Verlag, Vol. 3955, SETN 2006, pp. 148 – 157, 2006
* S. Kotsiantis, P. Pintelas, Selective Averaging of Regression Models, Annals of Mathematics, Computing & TeleInformatics, Vol 1, No 3, 2005, pp. 66-75

[edit] Software

* All major statistical software packages, e.g. SAS System, SPSS, Minitab, or Stata, perform various types of regression analysis correctly and in a user-friendly way.
* Simpler regression can be done in spreadsheets like MS Excel or OpenOffice.org Calc.
* Experts can run complex types of regression using special programming languages like Mathematica, R programming language or Matlab.
* There are a number of software programs that perform specialized forms of regression.
* There are a number of web sites that allow online linear and nonlinear regression.

[edit] External links

* Curvefit: A complete guide to nonlinear regression - Online textbook
* Exegeses on Linear Models - Some comments on linear regression models by Bill Venables.
* Mazoo's Learning Blog - Example of linear regression. Shows how to find the linear regression equation, variances, standard errors, coefficients of correlation and determination, and confidence interval.
* Regression of Weakly Correlated Data - How linear regression mistakes can appear when Y-range is much smaller than X-range

[hide]
v • d • e
Statistics
Descriptive statistics Mean (Arithmetic, Geometric) - Median - Mode - Power - Variance - Standard deviation
Inferential statistics Hypothesis testing - Significance - Null hypothesis/Alternate hypothesis - Error - Z-test - Student's t-test - Maximum likelihood - Standard score/Z score - P-value - Analysis of variance
Survival analysis Survival function - Kaplan-Meier - Logrank test - Failure rate - Proportional hazards models
Probability distributions Normal (bell curve) - Poisson - Bernoulli
Correlation Pearson product-moment correlation coefficient - Rank correlation (Spearman's rank correlation coefficient, Kendall tau rank correlation coefficient)
Regression analysis Linear regression - Nonlinear regression - Logistic regression

Retrieved from "http://en.wikipedia.org/wiki/Regression_analysis"

Categories: Cleanup from May 2007 | Wikipedia articles needing clarification | Actuarial science | Regression analysis
Views

* Article
* Discussion
* Edit this page
* History

Personal tools

* Sign in / create account

Navigation

* Main page
* Contents
* Featured content
* Current events
* Random article

interaction

* About Wikipedia
* Community portal
* Recent changes
* Contact Wikipedia
* Donate to Wikipedia
* Help

Search

Toolbox

* What links here
* Related changes
* Upload file
* Special pages
* Printable version
* Permanent link
* Cite this article

In other languages

* Български
* Česky
* Dansk
* Deutsch
* Français
* 한국어
* Magyar
* Nederlands
* 日本語
* ‪Norsk (bokmål)‬
* Polski
* Português
* Русский
* Basa Sunda
* Svenska
* Tiếng Việt
* 中文

Powered by MediaWiki
Wikimedia Foundation

* This page was last modified 13:10, 28 August 2007.
* All text is available under the terms of the GNU Free Documentation License. (See Copyrights for details.)
Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a U.S. registered 501(c)(3) tax-deductible nonprofit charity.
* Privacy policy
* About Wikipedia
* Disclaimers

Your continued donations keep Wikipedia running!

Regression analysis

Saturday, September 1, 2007

Regression analysis

No comments:

Blog Archive

About Me