The purpose of this analysis was to develop a regression model to predict mortality. Data was collected, by researchers at General Motors, on 60 U.S. Standard Metropolitan Statistical Areas (SMSA's), in a study of whether air pollution contributes to mortality. This data was obtained and randomly sorted into two even groups of 30 cities. A regression model to predict mortality was build from the first set of data and validated from the second set of data.


The following data was found to be the key drivers in the model:
? Mean July temperature in the city (degrees F)
? Mean relative humidity of the city
? Median education
? Percent of white collar workers
? Median income
? Suffer dioxide pollution potential
The objective in this analysis was to find the line on a graph, using the variables mentioned above, for which the squared deviations between the observed and predicted values of mortality are smaller than for any other straight line model, assuming the differences between the observed and predicted values of mortality are zero. Once found, this ?Least Squared Line? can be used to estimate mortality given any value of above data or predict mortality for any value of above data. Each of the key data elements was checked for a bell shaped symmetry about the mean, the linear (straight line) nature of the data when graphed and equal squares of deviations of measurements about the mean (variance). After determining whether to exclude data points, the following model was determined to be the best model:
-3276.108 + 862.9355x1 - 25.37582x2 + 0.599213x3 + 0.0239648x4 + 0.01894907x5 - 41.16529x6 + 0.3147058x7 +
See list of independent variables on TAB #1. This model was validated against the second set of data where it was determined that, with 95% confidence, there is significant evidence to conclude that the model is useful for predicting mortality.

Although this model, when validated, is deemed suitable for estimation and prediction, as noted by the 5% error ratio (TAB #2), there are significant concerns about the model. First, although the percent of sample variability that can be explained by the model, as noted by the R? value on TAB #3, is 53.1%, after adjusting this value for the number of parameters in the model, the percent of explained variability is reduced to 38.2% (TAB #3). The remaining variability is due to random error. Second, it appears that some of the independent variables are contributing redundant information due to the correlation with other independent variables, known as multicollinearity. Third, it was determined that an outlying observation (value lying more than three standard deviations from the mean) was influencing the estimated coefficients.

In addition to the observed problems above, it is unknown how the sample data was obtained. It is assumed that the values of the independent variables were uncontrolled indicating observational data. With observational data, a statistically significant relationship between a response y and a predictor variable x does not necessarily imply a cause and effect relationship. This is why having a designed experiment would produce optimum results. By having a designed experiment, we could, for instance, control the time period that the data corresponds to. Data relating to a longer period of time would certainly improve the consistency of the data. This would nullify the effect of any extreme or unusual data for the current time period. Also, assuming that white collar workers are negatively correlated with pollution, we do not know how the cities were selected. The optimal selection of cities would include an equal number of white collar cities and non white collar cities. !

Furthermore, assuming a correlation of high temperature and mortality, an optimal selection of cities would include an equal number of northern cities and southern cities.


The model has been tested and validated on a second set of data. Although there are some limitations to the model, it appears to provide good results within 95% confidence. If time had permitted, different variations of independent variables could have been tested in order to increase the R? value and decrease the multicolliniarity (mentioned above). However, until more time can be allocated to this project, the results obtained from this model can be deemed appropriate.



In order to select the best model, several exercises were implemented. Sometimes, data transformations are performed on y values to make them more nearly satisfy the important model assumptions listed