Uploaded by Julius

Quantitative Research Methods

advertisement
1
Quantitative Research Methods
Oliver Fris Bjørling: 202207312
Julius Jebjerg: 202208510
Exam prep notes:
5
Definitions
5
SRS:
5
Independence:
5
Trustworthiness:
5
Variance homogeneity:
5
Residuals:
5
Prediction interval:
5
Exam summary:
7
Variables:
7
Nominal variable:
7
Ordinal variable:
7
Continuous variable:
7
Exception:
7
Syllabus relevant tests:
7
Goodness of fit test/ representative test
7
Independency test:
7
Anova
8
Two Factor Anova
8
Regression:
8
Assignment walkthroughs:
9
Introduction to the data:
9
Assignment 1: Simple linear regression
10
a) Describe the relationship between salary and age in a scatter plot and assess whether
the assumptions for linear regression are met. Discuss how the model could be
reformulated to avoid any assumption-related issues.
10
Assumptions (Base assumptions for “sample”)
10
Assumptions specifically concerning linear regression. (Guideline)
11
1. There needs to be linearity between x and y.
11
2. Examine if the x variable has a uniform distribution.
11
3. Examine if the y variable is normally distributed
12
→ Make Residual analysis:
13
1. Residuals are normally distributed (make residual plot and discuss)
13
2. Expected value for residuals is equal to 0. Average of residuals is equal to 0.
13
3. Homoscedasticity (variance in residuals is constant)
14
4. Independence between residuals
14
2
b) Based on this analysis, develop a relevant linear regression model and explain which
variable is the dependent variable.
16
c) Test whether age has a significant effect on salary (test the model).
17
d) Interpret the results of the model:
18
e) Estimate a 95% confidence interval for the effect that age has on salary.
18
The following confidence interval is set up by:
18
845,45t100-1-1;0,052*76,22
19
Assignment 2: Transformations in linear regression + PI + CI
20
a) Add the quadratic terms for age to the model from the previous task and assess
whether they should be included.
20
Calculating tobs
21
b) Calculate the expected salary and a corresponding 95% confidence interval for a 30year-old employee in the model you find appropriate. Also, determine a prediction
interval.
21
c) In the two variables Sam_løn and Sam_alder, a random selection of Danes is used to
represent the salaries and ages of the population. Create the model �� = ��0 + ��1 ⋅
�� + ��2 ⋅ ��2 and determine at what age the maximum point (top point) is reached
based on this data.
23
First we examining a uncentered polynomial:
23
Examining the centered polynomial
24
Assignment 3 - Multiple linear regression
25
a) Estimate the model below and briefly assess the assumptions for the analysis,
including whether there might be an issue with multicollinearity.
25
Assumptions for the model:
26
X variables must be uniformly distributed
26
Y-variable needs to be normally distributed:
26
Residual analysis:
27
1) Homoscedasticity:
27
2) Normally Distributed residuals
27
3) Independent residuals
27
4) Expected value of residuals = 0
27
Estimating the model:
28
b) Argue for the actual (reduced) model.
30
c) Compare the reduced model with the original model from 4.a. Which one is
preferable?
31
d) In the final reduced model, please investigate if there are any interaction effects that
should be included.
31
Assignment 4: Variance analyses (one-way ANOVA /1 factor ANOVA)
32
a) Calculate the mean and variance of the salary for employees on each of the 3
machines and test for homogeneity of variances for the three samples, specifying the
assumptions for the test in question b precisely.
32
Assumptions for one-way ANOVA:
33
b) Conduct a relevant test to determine if the salaries are the same for employees
working on different machines.
34
3
Anova 1 way
34
Assignment 5: Variance analyses (Two-way/two-factor ANOVA)
36
a) Formulate a Two-factor ANOVA model and test whether there is an interaction
between gender and marital status when explaining salary.
36
The following assumptions are tied to the analysis:
36
1. Normally distributed populations or n>30 (look at the distributions)
36
2. Even population sizes (approximately)
36
3. False f- test - variance homogeneity in between the groups
37
b) If an interaction is concluded, it should be commented upon. If it is concluded that
there is no interaction, the model should be reduced and commented upon.
39
Assignment 6: Logistic regression
40
a) Develop a model that can explain whether an employee is satisfied or not.
40
Assumptions for logistic regression:
40
1. It is required that the dependent variable is categorically binary
40
2. Approximately the same number of observations in group 1 and 0.
41
3. There must not be multicollinearity between the x variables
41
b) Based on the final (reduced) model, interpret and assess the results.
43
c) What is the odds ratio for men being satisfied compared to women being satisfied?
45
d) What is the odds ratio for each additional unit of currency earned?
45
e) What is the probability that an average, married, female employee in machine 1 becomes
satisfied? (Not all information may be necessary for the prediction.)
46
Assignment 7: Chi squared and goodness of fit test
46
Following X2 assumptions will be discussed:
46
1. Mutually exclusive groups
46
2. Independence between groups
46
3. Rule of five
47
Assignment 8: Chi-squared independence test:
48
a) Conduct a relevant test to determine if the gender distribution across different machines
differs from each other. Comment on the results you find.
48
Assignment 9: Forecasting
50
a) Estimate a Trend model that explains the development in average salary and use the model
to predict the average salary in January 2019. Evaluate whether the model exhibits
autocorrelation.
50
b) Assess whether there is a possibility to enhance the trend model by incorporating
seasonality and use the model to predict the average salary for January 2016.
52
c) Estimate an autoregressive model where average salary is explained by previous periods'
salaries. Use the model to predict the average salary for January 2019.
53
d) Which model provides the best estimate for January 2019?
54
JMP Guide:
55
Scatterplot
55
For regression line (after creating scatter plot):
55
For log(y) regression line (improving model if needed):
55
Checking for improvement in distribution (log-level model)
55
4
Can add a little extra (Normal Distribution and Quantile plot):
Parameter estimates
Checking variable distribution JMP (Uniform/Normal distribution)
Residual plot in JMP
Residual analysis with multiple variables
Saving residuals to dataset (Test for distribution):
Confidence interval in JMP
Or for confidence intervals in the dataset:
Inserting quadratic terms for improving correlation:
Prediction interval:
Test for multicollinearity
One-way/One-factor Anova
Two-way/Two-factor ANOVA:
Logistic Regression
Chi squared:
Forecasting
55
55
55
55
55
56
56
56
56
56
56
56
56
57
57
57
5
Exam prep notes:
1) Theoretical contributions surrounding JMP calculations should only be included if you have extra
time.
Definitions
SRS:
Simple random sampling
Independence:
Trustworthiness:
Variance homogeneity:
Difference between t test and f test
The t-test is used to compare the means of two populations. In contrast, f-test is used to compare two
population variances.
Residuals:
The distance between the observations and the linear regression line. If the residuals = 0, then all
our observations fall on the regression line (which is not a realistic scenario).
Prediction interval:
What happens if the dataset receives one more observation?
6
7
Exam summary:
Y=X+X+X
=> Dependant variable = independent variable + independent variable + independent
variable
______________________________________________________________________________
Variables:
Nominal variable:
Categorical variable without rank order ( sex) (0-1)
Ordinal variable:
Categorical variable with clear rank order (for example first to third place) (1-3)
Continuous variable:
Where it makes sense to take the mean (average). Typically comma-numbers (for example
weight or height)
Exception:
Sex (gender) is the only variable that can function as both a continuous variable and as a categorical
variable. - That’s just how it is.
____________________________________________________________________________________
Syllabus relevant tests:
Goodness of fit test/ representative test
Categorical= % or proportions. For example we are testing 50/50 or distributed proportions for
f.x group 1 is 10%, group 2 is 30% and so on
Independency test:
Categorical variable = Categorical variable
For example: Opinion about the environment (1-5) vs. if you have a macbook (0-1)
8
Logistic regression
Categorical variable= interval + + interval (minimum 1 x- variable)
(with no minimum x variable then it’s a goodness of fit test)
Example: If you are continuing directly onto a economics masters degree (0-1) = age+ sex+
weight
Anova
Interval= Categorical→ One way Anova
Interval= Categorical+ Categorical → 2 way Anova
Two Factor Anova
Interval variable = Categorical + Categorical + Interaction→ Two Factor ANOVA
Regression:
Continuous = Continuous → Simple linear regression
Continuous = Continuous + continuous → Multiple regression
Continuous= Time→ forecasting/ Autoregressive
______________________________________________________________________________
9
Assignment walkthroughs:
Introduction to the data:
Datasættet ”Arbejdsliv” indeholder information om 100 tilfældigt udvalgte medarbejdere fra
virksomheden Sleepy, der er en produktionsvirksomhed, som laver senge og tilbehør hertil.
Information om løn, alder, Maskine, Forfremmelse, køn, Civilstatus og Gennemsnitsløn er data
indsamlet fra kontrakter og andre virksomhedsregistreringer, hvorimod Erfaring og tilfredshed er
indsamlet via spørgeskemaundersøgelse af medarbejderne. De variable i datasættet, der skal
anvendes i denne opgave, er følgende:
A: ID number (i=1 to 100)
B: Løn (medarbejderens løn i DKK)
C: Alder (medarbejderens alder i år)
D: Erfaring (antal års erfaring en medarbejder har i branchen)
E: Maskine (indikerer maskinen medarbejderen arbejder på, 1=maskine 1..3)
F: Kvinde (Dummy som indikerer hvis en person er kvinde)
G: Uddannelseskategori (1=ingen, 2=Folkeskole 3=kort videregående 4=lang)
H: Gift (Dummy som indikerer hvis en medarbejder er gift)
I: Forfremmet (Dummy som indikerer om der er givet en forfremmelse)
J: Tilfreds (Dummy som indikerer om medarbejderen er tilfreds med sit job)
10
Assignment 1: Simple linear regression
a) Describe the relationship between salary and age in a scatter plot and assess whether the
assumptions for linear regression are met. Discuss how the model could be reformulated to avoid
any assumption-related issues.
The older you are the more you earn. When you reach a certain age, income will follow a
decreasing pattern as you hit pension age. It’s typical that you earn more the older you are, until
you reach retirement, from which income will decrease again. This depends on your data set, for
example if your data set sample only includes age up to 25 years of age then you will not see the
decreasing income values as people hit 60. The decreasing trend will not be pictured in this data
set.
Scatterplot shows a linear trend in
terms of Age and Income.
However there is not a decreasing
trend depicted as of yet → with a
data set of a higher age we may be
able to see this expected trend.
(Where the model could be
explained by a polynomial .
Assumptions (Base assumptions for “sample”)
Simple random sample (SRS): There are 100 randomly chosen workers picked for this sample as
described by the introduction to data above. Therefore this assumption is accounted for.
Independence: “Have the respondents affected each other?” - From the introduction to the data, it’s
explicitly stated that the variables age and salary are derived from registration systems. This means that it
would have been impossible for the samples to have influenced each other internally.
Trustworthiness: Following the fact that the data comes from a reliable source (registration systems) so
there is no reason to believe the data is not trustworthy.
11
Assumptions specifically concerning linear regression. (Guideline)
1. There needs to be linearity between x and y.
From our graph to the left, we can see that
there is definite linearity between salary and
age. There are no immediate signs of any
possible other transformation.
We can see that there is a positive correlation
between the 2 variables, as was expected The older you are, the more money you earn.
This is also applicable to the real world,
where we would also expect to see a 45 year
old, earning more than a 25 year old.
2. Examine if the x variable has a
uniform distribution.
From the distribution above, we can derive that the minimum age of our sample is = 29, and the
maximum age is = 63. This is relevant because it means that our sample does not include people at
retirement age. The proportion of retired observations (60+) is therefore also relatively low. From our
distribution we can also see that the model is best at describing observations around the age of ≈ 45. We
can also conclude that the x-variable is not perfectly uniform, however, our middle observations are
somewhat uniformly distributed.
Furthermore, we can derive from the model that there are very few observations in the age 29-31 and
from 60 plus. This means that our model will have a weaker describing ability of these people. The model
12
is better at explaining observations at the age of 32-56, given that ≈ 80% of our observations fall within
this age.
3. Examine if the y variable is normally distributed
From the graph we can see the income distribution. The distribution of income is not normally
distributed, this is because income will never be normally distributed. This is because the majority of our
observations earn roughly 45000. There are a few observations that earn much higher than the average
person, resulting in the distribution becoming right skewed. Thus, we have identified an assumption error,
which results in a model that will be less valid. Therefore the overall describability if the observations
will be lesser compared to if there wasn’t an assumption error.
Later in the problem, it will be identified that this assumption problem can be fixed.
13
→ Make Residual analysis:
1. Residuals are normally distributed (make residual plot and discuss)
Observing the residuals, we can see that they are somewhat normally distributed. It can be said
that the model is a little bit right skewed. Thus follows, that income is skewed to the right side.
The residuals are calculated by the distance from the observation to the regression line.
2. Expected value for residuals is equal to 0. Average of residuals is equal to 0.
(Normally true) Assuming that the residuals are somewhat normally distributed, and we have a
nice linear regression, without any indications of a need for transformation (by for example
adding quadratic terms (polynomial)), it is fair to assume that the expected value for the residuals
= 0.
14
3. Homoscedasticity (variance in residuals is constant)
(“Trumpet” shape in residual plot)
Homoscedasticity means that there should be a constant variance in the residuals, which will
mean that there isn’t heteroskedasticity. Heteroscedasticity means that the variances are not
constant.
The plot shows that to the left the variances can not be said to be constant. This is shown by the
residuals getting bigger and bigger the higher the income one earns. This can be backed up by
the fact the residuals have “trumpet shape” and therefore the residuals do not hold the
assumption of homoscedasticity - But is a case of heteroscedasticity.
Therefore the assumption is broken which will then remove overall reliability of the model,
compared to if an assumption hadn’t been broken. The overall conclusion of this model is
therefore one of lower quality.
4. Independence between residuals
To check for independence between residuals, we can simply take a look at the scatterplot above,
and look for any patterns in the observations. From the plot we can see that the residuals look
randomly distributed, thus there is independence between the residuals.
15
Reformulation of the Model to eliminate assumption problems:
Model
X increases Y increases
Hjælp:
Level(Y)-Level(X)
y=b0+b1*x
units
B1 units
N/a
Log-level
log(y)=b0+b1*x
Units
B1* 100%
Can help IF y is NOT
normally distributed.
Level-Log
y=b0+b1*log(x)
1%
B1 / 100
Not usually used
Log-Log
Log(y)=b0+b1*
Log(x)
1%
B1*%
Not usually used
●
In an exam, this is the only relevant one
The following problems arose with our assumptions:
1) Y-variable is not normally distributed (right skew)
2) Variation in residuals is not constant (variance homoscedasticity)
If the model needs to eliminate the problems with our assumptions, then the following model is
̂ = 𝑦0 + 𝑦1 ∗ 𝑦
recommended: 𝑦
̂ ) = 𝑦0 + 𝑦1 ∗ 𝑦
This can be reformulated into 𝑦𝑦𝑦(𝑦
Here we evaluate if using the logged value of y can help us resolve the problems with our
assumptions. We examine the assumptions that were breached:
Overall, we can see that the model is usable.
However, there isn’t a big difference in the
usability of the model. Furthermore, we can
also see that the correlation between
log(salary) and age is strong. The correlation
between the two variables is positive.
16
We now examine whether the error in assumption about normal distribution of salary has been
improved:
From the histogram to the left, we can see that the
distribution is now a lot smoother with the log-level
model, than it was earlier. This means that the
distribution is now almost normally distributed. This is
supported by the plot above the distribution (Normal
quantile plot) Where we can see that the majority of our
observations fall within the expected limits.
From the boxplot, we can also derive that the
distribution is now somewhat normally distributed.
b) Based on this analysis, develop a relevant linear regression model and explain which
variable is the dependent variable.
(From the question formulation, we can either choose to use the level-level model, or the loglevel model, given nothing is explicitly stated in the assignment. For ease of explainability, we
will continue on with the level-level (original) model.)
Model Formulation:
̂ ) = 𝑦0 + 𝑦1 ∗ 𝑦1 + 𝑦 → The “true” model.
(𝑦
The true model cannot be estimated. The model includes a residual that we cannot estimate. This
in turn means, that we are limited to utilizing the estimated model:
̂ ) = 𝑦0 + 𝑦1 ∗ 𝑦1 → The estimated model.
(𝑦
This is the formula for the estimated model. With insertion of our variables we get:
̂ (𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦(𝑦𝑦𝑦)
𝑦ø𝑦
From here we would then examine whether salary can be explained by age. From the previous
section, it’s stated that we’re working with a linear regression, why there will be a test to
examine whether there is in fact a linear relation, in the following section.
17
c) Test whether age has a significant effect on salary (test the model).
To test whether age has a significant effect on salary, we set up the following hypotheses:
𝑦0 = 𝑦1 = 0 (slope of regression line = 0)
𝑦1 = 𝑦1 ≠ 0
The above hypotheses will test if there is a linear relation between x and y. Alpha is set = 0,05,
because nothing else is stated in the assignment. The results of the test are as follows:
From the parameter estimates, we can see that the p-value for age is significant with the value
=0,0001*. This means that the p-value is below our significance level (0,05), why we reject the
null hypothesis in favor of H1. In other words, we can conclude that age has a significant
correlation with salary.
With calculations (Only if explicitly asked for in assignment, or there’s extra time):
(𝑦 − 0)
𝑦𝑦𝑦𝑦 = 1
𝑦𝑦1
Inserting variables (from parameter estimates):
845,45 − 0
𝑦𝑦𝑦𝑦 =
= 11,09
76,22
The critical values are calculated:
𝑦𝑦−𝑦−1;𝑦 = 𝑦100−1−1;0,05 = 𝑦98;0.025 = 1,984
2
2
Is the tobs within the critical limit?
Given that our observed value is 11,09 - which is significantly larger than our critical value of
1,984 - We reject our null hypothesis in favor for H1. Thus we can conclude, that there is a linear
correlation, given our slope is not = 0.
18
d) Interpret the results of the model:
From the linear regression, we got the following results:
̂ (𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦(𝑦𝑦𝑦)
𝑦ø𝑦
By inserting values from our parameter estimates, we get:
̂ = 8182,75 + 845,45 ∗ 𝑦𝑦𝑦𝑦𝑦
𝑦ø𝑦
From the above, we can derive that for each time age is increased with 1, the salary increases
with 845,45. This seems reasonable, since you earn more, the older you are. Note the starting
value (b0) is 8182,75. This means that when age = 0, you earn 8182,75. This makes no practical
sense, as it is very rare for a 0-year old to make any money.
The model only ecompasses observations in the age range of 29-63. This means that b0 has no
explanatory power on its own. Instead we can examine how much a 29-year old earns, as this is
the youngest observation of our sample:
𝑦ø𝑦29 = 8182,75 + 845,45 ∗ 29 = 32.700
This makes a lot more sense, and is more usable for the interpreter. Furthermore it would also
seem fair to assume a 29-year old to make 32.700.
Here it is important to emphasize that our model had two large assumption violations. One being
the distribution of the y-variable and the other the absence of variance homoscedasticity. The
abovementioned results are therefore based on a model that hasn’t been able to fulfill the
expected assumptions, which in turn means that the results should not be the basis of any large
decisions.
e) Estimate a 95% confidence interval for the effect that age has on salary.
The following confidence interval is set up by:
𝑦𝑦 ± 𝑦𝑦−𝑦−1;𝑦 ∗ 𝑦𝑦𝑦
2
In this case, we are being asked to set up a confidence interval for the slope coefficient for age which is
indicated by 1.
𝑦𝑦 ± 𝑦𝑦−𝑦−1;𝑦 ∗ 𝑦𝑦1
2
19
The assumptions for this confidence interval are the same that we have gone through in the above
question. Take a look at the above questions for these.
845,45 ± 𝑦
100−1−1;
0,05
2
∗ 76,22
Working out:
𝛽1 𝑦(694,2; 966,7)
It can then be said with 95 confidence that the slope coefficient for age compared to income is between
694 to 966. Therefore 0 is not included in this interval, of which further indicates that age is a significant
factor in the description of wage.
In JMP:
Go to analyse→ fit model→ Income in y, x in construct cross model effects→ run → red triangle→
regression reports→ show all confidence interval→ then scroll to bottom
20
Assignment 2: Transformations in linear regression + PI + CI
a) Add the quadratic terms for age to the model from the previous task and assess whether they
should be included.
On the background of the previous task, it can be said that it is not better to add the quadratic terms, it is
found out that they will not have anything to do with giving a better link between income and age. The
reasoning behind his is because the data set only goes from 29-63. What we shall do instead is:
Model formulering
̂ = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦^2
𝑦
Using the above equation, we can examine whether the quadratic term is beneficial for the conclusion.
(anbefales at bruge fit model when spoken about multiple quadratic questions)
To test whether the model can contribute or not we form the following hypothesis.
𝑦0 : 𝑦𝑦 = 0
𝑦1 : 𝑦𝑦 ≠ 0
Now we are testing if the slope coefficient is equal to 0. This is done with help of parameter estimates.
According to the parameter estimates, the current model is not significant. This means that the model
should not be continued with a polynomial term. There is not a correlation between income and age
quadratic. The p values concerning age and age^2 are ages over our alpha level (5%). Therefore it is
forecasted that H1 and H0 hold. This is an indication that our slope coefficient is equal to 0
21
Calculating tobs
𝑦𝑦 − 0 525,26 − 0
𝑦𝑦𝑦𝑦 =
=
= 0,68
𝑦𝑦
775,32
𝑦𝑦 − 0 3,60 − 0
𝑦𝑦𝑦𝑦 =
=
= 0,42
𝑦𝑦
8,60
The critical limit is known from previous assignment:
𝑦
𝑦
𝑦−𝑦−1;
2
=𝑦
0,05
100−2−1:
2
= 1,985
From the above, we can see that the observation value are included within the critical limits, for which we
fail to reject H0 .
This supports the conclusion in the JMP output. Thus we don’t deem it necessary to include to quadratic
term.
b) Calculate the expected salary and a corresponding 95% confidence interval for a 30-year-old
employee in the model you find appropriate. Also, determine a prediction interval.
It is obvious to see that the model that fit the best is the original model without the quadratic terms.
̂
𝑦𝑦𝑦𝑦𝑦𝑦
= 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦
First we get our prediction interval
22
̂ ± 𝑦𝑦−2;𝑦/2 ∗ 𝑦𝑦 ∗ √1 +
𝑦
1 (𝑦 − 𝑦)2
+
𝑦
(𝑦 − 1)
∗ 𝑦2𝑦
JMP is then used to calculate the above:
Analyze → Fit model → Add y → add x → run → red triangle → Save columns → Indiv confidence
limit Formula → Adds confidence intervals to dataset for each value.
From the above, we can set up the following interval.
95% 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 = [20.085; 47.007]
This means that if the company hires a NEW employee, that is 30 years old, he/she would with 95%
certainty have a salary between 20.085 and 47.007 kr.
Predictions interval: What happens if the data set gets one more observation?
Confidence interval: Confidence interval is set up for a current employee who is 30 years old. The
following confidence interval is set up.
̂ ± 𝑦𝑦−2:𝑦/2 ∗ 𝑦𝑦 ∗ √1/𝑦 +
𝑦
𝑦 − 𝑦)^2
(𝑦 − 1) ∗ 𝑦2𝑦
USe jmp to calculate the above confidence interval. (Red triangle→ save columns→ mean confidence
limit formula)
From the above we can set up the following confidence interval:
95% 𝑦𝑦 = [31.009; 36.084]
23
c) In the two variables Sam_løn and Sam_alder, a random selection of Danes is used to represent
the salaries and ages of the population. Create the model �� = ��0 + ��1 ⋅ �� + ��2 ⋅
��2 and determine at what age the maximum point (top point) is reached based on this data.
𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦^2 𝑦𝑦𝑦𝑦𝑦
A polynomial as above can also be reformulated into:
𝑦 = 𝑦𝑦2 + 𝑦𝑦 + 𝑦𝑦
This is used when we have to calculate our top point. In a section further below we will show how to
calculate the top point for a centered and an uncentered polynomial.
−𝑦1
2 ∗ 𝑦2
𝑦1
=𝑦−
2 ∗ 𝑦2
𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 (𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦𝑦𝑦𝑦 =
𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 (𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦𝑦𝑦𝑦
First we examining a uncentered polynomial:
In JMP: Analyze → Fit model → Add y variable → add x variable → right click x variable → transform
→ square → add squared x variable → run
From the output above, we utilize the values to calculate the top point:
𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 (𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦𝑦𝑦𝑦 =
−4816,2277
= 52,5611
2 ∗ −45,81
From the above it can be see that the top point is 52.56 which approximates to 53 years old for when one
reaches the largest income possible (for data set). Income for this person can be calculated by:
𝑦𝑦𝑦𝑦ø𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦2𝑦𝑦𝑦
𝑦𝑦𝑦𝑦ø𝑦 = −73739,28 + 4816,2277 ∗ 52,56 − 45,81 ∗ 52.56^2 = 52.837
It is seen by the above that if you calculate the precise income then the value of 52837 is realistic for a
person of 53 years of age.
Examining the centered polynomial
𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 (𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦𝑦𝑦𝑦 = 𝑦 −
𝑦1
2 ∗ 𝑦2
24
In JMP: Analyze → Fit y by x → Add y variable → add x variable → red triangle → Fit polynomial →
quadratic
By using JMP we get the above polynomial. Through further examination, we can rough-estimate our top
point to be around 50. Thus our earlier calculation of 52,56 is assumed to be correct and appropriate.
We can now calculate the top point for the centered polynomial:
(should be the same)
𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 (𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦𝑦𝑦𝑦 = 𝑦 −
𝑦1
2 ∗ 𝑦2
Using the values from our parameter estimates:
𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 (𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦) = 𝑦𝑦𝑦𝑦 = 46,47 −
558,13
= 52,56
2 ∗ −45,81549
25
Assignment 3 - Multiple linear regression
a) Estimate the model below and briefly assess the assumptions for the analysis, including
whether there might be an issue with multicollinearity.
𝑦𝑦ø𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦3 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦3 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦
The model above is the true model: 𝑦𝑦ø𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦3 ∗
𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦3 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦
We can then do an analysis on the estimated model:
̂
𝑦
𝑦ø𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦3 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦3 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
In this section we will evaluate whether there are problems with mulitcollinearity. Furthermore the
assumptions surrounding multiple regression will be discussed.
We start with testing multicollinearity:
From the picture above, we can derive the correlations between our variables. The correlation is between 1 and 1, where 1 is a perfect positive correlation, which means that when one variable increases, the other
variable increases the same amount. -1 defines perfect negative correlation, which is just the opposite of
perfect positive.
From the above x-variables, we can see that they all (excluding køn) have a strong correlation with løn.
This is deductible, given all variables have a correlation value greater than 0,60. This was as expected, as
age, experience and education all have explanatory correlation with løn. In reality, salary is also
dependent on age, experience and length of education.
Note: It’s really good if x-variables have a large correlation with y-variables
Note: It’s really BAD if the x-variables have a large internal correlation. This means that the xvariables are good at explaining each other, which is not wanted - as we want to explain the yvariable. Internal x-variable correlation results in “noise” in the analysis.
We can see that between age and experience there is a high correlation of 0,85. This means that the
variables describe each other very well and this can mean that one of these variables are not needed in the
regression analysis.
Moreover the correlation between experience and education is also relatively large. This doesn’t make a
lot of sense as if you have a large education then it would make more sense that you would have less
experience, as you’ve spent more time on education rather than working and gaining experience.
Our women variable has little to no correlation between our other x variables however also has little
explanation to do with our income variable (y variable) .
26
Assumptions for the model:
X variables must be uniformly distributed
Age
Age is already controlled for uniform distribution in the previous
question (assumptions concerning linear regression). Here we found
out tha age was not perfectly uniformly distributed. We found out
the bounds for age were from 29 to 63.
Experience
Experience cannot take a uniform distribution. This is because there
is a lot of observations for experience between 5-10 years. This
means that the model will be skewed in terms of the explanation of
these observations.
Women
Sex can take a uniform distribution
Education:
Education looks to be OK uniformly distributed. There are a few
more observations with a lower education than those with high,
however, given it’s such a slight difference, it is assumed not to
have a significant effect on the results.
Y-variable needs to be normally distributed:
Was covered in previous question, so please refer back to that.
27
Residual analysis:
1) Homoscedasticity:
From the above, we can evaluate whether there is constant variation in the residuals. There is no
immediate “trumpet” shape or heteroscedasticity in the picture above. The variation is between -10.000
and +10.000 which seems alright.
2) Normally Distributed residuals
From the distribution to the left, we can see
that the residuals are almost normally
distributed. This is with the exception of a
single observation, that can be classified as an
outlier. (The black dot in the box plot)
This is good for our test, as we want normally
distributed residuals.
3) Independent residuals
There is no clear indication of pattern in our residuals. We therefore assume the residuals to be
independent.
4) Expected value of residuals = 0
Due to the normal distribution of our residuals, we can conclude that the expected value is also = 0. To be
exact it is 3,383e-12, meaning it is not exactly 0, but close enough.
28
Estimating the model:
The model is now estimated using JMP:
Analyze → Fit model → add y variable → add x variables → Run
In the multiple regression, we first need to examine whether the “whole model test” is significant or not.
This means, we’re testing the model as a whole (including all variables) to test whether it is significant or
not.
𝑦0 : 𝑦1 = 𝑦2 = 𝑦3 . . . = 𝑦𝑦 = 0
𝑦1 : 𝑦𝑦 ≠ 0
The model thus tests if the entire model is insignificant og therefor if there is at least one variable that is
significant.
The assumptions are gone through in sub question a)
The following observations for the hypothesis test above:
Theoretical contribution about JMP calculations (Only include if you have time):
𝑦𝑦𝑦𝑦 =
𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦
8,12𝑦𝑦𝑦
=
= 114,76
𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦
1,68𝑦𝑦𝑦
The observation for f is 114,76. This needs to be compared to the critical limit.
𝑦𝑦;𝑦−𝑦−1;𝑦 = 𝑦4;100−1;0,05 = 2,467
k= Number of x variables
n= sample size
a= significance level
With the above calculations, we can see that the Fobs value is outside the critical limit, which is why we
reject H0 in favor of H1. This means that there is at least one coefficient slope that is significant. We will
now examine which variable(s) is significant:
29
From the above parameter estimates, we can see that age and gender aren’t significant. This means that
the variables “age” and “woman” cannot be used to explain the development in salary.
In assignment 1, we found that age was significant, yet now we find that age is not significant. This is two
contrasting conclusions. The explanation is, that in our multicollinearity analysis age and experience were
correlated with 0,85, this means that when we investigate income in terms of age and experience at the
same time, then experience explains so much of “wage” that age already does. This in turn means that the
variable “age” becomes redundant. This is why age is insignificant. (Everything age could explain in
assignment 1, can be explained by experience plus a little more, thus making it more significant).
An example of how to calculate the JMP conclusions (Only for understanding):
𝑦𝑦 − 0
𝑦𝑦𝑦
𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦æ𝑦𝑦𝑦 = 𝑦
𝑦𝑦𝑦𝑦 =
𝑦−𝑦−1;
𝑦
2
−1097,50 − 0
= −1,30
847,35
𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦æ𝑦𝑦𝑦 = 𝑦𝑦−𝑦−1;𝑦 = 1,985 (From previous assignment)
𝑦𝑦𝑦𝑦 =
2
k= Number of x variables
n= sample size
a= significance level
On the background of the above we can see that we should maintain H0 conclusion which in turn results
in an insignificant model. This means that “women” can not explain salary.
30
b) Argue for the actual (reduced) model.
The model is as follows:
We can see that not all variables are significant. We thus conclude it necessary to remove the insignificant
variables.
The model is reduced until only significant variables are left.
We have removed age and women from the model. First we removed age due to the high p-value and then
we removed women from the model due to it being insignificant. Above we see the results of the
complete significant test. First off, we can see that the “whole model test” (Analysis of variance) is
significant. Furthermore, we can see that all the individual x-variables are significant (parameter
estimates). We can thus conclude that salary can be explained by experience and education. Furthermore,
we can see that the model has a degree of explanation = 0,82, which means that experience and education
can explain 82% of the variation in salary. This is fairly high.
It is important to note that there was an assumption violation with salary (y-variable) not being normally
distributed (See assignment 1). The above conclusion is based on this violation.
With more than one x variable we choose r squared adjusted.
31
We can then setup the following linear regression model:
̂ = 28.851 + 912,82 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 1373,96 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
𝑦ø𝑦
From the above, we can derive that when experience and education = 0, then the income is equal to
28851. This is assumed to be likely for an uneducated individual without experience (at the age of 29).
Furthermore, we can derive that for every time experience is increased by a year, salary is increased by
912,82. Likewise an added year of education results in a salary increase of 1373,96.-
c) Compare the reduced model with the original model from 4.a. Which one is preferable?
d) In the final reduced model, please investigate if there are any interaction effects that should be
included.
32
Assignment 4: Variance analyses (one-way ANOVA /1 factor ANOVA)
a) Calculate the mean and variance of the salary for employees on each of the 3 machines and
test for homogeneity of variances for the three samples, specifying the assumptions for the test in
question b precisely.
We start with examining the distributions: This is done through JMP:
Analyze→ Distribution → Add y variable → Add x variable in “By” → Ok
From the first variable (Løn,
where Maskine =1). Here
there is not a normal
distribution, but a left-skewed
distribution. This is concluded
as majority of the
observations fall after the
middle (50.000)
From machine two, the mean
is 42.459. The distribution is
relatively normally
distributed.
From machine 3, we can see
that the mean is 40.026. The
distribution here is right
skewed.
From the above, we have
evidence that indicates that workers on machine 3, earn more than those working the other 2 machines.
This is examined through a one-way ANOVA.
33
Assumptions for one-way ANOVA:
Refer back to assignment 1 where these assumptions have already been discussed (Trustworthiness, SRS,
Independence).
There needs to be equally large (uniform) groups. In machine 1 we have 35 observations, in machine 2 we
have 33 observations and in machine 3 we have 32 observations. These seem to be approximately even
groups.
We now test for variance homoscedasticity (equal variances) internally in the groups. This is done
through Hartley’s F-test (Fmax Test).
𝑦0 : 𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
𝑦1 : 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
𝑦2𝑦𝑦𝑦
𝑦𝑦𝑦𝑦 = 2
𝑦𝑦𝑦𝑦
When we do max divided by minimum, then the result will always be over 1. Therefore we can get away
with only doing the upper critical limit test.
We start by calculating the observator:
‘
𝑦𝑦𝑦𝑦 =
93442
68882
= 1,84
Calculating the critical limit:
𝑦𝑦1−1;𝑦2−1;𝑦∗
2
Where n1 defines the sample size, which belongs to the max standard deviation and n2 defines the sample
size, that belongs to the min standard deviation.
Alpha star is calculated below:
𝑦 ∗=
2 ∗ 𝑦𝑦𝑦ℎ𝑦
2 ∗ 0,05
𝑦∗
=
= 0,01667 =>
= 0,00833
𝑦(𝑦 − 1)
3(3 − 1)
2
Where k = number of groups.
𝑦36−1;32−1;0,00833=2,381
We can see that the Fobs is = 1,84 and the critical limit = 2,381. This means that the observed
value falls within the critical limit, which means that we fail to reject H0. Thus the variances are
equal and furthermore, this also means that our assumption is fulfilled.
34
The significance level is corrected according to the Bonferroni principle. More tests increase the
risk of making a type 1 error, why the significance level increases. This is typically used in
Hartley’s F-test and in simultaneous confidence intervals
b) Conduct a relevant test to determine if the salaries are the same for employees working on
different machines.
Anova 1 way
𝑦0 : 𝑦1 = 𝑦2 = 𝑦3
𝑦1 : 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 2 𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
This can be written as regression model:
𝑦 = 𝑦 + 𝑦𝑦 + 𝑦𝑦
Where
𝑦0 : 𝑦𝑦 = 0
Assumptions have already been covered in previous assignments, why we will only be conducting the
test:
Calculations that JMP does: (For understanding)
𝑦𝑦𝑦
1,8𝑦𝑦𝑦
The following observer belongs to the hypotheses; 𝑦𝑦𝑦𝑦 = 𝑦𝑦𝑦 = 63.198.039 = 29,1
This is then tested against the critical limit:
𝑦𝑦1−1;𝑦2−1;𝑦 = 𝑦2−1;100−2;0,05 = 3,938 ≈ 3,94
It’s apparent that our F-observer is larger than the critical limit, so we can conclude that at
LEAST one of the machine groups has a different mean than the others. In other words, at least
one of the machines has a different salary from the others.
From the analysis of variance, we can see
the same conclusion as above. This is
shown through the significant p-value,
which indicates that at least one of the
groups has a significantly different mean.
We reject H0 in favor of H1.
We now examine WHERE the difference lies, i.e., where the mean is different from the others.
35
From the plot we can see that machine 1 has a
significantly higher salary than the other
machines. Furthermore, it is a little hard to
conclude whether machines 2 and 3 share the
same mean or are significantly different. We
thus use following outputs to examine this:
From the LSMeans table we can see that:
Machine 1 has a mean income of 53809.
Machine 2 has a mean of 42459. Machine 3
has a mean of 40026.
The question is now, whether machine 2 and 3
are significantly different or whether they can
be statistically viewed as equal:
In ANOVA model: → machine red triangle → LSMeans student’s t-test
From the table we can see the following:
If we take the difference between machine 2 and 1, the
difference is significant with 11.350. This was as
expected.
The difference between machine 3 and 1 is 13.783.
The difference between machine 2 and 3 is 2.432,
which is NOT significant. Thus the salary for the
employees that are working machine 2 and 3, can be
(statistically) viewed as equal.
The conclusion is therefore, that the workers on
machine 1 earn significantly more than the others.
36
Assignment 5: Variance analyses (Two-way/two-factor ANOVA)
a) Formulate a Two-factor ANOVA model and test whether there is an interaction between
gender and marital status when explaining salary.
Alternative formulation: Estimate following model: (Possible exam question format)
𝑦𝑦ø𝑦 = 𝑦0 + 𝑦1 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 𝑦𝑦𝑦𝑦𝑦 + 𝑦3 𝑦𝑦 + 𝑦
Where EG is the interaction variable between sex and civil status.
To test whether there is a difference in salary in comparison to civil status and gender, we can
setup the following model:
𝑦𝑦𝑦 = 𝑦 + 𝑦𝑦 + 𝑦𝑦 + 𝑦𝑦𝑦 + 𝑦𝑦ℎ𝑦
Where Gamma (𝑦𝑦𝑦 ) is the interaction between gender and civil status.
The following assumptions are tied to the analysis:
1. Normally distributed populations or n>30 (look at the distributions)
We can with the aid of JMP examine whether the above assumption is met: (In the exam if you have a lot
of graphs, merely screenshot the “ugly” ones for discussion)
Please note, that with the many distributions, i have chosen to discuss
the distributions that are furthest from the assumption regarding
normally distributed populations.
From the first distribution is for women who are not married, we can see
that the population is not normally distributed. Furthermore, there is a
gap between 45.000 and 50.000 which in principle means that we
cannot explain anything for an unmarried woman who earns between
45.000 and 50.000.
It gets even worse in the next distribution for married women, where we
observe multiple gaps around 60.000 and 70.000. The assumptions for
normally distributed populations are therefore not met, which in turn
will have negative effects on a future conclusion.
2.
Even population sizes (approximately)
We can see that the groups are approximately equal: 𝑦1 = 23, 𝑦2 = 26, 𝑦3 = 24, 𝑦4 = 27
The above samples are approximately equal and therefore the assumption is met.
37
3. False f- test - variance homogeneity in between the groups
We now test if the variance between the groups is equal:
𝑦0 : 𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
𝑦1 : 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦:
𝑦𝑦𝑦𝑦 =
𝑦𝑦𝑦𝑦 =
𝑦2𝑦𝑦𝑦
𝑦2𝑦𝑦𝑦
102912
89802
= 1,313
Calculating the critical limit:
𝑦𝑦1−1;𝑦2−1;𝑦∗ = 𝑦27−1;24−1;0,00417 = 3,0678 (From the excel template)
2
Alpha-star calculation:
𝑦 ∗=
2 ∗ 𝑦𝑦𝑦ℎ𝑦
2 ∗ 0,05
𝑦∗
=
= 0,00833 =>
= 0,00417
𝑦(𝑦 − 1)
4(4 − 1)
2
From the above, we can see that the observed value falls within the critical limit. Thus the variances can
be assumed to be equal. This is positive for the test, and the assumption regarding variance
homoscedasticity is thus met.
Test can now be formulated: (2-factor anova model)
This has the following hypothesis:
𝑦0 : 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦
𝑦1 : 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦
𝑦0 : 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦
𝑦1 : 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦
𝑦0 : 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
𝑦1 : 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
NOTE: We ALWAYS need to examine interaction first, and we might potentially need to reduce
interaction first, if possible. If we have a significant interaction between the variables, we cannot reduce
the model further. If we on the contrary have an insignificant interaction term, we need to reduce this
first.
38
We use JMP to examine this:
From the parameter estimates, we can see that
the interaction term isn’t significant, and
furthermore that “kvinde” (gender) isn’t
significant either.
Given that we always need to reduce the interaction term first, BUT first we need to examine the
interaction term:
The graph shows that the lines are reasonably
parallel, which means that there is no
interaction between being married and a
specific gender. There is no indication that
women are more or less married than men in
this data set. On the flipside, there is no
evidence that supports that men are more or less
married when compared to women.
Furthermore there is no difference in income if
you are married / gender.
There is no interaction between gender and
marital status compared to income. It is not such that you earn more as a single man compared to being a
single woman. Or that you earn more as a single man compared to married woman. The Correlation is
apparent; if you are married (disregarding gender), you earn more compared to if you are single (not
married). On the contrary, if you are single (disregarding gender) you earn less, than if you are married.
On account of the above, we can now reduce (remove) the interaction term.
(Take the interaction term out of model and run again.)
By running the model we get the following results:
Now we can see that gender is insignificant, which means that no matter what gender you are
then you earn the same. This is in accordance with the results from the previous questions, where
we concluded that there was no linear correlation between gender and income.
39
The model is reduced until there are only significant terms. Therefore the next step is take gender
out of the model.
Taking the new (and reduced) parameter estimates into consideration, we can see that the model
is now significant. Furthermore, we can see that being married has a significant effect on salary.
The correlation is as follows:
From the table to the left, we can clearly see, that
being married has a significant effect on salary. We
can see that if civilstatus = 0 (not married) the mean
salary is 42.352 and for civilstatus = 1 (married) the
mean salary is 48.580. There’s a large difference
between the two, where you make more money if
you’re married.
This makes a lot of sense, seeing as most people get
married later in their lives, which means that the
people that are married, are often also older.
In an earlier question, we found that age was
significant in relation to explaining salary. This
pairs well with the conclusion above. Given that
married people often are older and the older people
earn more money, this correlation is apparent.
To conclude, the following model is formulated:
𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦
The above model is significant in relation to explaining salary.
b) If an interaction is concluded, it should be commented upon. If it is concluded that there is no
interaction, the model should be reduced and commented upon.
Question b) has been answered throughout subquestion a, so please refer back to question a, to see
conclusion on interaction and a reduced model.
40
Assignment 6: Logistic regression
(Y variable is categorical, x variables are continuous)
The following variables can be considered as explanatory variables in the chosen model: salary,
age, experience, gender, machine, and marital status. No transformed forms and/or interaction
effects should be used.
a) Develop a model that can explain whether an employee is satisfied or not.
Model formulation:
We set up an estimated model, and refer back to previous assignment for explanation:
̂
𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
= 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦 + 𝑦3 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 + 𝑦4 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦
+ 𝑦5 ∗ 𝑦𝑦𝑦ℎ𝑦𝑦𝑦 + 𝑦6 ∗ 𝑦𝑦𝑦𝑦𝑦
We now examine whether we can predict the likelihood of being satisfied or not, based on the
above mentioned variables. The dependent variable (satisfaction) is a categorical variable and the
model MLE (Maximum likelihood estimation) is thus used. The model is a likelihood model that
estimates the likelihood for being satisfied.
Assumptions for logistic regression:
1. It is required that the dependent variable is categorically binary
This assumption is met, given that the variable “satisfaction” is categorical.
From the distribution above, we can clearly see that the distribution is categorical. We can see
two outcomes which is alsoc required. Furthermore there’s approximately an equal number of
observations in 0 and 1. They are distributed as follows: 45% unsatisfied and 55% satisfied.
41
2. Approximately the same number of observations in group 1 and 0.
This assumption is followed as seen by the distribution above having approximately the same
number of observations on either side.
3. There must not be multicollinearity between the x variables
We want a large correlation between the x-variables and the y-variable and a low correlation between the
x-variables. There is a fairly high correlation between satisfaction and salary. This is as expected, given
that most people work to earn money. By which salary is a big factor.
Furthermore we can see that age can explain satisfaction alongside experience. A little more questionable
is the negative correlation between machine and satisfaction. This means that the higher the machine you
have (1-3), the lower the satisfaction. This actually makes sense, given that we in an earlier assignment
concluded that it was most attractive to work at machine one, as you then earn more money in comparison
to the two other machines. It is peculiar however that satisfaction drops by working at machine 1. We can
also see that if you are a woman (1=woman, 0=man), then you are per definition unsatisfied.
Between the x variables:
In a previous question, the relation between age vs experience and experience vs income. Has already
been discussed. Therefore we will focus on the rest. There is a big correlation between machine and
income, this links nicely back to the previous question where machine 1 was the one where we earned the
most. Furthermore there is a correlation between age and marital status which makes sense as discussed in
an earlier question.
42
We test the following hypotheses:
- To start off, we make a whole model test, to control whether the model as a whole is
significant:
𝑦0 : 𝑦1 = 𝑦2 . . . = 0
𝑦1 : 𝑦 𝑦 ≠ 0
We are testing whether the slope coefficient is equal to 0 and thus if there is at least one outlier.
The model's assumptions have already been covered in a previous question. (Refer back to
regression)
To this we can use JMP:
From the whole model test, we can see that the p value is significant. This means that the model
as a whole is significant. There is thus at least one slope coefficient that is NOT equal to 0. We
can now examine which coefficients are = 0 and thus need to be reduced.
From the parameter estimates we can see that there is a fair amount of slope coefficients that are
not significant. By first glance it seems that gender is significant. The model is reduced until
there are only slope coefficients left.
43
We remove slope coefficients one by one in the following order (always reducing the most
insignificant): 1) Machine, 2) marital status, 3) age, 4) experience.
The result of the model is as follows:
From the above, we can now see that salary and gender are significant, as their p-values fall
below the significance level. This in turn means that their respective slope coefficients aren’t
equal to 0, and that they thus aid in predicting the likelihood of being satisfied.
̂
The final model is as follows: 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
= −6,8934918 + 0,000018074 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 −
1,831 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦
Alternative model formulation: 𝑦(𝑦 = 1(𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦)) =
𝑦𝑦𝑦(−6,893+0,00018∗𝑦𝑦𝑦𝑦𝑦𝑦−1,832∗𝑦𝑦𝑦𝑦𝑦𝑦)
1+𝑦𝑦𝑦 (−6,893+0,00018∗𝑦𝑦𝑦𝑦𝑦𝑦−1,832∗𝑦𝑦𝑦𝑦𝑦𝑦)
The model has also removed the issues with multicollinearity, so now the model can be said to be
relatively strong.
b) Based on the final (reduced) model, interpret and assess the results.
In order to give the best possible explanation of the model, vi utilize JMP to explain pseudo R^2,
estimates and the Hitt rate:
From our confusion matrix, we can calculate our pseudo R^2 and
furthermore from that our hitt rate. We can also calculate the most
likely outcomes.
44 + 35
79
=
= 79%
100
100
Thus the model has a prediction accuracy of 79%. We would like the
model to predict 25% more outcomes compared to if we guessed.
This can be found out by the following:
𝑦𝑦𝑦𝑦𝑦𝑦𝑦 ℎ𝑦𝑦𝑦𝑦𝑦𝑦𝑦 =
55
= 55%
100
45
𝑦(𝑦 = 0) =
= 45%
100
𝑦(𝑦 = 1) =
44
If we just guessed then we would have hit correctly roughly 55% of the time, this can be
compared to our overall hit rate:
55% ∗ (1 + 0,25) = 0,6875 = 68,75%
Therefore our model is significantly better, because our model has a better prediction accuracy
compared to 68,75. This in turn, means we have a good model.
Furthermore, we can in our Hitt ratio see:
44
𝑦(𝑦 = 1) = 55 = 0,8 = 80%
35
= 0,78 = 78%
45
The model is almost as good at prediction satisfaction (y=1) as it is at predicting dissatisfaction
(y=0).
Another expression for our model's quality is the R^2. For this we utilize our JMP output:
𝑦(𝑦 = 0) =
From the output above we can see that our 𝑦2𝑦 = 0,3505. This means that the model can explain
roughly 35% of the satisfaction in the employee.
Lastly we can analyse the coefficients individually.
When income rises then satisfaction also rises. This can be seen by the sign of income (+). On
the other hand when gender rises (0= man, 1= woman) then satisfaction falls. This means men
have a larger satisfaction compared to women. This would mean that the women are unsatisfied.
45
c) What is the odds ratio for men being satisfied compared to women being satisfied?
From our odds ratio we get the
following:
𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 =
𝑦𝑦𝑦𝑦(𝑦 = 1)𝑦𝑦𝑦𝑦𝑦𝑦
= 0,1601 => 𝑦𝑦𝑦(−1,832)
𝑦𝑦𝑦𝑦(𝑦 = 0)𝑦𝑦𝑦𝑦
The above should be interpreted as such, so that there is a 16% chance of a given woman to be more
satisfied than a man.
We can also calculate the opposite:
𝑦𝑦𝑦𝑦(𝑦 = 0)𝑦𝑦𝑦𝑦
= 6,24 => 𝑦𝑦𝑦(1,832)
𝑦𝑦𝑦𝑦(𝑦 = 1)𝑦𝑦𝑦𝑦𝑦𝑦
From the above we can derive that it is 6 times more likely for a man to be satisfied compared to
a woman.
𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦 =
d) What is the odds ratio for each additional unit of currency earned?
Odds to be satisfied are 1,000181 times larger for each extra dkk you earn. (Read from the odds
ratio above)
46
e) What is the probability that an average, married, female employee in machine 1 becomes
satisfied? (Not all information may be necessary for the prediction.)
JMP: Analyze → distribution → add y variable →Change gender (x) to nominal → add x
variable → run → stack → Take the mean value:
The average woman earns 45.670. This means that we can calculate the probability with
JMP: From our Logistic regression (reduced model) → Red triangle (Nominal logistic fit) →
Profiler
From the graph to the left, we can derive that if you
are a woman with average salary, then there’s a
61,6% chance that you are dissatisfied and a 38,4%
chance of you being satisfied.
Assignment 7: Chi squared and
goodness of fit test
a) To assess whether the employees in the study are representative of the gender distribution in
the company, please conduct a relevant test to determine if the gender distribution in the sample
aligns with the entire company, considering the fact that there are an equal number of men and
women.
In this question we are being asked to examine whether gender can take a uniform distribution. To start
with we construct the following hypothesis:
𝑦0 : 𝑦𝑦𝑦𝑦 = 0,50 & 𝑦𝑦𝑦𝑦𝑦𝑦 = 0,50
𝐻1 : 𝐻𝐻 𝐻𝐻𝐻𝐻𝐻 𝐻𝐻𝐻 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻
Following 𝑦2 assumptions will be discussed:
1. Mutually exclusive groups
Gender is assumed mutually exclusive, as you cannot be both a man and woman in this data set. It is not
possible to choose anything else, it is either or.
47
2. Independence between groups
We assume there is independence between choice of gender. One person's pick won’t influence the
gender choice of the next sample.
3. Rule of five
Rule of five means the expected value in all groups must be over 5.
“Percentage multiplied with the lowest sample observation” => 50%*49 =24,5
This is larger than 5, why the expected value in both groups will be over 5.
To examine the following, the observed value is calculated.
𝑦
2
𝜒 =∑
𝑦
(𝑦𝑦 − 𝑦𝑦 )^2
𝑦𝑦
(𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 − 𝑦ℎ𝑦 𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦)^2
(𝑦ℎ𝑦 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦
=
. . . +. . .
𝑦ℎ𝑦 𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦
(𝑦ℎ𝑦 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦
(The red highlights are for understanding. DON'T add these in an exam)
From JMP we can see that it calculates an observed value of =0,04 and a p-value of 0,8415. This means
that we fail to reject H0. This in turn means that the proportion of men and women in the company for
both genders = 50%.
This can be backed up by following confidence intervals
From the table it’s apparent that in both scenarios (genders) 0,50 is within the confidence interval, which
further supports the conclusion that the proportion of men/women (gender) is equal at the company.
48
Assignment 8: Chi-squared independence test:
a) Conduct a relevant test to determine if the gender distribution across different machines differs
from each other. Comment on the results you find.
On the background of the question text, an independence test needs to be conducted.
Following Assumptions need to be discussed further (in comparison to assignment 7)
Mutually exclusive:
The machines are assumed to be mutually exclusive. In reality there might be a chance that
employees work on different machines.
For the discussion of mutual exclusivity regarding gender, we refer back to the discussion in the
previous section (Assignment 7).
Rule of 5:
The expected value for all groups must be larger than 5. JMP is used to control the last
mentioned assumption.
In JMP: Analyze → Fit y by x → add y variable (nominal) → add x variable (nominal) → OK
→ Red triangle on contingency table → remove total% → remove colum% → remove row%
→ add Expected → add Cell Chi Square
49
From the table to the left we can see that expected
value in all groups is larger than 5. (The Expected)
50
We therefore set the following hypotheses
𝑦0 : 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦 𝑦𝑦𝑦ℎ𝑦𝑦𝑦𝑦
𝑦1 : 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 𝑦𝑦𝑦 𝑦𝑦𝑦ℎ𝑦𝑦𝑦𝑦
The following observed value is used:
𝑦
2
𝜒 =∑
𝑦
(𝑦𝑦 − 𝑦𝑦 )^2
𝑦𝑦
We can see in the jmp output that jmp calculates an observation value of 0,144. With a corresponding p
value of 0,9305. We then choose our h0 conclusion. This means that there is no difference between
gender and machines. It can not be concluded that there are more men on machine 1 compared to women.
Moreover we can see that chi^2 values are low. If the chi squared values were high then we would lean
towards our h1 hypothesis. This is not the case in this question and the distribution between gender and
machines are therefore completely independent.
(The larger the Chi-Squared value, the higher likelihood of there being a significant difference)
51
Assignment 9: Forecasting
The company's average salary for the period 2011-2018 is shown in the JMP file. The variable
ID can be used to indicate time (1=January 2011… 96=December 2018).
Assumptions that are not explicitly requested to be tested are assumed to be met in the tasks
below.
a) Estimate a Trend model that explains the development in average salary and use the model to
predict the average salary in January 2019. Evaluate whether the model exhibits autocorrelation.
Model formulation
We construct an estimator model for explanation:
̂ = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦
𝑦
A_ID is used as time, where 1= january 2011 and 96= december 2018
52
The assumptions are assumed to be true, JMP is used to provide the following results.
From the model, we can immediately see that
the R-square value, that the predictability of
the model is = 0,8 - This means that time can
explain 80% of the variation in average salary.
Furthermore, we can see that, the coefficient
slope for ID is =86,00, which means that for
each month that passes, the salary is increased
with 86 kr.
We can also from parameter estimates see that
B0 =26.190, which means that in period 0
(December 2010) the average salary was
26.190.
From the model we can see that there are a lot
of systematic outliers, which potentially could
indicate seasonal fluctuations (these will be
analyzed in question 8.d)
We can then produce the following model:
𝑦𝑦𝑦𝑦̂
𝑦𝑦𝑦𝑦𝑦𝑦 = 26.190 + 86 ∗ 𝑦
From the assignment we’re asked to predict the
average salary in January 2019. We can then
calculate the following:
If the value 96 in ID correlates to December 2018, then 97 must be January 2019. By inserting
this we can calculate to the following:
̂
𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦
𝑦𝑦𝑦19 = 26.190 + 86 ∗ 97
̂
And if we calculate, we get: 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦
𝑦𝑦𝑦19 = 34.532
Thus the mean salary in January 2019 is 34.532.
We’re furthermore asked to test for autocorrelation.
To test for autocorrelation we can use JMP:
53
From the Durbin-Watson test, we can see that the p-value > 5%, which means that there’s no
autocorrelation. This is positive for our test.
54
b) Assess whether there is a possibility to enhance the trend model by incorporating seasonality
and use the model to predict the average salary for January 2016.
We saw that there were some outliers in the model that weren’t incorporated. This could
possibly be explained by seasonality fluctuations, such as a December bonus. Therefore we
insert a dummy variable that could describe this correlation for us.
In JMP: Add a new column → Name it → right click new colum → Formula → Conditional → If →
Month → add “==” to month for strict command → insert “December” → Then clause =1 → else =
0
By expanding the model, we can examine the following model:
̂ = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦 + 𝑦2 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦
𝑦
In JMP: Analyze → Fit model → Add y variable (mean salary)→ add x variables (season + ID)
→ Run
From the model we can see that there is a perfect
predictability value of 99%.
We furthermore see that the model as a whole is
significant, with a p-value that is approximately
0%. We can also see that all the x-variables and
their respective coefficients are significant. This
means that season was indeed utilized in
explaining the average salary.
On the background of this the regression is set up:
𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 = 25.960 + 83,47 ∗ 𝑦𝑦𝑦 + 4227 ∗ 𝑦𝑦𝑦𝑦𝑦𝑦 (1 = 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦)
We can see from the above that one would receive a bonus in December of 4227kr. This changes
our regression outcome from 34,059 when there wasn’t any mention of December, to the
following:
𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 = 25.960 + 83,47 ∗ 𝑦𝑦𝑦 + 4227 ∗ 1 = 34059 + 4227 = 38286
For January 2019:
𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 = 25.960 + 83,47 ∗ 𝑦𝑦𝑦 + 4227 ∗ 0 = 25.960 + 8096 = 34.057
55
c) Estimate an autoregressive model where average salary is explained by previous periods'
salaries. Use the model to predict the average salary for January 2019.
We need to explain the mean salary based on earlier periods' average salary. We therefore set up
the following autoregressive model:
̂ = 𝑦0 + 𝑦1 ∗ 𝑦𝑦−1
𝑦
Which can also be reformulated to: 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦 = 𝑦0 + 𝑦1 ∗ 𝑦𝑦𝑦𝑦 𝑦𝑦𝑦𝑦𝑦𝑦𝑦−1
We calculate a new variable in JMP, which is the average salary for a previous period.
We now setup the results for the autoregressive model:
From the model, we can see it has a
predictability of 0,60 or 60%, which is
assumed to be okay. Furthermore we can
see that the model is significant.
We can also derive that previous periods
explain 80% of a current period's salary.
This is evaluated to be relatively high
Using the model results, we can then set up the following model:
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑖𝑛𝑐𝑜𝑚𝑒 = 6107 + 0,80 ∗ 𝑜𝑙𝑑 𝑠𝑎𝑙𝑎𝑟𝑦
We asked about the income for january 2019, so therefore we can use the old salary from
december 2018.
Predicting the salary of January 2019 with our model:
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑠𝑎𝑙𝑎𝑟𝑦𝐽𝑎𝑛19 = 6107 + 0,8 ∗ 38.196 = 36.664,5
In this question, the mean salary for January 2019 is 36.664, whereas in the trend series further
up, it was 38.286. This is a relatively large difference in salary. This will be discussed in the next
question.
56
d) Which model provides the best estimate for January 2019?
Model
January 2019 wage
R^2
8.b (Trend model)
34.532
0,80
8.d (Trend model with season)
34.057
0,99
8.e (autoregression)
36.664
0,60
Using the above table, we can conclude that the model with the best predictability, is model 8.D (Trend
model including season). This is due to the high predictability and the fact that it accounts for seasonal
fluctuations.
57
JMP Guide:
Scatterplot
Analyze→ Fit y by x → y response variable (income), x factor (describing variable) age. Check
distribution should be the same (in this case continuous).
For regression line (after creating scatter plot):
→ Red triangle → Fit line
For log(y) regression line (improving model if needed):
→ Analyze → Fit y by x → y response variable (income), x factor (describing variable) age → Red
triangle → Fit line → Red triangle → Fit special → Log(y)
Checking for improvement in distribution (log-level model)
→ Analyze → Distribution → right click y variable→ transform → Log → add log(y) into JMP y
variable → Check distribution.
Can add a little extra (Normal Distribution and Quantile plot):
Distribution: Red triangle (Log) → Continuous fit → Normal distribution
Normal Quantile plot: Red triangle (Log) → Normal Quantile Plot
Parameter estimates
→ make regression line → Screenshot parameter estimates and discuss
Checking variable distribution JMP (Uniform/Normal distribution)
Analyze → Distribution → Add either X or Y variable (NOT BOTH) → Okay → red triangle
(distribution)→ Stack→ copy plot into answer and discuss problems with model (if any).
Residual plot in JMP
Analyze→ Fit y by x → Check variables (continuous or independent) Fit income (Y) variable, fit age
(X) variable → press ok→ go top red triangle → plot line of fit→go red triangle (linear fit) → click plot
residuals.
58
Residual analysis with multiple variables
Analyze → fit model→ y variable (income)--> cross model effects add x variables→ run
Saving residuals to dataset (Test for distribution):
Analyze → fit model→ y variable (income)--> cross model effects add x variables→ run → red triangle
→ save columns → residuals
Then: Analyze → Distribution → add residuals as y → run → stack
Confidence interval in JMP
Go to analyze→ fit model→ Income in y, x in construct cross model effects→ run → red triangle→
regression reports→ show all confidence interval→ then scroll to bottom
Or for confidence intervals in the dataset:
Go to analyze→ fit model→ Income in y, x in construct cross model effects→ run → red triangle→
save column → Mean confidence limit formula
Inserting quadratic terms for improving correlation:
Analyze → Fit model → add y variable → add x variable → right click x variable → Transform→
Square→ Add x squared to x variables → Run
*Always recommended to use Fit model when dealing with more than 1 variable*
Prediction interval:
Analyze → Fit model → Add y → add x → run → red triangle → Save columns → Indiv confidence
limit Formula → Adds confidence intervals to dataset for each value. Find the
Test for multicollinearity
This is for multiple regression:
Analyze→ Multivariate methods → multivariate→ y variable (income, age, experience, sex, Education)
One-way/One-factor Anova
(Checking distribution of variables) Analyze→ distribution→ y variable→ put x variable into “by” →
stack
Analyze→ fit model→ y variable (continuous)→ x variable (nominal - might need to convert) into
construct model effects→ red triangle on x variable plot→ lsmeans plot
Put into “By” because we test income in terms of what machine you are working at
59
Two-way/Two-factor ANOVA:
(checking distribution of variables): Analyze → distribution → Add y variable → add x
variables (Make sure to convert to categorical variables) →
Analyze → Fit model → add y variable → add x variables → add interaction term (highlight x variables
in column selection and press cross in model effects) → Run
Logistic Regression
Analyze → Fit model → add y variable (As nominal) → add x variables (as continuous) → Run
Check distribution: Analyze→ distribution → change y variable to nominal→ ok→ stack→ ss
(all variables need to be continuous when testing for) Multicollinearity: Analyze→ multivariate
methods→ multivariate→ add continuous variables → OK
Interpretation of model (pseudo r^2 and hit rate): Red triangle on nominal logistic—> odds ratio and
confusion matrix
Chi squared:
For distribution: Analyze→ distribution→ y variable (nominal)--> press red triangle and press test
probabilities→
Distribution dependant variable in relation to independent (y in relation to x): Analyze → Fit y by x
→ add y variable (nominal) → add x variable (nominal) → OK → Red triangle on contingency table →
remove total% → remove column% → remove row% → add Expected → add Cell Chi Square
To find confidence intervals: press red arrow and go down to confidence intervals 95%
Forecasting
analyze → fit model→ y variable (average income)--> x income (ID dates)
Test for autocorrelation: From forecasting model → Response red triangle → Row Diagnostics
→ Durbin Watson Test.
60
Adding dummy variable (To include outliers): Add a new column → Name it → right click new
column → Formula → Conditional → If → Fill in if formula with relevant information:
Autoregressive model:
Step 1) Lag the variable: double click on new column (old salary) → right click→ formula →
row→ lag → choose variable to lag (in this case old salary) → n=1 (given we only go back one
period)
analyze → fit model → choose y variable → choose x variables → run
Download