辅导G12SMM/MATH、辅导Statistical Models、R语言辅导、讲解R编程设计 讲解数据库SQL|讲解R语言编程

- 首页 >> Database
G12SMM/MATH 2011 Statistical Models and Methods
Linear Models, Assessed Coursework — 2018/2019
Please submit your work on Moodle as a pdf file by 3.00pm on Friday 12 April 2019.
Your solutions should contain all relevant R output needed to justify your answers/arguments,
together with appropriate discussion, but please do not include pages of irrelevant plots/output
which you do not discuss. The easiest way to include R output is to use R Markdown to produce
your solutions, but you do not have to do so. You do not need to include your R code, though
you can include it if you wish. If you are using R Markdown, and do not wish to include your
R code, then you can suppress the R code using the echo = FALSE argument, i.e. enclose the
code in an {r, echo=FALSE} environment in the Markdown file.
There will be a Moodle forum specifically for answering queries about the coursework, so you
may post questions and I will answer them there so that everyone receives the same assistance.
Please be careful to not inadvertently give away parts of your answer if you do post a question.
Note that as this is assessed work, I can only answer queries relating to clarification, and I will
only answer queries via the forum so that everyone can see my responses. You can change
your settings so that you get email notifications of new posts if you wish (I do not think that this
is the default setting). Otherwise, please check the forum to see if your query has already been
asked.
Unauthorised late submission will be penalised by 5% of the full mark per day. Work submitted
more than one week late will receive zero marks. You are reminded to familiarise yourself with the
guidelines concerning plagiarism in assessed coursework (see the student handbook), and note
that this applies equally to computer code as it does to written work.
The work contributes 15% to the overall module mark.
The Data
The objective is to build a predictive model for body fat content using 10 body measurement
variables. Body fat is difficult to measure, but is important to help medical professionals determine
risk of certain conditions. To this end, the body fat content of 202 men was accurately measured
using an underwater weighing technique, but this is not practical for general use. Hence, it is
desirable to develop a model for predicting body fat content reasonably accurately using easilyobtainable
measurements.
The data for the 202 individuals is contained in the file Train.txt on Moodle. The body fat
measurement is the variable brozek (which refers to Brozek’s equation for body fat content).
The remaining 10 variables give the circumference, in centimetres, of neck, chest, abdom, hip,
thigh, knee, ankle, biceps, forearm and wrist. This data set is the training data, to be
used for model development.
Additionally, the file Test.txt contains the same data for a further 50 individuals. This is to
be used for testing the predictive ability of models, and should not be used in any model
development.The Task
(a) Using only the training data, develop a model, or models, for predicting the body
fat content (brozek) using the other 10 measurements. You may use whichever methods
covered in the module you see fit. However, for this part, you should not use the test
data in any way. [40]
(b) Use your chosen “best” model(s) from (a) to predict the body fat content of the individuals
in the test data set. Use appropriate numerical summaries/plots to evaluate the quality of
your predictions. How do the predictions compare to those of the model of the form
brozek = intercept + neck + chest + abdom + hip + thigh + knee + ankle +
biceps + forearm + wrist? [10]
Notes
An approximate breakdown of marks for part (a) is: Exploratory analysis [10 marks], Model
selection [20 marks], Model checking and validation [10 marks]. About half the marks
for each are for doing technically correct and relevant things, and half for discussion and
interpretation of the output. However, this is only a guide, and the work does not have to
be rigidly set out in this manner. There is some natural overlap between these parts, and
overall level of presentation and focus of the analysis are also important in the assessment.
The above marks are also not indicative of the relative amount of output/discussion needed
for each part, it is the quality of what is produced/discussed which matters.
As always, the first step should be to do some exploratory analysis. However, you do not
need to go overboard on this. Explore the data yourself, but you only need to report the
general picture, plus any findings you think are particularly important.
For the model fitting/selection, you can use any of the techniques we have covered this
semester to investigate potential models — the automated methods of Chapter 6/Case
Study 9 can be used to narrow down the search, but you can still use hypothesis tests, e.g.
if two different automated methods/criteria suggest slightly different models.
Please make use of the help files for R commands. Some functions may require you to
change their arguments a little from examples in the notes, or behaviour/output can be
controlled by setting optional arguments.
You should check the model assumptions and whether conclusions are materially affected
by any influential data points.
The task is deliberately open-ended: as this is a realistic situation with real data, there is
not one single correct answer, and different selection methods may suggest different “best”
models — this is normal. Your job is to investigate potential models using the information
and techniques we have covered. The important point is that you correctly use some of the
relevant techniques in a logical and principled manner, and provide a concise but insightful
summary of your findings and reasoning. (Note however that you do not have to produce
a report in a formal “report” format.)
You do not need to include all your R output, as you will likely generate lots of output when
experimenting. You might try a few different things whilst experimenting, and you do not
need to give all the details of everything you do — this will detract from the analysis.
2