讲解PUBL0055、辅导R程序语言、辅导Research Methods、讲解R设计

2019.08.18 - 首页 >> 其他

Final Coursework
Introduction to Quantitative Research Methods (PUBL0055)
Instructions
The coursework will be posted on Moodle on 14 December 2018 at 6pm, and is due on 7 January 2019
at 2pm. Please follow all designated SPP submission guidelines for online submission as detailed on the
PUBL0055 Moodle page. Standard late submission penalties apply.
This is an assessed piece of coursework (worth 75% of your final module mark) for the PUBL0055
module; collaboration and/or discussion of the coursework with anyone is strictly prohibited. The rules
for plagiarism apply and any cases of suspected plagiarism of published work or the work of classmates
will be taken seriously.
As this is an assessed piece of work, you may not email/ask the course tutors or teaching fellows
questions about the coursework.
Along with the coursework itself, the datasets for the coursework can be found in the PUBL0055 page
on Moodle.
Coursework should be submitted via the appropriate link on the course Moodle page. You will need to
click the ‘Submit Paper’ link at the bottom of the page. When presented with the ‘Submit Paper’ box,
the ‘Submission Title’ should be your candidate number, and you should upload your document into
the box provided.
– Please remember to state ONLY your candidate number on your coursework (your candidate
number is made up of four letters and one number e.g. ABCD5). Your name and/or student
number must not appear on your coursework.
Answers should be written in complete sentences. Be sure to answer all parts of the questions posed
and interpret the results.
The word count for this assessment is 3000 words. This does not include the appendix, or any words
(or numbers) contained within tables. Please note that any full sentences included in tables will form
part of the word count.
Please submit your type-written (numbered) answers in a single document. Create an appendix section
at the end which contains all the R code needed to reproduce your results (you do not need to include
the code that failed to run, but just the cleaned-up version. Your code has to work when we run it).
Failure to include the R code means that the coursework will be marked incomplete.
You may assume the methods you have used (e.g. t-test, linear regression, etc) are understood by the
reader and do not need definitions, but you do need to explain the intuition of these methods.
Round all numbers to two digits after the decimal point.
Do not copy and paste any brute R output (e.g. summary(lm(y ～x))) into your answers. Create a
minimally formatted table, e.g. with the screenreg command as seen in class. If that does not work,
re-create by hand such a table.
Assign every table and figure a title and a number and refer to the number in the text when discussing
a specific figure or table.
1Datasets
Varieties of Democracy – vdem.csv
This data set includes several variables taken from the Varieties of Democracy project (https: //www.vdem.net/en/).
The unit of analysis is the country-year. The data here covers 161 countries for the years 1993
and 2010. There are a total of 2898 observations in the data.
Table 1: Varieties of Democracy codebook
Variable Description
country_name The name of the country
year The year of the observation
child_mortality The number of deaths prior to age 1 per 1000 live births in a year
inequality_gini A measure of income inequality, based on the GINI coefficient.
Higher values indicate more inequality. The theoretical minimum is
0, where income is perfectly equal, and the theroetical maximum is
100, where one individual has all the income.
life_expectancy Life expectancy at birth (in years)
radio_television_per_cap Number of radio and television sets per capita
log_population Logged population
civil_war 1 if there was an intra-state war with at least 1,000 battle deaths in
a given country-year, 0 otherwise
international_war 1 if the country participated in an international armed conflict in a
given year, 0 otherwise
urban_population_pct Percentage of population living in urban areas (in percentage points)
oil_production_per_cap Value of petroleum produced per capita
gdp_per_cap Gross domestic product, per capita
inflation Annual inflation rate
region_name Geographic region in which the country is located (categorical)
education15 Average years of education among citizens older than 15
government_effectiveness A continuous measure of government effectiveness based on the
quality of public service provision amongst bureaucrats and
government actors (Higher values indicate more effective
government)
political_stability A continuous measure of political stability based on perceptions of
the likelihood that the government in power will be destabilized or
overthrown by possibly unconstitutional and/or violent means
(Higher values indicate higher levels of stability)
polity Score on the polity scale (higher values indicate more democratic
countries, lower values indicate more autocratic countries)
healthcare A continuous variable measuring the extent to which high quality
basic healthcare is guaranteed to all (higher values indicate higher
access to healthcare)
womens_civ_lib A continuous variable indicating whether women have the ability to
make meaningful decisions in key areas of their lives (higher values
indicate higher levels of civil liberties for women)
media_censorship A continuous variable indicating whether the government directly or
indirectly attempts to censor the print or broadcast media (lower
values indicate higher levels of censorship)
internet_access 1 if there internet in this country-year, 0 otherwise
2You can access this data in two ways:
1. You can download the vdem.csv data file from Moodle, copy it to your working directory, and load it
into R as we have been doing in class.
– or –
2. You can run the following line of code in R and this will load the data directly from the course website:
vdem <- read.csv("https://uclspp.github.io/datasets/data/vdem.csv")
These two ways of loading the data will produce identical results.
3European Social Survey – ess.csv
This dataset includes several variables taken from the 2016 European Social Survey (https://www.europeansocialsurvey.org).
The unit of analysis is individual respondents to a face-to-face survey. There are a total of 13075 observations
in the data, with respondents surveyed in 17 different European countries.
Table 2: ESS codebook
Variable Description
country_code The country of the respondent
leave 1 if the respondent would vote to leave the European Union in a referendum, 0
otherwise
gender Whether the respondent is male or female
age The age of the respondent (in years)
years_education The number of years of education the respondent has completed
unemployed 1 if the respondent is unemployed, 0 otherwise
income 1 if the respondent earns above the median income in their country, 0 otherwise
religion Categorical variable of the religion of the respondent
trade_union 1 if the respondent is a member of a trade union, 0 otherwise
news_consumption Amount of time the respondent spends reading newspapers/online news each
week (in minutes)
trust_people The degree to which the respondent trusts other people (0 = low trust, 10 = high
trust)
trust_politicians The degree to which the respondent trusts politicians (0 = low trust, 10 = high
trust)
past_vote 1 if the respondent voted in the last general election in their country, 0 otherwise
immig_econ The respondent’s view of the economic effects of immigration in their country (0
= Immigration is bad for the economy; 10 = Immigration is good for the
economy)
immig_culture The respondent’s view of the cultural effects of immigration in their country (0 =
Immigration undermines the country’s culture; 10 = Immigration enriches the
country’s culture)
country_attach The respondent’s emotional attachment to their country (0 = Not at all
emotionally attached; 10 = Very emotionally attached)
climate_change How worried the respondent is about climate change (1 = Not at all worried; 5 =
Very worried)
imp_tradition How important the respondent feels it is to follow traditions and customs (1 =
Very important; 6 = Not at all important)
imp_equality How important the respondent feels it is people are treated equally and have
equal opportunities (1 = Very important; 6 = Not at all important)
eu_integration The respondent’s views on European unification/integration (0 = “Unification
has already gone too far”; 10 = “Unification should go much further”)
Again, you can access this data in two ways:
1. You can download the ess.csv data file from Moodle, copy it to your working directory, and load it
into R as we have been doing in class.
– or –
2. You can run the following line of code in R and this will load the data directly from the course website:
ess <- read.csv("https://uclspp.github.io/datasets/data/ess.csv")
4Part 1: Does rain affect turnout? (25 points)
If voters are rational, then their behaviour should be responsive to changes in the costs and benefits they
face in casting votes on election day. Elections in the US are typically held in November, when it can be
very bad weather in some parts of the country. When the weather is bad – in particular, when it rains –
this imposes additional costs on potential voters, and may lead to declines in turnout on election day. In
this question, your task is to interpret regression models which analyse the relationship between inclement
(i.e. bad) weather and voter turnout at the county level in US presidential elections. The models also include
information about how competitive the presidential race is in different counties – where a county is considered
competitive if it is located in a state where the presidential race is close between the top two candidates.
A team of researchers decided to analyse the effects of rain on turnout by collecting data from 43124 countylevel
elections in the US between 1948 and 2000. For each county-year observation in their data, they collect
information on the following variables:
Table 3: Part 1 variables
Variable Description
Turnout The turnout rate in a given county in the election (measured in percentage points)
Rain The rainfall in the county on election day (measured in inches)
Competitive 1 if the county is located in a competitive state, and 0 otherwise
Unemployment The unemployment rate in a given county (measured in percentage points)
High_School The high school graduation rate in the county (measured in percentage points)
To test the effect of rain on voter turnout, the researchers ran the following linear regression models, both of
which have turnout as the dependent variable.
Model 1
Turnouti = α + β1Raini + β2Competitivei + β3Unemploymenti + β4High_schooli + i
Model 2
Turnouti = α+β1Raini+β2Competitivei+β3(Raini
·Competitivei
)+β4Unemploymenti+β5High_schooli+i
The estimates produced by these two models are presented in table 4 below. You should use these estimates
to provide answers to the following questions.
Questions
1) What is the null hypothesis for the interaction term β3 in model 2?
2) Interpret the effects of rain on turnout using the coefficients from models 1 and 2.
3) Based on model 2, what is the expected level of turnout for a county with the following characteristics?
A) A county with 0 inches of rain, in an uncompetitive state, with an unemployment rate of 6% and a
high school graduation rate of 80%
B) A county with 3 inches of rain, in an uncompetitive state, with an unemployment rate of 6% and a
high school graduation rate of 80%
C) A county with 0 inches of rain, in a competitive state, with an unemployment rate of 6% and a high
school graduation rate of 80%
5 D) A county with 3 inches of rain, in a competitive state, with an unemployment rate of 6% and a high
school graduation rate of 80%
4) Based on your answers to the questions above, do you conclude that rain is an important determinant of
turnout in US elections?
Table 4: Rain and turnout regressions
Turnout
Note: Standard errors in parentheses
Something
6Part 2: Inequality and child mortality (40 points)
There is considerable academic debate regarding the relationship between socioeconomic inequalities and
various health outcomes for children. In particular, one influential theoretical argument suggests that if
income is redistributed from richer people to relatively poorer people, health outcomes for poor children
should be expected to improve, but we should not expect a similar decline in the health outcomes of rich
children.
At the aggregate level, and focussing on child mortality as a measure of children’s health outcomes, this
argument suggests that we should observe a positive association between income inequality and child mortality.
The plot below shows the bivariate association between these variables based on the data that you will use
for this question. The dataset for this question is the Varieties of Democracy dataset, which can be found in
the vdem.csv file described above.
0 20 40 60 80 100
0 50 100 150
Income inequality
Child mortality per 1000 births
Child mortality – here displayed on the Y -axis – measures the number of children who die before the age of
1 for every 1000 live births in a country each year. Income inequality – on the X-axis – is measured as a
country’s GINI coefficient, which is a measure of how concentrated the wealth of a country is. The GINI
measure can range in theory from 0 – where income is perfectly equally shared amongst all citizens – to 100 –
where all of a country’s income is concentrated in the hands of a single individual.
Questions
1) Your main task in this section is to develop theoretically-grounded models of child mortality using the
Varieties of Democracy data. In this subquestion, you should implement two linear regression models with
child_mortality as the dependent variable.
In the first model, the only explanatory variable should be the inequality_gini variable.
For the second model, you should build a model which – in addition to the inequality_gini variable –
includes six theoretically important explanatory variables that you think might be appropriate from the
supplied dataset. You should explain why you think these particular variables are important to include, given
that our main interest is in the relationship between inequality and child mortality. Please note that, for the
second model, you should not estimate several different models and present the results, but rather you should
7argue theoretically why you chose certain variables. You should also consider whether any non-linear and/or
interactive specifications of the variables you include in your model would be appropriate.
You should write up the results of these models as if they were to be published in a political science journal
article with a focus on communicating the substantive meaning of your results. In your discussion of these
models, you should focus on communicating the substantive implications of the regression that you implement,
paying particular attention to the relationship between child mortality and income inequality. You may wish
to focus on the following:
Provide descriptive statistics and/or plots to provide the reader with an overview of the dependent
variable and the important explanatory variables that you intend to use.
Provide a well-formatted table of regression output which includes the key information about the two
models you have estimated.
For the second model, you must use a model which includes 7 explanatory variables (inequality_gini
plus the 6 you choose) to explain child mortality. You should state an appropriate hypothesis/null
hypothesis for the variables in your model.
Discuss the statistical significance of the coefficients in the models.
Present quantities of interest from your models that help to describe the relationship between income
inequality and child mortality. You could also illustrate the relative importance of the different
explanatory factors that you include in the second model. Examine the effects for sensible values of the
independent variables, and focus your interpretation not just on the direction of the effects, but also
the magnitude of the effects.
Discuss the fit of your two models using appropriate statistics.
Evaluate the models with reference to the assumptions of linear regression and, if appropriate, implement
corrections when these assumptions appear to be violated.
You should not use fixed-effects in either of the models in this question.
2) You present the results of your models to a friend who argues that, because you are dealing with panel
data, you should think about adding fixed-effects to your second model. Address this point now by focussing
on the following:
a) Test for the presence of unit and time fixed-effects in this data, and present results from an appropriate
model specification. Interpret any changes that have resulted from this change in model specification,
particularly with reference to the inequality variable.
b) Test for and correct any dependence in the error term. Do your substantive conclusions change at
all?
8Part 3: Leave or Remain? (35 points)
What determines support for the European Union? In the aftermath of the UK public’s vote to leave the
EU in the 2016 referendum, much attention has been paid to whether support for the EU varies predictably
across different types of individuals. In this question, you will use an appropriate binary dependent variable
model to improve our understanding of which types of citizens are more or less likely to vote to leave the
European Union if a referendum on membership were to be held in their country.
The data for this question comes from the 2016 European Social Survey (ESS) and includes information on
the political attitudes and demographics of European citizens. The data can be found in the ess.csv file
described above. In 2016, the ESS included the following question:
Imagine there were a referendum in your country tomorrow about membership of the European
Union. Would you vote for your country to remain a member of the European Union or to leave
the European Union?
The dependent variable for this analysis is leave, which measures 1 if the respondent said that they would
vote to leave the EU, and 0 otherwise.
1) The primary task of this section is to implement a logistic regression model with 5 theoretically important
predictors from the dataset. You should explain why you have selected the variables that you include in the
model and explain – from a theoretical perspective – why you expect them to be important for determining
whether an individual would decide to vote to leave the EU or not. As with question 2, you should think
carefully about your choice of variables, and consider whether it would be appropriate to include non-linear
or interactive specifications of these variables.
You should focus on the following:
Fit a model which uses 5 explanatory variables to predict referendum vote choice. You should be
clear about the theoretical rationale for including each variable, and you should state the appropriate
hypothesis for each of the variables in your model. Do not include the immig_econ or immig_cultre
variables in this model.
Provide and discuss an appropriate fit statistic for your model
Interpret your model in both statistical and substantive terms. You should present predicted probabilities
from the model that help to illustrate the substantive importance of the variables in your model. (I.e.
simply reporting estimated coefficients is not sufficient for full marks).
Create at least one plot of predicted probabilities from your model for a continuous independent variable.
You should write up your results as if they were to be published in a political science journal article with a
focus on communicating the substantive meaning of your results.
2) An ongoing academic discussion focusses on whether cultural or economic concerns about immigration are
more important as predictors of support for the European Union. To contribute to this debate, you will now
develop your model from the first part of this section by including some additional variables.
a) Estimate a new version of your model, this time including the immig_econ and immig_cultural
variables.
b) Does this model provide a better fit to the data than your original model? Use a fit statistic that
you have learned on this course to check.
c) How – if at all – does your interpretation of the effects of the original variables change in this new
model? Why might this be the case?
d) Calculate some predicted probabilities to demonstrate the substantive effects of the immig_econ and
immig_cultural variables. Do you conclude that economic concerns about immigration or cultural
concerns about immigration are more important in predicting opposition to the EU?
9