I'm taking some time off right now to do a Master's degree through Harvard Extension, and I'm also taking multiple classes through Coursera, EdX, Kennedy School ExecEd, UC Irvine, etc. Everything from educational policy & leadership to quantitative research & data analysis to non-profit management & financial accounting. This blog is a place for me to collect my learnings from this adventure I'm on! Most of the time, I'll just be cutting and pasting from various assignments or papers to be able to easily reference them later, but sometimes I'll do specific blog posts knitting my thoughts together from the different coursework. :-)

Saturday, September 20, 2014

Project Proposal for Coursera Data Analysis

In the Data Analysis Coursera class, we get to do a project based on a research question and data that we choose!  So even though it's a math/there's only one correct answer type of class, I can post some of my work since it's about the data that I'm choosing rather than everyone using the same data.  (I still want to write a post about what I can and can't post here because of whether or not the homework has 'one' correct answer.)

RESEARCH QUESTION: In one sentence, what is your research question?
Is there a relationship between educational achievement, as measured by highest degree, and hours of TV watched per week?

DATA - Citation: Include a citation for your data, and if your data set is online, provide a link to the source
I will be using the General Social Survey dataset as provided for this class - I'm going to be using only the data from after 2000 in order to try to make the results more generalizable to current day.
http://bit.ly/dasi_gss_data

DATA - Collection: Describe how the data were collected.
The data for the General Social Survey has been collected through interviews every year or two since 1972.  From the intro pdf at http://publicdata.norc.org/GSS/DOCUMENTS/BOOK/GSS_Codebook_intro.pdf, "Each survey from 1972 to 2004 was an independently drawn sample of English-speaking persons 18 years of age or over, living in non-institutional arrangements within the United States.  Starting in 2006 Spanish-speakers were added to the target population."  The exact sampling procedure has varied over the years as described in http://publicdata.norc.org:41000/gss/documents//BOOK/GSS_Codebook_AppendixA.pdf

DATA - Cases (observational/experimental units): What are the cases? (Remember: case = units of observation or units of experiment)
The cases are individual adults - English-speaking, 18 and over, not living in institutions; with Spanish-speaking adults added for surveys done 2006 and later.

DATA - Variables: What are the two variables you will be studying? State the type of each variable.
I will be looking at...
1) educational achievement as defined by highest degree - DEGREE.  In the GSS data, it's an ordinal/categorical with five levels (Left HS, HS, Junior College, Bachelors, Graduate), so I may break it down into two or three categories, such as High School vs College, depending on the statistical methods we learn and if I can compare that many levels.
2) hours of TV watched a week - TVHOURS.  It's numeric with values from 0-24.

DATA - Type of study: What is the type of study? Is it an observational study or an experiment? Explain how you've arrived at your conclusion using information on the sampling and/or experimental design.
The GSS Study is an observational study - the interviewers 'observe' or take note of the participant's answers.  There's no treatment or control groups, so it's not an experiment.

DATA - Scope of inference - generalizability: Identify the population of interest, and whether the findings from this analysis can be generalized to that population, or, if not, a subsection of that population. Explain why or why not. Also discuss any potential sources of bias that might prevent generalizability.
The GSS Study hopes to represent the whole US population - or at least English (and now Spanish) speaking adults.  Because it's an observational study, it has external validity and can be generalized to the population of interest.  The study tries to get a representative sample through its surveying method, although the method has changed over the years of the survey.  One thing to note is that the survey answers collected from many years ago may not generalize to the current population (although that data would probably accurately reflect the population of that time), hence why I am only using the data from after 2000.

DATA - Scope of inference - causality: Can these data be used to establish causal links between the variables of interest? Explain why or why not.
No, these data cannot be used to establish any causal links between the variables.  Because it's an observational study, and not an experiment, we can only tell if the two variables are correlated - and correlation does not equal causation!  In this particular case of education and tv watched, I'm not even sure if causality is really something to think about.  The educational attainment came first, so therefore does going to school longer cause someone to watch less TV?  Or is it more likely there's a confounding variable of motivation or intelligence or something else that causes both higher educational attainment and less TV watching?


EXPLORATORY DATA ANALYSIS:
Perform a brief exploratory data analysis - just one or two relevant descriptive statistics and visualizations of the data. Also address what the exploratory data analysis suggests about your research question.

Using the GSS data from after the year 2000, I found that the mean hours of TV watched was very different for the different levels of educational attainment.  The overall average hours was 2.983, but the average for the different education varied from 1.939 to 4.070, indicating that there is a relationship between education and tv watched.

I also did a boxplot to see the distributions over the different educational levels (attached as a pdf).  The distributions seem to be fairly similar, with long trailing tails of outliers going toward the maximum of 24 hours per week.  The boxplot shows that the median for left high school and high school is the same (3 hours), and the median for junior college, bachelors, and graduate is the same (2 hours), so perhaps the greatest difference is between the two broader categories of 'high school or lower' and 'any college education'
.

by(gssafter2000$tvhours, gssafter2000$degree, mean, na.rm=TRUE)
gssafter2000$degree: Lt High School
[1] 4.069703
---------------------------------------------------------------------------- 
gssafter2000$degree: High School
[1] 3.169679
---------------------------------------------------------------------------- 
gssafter2000$degree: Junior College
[1] 2.641186
---------------------------------------------------------------------------- 
gssafter2000$degree: Bachelor
[1] 2.230769
---------------------------------------------------------------------------- 
gssafter2000$degree: Graduate
[1] 1.939353



jklj;




No comments:

Post a Comment