… we will be using R for that!
R is …
- with python most commonly used data science language (see kdnuggets
survey)
- Free to use, open source so you
can see what code is doing to your data
- Extensible: Over 18,000 user contributed add-on
packages currently on CRAN! Bioconductor has more than 1300 packages,
and many researchers provide packages through github.
- Powerful
- With the right tools, get more work done, faster.
- Flexible
- Not a question of can, but how.
at the end of the course you will …
- be able to acquire and read data in different formats and from
different sources
- know the basic programming principles of R
- be able to implement a basic data pipeline
- be able to do a data exploration
- visualize data in appropriate forms
- communicate your findings in a reproducible form as report and/or
web-app
Syllabus
Full syllabus is available on Canvas
Textbook (optional)
Course website:
Grades
Homework |
20% |
Labs |
25% |
Midterm |
25% |
Final Project |
|
report |
22.5% |
presentation |
7.5% |
Labs
- during class time on every other Wednesday (starting with Wednesday,
Sep 18th)
- you will be partnered (randomly) in groups of 3 to 4
- lab assignments are designed to be finished during class time, but
‘finishing touches’ can be applied until the following Monday, 11:59
pm.
- if you cannot attend the lab, please let me know beforehand.
Nevertheless, you are expected to do the work!
Homework
- in weeks without a lab, a homework is posted.
- homework assignments revise what we covered, plus synthesize some
new information.
- plan to spend about 3-4h on each assignment.
Midterm
- in-class programming exam.
- open book, open note, open internet
- no direct help from anyone else
- tentatively scheduled for Oct 30.
- sample exams will be posted as we get closer to the date.
Final project
- no final exam.
- team-based project (3-4 members).
- several stages:
- identify topic and data set
- identify line of inquiry
- report findings in report or shiny app
- present your project in front of the class
Attendance
I expect you to attend class in some way (f2f or via WebEx): there
will be a substantial amount of time devoted to ‘hands-on’ examples on
the computers. Make use of that time!
If you have to miss class, please
- let me know ahead of time.
- make sure to catch up with the material (e.g. have a designated note
taker, talk to one of your team members, … )
What is exploratory data analysis?
Typical data science project:
- exploration goes hand in hand with understanding
- our understanding of the world must be based on data
An example: mind the gap!
Statistician Hans Rosling (1948 - 2017) presented GapMinder
at TED 2006
- preconceived notions are problematic
- up-to-date data helps us learn about the world
… let’s try this out …
Your Turn
- Follow the link to open Gapminder tools at https://www.gapminder.org/tools/#_chart-type=bubbles
- Recreate Hans Rosling’s chart of life expectancy
(y-axis) by number of children (x-axis) and move the slider over
time.
- Using this chart, can you find evidence for the
AIDS epidemic in Africa? the civil war in Nigeria? the earth quake in
Haiti?
- How is income generally related to life
expectancy?
- What else did you find? How much of this did you
know before?
TODO after today’s lecture
- (optional) watch the GapMinder TED talk
- sign up for slack using your iastate.edu email if you haven’t
- sign up for github if you haven’t