DS 202 - Syllabus discussion

Will Ju

Welcome to
DS 202

Data Acquisition and Exploratory Data Analysis

Topics of today!

Hello!

What is this course about?

What is this course about? (cont’d)

Data comes in a lot of different formats

… as sound

… as image

… in a monitoring device

library(tuneR)
ilr_class <- readWave("data/i-like-r.wav")
str(ilr_class)
## Formal class 'Wave' [package "tuneR"] with 6 slots
##   ..@ left     : int [1:123904] 2 2 2 2 -1 -1 -1 0 1 3 ...
##   ..@ right    : int [1:123904] -2 -2 -2 -1 2 1 0 0 -2 -4 ...
##   ..@ stereo   : logi TRUE
##   ..@ samp.rate: int 44100
##   ..@ bit      : int 16
##   ..@ pcm      : logi TRUE

library(jpeg)
img <- readJPEG("data/imgres.jpg")
str(img)
##  num [1:193, 1:200, 1:3] 0.235 0.235 0.239 0.239 0.243 ...

… what kind of birds are these?

… we will be using R for that!

R is …

  • with python most commonly used data science language (see kdnuggets survey)
  • Free to use, open source so you can see what code is doing to your data
  • Extensible: Over 18,000 user contributed add-on packages currently on CRAN! Bioconductor has more than 1300 packages, and many researchers provide packages through github.
  • Powerful
    • With the right tools, get more work done, faster.
  • Flexible
    • Not a question of can, but how.

at the end of the course you will …

  • be able to acquire and read data in different formats and from different sources
  • know the basic programming principles of R
  • be able to implement a basic data pipeline
  • be able to do a data exploration
  • visualize data in appropriate forms
  • communicate your findings in a reproducible form as report and/or web-app

Syllabus

Full syllabus is available on Canvas

Textbook (optional)

Course website:

Grades

Component Weight
Homework 20%
Labs 25%
Midterm 25%
Final Project
report 22.5%
presentation 7.5%

Labs

  • during class time on every other Wednesday (starting with Wednesday, Sep 18th)
  • you will be partnered (randomly) in groups of 3 to 4
  • lab assignments are designed to be finished during class time, but ‘finishing touches’ can be applied until the following Monday, 11:59 pm.
  • if you cannot attend the lab, please let me know beforehand. Nevertheless, you are expected to do the work!

Homework

  • in weeks without a lab, a homework is posted.
  • homework assignments revise what we covered, plus synthesize some new information.
  • plan to spend about 3-4h on each assignment.

Midterm

  • in-class programming exam.
  • open book, open note, open internet
  • no direct help from anyone else
  • tentatively scheduled for Oct 30.
  • sample exams will be posted as we get closer to the date.

Final project

  • no final exam.
  • team-based project (3-4 members).
  • several stages:
    • identify topic and data set
    • identify line of inquiry
    • report findings in report or shiny app
    • present your project in front of the class

Attendance

I expect you to attend class in some way (f2f or via WebEx): there will be a substantial amount of time devoted to ‘hands-on’ examples on the computers. Make use of that time!

If you have to miss class, please

  1. let me know ahead of time.
  2. make sure to catch up with the material (e.g. have a designated note taker, talk to one of your team members, … )

Skills and tools

  • self-learning
    • identify the problem (a big problem -> a series of small questions) and ask a good question
    • gather and process related information/knowledge
    • apply what you learned to see if it works (expected vs unexpected)
    • seek the right help from the right person
  • Github
  • slack
  • Large Language models (LLMs)
    • chatGPT

Skills and tools | Asking a good question

is a learned and valuable skill!

Have a look at:

Skills and tools | Gather information

  • Google
  • medium
  • stackoverflow
  • R help
  • LLM tools (with caution)

Skills and tools | Apply what you learned

  • projects
  • answer the questions that you are interested

Skills and tools | seek help

  • ask a team member,
  • write email to the instructor with your question

Skills and tools | slack

  • slack invite link
  • “messaging app for business that connects people to the information they need”
  • fancy features
  • invitation link expires in 30 days
  • slack account registered with iastate.edu is required to join

Skills and tools | Large Language Models

  • be replaced?
  • do not fully trust it
  • use it wisely

What is exploratory data analysis?

Typical data science project:

typical cycle of a data science project

  • exploration goes hand in hand with understanding
  • our understanding of the world must be based on data

An example: mind the gap!

Statistician Hans Rosling (1948 - 2017) presented GapMinder at TED 2006

  • preconceived notions are problematic
  • up-to-date data helps us learn about the world

… let’s try this out …

Your Turn

  • Follow the link to open Gapminder tools at https://www.gapminder.org/tools/#_chart-type=bubbles
  • Recreate Hans Rosling’s chart of life expectancy (y-axis) by number of children (x-axis) and move the slider over time.
  • Using this chart, can you find evidence for the AIDS epidemic in Africa? the civil war in Nigeria? the earth quake in Haiti?
  • How is income generally related to life expectancy?
  • What else did you find? How much of this did you know before?

TODO after today’s lecture

  • (optional) watch the GapMinder TED talk
  • sign up for slack using your iastate.edu email if you haven’t
  • sign up for github if you haven’t