Course Syllabus
CS 30: Programming for Data Science
In this course we'll study basic and advanced topics of Python programming as it relates to Data Science and Data Analytics. We'll study how to use, construct, and properly engineer python libraries, how to make code understandable, and how to engage in that oxymoron called "Python Software Engineering". We'll cover a smattering of topics relevant to data analytics, including but not limited to basic data handling, transformation, and visualization. Along the way, we'll expose you to the basics of thinking about data, and the research quandaries and difficulties that plague effective data analytics.
Background
The career of "Data Scientist" is very different than the career of "Software Engineer." While the "Software Engineer" typically writes programs for others to use, the "Data Scientist" typically writes programs to convince others to take a particular course of action or make a particular decision. Thus, there is a lot more "science" and a lot less "engineering" required to do the job of "Data Scientist" than is required of the typical "Software Engineer."
However, the rare person who is educated in both "science" and "engineering" has an edge in the market, and we will be attempting to turn each of you into experts in both aspects. We will learn both the science of using python for data analysis and the engineering required to share solutions with others for reuse. We will learn how to use the latest Python libraries, but will not stop there. We will learn how Python libraries are built and build some ourselves. Along the way, we will learn several principles of programming to professional standards that hopefully will aid you during any career you might choose in the future.
The reality of the profession is that on average, the Data Scientist spends roughly 90% of the time involved in data preparation, including getting access, transforming into appropriate forms, and filtering. This course is specifically about that process. We stop short of explaining the actual analyses in depth; that is the job of CS 135, 136, 137, 138, 141, and 142. Instead of this, we concentrate specifically on what is commonly called "data wrangling": pre-processing data so that existing tools can be used.
CS 30 and CS 40
In the Bachelor of Science in Data Science, you may take both CS 30 and CS 40 but only one counts for the degree. The other one counts as a free elective. CS 30 covers the issue of programming to professional standards in Python, while CS 40 covers the same issues for C and C++. The considerations are so different for Python versus C/C++ that there is almost no repeated material between the two courses.
Prerequisites
- CS 11 Introduction to Computer Science
- CS 15 Data Structures
or their equivalents at another institution. I expect some fluency in use of basic data structures including stacks, queues, trees, and graphs. During the early parts of the course, we will be learning to do the same things in Python that you might have learned in C/C++. Later on, however, we'll concentrate on things Python can do that C/C++ cannot.
Class meetings:
- Mon, Wed, 3:00 - 3:50 pm
- Friday: 3:30-4:20 pm
Barnum Hall, Room 104 - Labs on Tuesdays at the following times;
- 9:00-10:15 am
- 12:00-1:15 pm
- in JCC 240
Missed classes
I am aware that there are many situations this term that can cause you to miss one class or even a week of classes. In accordance,
- All lecture notes from each class will be posted on Canvas.
- All classroom exercises will be posted to Canvas and can be completed outside class for those students who need that accommodation.
- All lab exercises will be posted on Canvas.
- People who miss a lab for an acceptable reason can complete the lab later at home.
Piazza
In this class, both myself and the teaching assistants will be monitoring Piazza for questions and providing answers there. Please use Piazza instead of email. Private messages are enabled, but to get to an instructor, please use "Instructors" as a message target because that gets to both myself and the teaching assistants. That way, you'll get the fastest possible answer.
Textbooks
Requirements
- Weekly labs (30% of each weekly grade).
- Weekly homeworks (30% of each weekly grade).
- In-class exercises (10% of each weekly grade).
- Weekly online quizzes (30% of each weekly grade).
See below for a rather novel grading scheme we are trying out, on the advice of education researchers into grading methods.
Workload
This is a 4-credit hour course and it is against your instructor's religion to make it more work than that! You can expect to spend four hours in class per week and roughly three-four times that outside of class. This is not CS 40 and its workload bears no resemblance to the latter.
Instructor: BJ Stubbs
- Office: JCC 3rd floor or my personal zoom channel:
https://tufts.zoom.us/my/bjstubbs
- Office hours: Monday 12:30-2:30 JCC 3rd floor kitchen area.
- And by appointment; email me or catch me after class, or send me a message on Piazza.
- Email: benjamin.stubbs@tufts.edu
Teaching assistants
- BJ Stubbs who will be your guide during the labs.
Your instructor
I am a PhD Student here at Tufts in Computer Science. I have my masters in CS from Tufts, an undergraduate degree in math, and almost 20 years experience in data science at Mass General Brigham/Channing Division of Network Medicine focusing on clinical trials and genetics.
Assignments and Labs
In this course, all assignments and labs begin as worksheets in the Jupyter environment for data science, using the iPython kernel. As we will quickly learn, iPython is not quite Python. The reasons for this are complex and will be covered in class.
In working on assignments, you may use any up-to-date instance of the Anaconda machine learning environment that contains Jupyter and Jupyter Labs. There are versions of this for Windows, MacOS, Linux, and even ChromeOS. Thus, you may use your personal laptops to complete assignments. In addition, all stations in JCC 240 have been provided with the latest Anaconda version. In assignments we will be navigating various parts of Anaconda.
Collaboration policy
In this course, there are four kinds of deliverables:
- A weekly lab, in which you are encouraged to collaborate with other students on filling out parts of a Jupyter Notebook.
- A weekly assignment, in which you are encouraged to collaborate with other students in filling out parts of the notebook.
- Daily classroom exercises, to be completed in class. Collaboration is allowed and encouraged.
- A weekly quiz (after the first week), that must be your own work. Collaboration is not allowed.
In other words, you are encouraged to work with others on the homeworks and labs, but the quizzes must be completed by each of you individually without collaboration.
The quizzes are an interesting experiment in themselves.
- The quizzes are not just open book; they're open internet, in the sense that you can refer to any static material on the internet in completing them.
- However, you may not utilize human contact in doing so. This includes posting questions about the quiz questions on the internet, either during the quiz or afterward.
- The quizzes are timed but occur outside of class and you can take them at your leisure.
- Thus the quizzes do not test what you can remember, but rather, how good you are at looking things up!
The science of grading
This semester we are undertaking a bold experiment in grading suggested by cutting-edge research in learning and grading policy.
- You get a grade for each week; this is a weighted average of all work turned in for the week.
- Homework counts as 30%.
- Lab counts as 30%.
- In-class exercises count as 10%.
- Quiz on past work counts as 30%.
- Your final numerical score is a weighted median of weekly grades, in which weekly grades are not counted equally. The last week of the course is worth twice as much toward your grade as the first week. For a 13-week term, the weighted grade for a week n is (weekly average) * (1 + (n-1)/12). The final grades are based upon the median of these scores.
In other words, you can always improve your grade over past performance by better performance later.
- A bad week doesn't do much.
- A great week doesn't do much either.
- It's what you do consistently that matters.
- As assignments get more difficult, their weight in the median increases along with the difficulty.
There is serious academic evidence that this kind of grading is a better measure of attainment than taking any kind of average.
"Consistency is our most important product."
This is an internal motto of the McDonalds Corporation. And it represents the spirit of this form of grading. This is my rather blatant attempt to:
- Discourage concentrating on the grade you get for each thing.
- Encourage concentrating on learning rather than grades.
- Reward consistent behavior, but also
- Reduce penalties for inconsistent behavior.
Where did this come from? I've been a member of an ad-hoc committee on grading reform for a year in the Tufts AS&E Educational Policy Committee. We're looking for grading schemes that bring learning to the center and de-emphasize "earning points" in favor of "learning things". In general we are looking for grading schemes that are different than the "video game points culture" of traditional grading. This is one attempt to put some of what we've learned about grading into practice.
Grading and lateness
In general there will be one assignment per week, introduced in the lab for that week and due at midnight on Mondays, before the lab for the following week. Late assignments will be accepted for one week following the official deadline. There is no penalty for lateness other than the obvious penalty of not being ready for a quiz on the material in the same week. Other extensions must be requested from the professor and TAS. We strongly suggest the use of Piazza to request these extensions, using a personal message to "Instructors".
Although there are no late penalties for late submissions, I and the TAs respectfully request -- for our own sanity -- that you keep up with the pace of the other students. It would drive us completely nuts to receive all work for the course in the last week!
The 80/20 rule
Although this has never been an issue in practice, to pass the course, you must attempt at least 80% of the work for the course. In practice, I've found that everyone attempts everything, but I just consider it fair to mention that one can fail the course by not attempting enough of the assignments and labs. In statistical terms, there must be enough data on your performance to make the median score of your performance meaningful.
The syllabus
In general, this course is a collaborative work in progress and the initial syllabus is subject to drastic modification as I determine -- together with you -- what works and what doesn't work. This is only the second time this course has ever been taught. Thus the main problem I will have is determining how difficult the course is "for you". Right this second, I will sketch out my best guess as to the topics for each week. This is subject to radical change as I learn what you can and cannot do.
A request
My request to you is to point out any errors you find. I will not be insulted. I will be thankful. And I'll fix them as soon as humanly possible.
Frequently asked questions
Q: Is there really no midterm and no final?
A: Yes! The whole grade is based upon your ability to interact with Jupyter. This includes untimed homework and timed quizzes.
Q: Will we be studying Jupyter kernels other than iPython?
A: In general, no. There are many kernels for Jupyter, including those for Python, R, Bash, and even Matlab. We will be sticking to iPython, though many principles we will discuss apply to all kernels.
Q; Is the syllabus etched in stone? Can we discuss something new?
A: I am quite flexible. This is a second offering and I almost never keep things the same from offering to offering, and if there is something you specifically want to learn about, please reach out.
Q: Can I use something other than native Anaconda to do exercises?
A: While there is some "chance" that other environments are compatible, I would not bet on this. If you don't have a laptop that can run Anaconda, you can always work in JCC 240 and/or I can provide you an environment on a remote server. In particular, assignments "might" work on Google Collab and similar environments, but I don't have the time to ensure that! I'm limiting to Anaconda more or less to assure my own sanity!
Tentative Schedule:
Week 1 Intro 9/5:
9/5 Lab-lab01
├── 00 Introduction.pptx
├── 00 Policies.pptx
├── 00 Structure.pptx
├── 01 Python and Jupyter.pptx
├── 02 Patterns and rituals.pptx
9/6 Wed Class
├── 03 Mutability and hashability.pptx
├── 04 Dynamic types.pptx
9/8 Friday Class
├── 05 Semantics and substitution.pptx
├── 06 Learning Python.ipynb
├── 06 Learning Python.pptx
Week 2: Programming 2 9/11
Monday Class
├── 07 Mapping.pptx
├── 08 Filtering and reduction.pptx
Tuesday Lab: Lab02
Wed Class:
├── 09 Objects.pptx
├── 10 Inheritance.pptx
Friday Class:
├── 11 Super.pptx
├── 12 Exceptions.pptx
Week 3: Programming 3 9/18
Monday Class:
├── 13 Modules.pptx
├── 14 Packages.pptx
Tuesday Lab: Lab02
Wed Class:
├── 15 Numpy.pptx
├── 16 Broadcasting.pptx
Friday Class:
├── 17 Pandas.pptx
├── 18 Dataframes.pptx
Week 4: Programming 4 9/25
Monday Class:
├── 19 Grouping.pptx
├── 20 Workflows.pptx
Tuesday Lab: Lab04
Wed Class:
├── 21 Cleaning.pptx
├── 22 Noise.ipynb
Friday Class:
├── 22 Noise.pptx
├── 23 Models.pptx
Week 5: Models 1 10/2
Monday Class:
├── 24 Linear models.ipynb
├── 24 Linear models.pptx
├── 25 Metrics.pptx
Tuesday Lab: Lab05
Wed Class:
├── 26 Logistic regression.pptx
├── 27 SVMs.pptx
Friday Class:
├── 28 Parameters.pptx
├── 29 Kernels.pptx
Week 6: Models 2 10/9
Monday NO Class:
Tuesday Lab: Lab06
Wed Class:
├── 30 Trickery.pptx
├── 31 Overfitting.pptx
Friday Class:
├── 32 Pipelines.pptx
├── 33 Transformers.pptx
Week 7: Models 3 10/16
Monday Class:
├── 34 Dimension.ipynb
├── 34 Dimension.pptx
├── 35 Dependence.ipynb
├── 35 Dependence.pptx
Tuesday Lab: Lab02
Wed Class:
├── 36 Clustering.ipynb
├── 36 Clustering.pptx
├── 37 Topics.pptx
Friday Class:
├── 38 Gibbs.pptx
├── 39 LDA.ipynb
├── 39 LDA.pptx
Week 8 Models 4 10/23
Monday Class:
├── 40 Beyond sklearn.pptx
├── 41 Geodata.pptx
Tuesday Lab: Lab08
Wed Class:
├── 42 Geocodes.ipynb
├── 42 Geocodes.pptx
├── 43 Standards.ipynb
├── 43 Standards.pptx
Friday Class:
├── 44 Visualization.pptx
├── 45 D3.ipynb
├── 45 D3.pptx
Week 9: Interactions 10/30
Monday Class:
├── 46 Interactions.ipynb
├── 46 Interactions.pptx
├── 47 Callbacks.pptx
Tuesday Lab: Lab09
Wed Class:
├── 48 Cautions.pptx
├── 49 Social.pptx
Friday Class:
├── 50 Sentiment.ipynb
├── 50 Sentiment.pptx
├── 51 Graphs.ipynb
├── 51 Graphs.pptx
├── 51 Ties.pptx
Week 10: Data Science 11/6
Monday Class:
├── 52 Communities.ipynb
├── 52 Communities.pptx
├── 53 Influence.ipynb
├── 53 Influence.pptx
Tuesday Lab: Friday Schedule lab 10
Wed Class:
├── 54 Contagion.pptx
├── 55 Optimization.pptx
Friday Class:
├── 56 Neurons.pptx
├── 57 Networks.pptx
Week 11: Networks 11/13
Monday Class:
├── 58 Scale.pptx
├── 59 Networking.ipynb
├── 59 Networking.pptx
Tuesday Lab: Lab11
Wed Class:
├── 60 Guardrails.pptx
├── 61 Decisions.pptx
Friday Class:
├── 62 Questions.pptx
├── 63 Risk.pptx
Week 12: Fun 11/20
Monday Class:
Tuesday Lab: Lab02
Wed Class: NO CLASS
Friday Class:NO CLASS
Week 13: Econimics 11/27
Monday Class:
├── 64 Communication.pptx
├── 65 Management.pptx
Tuesday Lab: Lab13
Wed Class:
├── 66 Engineering.pptx
├── 67 Economics.pptx
Friday Class:
├── 68 Services.pptx
├── 69 AI.pptx
Week 14: Programming 12/4
Monday Class:
├── 70 Availability.pptx
├── 71 Reusability.pptx
Tuesday Lab: Lab14
Wed Class:
├── 72 Reproducibility.pptx
├── 73 Ethics.pptx
Friday Class:
├── 74 Persuasion.pptx
├── 75 Politics.pptx
Week 15 Programming 12/11
Monday Class:
├── 76 Mistakes.pptx
├── 77 The future.pptx
└── 78 Epilogue.pptx