Course Syllabus
CS 30: Programming for Data Science
In this course we'll study basic and advanced topics of Python programming as it relates to Data Science and Data Analytics. We'll study how to use, construct, and properly engineer python libraries, how to make code understandable, and how to engage in that oxymoron called "Python Software Engineering". We'll cover a smattering of topics relevant to data analytics, including but not limited to basic data handling, transformation, and visualization. Along the way, we'll expose you to the basics of thinking about data, and the research quandaries and difficulties that plague effective data analytics.
Background
The career of "Data Scientist" is very different than the career of "Software Engineer." While the "Software Engineer" typically writes programs for others to use, the "Data Scientist" typically writes programs to convince others to take a particular course of action or make a particular decision. Thus, there is a lot more "science" and a lot less "engineering" required to do the job of "Data Scientist" than is required of the typical "Software Engineer."
However, the rare person who is educated in both "science" and "engineering" has an edge in the market, and we will be attempting to turn each of you into experts in both aspects. We will learn both the science of using python for data analysis and the engineering required to share solutions with others for reuse. We will learn how to use the latest Python libraries, but will not stop there. We will learn how Python libraries are built and build some ourselves. Along the way, we will learn several principles of programming to professional standards that hopefully will aid you during any career you might choose in the future.
The reality of the profession is that on average, the Data Scientist spends roughly 90% of the time involved in data preparation, including getting access, transforming into appropriate forms, and filtering. This course is specifically about that process. We stop short of explaining the actual analyses in depth; that is the job of CS 135, 136, 137, 138, 141, and 142. Instead of this, we concentrate specifically on what is commonly called "data wrangling": pre-processing data so that existing tools can be used.
CS 30 and CS 40
In the Bachelor of Science in Data Science, you may take both CS 30 and CS 40 but only one counts for the degree. The other one counts as a free elective. CS 30 covers the issue of programming to professional standards in Python, while CS 40 covers the same issues for C and C++. The considerations are so different for Python versus C/C++ that there is almost no repeated material between the two courses.
Prerequisites
- CS 11 Introduction to Computer Science
- CS 15 Data Structures
or their equivalents at another institution. I expect some fluency in use of basic data structures including stacks, queues, trees, and graphs. During the early parts of the course, we will be learning to do the same things in Python that you might have learned in C/C++. Later on, however, we'll concentrate on things Python can do that C/C++ cannot.
Class meetings:
- Monday 9:30-10:20. Monday labs: 1:30-2:45 -or- 3:00-4:15.
- Tuesday and Thursday 10:30-11:20.
Missed classes
I am aware that there are many situations this term that can cause you to miss one class or even a week of classes. In accordance,
- All lecture notes from each class will be posted on Canvas.
- All classroom exercises will be posted to Canvas and can be completed outside class for those students who need that accommodation.
- All lab exercises will be posted on Canvas. People who miss a lab for an acceptable reason can complete the lab later at home.
Piazza
In this class, both myself and the teaching assistants will be monitoring Piazza for questions and providing answers there. Please use Piazza instead of email. Private messages are enabled, but to get to an instructor, please use "Instructors" as a message target because that gets to both myself and the teaching assistants. That way, you'll get the fastest possible answer.
Textbooks
Requirements
- Weekly labs (30% of each weekly grade).
- Weekly homeworks (30% of each weekly grade).
- In-class exercises (10% of each weekly grade).
- Weekly in-class quizzes (30% of each weekly grade).
See below for a rather novel grading scheme we use, on the advice of education researchers into grading methods.
Workload
This is a 4-credit hour course and it is against your instructor's religion to make it more work than that! You can expect to spend four hours in class per week and roughly three-four times that outside of class. This is not CS 40 and its workload bears no resemblance to the latter.
Instructor: Alva Couch
- Office: JCC 351 or
https://tufts.zoom.us/myalvacouch
- Office hours: to be determined
- And by appointment; email me or catch me after class, or send me a message on Piazza.
- Email: Alva.Couch@tufts.edu
Teaching assistants
- To be announced.
Your instructor
I can take some getting used to as an instructor, because I do not function by the same rules as many other instructors. First, I do not teach from authority. I do not consider myself smarter than you. In fact, I am quite sure that you know more than I do about some things. What I have to offer, instead, is more than 31 years of experience in the discipline, and some proven approaches to developing understanding.
I do not teach from authority because I do not need to. Physical law is my authority, and I can demonstrate everything I claim either experimentally or mathematically. In a systemic and deep sense, I am fundamentally a scientist, dedicated to the truth and to teaching you how to find it for yourself.
This means that, among other things:
- I am a random-access device. I can be asked about anything at any time. Do not worry about putting me on the spot. If I don't know something, I will simply find out about it, and enjoy doing it!
- I expect and greatly enjoy challenges and learning new things. If there is something neat and cool (or perhaps "c00l" or "kewl" or "wicked") that you want to know more about, ask me! I hope to impress upon you the importance of adopting that attitude yourselves!
- You cannot bruise my ego by proving me wrong. My primary duty is to the truth, and you can please and impress me by pointing out where I am wrong! In fact, I will feel worse if you let me get something wrong without pointing it out!
- You cannot hurt my opinion of you by showing me that you don't know something. I will simply try to help you with it. You can increase my opinion of you by showing me that you understand the limits of your knowledge!
The last one of these is -- in fact -- the key to everything we will do.
I come to this course with 36 years of experience in Software Engineering and Data Analysis. I started programming in Python in 2009, and for the last 9 years, have been involved in a professional programming project HydroShare (https://www.hydroshare.org) from which I recently retired as the lead software architect. I wrote the business engine for HydroShare and it might be said accurately that I put the "share" into HydroShare. HydroShare is a bold experiment in "Python Software Engineering" for data analysis, and incorporates all of the modern practices required to write Python to professional standards. As the course progresses, we will discuss each of these in turn.
A novel course format
In this term, I will try something I've never tried before. Through considerable work from a lot of people, all of the lecture material is pre-recorded as lessons. Thus, the classes will be discussion-based and will lead you through some of the more difficult concepts expressed in the lessons. I will generally expect you to listen to online lectures before the class on that topic. I will make notes as to which lectures cover which material in the modules list.
Assignments and Labs
In this course, all assignments and labs begin as worksheets in the Jupyter environment for data science, using the iPython kernel. As we will quickly learn, iPython is not quite Python. The reasons for this are complex and will be covered in class.
In working on assignments, you may use any up-to-date instance of the Anaconda machine learning environment that contains Jupyter and Jupyter Labs. There are versions of this for Windows, MacOS, Linux, and even ChromeOS. Thus, you may use your personal laptops to complete assignments. In addition, all stations in JCC 240 have been provided with the latest Anaconda version. In assignments we will be navigating various parts of Anaconda.
Collaboration policy
In this course, there are four kinds of deliverables:
- A weekly lab, in which you are encouraged to collaborate with other students on filling out parts of a Jupyter Notebook.
- A weekly assignment, in which you are encouraged to collaborate with other students in filling out parts of the notebook.
- Daily classroom exercises, to be completed in class. Collaboration is allowed and encouraged.
- A weekly quiz (after the first week, every Thursday), that must be your own work. Collaboration is not allowed.
In other words, you are encouraged to work with others on the homework and labs, but the quizzes must be completed by each of you individually without collaboration.
This structure arises from consultation with educational experts. The best way to learn this material is to discuss it with others. Although you are allowed to complete work on your own, this is not recommended. The best way to reinforce the material is to explain it to someone else.
The science of grading
This semester we continue a bold experiment in grading suggested by cutting-edge research in learning and grading policy.
- You get a grade for each week; this is a weighted average of all work turned in for the week.
- Homework counts as 30%.
- Lab counts as 30%.
- In-class exercises count as 10%.
- Quiz on past work counts as 30%.
- Your final numerical score is a weighted median of weekly grades, in which weekly grades are not counted equally. The last week of the course is worth twice as much toward your grade as the first week. For a 13-week term, the weighted grade for a week n is (weekly average) * (1 + (n-1)/12). The final grades are based upon the median of these scores.
In other words, you can always improve your grade over past performance by better performance later.
- A bad week doesn't do much.
- A great week doesn't do much either.
- It's what you do consistently that matters.
- As assignments get more difficult, their weight in the median increases along with the difficulty.
There is considerable academic evidence that this kind of grading is a better measure of attainment than taking any kind of average.
"Consistency is our most important product."
This is an internal motto of the McDonalds Corporation. And it represents the spirit of this form of grading. This is my rather blatant attempt to:
- Discourage concentrating on the grade you get for each thing.
- Encourage concentrating on learning rather than grades.
- Reward consistent behavior, but also
- Reduce penalties for inconsistent behavior.
Where did this come from? I've been a member of an ad-hoc committee on grading reform in the Tufts AS&E Educational Policy Committee. We've studied grading schemes that bring learning to the center and de-emphasize "earning points" in favor of "learning things". In general we are looking for grading schemes that are different than the "video game points culture" of traditional grading. This is one attempt to put some of what we've learned about grading into practice.
Grading and lateness
In general there will be one assignment per week, introduced in the lab for that week and due at midnight on Monday of the following week. Late assignments will be accepted for one week following the official deadline. There is no penalty for lateness other than the obvious penalty of not being ready for a quiz on the material in the same week. Other extensions must be requested from the professor and TAS. We strongly suggest the use of Piazza to request these extensions, using a personal message to "Instructors".
Although there are no late penalties for late submissions, I and the TAs respectfully request -- for our own sanity -- that you keep up with the pace of the other students. It would drive us completely nuts to receive all work for the course in the last week!
The 80/20 rule
Although this has never been an issue in practice, to pass the course, you must attempt at least 80% of the work for the course. In practice, I've found that everyone attempts everything, but I just consider it fair to mention that one can fail the course by not attempting enough of the assignments and labs. In statistical terms, there must be enough data on your performance to make the median score of your performance meaningful.
The syllabus
In general, this course is a collaborative work in progress and the initial syllabus is subject to drastic modification as I determine -- together with you -- what works and what doesn't work. This is the first time this course has ever been taught in this way. Thus the main problem I will have is determining how difficult the course is "for you". Right this second, I will sketch out my best guess as to the topics for each week. This is subject to radical change as I learn what you can and cannot do.
A request
My request to you is to point out any errors you find. I will not be insulted. I will be thankful. And I'll fix them as soon as humanly possible.
Frequently asked questions
Q: Is there really no midterm and no final?
A: Yes! The whole grade is based upon your ability to interact with Jupyter. This includes untimed homework and in-class quizzes.
Q: Will we be studying Jupyter kernels other than iPython?
A: In general, no. There are many kernels for Jupyter, including those for Python, R, Bash, and even Matlab. We will be sticking to iPython, though many principles we will discuss apply to all kernels.
Q; Is the syllabus etched in stone? Can we discuss something new?
A: I am quite flexible. I almost never keep things the same from offering to offering, and if there is something you specifically want to learn about, please reach out.
Q: Can I use something other than native Anaconda to do exercises?
A: While there is some "chance" that other environments are compatible, I would not bet on this. If you don't have a laptop that can run Anaconda, you can always work in JCC 240 and/or I can provide you an environment on a remote server. In particular, assignments "might" work on Google Collab and similar environments, but I don't have the time to ensure that! I'm limiting to Anaconda more or less to assure my own sanity!
Q: Does it matter if I took Data Structures in Java?
A: No. There is no dependence upon C or C++ in this course. Equivalents of CS 11 and CS 15 in Java are fine as preparation for this course.
Student Resources:
Accommodations for Students with Disabilities: Tufts University values the diversity of our students, staff, and faculty and recognizes the important contribution each student makes to our unique community. Tufts is committed to providing equal access and support to all qualified students through the provision of reasonable accommodations so that each student may fully participate in the Tufts experience. If you have a disability that requires reasonable accommodations, please contact the StAAR Center (formerly Student Accessibility Services) at StaarCenter@tufts.edu or 617-627-4539 to make an appointment with an accessibility representative to determine appropriate accommodations. Please be aware that accommodations cannot be enacted retroactively, making timeliness a critical aspect for their provision.
Academic Support at the StAAR Center: The StAAR Center (formerly the Academic Resource Center and Student Accessibility Services) offers a variety of resources to all students (both undergraduate and graduate) in the Schools of Arts and Science, Engineering, the SMFA and Fletcher; services are free to all enrolled students. Students may make an appointment to work on any writing-related project or assignment, attend subject tutoring in a variety of disciplines, or meet with an academic coach to hone fundamental academic skills like time management or overcoming procrastination. Students can make an appointment for any of these services by visiting the StAAR Center website (go.tufts.edu/StAARCenter).
Mental Health Support: As a student, there may be times when personal stressors or emotional difficulties interfere with your academic performance or well-being. The Counseling and Mental Health Service (CMHS) provides confidential consultation, brief counseling, and urgent care at no cost for all Tufts undergraduates as well as for graduate students who have paid the student health fee. To make an appointment, call 617-627-3360. Please visit the CMHS website (go.tufts.edu/Counseling) to learn more about their services and resources.
Course Summary:
| Date | Details | Due |
|---|---|---|