Course Syllabus

CS 30: Programming for Data Science

In this course we'll study basic and advanced topics of Python programming as it relates to Data Science and Data Analytics. We'll study how to use, construct, and properly engineer python libraries, how to make code understandable, and how to engage in that oxymoron called "Python Software Engineering". We'll cover a smattering of topics relevant to data analytics, including but not limited to basic data handling, transformation, and visualization. Along the way, we'll expose you to the basics of thinking about data, and the research quandaries and difficulties that plague effective data analytics. 

Background

The career of "Data Scientist" is very different than the career of "Software Engineer."  While the "Software Engineer" typically writes programs for others to use, the "Data Scientist" typically writes programs to convince others to take a particular course of action or make a particular decision. Thus, there is a lot more "science" and a lot less "engineering" required to do the job of "Data Scientist" than is required of the typical "Software Engineer."

However, the rare person who is educated in both "science" and "engineering" has an edge in the market, and we will be attempting to turn each of you into experts in both aspects. We will learn both the science of using python for data analysis and the engineering required to share solutions with others for reuse. We will learn how to use the latest Python libraries, but will not stop there. We will learn how Python libraries are built and build some ourselves.  Along the way, we will learn several principles of programming to professional standards that hopefully will aid you during any career you might choose in the future. 

The reality of the profession is that on average, the Data Scientist spends roughly 90% of the time involved in data preparation, including getting access, transforming into appropriate forms, and filtering. This course is specifically about that process. We stop short of explaining the actual analyses in depth; that is the job of CS 135, 136, 137, 138, 141, and 142. Instead of this, we concentrate specifically on what is commonly called "data wrangling":  pre-processing data so that existing tools can be used. 

CS 30 and CS 40

In the Bachelor of Science in Data Science, you may take both CS 30 and CS 40 but only one counts for the degree. The other one counts as a free elective. CS 30 covers the issue of programming to professional standards in Python, while CS 40 covers the same issues for C and C++. The considerations are so different for Python versus C/C++ that there is almost no repeated material between the two courses. 

Prerequisites 

  • CS 11 Introduction to Computer Science
  • CS 15 Data Structures

or their equivalents at another institution. I expect some fluency in use of basic data structures including stacks, queues, trees, and graphs. During the early parts of the course, we will be learning to do the same things in Python that you might have learned in C/C++. Later on, however, we'll concentrate on things Python can do that C/C++ cannot.  

Class meetings: 

  • Tue, Wed, Friday 9:30-10:20 in Barnum 104.
  • Labs in Halligan 116 on Thursdays at the following times:
    • 12:00 pm -- 1:15 pm
    • 1:30 pm -- 2:45 pm
    • 4:30 pm -- 5:45 pm 
  • All class lectures will be recorded and posted to Canvas. I use a recording technology that only records the lecture screen. I will use a local microphone that cannot pick up your questions, and will repeat questions for the benefit of later viewers. Your voice will be very quiet on any recordings. Recordings will only be available to class members and will not be generally available on the Internet. 
  • I do not have the technology or patience to simulcast classes to zoom in real time while lecturing in person. At best, this is a fragile strategy, because I do not have anyone to monitor zoom for questions and you would be correct in assuming that I am ignoring remote viewers. As well, there are class activities that require in-person participation; doing these both remote and in-person at the same time is impractical. Thus, in general, unless we have to go completely virtual, I will not be trying to do this. 

Health and Covid-19

In accordance with University policy, masks must be worn properly at all times during lectures and labs. Accordingly, you may not eat or drink during in-person lectures. To the extent that the class requires it, we will doing things that require group conversations and group work. Thus it is particularly important to wear masks because some activities will require that we not maintain social distance.

As well, I feel it fair to mention that I am the oldest Computer Science professor, and in a vulnerable population as far as Covid-19, and am teaching in person this term somewhat against the advice of my closest friends, who consider it too much of a risk. I would appreciate strict compliance with mask wearing to protect me and yourselves. Let's show the world that we can learn safely in person.

Missed classes and remote participation

I am aware that there are many situations this term that can cause you to miss one class or even a week of classes. In accordance, 

  • All lecture notes from each class will be posted on Canvas. 
  • All lectures will be recorded and posted on Canvas. 
  • All classroom exercises will be posted to Canvas and can be completed outside class for those students who need that accommodation. 
  • All lab exercises will be posted on Canvas. 
  • People who miss a lab for an acceptable reason can complete the lab later at home.
  • Those students attending completely remotely for health reasons -- or due to visa difficulties -- can complete all classroom exercises and labs remotely. Students in this situation are invited to work with other remote students in completing any group assignments. 

Piazza

In this class, both myself and the teaching assistants will be monitoring Piazza for questions and providing answers there. Please use Piazza instead of email. Private messages are enabled, but to get to an instructor, please use "Instructors" as a message target because that gets to both myself and the teaching assistants. That way, you'll get the fastest possible answer. 

Textbooks

  • Python for Data Analysis, 2nd Edition, by Wes McKinney, O'Reilly and Associates, 2017
  • Frequent web readings. 

The textbook covers the basics but is not at all comprehensive and we will be covering much more advanced topics in the context of the course. There is no textbook that currently covers all that we will study. The textbook -- in general -- provides a gentle introduction to topics that we will -- in general -- cover in much more depth. 

Requirements

  • Weekly labs (30% of each weekly grade).
  • Weekly homeworks (30% of each weekly grade).
  • In-class exercises (10% of each weekly grade). 
  • Weekly online quizzes (30% of each weekly grade).

See below for a rather novel grading scheme we are trying out, on the advice of education researchers into grading methods. 

Workload

This is a 4-credit hour course and it is against your instructor's religion to make it more work than that! You can expect to spend four hours in class per week and roughly three-four times that outside of class. This is not CS 40 and its workload bears no resemblance to the latter.   

Instructor: Alva Couch

Teaching assistants

  • Nick Murphy, BSDS 2022. 
  • And others to be determined. 

Your instructor

I can take some getting used to as an instructor, because I do not function by the same rules as many other instructors. First, I do not teach from authority. I do not consider myself smarter than you. In fact, I am quite sure that you know more than I do about some things. What I have to offer, instead, is more than 31 years of experience in the discipline, and some proven approaches to developing understanding.

I do not teach from authority because I do not need to. Physical law is my authority, and I can demonstrate everything I claim either experimentally or mathematically. In a systemic and deep sense, I am fundamentally a scientist, dedicated to the truth and to teaching you how to find it for yourself.

This means that, among other things:

  • I am a random-access device. I can be asked about anything at any time. Do not worry about putting me on the spot. If I don't know something, I will simply find out about it, and enjoy doing it!
  • I expect and greatly enjoy challenges and learning new things. If there is something neat and cool (or perhaps "c00l" or "kewl" or "wicked") that you want to know more about, ask me! I hope to impress upon you the importance of adopting that attitude yourselves!
  • You cannot bruise my ego by proving me wrong. My primary duty is to the truth, and you can please and impress me by pointing out where I am wrong! In fact, I will feel worse if you let me get something wrong without pointing it out!
  • You cannot hurt my opinion of you by showing me that you don't know something. I will simply try to help you with it. You can increase my opinion of you by showing me that you understand the limits of your knowledge!

The last one of these is -- in fact -- the key to everything we will do.

I come to this course with 32 years of experience in Software Engineering and Data Analysis. I started programming in Python in 2009, and for the last 9 years, have been involved in a professional programming project HydroShare (https://www.hydroshare.org) for which I remain the lead software architect. I wrote the business engine for HydroShare and it might be said accurately that I put the "share" into HydroShare. HydroShare is a bold experiment in "Python Software Engineering" for data analysis, and incorporates all of the modern practices required to write Python to professional standards.  As the course progresses, we will discuss each of these in turn. 

Assignments and Labs

In this course, all assignments and labs begin as worksheets in the Jupyter environment for data science, using the iPython kernel. As we will quickly learn, iPython is not quite Python. The reasons for this are complex and will be covered in class.

In working on assignments, you may use any up-to-date instance of the Anaconda machine learning environment that contains Jupyter and Jupyter Labs. There are versions of this for Windows, MacOS, Linus, and even ChromeOS. Thus, you may use your personal laptops to complete assignments. In addition, all stations in Halligan 116 have been provided with the latest Anaconda version. In assignments we will be navigating various parts of Anaconda.  

OtterGrader and GradeScope

This term, we will be using an automated grading system called OtterGrader developed first in the Berkeley Data Science curriculum. This system runs inside GradeScope, which you are expected to use to submit all assignments and labs for grading. Regular grading occurs in two phases. First the automated grader in OtterGrader/GradeScope checks your solution for correctness, and then the teaching assistants comment on style and readability.  

This means, in turn, that your assignments and labs are actually directory hierarchies, with a Jupyter page that you should edit at the top of the hierarchy. When your page is graded, this hierarchy is stripped away and replaced with my reference version. So you should not change anything except the page(s) containing the assignment. Any changes you make in other files will not be used during grading. This can lead to a zero grade on correctness. 

Collaboration policy

In this course, there are four kinds of deliverables: 

  • A weekly lab, in which you are encouraged to collaborate with other students on filling out parts of a Jupyter Notebook. 
  • A weekly assignment, in which you are encouraged to collaborate with other students in filling out parts of the notebook. 
  • Daily classroom exercises, to be completed in class. Collaboration is allowed and encouraged.  
  • A weekly quiz (after the first week), that must be your own work. Collaboration is not allowed. 

In other words, you are encouraged to work with others on the homeworks and labs, but the quizzes must be completed by each of you individually without collaboration. 

The quizzes are an interesting experiment in themselves.

  • The quizzes are not just open book; they're open internet, in the sense that you can refer to any static material on the internet in completing them.
  • However, you may not utilize human contact in doing so.  This includes posting questions about the quiz questions on the internet, either during the quiz or afterward. 
  • The quizzes are timed but occur outside of class and you can take them at your leisure.
  • Thus the quizzes do not test what you can remember, but rather, how good you are at looking things up!

The science of grading

This semester we are undertaking a bold experiment in grading suggested by cutting-edge research in learning and grading policy. 

  • You get a grade for each week; this is a weighted average of all work turned in for the week. 
    • Homework counts as 30%
    • Lab counts as 30%
    • In-class exercises count as 10%. 
    • Quiz on past work counts as 30%
  • Your final numerical score is a weighted median of weekly grades, in which weekly grades are not counted equally. The last week of the course is worth twice as much toward your grade as the first week. For a 13-week term, the weighted grade for a week n is  (weekly average) * (1 + (n-1)/12).  The final grades are based upon the median of these scores. 

In other words, you can always improve your grade over past performance by better performance later. 

  • A bad week doesn't do much. 
  • A great week doesn't do much either. 
  • It's what you do consistently that matters. 
  • As assignments get more difficult, their weight in the median increases along with the difficulty.

There is serious academic evidence that this kind of grading is a better measure of attainment than taking any kind of average.    

"Consistency is our most important product." 

This is an internal motto of the McDonalds Corporation. And it represents the spirit of this form of grading. This is my rather blatant attempt to: 

  1. Discourage concentrating on the grade you get for each thing.
  2. Encourage concentrating on learning rather than grades.
  3. Reward consistent behavior, but also 
  4. Reduce penalties for inconsistent behavior.

Where did this come from? I've been a member of an ad-hoc committee on grading reform for a year in the Tufts AS&E Educational Policy Committee. We're looking for grading schemes that bring learning to the center and de-emphasize "earning points" in favor of "learning things". In general we are looking for grading schemes that are different than the "video game points culture" of traditional grading. This is one attempt to put some of what we've learned about grading into practice.   

Grading and lateness

In general there will be one assignment per week, introduced in the lab for that week and due at midnight on Wednesdays, before the lab for the following week. Late assignments will be accepted for one week following the official deadline. There is no penalty for lateness other than the obvious penalty of not being ready for a quiz on the material in the same week. Other extensions must be requested from the professor and TAS. We strongly suggest the use of Piazza to request these extensions, using a personal message to "Instructors". 

Although there are no late penalties for late submissions, I and the TAs respectfully request -- for our own sanity -- that you keep up with the pace of the other students. It would drive us completely nuts to receive all work for the course in the last week! 

The syllabus

In general, this course is a collaborative work in progress and the initial syllabus is subject to drastic modification as I determine -- together with you -- what works and what doesn't work. This is the first time this course has ever been taught. Thus the main problem I will have is determining how difficult the course is "for you".  Right this second, I will sketch out my best guess as to the topics for each week. This is subject to radical change as I learn what you can and cannot do. 

A disclaimer and a request

This is the first time this particular course has been taught, and the first time we've used OtterGrader/GradeScope. I used OtterGrader's predecessor in a previous course but the current version diverges greatly from the version I used before. This is a complex system, and there is no chance that I will get through the term without making errors in labs and/or homework. 

My request to you is to point out any errors you find. I will not be insulted. I will be thankful. And I'll fix them as soon as humanly possible. 

Frequently asked questions

Q: Is there really no midterm and no final?

A: Yes! The whole grade is based upon your ability to interact with Jupyter. This includes untimed homework and timed quizzes. 

Q: Will we be studying Jupyter kernels other than iPython? 

A: In general, no. There are many kernels for Jupyter, including those for Python, R, Bash, and even Matlab. We will be sticking to iPython, though many principles we will discuss apply to all kernels. 

Q; Is the syllabus etched in stone? Can we discuss something new? 

A: I am quite flexible this term. This is a first offering and if there is something you specifically want to learn about, please reach out. 

Q: Can I use something other than native Anaconda to do exercises? 

A: While there is some "chance" that other environments are compatible, I would not bet on this. If you don't have a laptop that can run Anaconda, you can always work in Halligan 116 and/or I can provide you an environment on a remote server. In particular, assignments "might" work on Google Colab and similar environments, but I don't have the time to ensure that! I'm limiting to Anaconda more or less to assure my own sanity!

Q: Is an online version of the textbook sufficient? 

A: In general, yes. We won't be having open-book in-class quizzes, but we will be having "open book"/"open internet" quizzes (yes, you read that correctly) in which you have all possible books and the whole internet to help you except for use of personal communications. An open-internet quiz is a Jupyter notebook that you are expected to complete on your own via use of notes, books, and Internet resources. These include physical and online copies of any books whatever.