chatGPT and You#

(aka Congratulations/I’m so sorry)

This reading is primarily about the use of Large Language Models (LLMs) to write code. There are many similar arguments related to the use of LLMs to write essays and reports, but that is not the focus of this reading.

You are beginning your Data Science education at a remarkable time. Data Science has always been a rapidly evolving field, but even by its own standards, the changes that have taken place over the last year with the emergence of highly capable Large Language Models (LLMs) like chatGPT, Github Copilot, Gemini, etc. are staggering. We are only just beginning to figure out how best to use these tools, and you are perfectly positioned to help shape that future (The “Congratulations!” in the subtitle above).

But because we are still learning how best to integrate these tools into our lives, I have to warn you that as students, you are also in a uniquely vulnerable position (the “I’m so sorry” subtitle). That’s because while it is clear that LLMs are here to stay, and will inevitably play a critical role in the daily life of the practicing data scientist, there is a significant danger that overreliance on LLMs early in your education may stunt your professional development.

The Catch-22#

The problem with LLMs is that when it comes to writing code for real data science work (the kind of stuff you’ll be doing in your advanced courses and in your career), LLMs are error-prone and require substantial supervision. Moreover, there is good reason to think that this is a problem that our current approach to developing LLMs will never be able to overcome. As a result, most practicing data scientists use LLMs like student research assistants—they let the LLMs write the first draft of code, then review and refine the code based on their own expertise. This is extremely useful, but it does mean that to use LLMs effectively, a data scientist still has to learn to program on their own.

And that’s where the danger lies with LLMs: while LLMs are error-prone when it comes to real-world data science projects, they’re shockingly good at the types of basic programming exercises that are the staple of introductory data science and programming courses. And that can create a temptation to use LLMs extensively in introductory classes, precluding the development of a strong understanding of the principles of programming. That, in turn, means you are likely to be less effective at using LLMs in the longer run.

The Importance of Active Learning#

If there is anything that researchers have discovered about learning, it is that to learn something effectively, one must actively engage with the material. Passive lectures—in which a professor stands at the front of a room and talks to students—may be the norm in schools, but empirically they are actually one of the worst ways to teach. It is only by doing activities in which students get to test their understanding of a topic that real learning happens.

(Ironically, while lectures are not very effective at actually getting students to learn, they are effective at providing students with the illusion of understanding—the false sense that learning has occurred! Here’s one nice illustrative study of this phenomenon)

And that’s why LLMs are potentially so problematic for students in your position—it’s not enough to do the readings about programming and then turn to an LLM as soon as you feel stuck; real learning requires you to spend time in that frustrated, uncomfortable stuck place trying to figure out what about your understanding of the material is inadequate to allow you to move forward.

OK, but isn’t this just the future? Calculators obviated long division after all#

Well… no. No, it didn’t. We still teach kids how to do addition, multiplication, and division despite the fact computers and calculators can do it better, after all. Why? Because it helps them to develop number sense, which is a critical foundation for more sophisticated types of quantitative reasoning and advanced mathematics, statistics, etc.

The same goes for programming—yes, an LLM can easily write a for-loop or a function to find prime numbers, but we aren’t asking you to practice those skills just so you can write a for-loop; we’re asking you to practice those skills to help you develop a comfort with solving problems through algorithmic thinking.

Reliance on LLMs Is Particularly Dangerous in Data Science#

As I mentioned before, part of the reason to not become overly dependent on LLMs early in your education is that it will prevent you from learning the skills you will later need to supervise them when doing more complex work. This is true in all domains of programming, but it’s especially true in Data Science.

In many contexts—like web design or app development—you can start to evaluate whether an LLM has successfully completed a task by looking at what it created. If an LLM creates a website for you, and the website looks correct and does everything you want it to do, then, well… the LLM did a pretty good job! You wouldn’t want to use this approach to launch a website for a big company—there are lots of corner cases that would be hard for you to verify work by guess-and-check, and it’s hard to verify your site is secure—but it’s at least a start.

But the same cannot be said for data science. As data scientists, we are in the business of generating new knowledge, which means that we don’t know what the output of our models should look like in advance. Sure, we have some sense of what reasonable outputs look like—an analysis that suggests smoking prevents cancer going to raise a lot of red flags—but the reason that the FDA forces drug companies to run clinical trials before approving drugs and Google runs A/B tests whenever they change how search results are shown is that we don’t know what drugs will work/what users will respond to in advance.

As a result—unlike in web development or writing app widgets—we can’t evaluate whether an LLM has correctly written the analysis code we asked for by looking at the results. Rather, in data science, our confidence in our conclusions depends almost entirely on our confidence in how our results were generated, and that can only come from reading, testing, and understanding our own code.

So Where Does That Leave Us?#

As I said at the top of this reading, we’re still learning how best to fit LLMs into our lives. They have clearly emerged as a critical tool in the data scientist’s toolbox, and one I am very confident you will be using regularly by the time you graduate from the MIDS program.

However…

For all the reasons laid out here, I hope you can see why overreliance on LLMs early in your education may be extremely detrimental. Indeed, I’ve had several recent MIDS graduates say to me how glad they are that they learned to program just before these tools became available, not because it means they don’t need LLMs, but because they feel it has set them up to use them effectively.

In your first year, you may find that different professors have different rules around whether the use of LLMs are allowed. For some classes—like Data Management, where a lot of the “programming” you are doing just entails calling APIs for cloud services, LLMs may prove really helpful. But even where the use of LLMs is allowed by the professor and does not constitute an honor code violation, my strong suggestion is to err on the side of under-using these tools, at least for your first year at Duke.

If you take Practical Data Science (IDS 720) with me in the fall, we will talk about the use of LLMs, and good ways to use them effectively. We will start the semester without them, then ease into their use later in the semester. And in doing so, we’ll talk about what I think are the best ways to bring them into your code, like as debugging aids, or for template code.

Be especially cautious of pro-active suggestions#

One thing we’ll talk about in 720 but which I’ll also say here is to be especially cautious of pro-active AI suggestions. I absolutely hate the AI code completion tools that pop up a code completion suggestion in gray because they hijack your train of thought without you having decided you want them involved. Please disable / don’t activate in-line code completion tools anywhere you see them — AI shouldn’t be deciding when and how to be involved, it should work for you when you decide it’s appropriate.