In today's Google-centric world, it's even more important to equip budding researchers with the right tools as they become citizen data scientists

Enabling the next generation of citizen data scientists


In today's Google-centric world, it's even more important to equip budding researchers with the right tools as they explore their role as data scientists

Google has ruined research.

Okay… I’m being hyperbolic. Google hasn’t ruined research. But as a senior data scientist, I do worry that search in the age of Google has outstripped our ability to gather, analyze and truly interpret data.

Thanks to algorithms, predictive text, billions and trillions of bytes of data, cookies, and the like, we’re used to searching for a needle in a haystack and Google returning the exact needle we’re looking for. We’ve been tricked by the fallacy that our answer will be in the top 10 results.

But that simply isn’t true. The rapidly increasing volume and complexity of data means that when we pose a hypothesis, our answer can be anywhere, anything… not just what Google serves up.

This has led to a remarkable uptick in the need for data scientists. Data science has been the top career on Glassdoor for the last four years, and according to some estimates, the field will add 11.5 million new jobs by 2026. 

Citizen data scientists—those without formal training in data analytics, such as academics, journalists, psychologists, academic researchers, sociologists, and historians—have stepped in to fill the void. So have student researchers and technical professionals hoping to revamp their careers. Heck, so has my 16-year-old daughter! Whether implicitly or explicitly, a new generation of data scientists is learning these skills.

It’s more important than ever to equip truth-seekers with the right tools and skills to perform real analysis and uncover and deliver novel insights. It leaves those of us responsible for helping researchers sift through the noise, gather and extrapolate novel ideas, and ask: How do you enable the next generation of citizen data scientists?

The democratization of data science

Access to government, financial, and medical research is steadily and rapidly increasing. Yet while data is more abundant than ever before, the tools and techniques for citizen data scientists—you know, mere mortals—are lagging far behind.

Enter the democratization of data science: leveling the playing field between readily-available data and the know-how to interpret it. In a March 2021 Harvard Business Review article, Thomas C. Redman and Thomas H. Davenport note:

“If data science is to be truly transformational, everyone must get in on the fun. Restricting data science to only the experts is a limiting proposition [and ignores] the vast majority of people and business opportunities.”

I couldn’t have put it better myself. But where and how do you begin? For starters:

  • Better metadata, aka the bits of code, index terms and sentiment attached to content will help users pinpoint relevant results more efficiently and accurately. Anything that helps streamline the analytical experience and reduce the time and complexity of curating a data set is a good thing.
  • Data visualization, such as word clouds and trend timelines, will help researchers identify patterns, pointing them toward potential insights. And, as they say, a picture is worth 1,000 words.
  • Traceable source data—a manifest of how one discovered, manipulated and interpreted their results—will become essential for reproducing and defending data analysis. Just like with any good journalism, citing sources and reproducibility is a big deal.

Avoiding algorithm bias

When you think about good, solid social and behavioral research, one tried-and-true rule still applies: correlation doesn’t equal causation. Gather the evidence, sift out the noise, determine what stands out in the right and wrong ways, and then perform your analysis.

This begins and ends with the quality of information you start with, especially in the era of misinformation. That’s why it’s so tremendously important for citizen data scientists to understand the role bias plays.

Bias doesn’t change with machine learning; in fact, AI has the potential to only amplify and accelerate it.  AI is built by, well, humans. Increasingly, it embeds or replicates biases that already exist, even on a subconscious level, by the very people who construct it.

We may have better tools at our fingertips, but those tools can be misused. If citizen data scientists rely on the open internet to perform their research, they won’t begin with the balanced and stratified sample that’s so critical to data-driven inquiry. Remember that Google-for-the-exact-right-needle-in-the-haystack scenario I mentioned? Algorithms—driven as they are by clicks, keywords and, often, advertising dollars—are so precise, so sophisticated, that they’re trained to think for you.

Search engine results aren’t subjective. It’s important to train future data analysts to cite sources and perform cross-section studies to ensure their results are unbiased.

The new literacy: Coding

I wasn’t kidding when I mentioned my 16-year-old daughter becoming a data scientist in her own right. Believe it or not, we sit around the dinner table talking about Python code. It has been fascinating to dig into what she is learning and realize how quickly the next generation has adopted these skills.

If it’s not already—basic coding skills will be a core part of a citizen data scientist’s job. And we must provide less experienced researchers with the tools they need to do their jobs.

Schools and businesses are already making progress toward this end. They’re tapping into open-source data science environments like Jupyter and Tableau to bridge the skills gap and inspire citizen data scientists.

Leveraging pre-packaged, open-sourced Python and R language notebook libraries—a set of “plug and play” analysis options, if you will—is something I’ve been working on, too, through Nexis® Data Lab. Users with more research experience can still modify their content libraries, embed their own code, or import third-party data, but our goal is to make data analysis easy for everyone. If my 16-year-old can curate a content set, identify patterns and trends, and turn that analysis into an interesting insight, anyone can.

Educators, academics and, yes, formally trained data scientists have an obligation to remove the veil from big data analysis to make it approachable, accessible, and enhanced for productivity. If we provide better tools, teach new digital literacy skills early, tell stories with pictures and graphs to make it accessible, help researchers find a starting point and uncover novelty early on, we will.

And maybe we’ll even un-ruin research.

eSchool Media Contributors