5 cutting-edge privacy considerations for Big Data

White House, MIT host discussion on the new implications for privacy and learning with Big Data

bigdata Big Data is taking higher education institutions by storm; however, the discussion has moved from whether or not Big Data is useful to whether or not institutions can actually manage the data received, and not just in capacity, but in privacy—privacy that, according to leading experts, is just an illusion.

During a “Big Data Privacy Workshop: Advancing the State of the Art in Technology and Practice,” co-hosted by the White House Office of Science & Technology Policy (OSTP) and the MIT Big Data Initiative at CSAIL in Cambridge, MA, thought leaders from academia, government, industry and civil society came together to discuss the future role of technology in protecting and managing privacy.

The workshop, one of a series of events being held across the country in response to President Obama’s call to review privacy issues in the context of increased digital information and the computing process to power it, offered cutting-edge considerations for not only higher education institutions, like MIT, but business and the health care industry.

Big Data is not just a trend, explained Rafael Reif, president of MIT, but rather a “topic that includes a broad range. Big Data affects many people and the whole spectrum of society.”

However, as Big Data grows to cover most sectors of industry, privacy concerns are also growing as data analytics are changing in response to new algorithms and new applications.

Consulting some of the leading experts in Big Data technology, ranging from Microsoft to Harvard, five new considerations in Big Data privacy are emerging for universities and other industries, leading some to wonder if the concept of privacy is in need of a new definition.

(Next page: 5 considerations for Big Data privacy)

Five considerations for Big Data privacy


According to Reif, MIT is at the forefront of digital learning through its Online X programs, the pilot program developed by a MIT and Harvard partnership called edX, a not-for-profit venture that has drawn students from around the world since its 2012 launch. (Read: “MIT’s Big Data education set to get underway.”)

Reif noted that the X programs have over 760,000 unique registered learners from more than 90 countries with over 700 million records.

“We want to measure what works to improve learning, but we also want to share this data with other institutions so that they can learn from our data. However, we have to think about privacy and for us that means FERPA.”

Governed by FERPA (Family Educational Rights and Privacy Act), universities must now decide who counts as a student under the law.

“Are you a student if you take a MOOC? What if you only take so many courses? Are you only considered a student if you receive a certificate? These are some of the new questions higher education institutions must discuss concerning Big Data and privacy,” explained Reif.

He also emphasized that online forums as part of online courses are proving the biggest challenge, since many students post personal information that can be aggregated.

“It’s about setting boundaries while balancing competing interests,” he said.

2. Non-predicated data

In 2012, 2.4 billion internet users shared enough information to surpass 2 zettabytes of digital information, leading to a jump in analytics technology reminiscent of something straight out of “Minority Report.”

With so much information available, said John Podesta, White House counselor, there’s been a move from predicated data, or data individuals have given, to non-predicated data, or data that can profile an individual based on information they themselves may not know.

“The discussion has currently become: How do we inform and develop privacy policies on non-predicated data and what are the social implications of this?”

For higher-ed institutions, this could mean predictive analytics that, based on things like learning trends, financial aid statistics, and teaching costs, can determine which students will succeed at the college before they are ever admitted. (Read: “Higher Education’s Big (Data) Bang: Part One.”)

Is it fair to use Big Data in this way, and should incoming students be notified of these analytics? How do we set up policy to protect student learning data that’s non-predicated? These are the questions we are trying to answer, explained Podesta.

3. Personnel, not software

“There’s a misconception that many data breaches are caused by software malfunctions,” said Mike Stonebraker, adjunct professor at MIT CSAIL. “The truth is, the tools used to manage Big Data, like The Hadoop/hive world, doesn’t make security mistakes. It’s the human element.”

Stonebraker argues that if Big Data users, like universities—often targets of data breaches—want to manage privacy risks, the database should have a command log.

“It’s perfectly acceptable to have a command log to know who’s doing what on the database. I’d also recommend creating a detection system to measure suspicious behavior in personnel when accessing the database. Yes, someone may have to create a breach for the system to become predictive, but it’s a worth investment,” he said.

(Next page: Considerations 4-5)

4. A culture of trust…and differentiated privacy

“The potential of big data hinges on one thing: trust,” explained Penny Pritzker, secretary of the U.S. Department of Commerce. “This promotion of trust can happen in how you plan to use their data, informing users of how you plan to use their data, training standards for employees, and also in consumers; consumers must share in trust through their online behaviors and becoming informed on their online data management.”

Cynthia Dwork for Microsoft Research explained how a relatively new definition of privacy, called differential privacy, could be used to help counter some of the security risks with online data.

Over the past five years, a new approach to privacy-preserving data analysis has born fruit, said Dwork, and this approach—differential privacy—includes a defined formal privacy guarantee. The data analysis techniques presented are then rigorously proved to satisfy the guarantee.

“Roughly speaking, this ensures that (almost, and quantifiably) no risk is incurred by joining a statistical database,” she said.

For more information on differential privacy, read Dwork’s paper.

One of the ways differential privacy is accomplished is through modern cryptography, which proposes a definition of what you’d like the cryptosystem to identify; then, an algorithm is developed to satisfy that definition until the definition is refined and, therefore, broken; then, another, stronger algorithm is developed for the new definition; and so continues the cycle.

Modern cryptography is a privacy-enhancing technology because it “uses techniques that extract from data without actually seeing it, keeping individual data sets private,” explained Shafi Goldwasser, professor at MIT CSAIL.

According to Goldwasser, there’s a lot of math involved, but higher education institutions can use modern cryptography to analyze information on students and programs mainly through the sharing of data sets with other institutions.

“Parties only learn the function of the output but nothing else about others’ inputs,” she said.

“By developing stronger cryptosystems, privacy loss can be managed,” said Dwork.

5. Privacy vs. the illusion of privacy

“Ultimately what needs to happen is an acceptance of the illusion of privacy,” said Manolis Kellis, associate professor at MIT CSAIL.

Every time you take a drink from a glass, explained Kellis, you leave behind your DNA, which reveals your personal information. In theory, if someone wanted to take that DNA from your glass they could.

“But you don’t hide your DNA by never touching things,” he continued. “Just like with Big Data and privacy, it’s out there and that’s the risk everyone takes by functioning in society. It’s the policies and laws that are implemented, security logs, and the information on how your data is used, that protects you.”

To think that a data breach won’t happen is an illusion; to plan for one through knowledge of current policies and cutting-edge technology tools to help mitigate that breach, is one way to protect privacy in a real way, said Kellis.