The New World of Massive Data Mining

Guest Host: Tom Gjelten

Flickr user: Daremoshiranai http://www.flickr.com/photos/daremoshiranai/

Every time you go on the Internet, make a phone call, send an email, pass a traffic camera or pay a bill, you create data, electronic information. In all, 2.5 quintillion bytes of data are created each day. This massive pile of information from all sources is called “Big Data.” It gets stored somewhere, and everyday the pile gets bigger. Government and industry are finding new ways to analyze it. Last week the administration announced an initiative to aid the development of Big Data computing. A panel of experts join guest host Tom Gjelten to discuss the opportunities — for business, science, medicine, education, and security … but also the privacy concerns.

Guests

John Villasenor Senior fellow at the Brookings Institution and professor of electrical engineering at UCLA."
Michael Leiter Senior counselor,Palantir Technologies, former director, National Counterterrorism Center.
Dr. Suzanne Iacono Co-chair, Big Data Senior Steering Group and senior science adviser, Directorate for Computer and Information Science and Engineering at the National Science Foundation.
Daphne Koller Professor,Stanford Artificial Intelligence Laboratory

Program Highlights

The term big data refers to the massive amounts of digital information companies and governments collect about us and our surroundings every day, pictures, records, temperatures, conversations. Our guests discuss how government and private industry are using big data and the main concerns surrounding its collection and utility.

What Is “Big Data?”

Villasenor said that big data is “really big.” The amount of data that’s estimated to have been created or replicated would fill 11 billion iPod classics, each holding about 160 gigabytes. “Remember that the world population is only 7 billion so that’s a truly incomprehensible amount of data,” Villasenor said.

Practical Uses

Every organization, whether it’s government or private sector, uses information in different ways, said Leiter. In the world of terrorism, data that was collected clandestinely could be cross-checked with information that was available publicly to try to identify people who were doing suspicious things. In the private sector, organizations like banks use data routinely to identify cyber fraud and organized crime activity. “There’s almost no application, either in government or the private sector, that can’t benefit from some of this big data,” Leiter said.

Privacy An “Enormous” Concern

Privacy is an enormous concern, but big data isn’t necessarily always directly correlated with privacy, Villasenor said. For instance, the total amount of data needed to represent all the websites an average person visits in one year is not that big – about one or two megabytes. But a lot of people would consider that information very private, Villasenor said. “That said, of course, the more data that’s out there, then the more opportunity there is that it could potentially be used in ways that were detrimental to privacy,” he said.

You can read the full transcript here.

Transcript

11:06:55
MR. TOM GJELTENThanks for joining us. I'm Tom Gjelten filling in for Diane Rehm today. She's having a voice treatment. The term big data refers to the massive amounts of digital information companies and governments collect about us and our surroundings every day, pictures, records, temperatures, conversations. Joining me to talk about how these data can be used and by whom, John Villasenor, senior fellow at the Brookings Institution and a professor of electrical engineering at UCLA.
11:07:24
MR. TOM GJELTENAlso, Dr. C. Suzanne Iacono, senior science advisor for computer and information science and engineering at the National Science Foundation, and Michael Leiter, senior counselor at Palantir Technologies and the former director of the National Counter Terrorism Center. We'll be taking your comments and questions. You can call us at 1-800-433-8850, or you can send us your comments or questions via email at drshow@wamu.org. You can also, of course, join us on Facebook or Twitter. Good morning everyone.
11:08:00
GROUPGood morning.
11:08:01
GJELTENBig data, let's start with you, John. I'm sure a lot of our listeners have no idea what we're talking about when we say big data. What does it mean to you?
11:08:12
DR. JOHN VILLASENORBig data is really big, and I'll give a very quick example. Many people are familiar with iPod classic, which can be used for holding music. That has about 160 gigabytes of data. It fits in the palm of your hand, and that can hold 40,000 songs. In 2011, the total amount of data that's estimated to have been created or replicated would fill 11 billion iPod classics so that's -- remember the world population is only 7 billion so that's a truly incomprehensible amount of data.
11:08:42
GJELTENAnd Michael Leiter, the data is so diverse, isn't it? I mean, data of all kinds from all sources. You used to work in counterterrorism. You must have just been overwhelmed at your institution by the amount of data coming in. I mean, was there a point when this was a barrier, and is it now, you know, sort of becoming something that you can take advantage of? What's the history of dealing with big data?
11:09:11
MR. MICHAEL LEITERWell, Tom, several things. I think, first, it's not just the volume of the data. As you said, it's also the speed with which it's coming in, and also the variety of forms of the data. It can be text, it can be weblog records, it can be video, it can be pictures. All of that data becomes more and more overwhelming, and the difficulty, of course, is trying to stay in front of that, trying to make sure you know what you have and how different pieces within different data sets are correlated with one another, and that was a challenge we had really before 9/11, but it is one that has just accelerated over the past 10 years.
11:09:50
GJELTENWhat's the secret to being able to deal with this flood of data and analyze it?
11:09:56
LEITERWell, I wish there were simply one secret, then I think this wouldn't be as big a challenge if there were. But from my perspective, what it requires is first of all, integrating that data. It's not just looking at one stovepipe of information. It's comparing one source of information with other sources and seeing where there are correlations that are meaningful. Second, it's being able to do so in a very flexible agile way so a human being can manipulate and play with that data, that you're not just relying on a set of algorithms that supposedly spit out an answer, that people can crawl through that data and identify what is meaningful, test hypotheses and then look in other areas.
11:10:38
GJELTENSuzy Iacono, you're at the National Science Foundation. What is the interest of the U.S. government here in trying to make this huge trove of data analyzable, manageable, what does the U.S. government want to do?
11:10:54
DR. SUZANNE IACONOWell, several years ago, the science leaders in the administration recognized that big data really is the next big thing, and what I mean by that, it's high time that we make significant investments to do just what you're saying, really get our arms around the big data and really make a difference to the country so there is obvious impacts economically. We're seeing a huge transformation in science from small data science to big data science and we have opportunities to address national challenges like clean energy and cyber learning and completely new ways that we've never thought about before.
11:11:30
DR. SUZANNE IACONOBut the challenges are really great, just what you were talking about, Michael. We've got to be able to integrate these heterogeneous databases to really be able to make a difference. So about a year ago, under the auspices of the National Science and Technology Council, the Office of Science and Technology Policy chartered and charged a big data senior steering group to go about and get a research and education agenda in place.
11:11:55
GJELTENOkay. Let's try and not get too far away from the real world here. John, give us a few more examples of what is it that we're talking about when we talk about big data, the varieties of data that sort of accumulate on top of each other, and the diversity of that data, the data that we're now trying to connect. What are we talking about? What kind of data?
11:12:16
VILLASENORThere's an almost endless list of types of data. There's all of the consumer information that all of us contribute to generating when we go online, when we shop online. There's climate data that's acquired by meteorological stations all over the world. There's communications data, used for example to deliver On-Demand video to communicate voice-video traffic across, you know, various places. There's IP Internet traffic, all of the emails and documents that we create and send and store and process, and you can go on and on for a half hour. There's just an endless variety of data out there.
11:12:53
GJELTENAnd how can this data be used, Michael, in a practical way?
11:12:57
LEITERWell, every organization, whether it's government or private sector, uses that information in different ways. In the world of terrorism, of course we had lots of clandestinely collected information that was collected by say the CIA or the FBI. That information could then be correlated with information from open-source websites or other publically available information to identify people who were doing suspicious things.
11:13:20
LEITERIn the private sector, of course, you might have a bank and a bank has tremendous amounts of information about its credit cards and where those credit cards are used, and it can use that information to identify cyber criminals or fraud activity by organized crime. So it really depends on where the business is. There's almost no application, either in government or the private sector that can't benefit from some of this big data.
11:13:42
GJELTENNow, the company you're now working for, Palantir, got its start with PayPal, and it's a very interesting story of how PayPal was having some serious problems with fraud and it used big data analysis to confront them. Can you very quickly tell that little story?
11:13:57
LEITERSure. You're exactly right. It turns out that sending money over the Internet isn't all that hard. What is hard is avoiding people doing that fraudulently. So when PayPal began, it was looking at almost a 20 percent fraud rate due to Russian organized crime, and the founders of PayPal looked at this. They knew the business would fail if they didn't get their arms around it, and they took a new approach which was looking at all of the data that they were collecting on different transactions, correlating that data in new and innovative ways, and identifying the fraudsters and shutting down accounts quite quickly, and because of that great leap forward in big data, they were able to cut the fraud rate from about 20 percent to under one percent thus making PayPal the incredible success that it was.
11:14:39
GJELTENAnd that same approach can be applied in many, many other areas. You first were talking about in counterterrorism and law enforcement. I know it's been used to sort of construct or diagram social networks that are being everything from the cells that make improvised explosive devices in war zones to gangs, and Suzy, there is such an incentive here for companies like Palantir to develop the software and for other companies to use that software and put it to application.
11:15:11
GJELTENWhat does the government need to do? What's the role here for the government, and why has this White House decided that it needs a research and development initiative to promote more big data computing?
11:15:23
IACONOWell, there's research and development going on in private firms and also in academia, but in the private firms, the research and development is mostly the D part that's going on. They're trying to develop and engineer products that they're gonna put out in the market. The government, however, takes a much longer term view and really is doing the long-term, or investing in the long-term research that's going to enable discoveries that are going to matter years down the road and actually end up in the kinds of products and things that we can't even imagine today that are gonna make big data truly beneficial.
11:15:59
GJELTENFor example?
11:16:00
IACONOWell, for example, so emergency preparedness and public safety. These are kind of public sector issues, challenges, all right? So imagine being able to predict a plume, some kind of nuclear disaster, and be able to integrate that with census data and which kids are at school, and be able to develop an evacuation plan that actually allows for people to get out safely and reduce deaths because of that.
11:16:28
GJELTENSo in this case, you have some environmental data that gives you a prediction of what's going to happen, and then you can use other sets of data, for example, school enrollments or whatever, ages of children, where they live, in order to put that all together and come up with an evacuation plan just in that one precise example.
11:16:47
IACONOThat's exactly correct. And we could not do that today. We do not have the underlying tools and techniques to make heterogeneous data seem more homogeneous so that there could be this kind of an evacuation plan done on the fly in real-time.
11:17:04
GJELTENJohn Villasenor, clearly there are some positive benefits here. What about the concerns about all that data about you and me out there in somebody's hands?
11:17:14
VILLASENORWell, of course, privacy is an enormous concern, and one that's growing. I think it's important to emphasize, however, that big data isn't necessarily directly correlated with privacy. I'll give a quick example. The total amount of data needed to represent all the websites a typical person visits over a whole year, the addresses of all those websites would probably be on the order of one or two megabytes. That's not much data at all, but that's for many people extremely private information.
11:17:42
VILLASENORThat said, of course, the more data that's out there, then the more opportunity there is that it could potentially be used in ways that were detrimental to privacy. So it's a big concern.
11:17:49
GJELTENAnd what's needed, in your opinion, to make sure that sort of the downside here doesn't become more of an issue.
11:17:58
VILLASENORWell, I think we need to have stronger consumer data protections, and I think it's just last week the Federal Trade Commission, if I'm not mistaken, issued about a hundred-plus page report listing in great detail things like Do Not Track and other steps that I think are positive. But as everyone recognizes, we've got a lot of work to do, and there is, in some sense, a disincentive, you know, the interests of the advertisers aren't always aligned with the interest of the consumers. So it's a complex problem.
11:18:22
GJELTENJohn Villasenor is a senior fellow at the Brookings Institution and a professor of electrical engineering at UCLA. We're talking about something called big data computing. It's about trying to make sense of all the data that's out there, data about us, data about what we do, what we think what we like, about our environment, how these pieces of data can be put together for good, and also about some of the concerns that this collection of data may raise with respect to privacy. You can join us, call 1-800-433-8850. We're gonna take a short break right now. We'll be right back.
11:20:04
GJELTENWelcome back. I'm Tom Gjelten sitting in for Diane Rehm. In this hour, we're talking about something called big data computing. My guests are John Villasenor. He's a senior fellow at the Brookings Institution and a professor of electrical engineering at UCLA. Also Dr. Suzanne Iacono. She's the senior science adviser for Computer and Information Science and Engineering at the National Science Foundation. She's also co-chair of the Big Data Senior Steering Group under the Networking Information Technology Research and Development Program. Okay, I'm not going to say that whole thing anymore.
11:20:38
GJELTENMichael Leiter also, a senior counselor at Palantir Technologies and former director of the National Counterterrorism Center. And we're talking here about the advantages and possibly the disadvantages that this huge trove of data presents. And joining us by phone right now from Palo Alto is Daphne Koller. She's a professor in the Stanford Artificial Intelligence Laboratory. Good morning, Daphne. Thanks for joining us.
11:21:03
MS. DAPHNE KOLLERGood morning.
11:21:04
GJELTENAnd you're interested from the standpoint of an educator how big data can be put to use to promote better education. Can you briefly explain how that would work?
11:21:15
KOLLERAbsolutely. So at Stanford in the fall, we initiated a project of massive online education in which three of the Stanford classes, initially just in the computer science department, were provided to students anywhere around the world for free. And an important difference between this and previous efforts is that this was not just video modules that provide the students with content but were also integrated with a significant amount of online assessments that allows students to practice the material, achieve mastery and move on. And ultimately at the end obtain what's called a statement of accomplishment that indicates that they really did master the material in the course.
11:21:57
GJELTENNow, Daphne, how many students need to be involved for these data to become big data? I mean, do you really need a very large set of students in order for that data to be useful?
11:22:08
KOLLERWell, perhaps not and admittedly one could make use of perhaps smaller datasets. But I think that the ability to track student behavior over very large numbers provides unique opportunities. So just to clarify, for each of those classes we had an enrollment of about 100,000 students or higher. And that gives you numbers that allow you to detect patterns that otherwise you would never been able to find. So for example, in one of the assignments in one of those courses, there was a case where 5,000 students submitted the exact same wrong answer.
11:22:43
KOLLERAnd so the teaching assistants looked at what the answer was and they realized that the students had inverted the order of two steps that they needed to use in one of the algorithms that they needed to implement. And that allowed us to detect this misconception, provide the students with a targeted error message, as well as realize that perhaps one could've explained that material a little bit better the first time to avoid that misconception in the first place.
11:23:09
KOLLERNow, if you were teaching a class to 200 people and 5 of them would've gotten the exact same wrong answer, nobody would ever have noticed. And so the availability of these really large amounts of data provides us with insights into how people learn, what they understand, what they don't understand, what are the factors that cause some students to get it and others not that is unprecedented, I think, in the realm of education.
11:23:36
GJELTENNow in this conversation this morning I'm going to keep coming back to this point of really making clear to our listeners what it is that we're talking about. So the different types of data that you are integrating there at Stanford in order to sort of carry out your experiment, you already mentioned test scores. Again, what are the other pieces of data that you are using -- the other sets of data that you are combining here?
11:24:02
KOLLERSo let me clarify that this is, at this point, a little bit speculative because this is a brand new experiment we only just started. But in terms of the data that we're collecting and will be collecting, it's things like when a student watches the video where did they pause and rewind? Which parts did they watch quickly and which parts, like at 1.2 or 1.5 speed, which ones did they watch more slowly? Which parts did they watch several times? What order did they do things in?
11:24:28
KOLLERNow you could also integrate, of course, to the extent that the student wants to provide demographic information about the students and their backgrounds from prior to taking the class. And so all of these different pieces of information combined with the assignment scores and the test scores to provide unique insight on how certain people learn and how do you guide each learner towards a trajectory that is most suitable to improve their own learning outcomes?
11:24:55
GJELTENNow as a former teacher, you know, I always believed in kind of rapport with my students and other kind of subjective factors. I mean, it almost sounds like this could lead to sort of a automated approach to teaching. If you are able to break down the learning process so precisely doesn't that sort of lead to a kind of robotic approach to education almost?
11:25:18
KOLLERNo, I don't think so. I think providing teachers with more information about what's likely to be better for certain students and others is only a good starting point. But the best teachers will take that and integrate that with the subjective judgment that they have from understanding the students or the person and end up with something that is better, I think, than either of, you know, a fully automated approach or just the purely subjective approach.
11:25:43
KOLLERI mean, in the same way that I think one could argue that a good doctor has an understanding of their patients and can diagnose things by touch and by feel and by eye. But you wouldn't argue that they shouldn't have access to the highest quality test results just because that would robot-acize the treatment of medicine.
11:26:03
GJELTENAnd Daphne Koller, are your students and faculty members there at Stanford -- how are they responding to this initiative? Do they welcome it? Do they find a lot that's valuable in it so far?
11:26:15
KOLLERI think a lot of faculty and a lot of students think this is a great idea because among other things for the Stanford students that are registered Stanford students, this opens up time in the classroom for that more personalized engagement with the instructor, whereas previously the time in the classroom was devoted mostly to having professors lecture to their students. And there wasn't very much interaction and engagement in that process. And now, by having the actual content (word?) be in a separate outside the classroom, it opens up much more interaction within the classroom time.
11:26:53
KOLLERSo I think there's a lot of potential there and there's a lot of people who see this as a great opportunity. Now, of course, just like for any disruptive change there's some people who prefer the old fashioned way of doing things. And I think time will tell how many people adopt this and the extent to which it takes off.
11:27:15
GJELTENOkay. Thank you so much. Daphne Koller is a professor in the Stanford Artificial Intelligence Laboratory and she is a real pacesetter in the application of big data computing in the education field. John Villasenor, is this being done at other educational institutions or any sort of similar experiments that have come across your attention?
11:27:40
VILLASENORI'm most familiar with the Stanford example and the numbers. And Daphne would be in a better position than I to give the numbers, but they were truly massive. I think there were maybe hundreds of thousands of people participating. Certainly many institutions are aware of the potential to reach enormous numbers of people and have put their classes online with varying results. But I'm not aware of anything that's been as carefully done as the Stanford effort. But I'm sure there are other things out there.
11:28:09
GJELTENAnd, Susie, does the Stanford effort have some U.S. government support behind it?
11:28:12
IACONOIt absolutely does, yes. This is an important project because it's really going to lead through breakthrough science about how people can learn. Having the cyber learning component online allows us to collect all those data.
11:28:26
GJELTENOkay. And Michael Leiter, we've heard now about this interesting application of big data computing in the education field. Let's quickly run down some ways that your company Palantir is using big data computing, you know, outside of law enforcement and counterterrorism. I know Palantir is involved in many areas of activity, right?
11:28:43
LEITERIndeed. In the defense field we support the marines in Afghanistan and they obviously have a huge amount of data there. But we also allow marines back in the United States as the troops were drawn down to support in real time that intelligence effort in Afghanistan. Getting farther away from national security with the Center for Disease Control we support the food borne illness program at the CDC, which has helped them much, much more quickly and more effectively track things like the cholera outbreak in Haiti or a food borne illness throughout the United States.
11:29:15
LEITERAnd again, this involves huge amounts of data coming from consumers and restaurants and health information and allowing them to correlate that quickly has allowed them to get on top of things (unintelligible) .
11:29:24
GJELTENOkay, be precise, Michael. What are the data that you are combining here? What are the precise points of data -- the sources of data?
11:29:31
LEITERIt begins with public health information so there's a reporting requirement when people get food borne diseases. Then there's also additional information that can be collected by hospitals and doctors and healthcare providers. In addition, there are simply public records, things like climatology information, which might give some indication about why disease is or is not moving. And of course there's open source information, potentially what people are searching on when they get ill on Google.
11:29:57
LEITERAnd finally there might be private information that those individuals who got sick say the information of what food they bought at a grocery store. So that might be information that the CDC could also leverage to identify the source of that food borne illness.
11:30:11
GJELTENWell, let's just take that example. So people are searching on the web for answers to some kind of symptoms that they're experiencing. Who is in a position to make use, to know about that data, to see that data, to make use of it and connect it to some other source of data?
11:30:29
LEITERWell, that information could be used by an organization like the Centers for Disease Control.
11:30:34
GJELTENThey would have access to that, to what I'm searching on Google?
11:30:37
LEITERYou can work with Google and other providers and find out what are the most common searches during a certain period of time. And that might give you some early indications that if people are all searching for something to solve flu-like symptoms that you could have an early outbreak of flu, that could be an indicator. And then someone like at CDC could use that information, combine it with the information they receive from healthcare providers to try to trace where that outbreak is and try to get ahead of that by delivering healthcare more quickly.
11:31:07
GJELTENNow, John Villasenor, big data computing data analysis is not only happening in the United States. You've written about how other governments are using this big data analysis not always to such positive ends.
11:31:20
VILLASENORYeah, that's right. One of the most remarkable statistics among many in the technology world is that storage costs have declined by over a million in the last three decades one million times. It now costs less than 17 cents to store everything a person says on a telephone in a year.
11:31:37
VILLASENORSo if you look at countries like Syria and China that have a record of, you know, monitoring their citizens extremely closely, it would be naive of us not to expect that these countries are going to employ all of the technology tools available in the big data world here in the United States to ends that we might not here in the United States think ours as nearly beneficial. And there's plenty of evidence, in fact, that they're already doing that.
11:32:00
LEITERAnd, Tom, can I just add to that?
11:32:02
GJELTENYeah, Michael.
11:32:03
LEITER'Cause I don't want to be the corporate guy who says we need to just use all this data.
11:32:06
GJELTENYeah.
11:32:06
LEITERI think John's point is exactly right and we have to make sure that the same technology that is used to leverage this data for very good purposes can also and is also used to protect privacy and civil liberties. You can audit the information that's looked at, you can put controls on how it's used. And that you have a conversation with privacy and civil liberties communities to make sure that there's trust in building these systems in the first instance.
11:32:29
GJELTENMichael Leiter. He's at Palantir Technologies. Before that he was director of the National Counterterrorism Center. I'm Tom Gjelten. You're listening to "The Diane Rehm Show." And, Michael, at Palantir do you have those -- are you reaching out in those ways or is that really for someone else to do?
11:32:47
LEITERNo, we do. We actually partner with a number of groups including the Center for Democracy and Technology. We have a series of privacy and civil liberties officers. And again we built it in so any action that is done with any of this big data is embedded in a permanent audit trail. People know what information and what the government has done, or what a company has done with that information and others can review that. And you can also limit accessed information based on an individual's needs to know.
11:33:11
GJELTENSusie Iacono, is there any way to introduce, you know, whenever you have upsides and downsides it's sometimes cause for regulation and rules. Is there any area here where you think maybe some rules and regulation might be called for?
11:33:26
IACONOWell, certainly I think in the early days, that's where we thought all of the help was going to come in helping people to maintain their information privacy and their civil liberties. But today there's a whole new field called privacy by design. And what we're interested in as computer scientists is trying to figure out what it is that people want and building it into our systems right from the get go, right.
11:33:48
IACONOSo we're not waiting until after the system hits the streets and people are using it to figure out, oh no, my privacy has been invaded. Let's figure that out when we start to design our systems. Figure out all during the production of these systems and the programming languages that are used and the policies that are embedded. Let's figure out what those systems should look like so that they actually help people maintain their information privacy and make choices.
11:34:16
GJELTENOkay. Clearly we have a number of callers who I think it's fair to say are concerned about what the implications of this big data analysis capability in the hands of the government or in private industry might be. Steve is on the line calling us this morning from Aurora, Ill. Good morning, Steve. Thanks for your call.
11:34:37
STEVEGood morning. Yeah, I'm not concerned about my individual freedom as much as I am the freedom of the nation. We're dealing with a system that could supply an immense amount of possible control to our general thought patterns by filtration and isolation and then spoon feeding us particular information for our particular desires.
11:35:01
GJELTENWho, who?
11:35:02
STEVEWe could get to the point of actual thought crime, "1984" style, where a deviant is punished for not being part of the pack. But I'm worried about the entire herd of us being carefully taken toward the edge of the cliff.
11:35:20
GJELTENAll right, John Villasenor, extreme. but, you know, he's making sense, I guess, in a way here, right?
11:35:26
VILLASENORWell, I'm not so cynical as to believe there's this some sort of a grand plot deep in the U.S. government to take us to the edge of any cliff. I think the bigger risks are really that the sheer amount of data creates a temptation that's going to be too tempting to avoid for a lot of folks in the advertising industry. And there's always the possibility the more data there is out there there's always the possibility that it could be abused. But over the many concerns I have, all of us being herded to some sort of cliff through control of our thoughts is not one that I've particularly given a lot of concern about.
11:36:00
GJELTENWell, you mentioned advertising and that's, of course, obviously people are very interested in -- people from the advertising side of things are very interested in information about this sort of behavior.
11:36:09
VILLASENORAnd I think that is a concern because, you know, advertisers will talk the talk in terms of, you know, respecting consumer privacy when it suits their interest. But I think there's a fundamental underlying financial incentive for advertisers to know as much as they can about you in order to give you targeted advertisements and other information. And as long as that incentive exists, which it always will, they're always going to push any boundaries they can to get as much information as possible.
11:36:36
GJELTENMichael Leiter, do you have your location noted on your iPhone? Do you have sent out location alerts, whatever they're called, so that Starbucks knows when you're walking by?
11:36:47
LEITERI don't, Tom, and I think an important piece of this is to make it as transparent as possible to consumers and users and the people who are ultimately providing this information about how it might be used. They have to have the ability to either opt in or opt out of how that information is used. And targeted advertising which has been going on for decades can clearly get a lot better in this case.
11:37:08
LEITERYou know, the Mackenzie Institute estimates that in some areas retailers could have a 60 percent increase in their effectiveness through big data.
11:37:17
GJELTENMichael Leiter. He's senior counselor at Palantir Technologies, former director of the National Counterterrorism Center. There are many sides to this issue of big data computing, rewards, risks, opportunities. We're anxious to hear your thoughts on it. Call us at 1-800-433-8850. We'll be right back.
11:40:02
GJELTENWelcome back, I'm Tom Gjelten, sitting in today for Diane Rehm. And our subject this hour is big data computing. How data about us and what we do is being gathered and analyzed and what use is being made of it. My guests here in the studio are John Villasenor. He's a senior fellow at the Brookings Institution and a professor electrical engineering at UCLA. Also Suzie Iacono, she's the senior science advisor for computer and information science and engineering at the National Science Foundation. And Michael Leiter, a senior counselor at Palantir Technologies and former director of the National Counterterrorism Center.
11:40:36
GJELTENAnd I'm going to be going to callers and tweeters and Facebooker's here in our remaining minutes. I have a note here from Pat who wonders about how much control the consumer has over his or her computer and what happens on his computer. And actually in a sense and answers to Pat from Roger who has written in who says "I don't really need to keep the batteries in my mobile devices while traveling."
11:41:08
GJELTEN"I don't need to use Google Maps. I don't need a vehicle with the manufacturer installed GPS. There is a very secret society out there that knows all about us, but we can fight back." John Villasenor, what are the options that consumers have here to sort of monitor and control what data are collected about them?
11:41:30
VILLASENORYeah, and again, this is one of the topics that's addressed in last week's FTC report. Everyone's onboard now that there needs to be much more power given to the consumer in terms of exactly that issue, knowing exactly what is and isn't being collected and how it's being used. Now, the unfortunate thing is that, sure enough, whenever this has happened or whenever these types of sentiments are expressed, companies have found a way or advertisers have found a way to essentially get around it. I'll give a very quick example. Many privacy policies provide that, what's called personally identifiable information, isn't collected.
11:42:03
VILLASENORBut what is collected is the unique device identifier for your phone or your iPad or your smart phone. And that information, it's really hard to argue that that really isn't personally identifiable since it's tied only to you. So I think the challenge is, as we move forward, is to sort of play devil's advocate even as these new policies get put into place to say how could they be abused by people trying to circumvent them and to try to plug those holes. And that's the kind of arms race which goes on in this field.
11:42:31
GJELTENMichael Leiter, we all remember a few years ago a fellow by the name of Admiral Joanne (sic) Poindexter who was very interested in pursuing big data analysis and applying it to the fight against terrorism. He came up with something called Total Information Awareness. And it involved, basically, big data analysis, taking information derived from driver's license applications, et cetera, and using them to identify and track possible terrorists. That had to be scaled back. There was a huge reaction against it. What was the lesson of that episode?
11:43:05
LEITERI think at least two lessons, Tom. First, in a really Total Information Awareness, or TIA as it was known, involved a really big fishing net which basically sucked up everything and then tried to find connections without knowing what you were really looking for to try to find terrorists. Second, it was really built without any input from the privacy and civil liberties community and no controls from the privacy and civil liberties community.
11:43:30
LEITERSo what we've tried to do instead of the failed TIA effort, I think appropriately failed, is to first make sure when you're looking at data you have a very specific goal and you have a very sufficient specific predicate about why you're looking for someone, whether a terrorist or a cyber criminal and second, as I've said earlier, building in that privacy and civil liberties upfront. So people only see what they're allowed see and that can be audited and confirmed by inside and outside supervision.
11:43:59
GJELTENNow, I guess, you could argue that, you know, we're talking here about what rights consumers have to stop people from collecting data about them. But if, you know, when it comes to possible terrorists, you don't necessarily want them to know what data is being collected about them.
11:44:14
LEITERWell you certainly don't want them to know what's being collected about them but you still want a very rigorous set of controls about what type of information is collected about them. And we've had many debates about that, whether it's the Foreign Intelligence Surveillance act controlling electronic surveillance or human intelligence operations here in the United States. All of those, in my view, need sufficient controls, sufficient oversight by Congress and by the legislative -- by the judiciary to make sure that that's being done in a way which is appropriate. And then analyze within the executive branch, in a way, that again protects privacy and civil liberties of the people who are innocent.
11:44:49
GJELTENWell, as you can imagine, our listeners do have strong feelings about this. I want to go back to the phones now. First to Beth who's calling us from Charlotte, N.C. Good morning, thanks for the call Beth, you're on "The Diane Rehm Show."
11:45:00
BETHGood morning. I'm concerned about many of these data centers that store information, coming to North Carolina and using huge amounts of so-called cheap coal fired electricity and water. You know, we could collect unlimited data, but is anyone thinking about the downside of all the electricity and water these data centers use?
11:45:32
GJELTENWell, Suzie, John mentioned earlier, a very important point which is how much more cheaply and efficiently data can be stored now. Beth, nevertheless, wants to know, "is there a limit to how many resources have to be expended in order to really get into this data computing in a big way?"
11:45:53
IACONOSo this is really a great question. And it's really something that society has to deal with. But there's a whole new area called Green IT and so there's a lot of computer scientists that really want to make sure that these huge data centers and server farms are not using up all this electricity. So we're developing new algorithms, we're developing new systems that are going to conserve energy and power down, for example, when transactions are not happening. And so we're, right now, grappling with these issues and we think it's a really important area of research.
11:46:29
GJELTENJohn, do you have any thoughts about that? I mean, because you were the one that brought up this issue of how -- and I think this is really important, you know, how cheap it is to store vast amounts of data now.
11:46:40
VILLASENORYeah, I think the caller's question is a very important one and as was noted, it's something that the government and people in the private sector are looking at. I'd add that there's another aspect of big data in all the machines that store and process it which is hugely important, which is electronic waste. We now, you know, we basically throw away and refresh computer hardware at a stunning rate. And all that waste piles up.
11:47:04
VILLASENORAnd so, I think, in the broader picture, I think we all have the responsibility to consider, not only the, for example, the privacy concerns of analyzing big data, but the environmental impacts of acquiring it and storing it and the environment impacts of the machines that we use to do that. And all of that needs to be considered under one umbrella.
11:47:21
GJELTENOkay. And let's go now to Elizabeth who's in Richmond, Va., and calling us this morning. Good morning, Elizabeth, you're on "The Diane Rehm Show."
11:47:29
ELIZABETHGood morning. One of the things that years ago triggered my interest in this was when I read the grocery list of our governor, Governor Baliles, from the mansion, that would be...
11:47:41
GJELTENThere in Virginia, the Virginia governor.
11:47:43
ELIZABETHIt was, right, the Virginia governor's mansion's grocery list from a local national grocery store in out paper. And there were huge articles about that. And it was the first time, even though I had friends that were talking about it already, first time I realized how invasive all of this could be. And recently, there's been so much pushing by our state again for getting broadband plus our Presidents talking about it all the time. And it strikes me that rather than being able to just listen to a radio or turn on a regular television, we now have equipment that is monitoring everything we do. And I really believe that there was a film that came out on public television, again, recently called "The Last Enemy."
11:48:34
ELIZABETHAnd in the intermissions, the host of the program was talking about the dangers that were reflected in this particular story. And he said, if you think this is fiction, you need to know that this is going on in England right now. And you need to be scared. And I've never heard such a warning before. You hear people thinking out loud about it all the time. But this was a new serious thing that has really dire consequences.
11:49:08
ELIZABETHAnd, you know, every time people talk about the coulds and ifs, what we might do to protect you or what we are doing or what we're thinking about doing and then when you keep talking about how advertising is what's driving all this, I think, that is really -- that's meant to divert us from what the real issue is. There is nothing in our world that is private anymore.
11:49:31
GJELTENLet me ask...
11:49:31
ELIZABETHAnd our (unintelligible) ...
11:49:31
GJELTEN...let me ask our panel about that. You've raised a lot of issues here, Elizabeth. Certainly the advent of big data computing produces fertile territory for our conspiracy theorists who see the government sort of doing a "1984" kind of thing. What about this connection? What about this issue, Suzie, about broadband? Is there any relation between the extension of broadband throughout the country and the capability of the United States and telecommunications companies to gather data?
11:50:01
IACONOOh, absolutely. So broadband enables us to shoot big data, you know, from one site to the next really easily and sometimes quite cheaply. And so, in my previous example about emergency preparedness, we can only do that kind of on the fly kind of evacuation planning if there was broadband in that vicinity. We would not be able to be doing the modeling and simulations and the predictions and all of that and getting all that information out to the first responders if we didn't have broadband. There's huge societal benefits.
11:50:36
GJELTENOf course, Michael, we don't have broadband in many parts of the country. Are we just seeing the tip of the iceberg here as far as big data analysis in computer as concerned?
11:50:45
LEITERI think we are, Tom. And most of what we've talked about, today, quite appropriately involves consumer information and the like. But there are -- there's really an application of big data in almost every sector we deal with in the private sector. For example, understanding where natural resources are. This is an incredibly data intensive process. And understanding all of that geological data and then mapping that to actually determine where energy might be found is incredible big data problem.
11:51:13
LEITERSimilarly, how we test pharmaceuticals. This is a big data problem, understanding how different drugs affect different people. Similarly, something like missing and exploited children, we do work with the National Center for Missing and Exploited Children and understanding how different cases might relate to another, where people are located geographically. All of this involves big data and you can just go on and on. And I think every sector of our society will face some of these issues in the coming years.
11:51:41
GJELTENNow, you mentioned earlier, something I just want you to summarize real quickly. And that's how big data computing capability is becoming important in the defense department and the U.S. military.
11:51:52
LEITERAbsolutely. The defense department and the intelligence community that supports the defense department obviously collects an enormous amount of information that in a place like Afghanistan, it's not just traditional information, it's newer forms of intelligence, such as overhead recognizance aircraft that are recording an enormous amount of video and signals intelligence.
11:52:10
LEITERAnd then you have to overlay that with different demographic information in a place like Afghanistan where you're doing USAID, international aid programs for rebuilding towns and cities and aid. And seeing how all that relates to trying to stop insurgency and aid the Afghan people. So enormous data problems and different types of data, all of which has to be supported by troops in the field, state department officials in the field and again people back in the United States supporting them.
11:52:39
GJELTENMichael Leiter, previously from the National Counterterrorism Center, now from Palantir Technologies. I'm Tom Gjelten, you're listening to "The Diane Rehm Show." John Villasenor, it's really fascinating to me how many practical applications there are to this, in this field, and how many potential problems can really be addressed in an efficient way through big data analysis. Is there a bottom line here? You know, in your judgment so far, do the privacy concerns outweigh the practical benefits or vice versa?
11:53:16
VILLASENORNo, I think, well, I don't want to say that the privacy concerns aren't important. But I think that very few people would argue that we're not better off for having broadband access and for having the ability, for example, as Michael was saying, to find disease outbreaks that would've otherwise potentially been undiscovered. And that the access that we all enjoy to information about important things is so much better.
11:53:42
VILLASENORThose are all very good things. But, like with any technology, they come with downsides and risks. And so the privacy and civil liberties communities is properly on the job, as it were, to make sure that as we gain all these benefits, we do so with balancing those with some of the concerns, the new concerns that are raised. But on balance, I think these kinds of technologies are beneficial.
11:54:01
GJELTENWell, there's no question that the preponderance of our listeners are concerned about the privacy issues here. I'm going to go now to Michael who's calling us from Ferndale, Mich. Good morning, Michael.
11:54:11
MICHAELHi, how are you?
11:54:11
GJELTENGood, how are you?
11:54:13
MICHAELVery good. My question has to do with, is how the people are talking about this information is being kind of targeted for consumer and also for gathering information for our security, maybe whether it be terrorism or for health safety. A while back now we had the conversation with the people from Google when they implemented their new privacy policy. And specifically with diseases and personal information about your medical history, that kind of data or those kind of data would not be collected is what Google was ensuring us.
11:54:56
MICHAELNow, here we have these people saying that this information will be collected. Where is that distinction made? Is there a sense of anonymity when it's collected and, you know, and for government as far as for the CDC or, you know, etcetera? We're talking about bacteria here and viruses in both cases, whether it be flu or food borne illness (unintelligible) ...
11:55:20
GJELTENI'm going to put that question to Michael Leiter.
11:55:23
LEITERWell, Michael, thanks for the question because it allows me to correct a misimpression that I'm sorry if I left it. One of the great things about big data, one of the challenges of big data, one of the benefits also, is it comes from many different sources. So in a case like working with the Center for Disease Control, they already have information that they are collecting through public health networks. That is information provided under law by doctors or hospitals to them. What big data can allow them to do is also integrate that data with open source information which is not particularly sensitive information and use that to find things that otherwise would be lost to them.
11:56:00
LEITERThat with that one data source, they couldn't identify the path of a disease at the speed of the disease, but with new big data techniques, you can integrate that data in a much more meaningful way and attack the problem more quickly. And of course, you have to do that all with a civil liberties and privacy in mind.
11:56:15
GJELTENSuzie Iacono, very quickly, is there any government agency that will hold users of big data accountable for privacy breeches?
11:56:24
IACONOWell, I believe that we have agencies that are very interested in developing frameworks, that citizens can come to understand in getting those frameworks implemented so that we ensure information privacy.
11:56:40
GJELTENOkay. Very good, Suzie Iacono is the senior science advisor for computer and information science and engineering at the National Science Foundation. I've also been joined this morning by Michael Leiter from Palantir Technologies, previously at the Counterterrorism Center and John Villasenor from Brookings Institution and UCLA. We've been talking about big data computing. I'm Tom Gjelten, thanks for listening.

Show more Show less

Monday, Apr 02 2012 • 11 a.m. (ET)

The New World of Massive Data Mining

Guests

Program Highlights

Transcript

11:06:55

11:07:24

11:08:00

11:08:01

11:08:12

11:08:42

11:09:11

11:09:50

11:09:56

11:10:38

11:10:54

11:11:30

11:11:55

11:12:16

11:12:53

11:12:57

11:13:20

11:13:42

11:13:57

11:14:39

11:15:11

11:15:23

11:15:59

11:16:00

11:16:28

11:16:47

11:17:04

11:17:14

11:17:42

11:17:49

11:17:58

11:18:22

11:20:04

11:20:38

11:21:03

11:21:04

11:21:15

11:21:57

11:22:08

11:22:43

11:23:09

11:23:36

11:24:02

11:24:28

11:24:55

11:25:18

11:25:43

11:26:03

11:26:15

11:26:53

11:27:15

11:27:40

11:28:09

11:28:12

11:28:26

11:28:43

11:29:15

11:29:24

11:29:31

11:29:57

11:30:11

11:30:29

11:30:34

11:30:37

11:31:07

11:31:20

11:31:37

11:32:00

11:32:02

11:32:03

11:32:06

11:32:06

11:32:29

11:32:47

11:33:11