We’re six weeks into 2014, and the concept of “big data” continues to make headlines nearly daily. The New York Times recently reported that over the past three years, there’s been a hundredfold increase in Google searches for the term. Last fall, Stanford enrolled more than 700 students in a course on machine learning and statistical algorithms. It was the largest on-campus class all semester.
On the one hand, I’m ecstatic about the promise of using big data to do amazing things:
- Create predictive models for personalized medicine
- Develop brain-like computers that can learn from experience
- Transform workplace strategies
And the massive amounts of data now being collected and stored cheaply require armies of quantitatively minded individuals to make sense of them all.
Some “quants” have even achieved rock-star status. Nate Silver is one example, and more are on the way: two female statisticians are on this year’s Forbes list of 30 most influential scientists under the age of 30, and both work in the big-data fields of imaging and genomics. But is bigger always better when it comes to data?
Big Data: One Term, Many Meanings
“Big data” probably has as many definitions as there are users. It was the same over a decade ago when the term “bioinformatics” became popular. The phrase “big data” means different things to different people, depending on their fields.
Statisticians working in the field of genomics have long had to face a big-data problem of their own; they just called it something else: “high-dimensional data,” or the “large p, small n” problem. They had lots of information—for example, expression levels of thousands of genes or other features (the large p) on a few experimental units (the small n). The challenge was to come up with ways to reduce the dimensionality of the genomic data using clustering, classification and other pattern-recognition approaches. The high cost of the assays, however, meant that they were limited to a “small n”; they could obtain all this genetic information on only a few experimental units at a time.
With information that is captured digitally and relatively cheaply using modern technologies, and often as a by-product of other primary activities, such as Internet searches, online shopping or visits to the doctor, you have a different type of big data: a ton of information on a ton of subjects. Electronic health-record databases, for example, hold a lot of clinical and laboratory data on thousands of patients. From an analytic perspective, this means you now have “huge p, huge n”—the best of both worlds, right? Not necessarily.
Growing Bigger and Less Confident
As Malcolm Gladwell reminds us in his latest book, David and Goliath, sheer size can be a source of tremendous power but it can also obscure major weaknesses. Similarly, just having a lot of data—even if they are messy and unreliable—can give people a false sense of confidence about the accuracy of their results. The caveat with big data is that they often do not come from carefully designed experiments with the goal of producing reliable and generalizable results. As a result, the data aren’t so well understood; the quality can be variable, and they may be from a biased population. For example, users of social networking-sites such as Twitter tend to be younger than non-users. Similarly, patients in some electronic health systems have different socioeconomic characteristics from those of the general U.S. population.
Don’t Let Machines Do All the Work
Powerful computers and analytic tools are available to automate the data processing and analysis; however, it’s dangerous simply to sit back and let the machines do all the work. Machines don’t understand context; algorithms are often based on unverifiable assumptions, and informed decisions need to be made at each step: which method to choose, how to set tuning parameters, what to do with missing or messy data fields. And in the world of clinical research, it is more obvious than ever that interdisciplinary teams of information technology experts, clinicians, epidemiologists, computer scientists and statisticians are needed to integrate diverse knowledge bases and skill sets in tackling big-data issues.
No doubt there is tremendous potential for big data to transform what questions we ask and how we conduct research, and ultimately to advance scientific knowledge. But whether we are dealing with petabytes of data or measurements from a few lab animals, the bottom line is this: sound statistical thinking—good study design, generalizability, reliable measurements, robust algorithms—is the key to valid results. And we shouldn’t forget the value of the most low-tech and cheapest tools of all, common sense and human judgment, when dealing with all kinds of data, big and small.
Comments on this entry are closed.
Thanks, Mimi! Great post.
Well stated that this latest “buzz word” has many different meanings for different users, but the most accepted definition of Big-Data is (according to Gartner) based on the three Vs: Volume (amount of data), Velocity (speed of data), and Variety (range of data types and sources).
HELLO, folks,
I left a rather lengthy blog comment here about years and years of field study eliciting huge volumes of data. I thought maybe real stories from the field about how, why, and what data is discovered and implemented into large scale data systems would prove interesting and insightful to those of you whose likely data experience is closed labs, clinical licensed care settings, or from referential large scale database systems. As my comment does not seem to show here, I shall presume it too lengthy and in depth real to the field to interest primarily academician data researchers and users.
The point to my relating these several stories is to point out currently most data systems using human health care data are, to my best experience, less that 50% robust to common good and sound data standards. Data is entered either partially or incorrectly and is so wrong. Those entering data after questioning time and time again know nearly nothing about their data, its measurement, its definition, nor its substance. Can you imagine asking someone whose job and livelihood are applying nuclear medical equipment to human subjects to test whether or not they have some diagnosable condition or not if they know anything about science? And then their answer is, this is NOT nuclear science, it’s simply and only pushing buttons.
I visited an extremely well known and prestigious Manhattan hospital last night and was ordered for a chest Xray. And that which I just wrote is exactly what happened — what I heard. I immediately walked out of that hospital with my response being I understood I was countervening the ER MDs’ orders and wrote myself there my plan to go to my own PCP’s hospital of duty, if needed, for competent, reliable care.
Thanks. I hope someday to share the several stories not to be or to be future posted here. They show how careful field observation and diligent data recording can help understand the many, many problems facing and plaguing humanity everywhere.
Yours, Dr. Elyas Fraenkel Isaacs, PHD, PHD, DPH, DDiv, MD in the City of New York