Explained: What is big data?

by Pieter Arntz on August 3, 2018

If the pile of manure is big enough, you will find a gold coin in it eventually. This saying is used often to explain why anyone would use big data. Needless to say, in this day and age, the piles of data are so big, you might end up finding a pirate’s treasure.

How big is the pile?

But when is the pile big enough to consider it big data? Per Wikipedia:

“Big data is data sets that are so big and complex that traditional data-processing application software are inadequate to deal with them.”

As a consequence, we can say that it’s not just the size that matters, but the complexity of a dataset. The draw of big data to researchers and scientists, however, is not in its size or complexity, but in how it may be computationally analyzed to reveal patterns, trends, and associations.

When it comes to big data, no mountain is high enough or too difficult to climb. The more data we have to analyze, the more relevant conclusions we may be able to derive. If a dataset is large enough, we can start making predictions about how certain relationships will develop in the future and even find relationships we never suspected to exist.

The treasure

We mentioned predicting the future or finding advantageous correlations as possible reasons for using big data analysis. Just to name a few examples, big data could be used to set up profiles and processes for the following:

Stop terrorist attacks by creating profiles of likely attackers and their methods.
More accurately target customers for marketing initiatives using individual personas.
Calculate insurance rates by building risk profiles.
Optimize website user experiences by creating and monitoring visitor behavior profiles.
Analyze work flow charts and processes to improve business efficiency.
Improve city planning by analyzing and understanding traffic patterns.

Beware of apophenia

Apophenia is the tendency to perceive connections and meaning between unrelated things. What statistical analysis might show to be a correlation between two facts or data streams could simply be a coincidence. There could be a third factor at play that was missed, or the data set might be skewed. This can lead to false conclusions and to actions being undertaken for the wrong reasons.

For example, analysis of data collected about medical patients could lead to the conclusion that those with arthritis also tend to have high blood pressure. When in reality, the most popular medication to treat arthritis lists high blood pressure as a side effect. Remember the old research edict: correlation does not equal causation.

In statistics, we call this a type I error, and it’s the feeding ground for many myths, superstitions, and fallacies.

The researchers

As more and more data becomes digitized and stored, the need for big data analysts grows. A recent study showed that 53 percent of the companies interviewed were using big data in one way or another. Some examples of use cases for big data include:

Data warehouse optimization (considered the top use case for big data)
Analyzing patterns in employee satisfaction; for example, in multinational companies, a 0.1 percent increase in turnover is considered too high
Sports statistics and analysis; sometimes the difference between being the champion or coming in second comes down to the tiniest detail
Prognosis statistics or success rates of particular medications can influence a doctor’s recommended course of treatment; an accurate assessment of which could be the difference between life and death
Selecting stocks for purchase and trade; quick decision-making based on analytical algorithms gives traders the edge

At Malwarebytes, we use big data in the form of anonymous telemetry gathered from our users (those that allow it) to monitor active threats. Viewing these data sets allows us to see trends in malware development, from the types of malware that are being used in the wild to to the geographic locations of attacks.

From these data, we’re able to draw conclusions and share valuable information on the blog, in reports, such as our quarterly Cybercrime Tactics and Techniques report, and even in heat maps like the one we created for WannaCry. (As our product detected WannaCry even before we added definitions, this gave us some valuable information about where it might have originated.)

The tools

Technologically, the tools you will need to analyze big data depend on a few variables:

How is the data organized?
How big is big?
How complex is the data?

When we are looking at the organization of data, we are not just focusing on the structure and uniformity of the data, but the location of the data as well. Are they spread over several servers, completely or partially in the cloud, or are they all in one place?

Obviously, uniformity makes data easier to compare and manipulate, but we don’t always have that luxury. And it takes powerful and smart statistical tools to make sense out of polymorphous or differently-structured datasets.

As we have seen before, the complexity of the data can be another reason why we need special big data tools, even if the sheer number is not that large.

As big data tools are made available, they are still in early stages of development and not all of them are ready for intuitive use. It requires knowledge and familiarity to use them most effectively. That is where personal preference comes in. Using a tool you have experience with is always easier, at least at first.

Our personal data

When we go online, we leave a trail of data behind that can be used by marketers (and criminals) to profile us and our environment. This makes us predictable to a certain extent. Marketeers love this type of predictability, as it enables them to figure out what they can sell us, how much of it, and at which price. If you’ve ever wondered how you saw an ad for vintage sunglasses on Facebook when you were only searching on Google, the answer is big data.

Imagine a virtual assistant that retrieves travel arrangement information at your first whim of considering a vacation. Hotels, flights, activities, food and drink—all could be listed to your liking, in your favorite locations, and in your price range at the blink of an eye. Some may find this scary, others would consider it convenient. However you feel, the virtual assistant is able to do this because of the big data it collects on you and your behavior online.

The data-driven society

One of the major contributions of big data to our society will be through the Internet of Things (IoT). IoT represents the most direct link between the physical world and the cyber world we’ve experienced yet. These cyber-physical systems will of course be shaped by the objects and software we create for them, but their biggest influence will be the result of algorithms applied to the data they collect.

With the evolution of these systems, we can expect to evolve into a data-driven society, where big data plays a major role in adjusting the production to meet our expected needs. This is an area where we will need safeguards in the future to prevent big data from turning into Big Brother.

Big data, big breaches

The obvious warning here is that gathering and manipulating big data require extra attention paid to security and privacy, especially when the data are worth stealing. While raw datasets may seem like a low-risk asset, those who know how to find the gold (cyber)coin in the pile of manure will see otherwise. With the advent of GDPR in May, any form of personally identifiable information (PII) will be sought after with great urgency, as the safeguards put in place to protect PII place endanger the lively black market trade set up around it.

The lesson, then, is to take seriously big data’s impact, both for good and for evil. Perhaps the whole pile should be considered the treasure.

*** This is a Security Bloggers Network syndicated blog from Malwarebytes Labs authored by Pieter Arntz. Read the original post at: https://blog.malwarebytes.com/security-world/technology/2018/08/explained-what-is-big-data/