We live in the age where big data and data science are used to predict everything from what I might want to buy on Amazon to the outcome of an election. The results of the Brexit referendum caught many by surprise because pollsters suggested that a “stay” vote would prevail. And we all know how that turned out.
History repeated itself on Nov. 8 when U.S. president-elect Donald Trump won his bid for the White House. Most polls and pundits predicted there would be a Democratic victory, and few questioned their validity.
The Wall Street Journal article, Election Day Forecasts Deal Blow to Data Science, made three very important points about big data and data science:
– Dark data, data that is unknown, can result in misleading predictions.
– Asking simplistic questions yields a limited data set that produces ineffective conclusions.
– “Without comprehensive data, you tend to get non-comprehensive predictions.”
Keep the baby, drain the bath water
A powerful new application of data science uses data to detect and stop cyber attacks in real time. Think of it as stopping the next Target, Anthem and Sony Pictures data breach. Data science has produced critical discoveries like the Higgs Boson particle, a scientific breakthrough to which I am proud to have contributed. Now, my team and I apply our data science minds to detecting hidden threats and cyber attacks on the businesses you trust.
So, from a data science perspective, what are the lessons learned from the big data blunders in election predictions? The lesson is all about using the right data for the problem at hand, and not about questioning if the data is right. The same applies for cybersecurity.
Using the wrong data
Cybersecurity that relies on logs as the data source suffers the same election-prediction fate as dark data. Logs provide detailed information about user identity and computers. For example, a log can tell us that Kevin accessed a database at 10:03 p.m. or Emily visited a Russian website at 5:32 a.m. The belief is logs are the fingerprints that reveal a cyber attacker’s presence. However, data breach victims never knew the attacker was there. Sophisticated attackers are experts at hiding in plain sight and never leaving any evidence they were there.
Asking simplistic questions
Cybersecurity that relies on flow data like NetFlow is similar to relying on a pollster that asked simplistic questions. Attackers who perpetrate the most sophisticated cyber heists like the Carbanak banking theft use remote access Trojans (RATs) to remotely control their attack. Flow data reveals that an internal computer communicated with an external one, when it started and ended, and how much data was sent and received. But flow data can’t distinguish between Web browsing and a RAT.
Using the right data to make comprehensive decisions
If you want to find a cyber threat in your computer network, then the most truthful source of big data is your computer network traffic. Data science enables you to make very rapid decisions based on incredibly big data sets. In fact, data science recently enabled a robot to set a new record for solving Rubik’s cube in less than a second. Data science likewise enables cybersecurity to listen to all the computer traffic on a network to find cyber attackers in the act and stop them before they steal personal, health or financial information. The key is using the right “big data” – in this case, network traffic – for data science to make the right decisions. Let’s hope that pollster in the next election learn from the past and use the right data source to predict outcomes.