The Big Data theme has always interested me, but lately, I’m devoting a little more of my time to learning and going deep to it. Last month I participated in the Wired Next Fest event where people from many different backgrounds spoke about the main theme “Future”, including the founder of Wikipedia Jimmy Wales and the security expert and ex-employee of CIA, Edward Snowden. It was really inspiring.
But talking about what motivated me to write this post, the Big Data was quoted by Seth Stephens-Davidowitz, a writer of the book “Everbody Lies” who worked as a data scientist on Google and now writes for the New York Times. He brought some questions that go unnoticed when we are overshadowed by the glow of the immense amount of information we can extract from a Big Data.
And he started the talk with that statement:
We are embarrassed to ask certain things to other people, however, this doesn’t happen when we ask the same questions to Google.
That is, subjects such as financial problems, abortion, infidelity or homosexuality, for example, are not always openly dealt with friends and family. On the internet the scenario changes, in the privacy of our computers, we can search for any kind of subject without the fear of someone judging us.
Now think of the same topics being published in social networks, I can bet you never read a post saying “I’m mired in debt, I don’t know what to do.” On the contrary, what we mostly see are photos of parties and trips, but internally we know that in real life many take loans to pay other loans.
Or we can analyze the relationship theme, the probability that you see romantic and affectionate messages between couples on social networks is infinitely greater than seeing a post “denigrating” the partner, even when we know that many relationships are no longer a honeymoon.
A comparison presented by Seth that showed this contrast well was about the top 5 results for the phrase that starts with “My husband is …” found in social networking posts and in Google searches.
According to the relationships you know, which of the two sources seems more realistic?
And this is precisely the point when we talk about data quality when we want to analyze, extract standards or even identify a type of correlation and causality.
If we had to identify a possible terrorist, would we do the analysis based on posts from social networks or Google searches? And a person who plans suicide?
It was moved for the curiosity born on the talk that I decided to read the book written by Seth: “Everbody lies” and I can affirm, it is fundamental reading for who think about walking the road of a Data Scientist.
The book shows that everything can become a quantifiable data, from texts to facial expressions, and different from petroleum, the data doesn’t end, on the contrary, each day grow exponentially.
When Google was created, the goal was to get people to know the world, not for researchers to get to know the people. But the potential of the data collected in the surveys is precisely the truthfulness they offer, compared to traditional surveys where people don’t feel comfortable to tell the truth to others. And in front of it, we can confront common sense and see if it really matches reality.
For example, the analysis cited in the book on the probability of becoming an NBA player points out that children of married parents with better socioeconomic conditions are much more likely than single-mom children living in poorer areas. So, that story that people who had big difficulties in childhood would be more likely to become great athletes has not proved true, quite the opposite.
Another example, if we were to verify the correlation between dreaming of banana, a fruit in a peculiar format that Freud would attributed to the repressed desires, with the data found in google, we arrive at the conclusion that in fact to dream of banana or apple is more related to the kind of fruit that you like than hidden sexual desires.
Seth listed four major Big Data powers:
Provide new types of data;
Offer honest data;
Allow to group this data into small groups of people, and
Allow making numerous random experiments with minimum investment and time.
However, it is useless to have these power in hand if we use the wrong source, that is:
The Big Data revolution is not gathering more and more data, but collecting the right data.
Another very interesting example describes how racehorses were selected. In the beginning this selection took into account the breed and the pedrigee of these horses, but an expert in the subject began to collect morphological data of the animals until finding the exact correlation, the horses that had great chances to become winners were those that had the main organs, like the heart, larger than the average, and this had no relation to the pedrigree.
But as said before, the data used in the analysis can be of diverse types. Another survey used yearbook photography of American high school students between 1905 and 2013. With the help of algorithms, it was possible to establish an “average” face for each decade. The curious thing is that over the years people started to smile, do you imagine the reason for this change? When the photo was invented, people compared the photo with the paintings, and whoever posed for a painter couldn’t keep smiling for so many hours, so they adopted a more serious expression. Consequently, in the older photos, people adopted the same expression.
Another question raised by the surveys is the confrontation of government official data and the data obtained in google searches. When we see the violence statistics falling, many celebrate the results. But we may be faced with a data collection problem.
For example, in Brazil, it is very common people know the most dangerous areas in a city, but at the same time, we don’t see enough policing in these same areas. Where is the fault? Do not the police “see” the problem or the victims of assault don’t register the fact? The strategy of policing in a city is made according to the number of complaints registered, that is: if we don’t have a complaint, then we have no problem. That is why it is extremely important that the victims report to the official department, no matter how bureaucratic it may be. Did you understand the gravity of the problem in being based on faulty data?
Another discussion about Big Data analysis is that what we can’t quantify is what really interests us, for example, we can quantify how students answer the multiple choice questions, but we can’t easily quantify critical thinking, curiosity or personal development.
In addition to this, we have the ethical issues, then, with the same statistical methods, we can identify the most effective treatment for a particular disease, and know if an athlete should be dismissed because he has reached his peak and his results will tend to worsen.
As technology professionals and with the analysis of big data spreading on all sides, we need to question whether the data we have is really the best for the analysis we intend to do. If we don’t consider this fact, we may come to false conclusions that can be very costly, either for decision makers based on those results or for people directly affected by them.
The conclusion I can reach is that:
Having access to a large amount of data requires a responsibility of the same size or even greater.
And this responsibility must be present in the beginning when we choose the source of data that we will use and especially what will be done with the results found.
And you, what do you think about it?
If you were interested in the talk that I mentioned, this is the video.