Google provides us with one of the most effective free tools to spot trends in any field, including the one I happen to spend most of my time with – Data Science. This tool is Google Trends, which allows to easily compare trends in search terms over time, understand geographical patterns, etc. Thanks to Google Trends, I can see and share with you evidence of an important industry trend:
The excitement about Big Data technologies and “data infrastructure” is leveling off. The market is now focusing on technologies that can truly extract information and value from data.
For someone who has been helping clients understand the Art and Science of data analytics for twenty years, it has been a bit puzzling to witness that for a at least 18 months the world of Data Analytics seemed to be synonymous to data infrastructure technologies such a Hadoop, MapReduce, NoSQL, Pig and the plethora of open source platforms that make up the “Hadoop Ecosystem”. With this I am not implying that these technologies were not critical to the expansion of Data Analytics in the business world – indeed they have. However, at some point it appeared that the most important task was to find better and cheaper ways to store a variety of data assets, from unstructured data, to images, videos and streaming data. The focus was on storage and processing though, not the Analytics.
And, I wondered, why all of a sudden the world seemed to have such a sense of urgency in squeezing information out of unstructured data and social network chatter, doing real-time analysis of streaming data, and other “Big Data” tasks? In the world I was living in just a few months earlier even large organizations were still in the process of learning how to truly leverage all the data they had in traditional relational repositories. All of a sudden the world seemed to be so “analytically savvy” to venture into the most advanced types of data sources, as if the everyone was already beyond the necessary capabilities to develop and deploy effective analytics in their organization.
I also found concerning the trend toward highly programmatic tools and technologies. It seemed that the Data Scientist of 2014 had to look much more like a programmer (and a damn good one), able to put together Java code, Python scripts and complex distributed machine learning algorithms running directly against Hadoop clusters using Mahout. In our experience, the essence of the Art and Science of Data Analytics is being able to perform possibly complex data preparation steps, visualize and understand data, interpret model results, etc. Thus a programming-oriented environment, in my personal opinion, is generally not the most efficient way to accomplish these tasks effectively and efficiently.
Thus, when I first read that Big Data was suddenly no longer mentioned in the well-known Gartner “Hype Cycle” curve for 2015 (it was already on the descend in 2014 actually), and that Machine Learning had basically taken its place at the peak of the curve, things started to make more sense. The focus has been shifting to disciplines and technologies that “do something” with all that data, rather than on data infrastructure itself. Whether Big Data and its correlated technologies are no longer the buzzword of the day because it has met its promises and has truly gone mainstream, or because it has failed to live to its promises, it’s hard to say. Indeed many organizations have implemented, or are in the process of implementing, Hadoop repositories, because of its capabilities and essentially a very low cost of ownership. But even on the adoption side, the picture at the end of 2015 seems very different than the forecast just a couple years ago.
In a 2013 study, IDC reported that 36% of the surveyed companies had implemented Hadoop in their environment, and an additional 31% had plans to implement in the following 12 months. In 2014, Gartner reported that 73% of the organizations they surveyed had invested in Big Data technologies or had plans to invest in the next 24 months. However, in their 2015 study on the same topic, Gartner found that 54% of the responders had “no plans to invest at this time” in Hadoop and Big Data technologies, and only 18% had plans to invest in the next 24 months. Clearly the excitement about implementing massive non relational repositories for unstructured data leveled off quite quickly. My simple explanation for that is that very few organization were ready to exploit such data assets with advanced analytics.
Now to Google Trends and what it reveals about the “pulse” of the industry simple by measuring which topics people are searching for on the Internet. The chart below compares the search for the terms “Big Data”, Data Science” and “Machine Learning” over time, between the beginning of 2010 and the end of 2015.
The chart, which represents Google searches originating from the United States only, clearly shows that the interest in Big Data – at least as a search term – has reached a plateau. On the other hand, searches on topics such as “Data Science” and “Machine Learning”are still on the rise. It is also interesting to note that Google Trends shows that the most related search to Big Data is “Big Data analytics”.
Thus, it seems that while the interest in “Big Data” in general has been leveling off, the interest in specific topics such as Machine Learning and Data Science continue to rise. Notice that at some point the term Big Data was broadly address everything that had to do with data analysis on a large scale, including the quantitative disciplines that today we are referring to as Data Science.
Similar trends appear when we look, using Google Trends, at the interest in the technologies that represent “Big Data” and “Data Science” respectively. When we look at the search terms for Big Data related technologies, for example “Apache Hadoop”, “Apache Hive” and the machine-learning-related (with Mlib) “Apache Spark“, we also see a trend that confirms a leveling off of the interest in “Hadoop ecosystem” technologies, and growing interest in technologies that are more focused on the processing and analysis of non-relational data repositories.
So, what is all this telling us? Well, first of all that Data Analytics is still very relevant to businesses, and that the focus is shifting from data infrastructure to advanced analytics. Whether the “Big Data” hype was just a hype or not, only time will tell; however, it has brought the attention back to data-driven decision making.
Many of us have been here through the waves of keywords describing the Art and Science of understanding, transforming and modeling large data sets: Knowledge Discovery, Data Mining, Analytics, Big Data, Machine Learning,… Not a lot has substantially changed under the sun, except for the scale of the data available and the refinement of modeling tools, nor has the relevance of driving business decisions through quantitative data patterns.
My conclusion is this – the practice of data analysis has gone through a series of hype-and-bust cycles, but every cycles takes us to a noticeable higher level of adoption, competence and actual change. It is everyone’s guess which new data-related technology is going to drive the next cycle, but no matter what, it will be welcomed.