Re-reading a text book - Data Mining Concepts and Techniques
Back in the days studying Computer Science at UIUC, data mining or data science hasn’t really crossed my mind. Nowadays, it has totally become a fad, and I began wondering, what has changed? Has it changed all that much? I remembered the popular subjects at that time (2006) were mobile/embedded devices, parallel(GPU) computing. At that age, I knew machine learning was cool (AI was cool as it has always been), but with all the required courseworks on the plate, there was no room left to diversify.
I started getting serious with data after working on business analytics software in IBM. I was focused on globalization(i18n) engineering, but to internationalize analytics software I had to learn alot of domain knowledge in analytics. My data mining 101 was learnt flipping through SPSS manual, which did teach alot of practical mining knowledge. How to configure and tune models, how to solve typical business problems. Working on enterprise software meant you get to access alot of practical sample data to practice on.
Now, at the startup I am working at, I am fortunate enough to work on data again. Working on data does not make you feel intelligent per se. As the internet joke goes, “In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.” The former in fact has its merits according to my experience. There is not enough people to consult in a startup in how to build a data warehouse, let alone building a big data pipeline using the latest hadoop technologies. The internet, software manuals, and conferences became my text book of performing that achievement. I really felt that 80% of the time is spent on building data pipes to feed the data. After that, 20% of the time was querying, aggregating and creating visualizations.
I felt my inadequacies in theories has hampered my progress. Reading a text book like Data Mining, Concepts and Techniques bridged the gaps in my knowledge. If I put myself in a college student’s shoes, this book would be a terrible read. Who would be interested in cuboids in OLAP if they have no practical experience operating a BI software. Cuboids are far from reality if you haven’t actually sliced and diced one. Fortunately reading this text book for me right now makes perfect sense.
Even in theory, you spend a lot of time doing data preparation. The first five chapters of this book covers data understanding, data preparation, and data warehousing, BI technologies. That’s like 40% of the book covering basics of getting data into your database and doing some standard queries.
The intelligent part about modeling comes right after, which is 8 chapters. In practice, I would love to do those things too, but there are too much warehouse building tasks on my plate, I did not have time to explore those. In fact, if you use some BI tools, it is enough to find many insights already. As the startup gets more robust, then mining data certainly will move the business to the next level.
Reading a text book like this is quite dry, but organized knowledge clarifies alot about your practical understanding. Especially comparing advanages and disadvantages of an approach gives many insights. E.g. when to do precomputation, when not to. I highly recommend reading a textbook after gaining experience in the data field.
As to my question, how did data science become a fad? what has changed in the last 10 years? In theory, the art of data science hasn’t really changed. Reading textbooks, tech manuals, the kind of fundamental technical knowledge are fairly the same. Data modeling, data warehousing’s underlying principles are still valid. What has changed is the number of new technology, e.g. big data frameworks, logging, visualization tools. The amount of hype, salary, media exposure, intriguing blogs and education materials on data science has also increased. As a person working on data, the amount and flow of data has increased. The database technologies are different. But the type of fundamental modeling and techniques are the same. E.g you still can’t get away with data cleaning, but new technologies helps.
Many companies are aware that data is the core of their operations, and it reflect their operational results. I feel that data science got its attention because of an increasing shift of data as a core of many businesses.