Welcome to the Shape of Data blog. Over the next few months, I plan to write a number of posts illustrating how understanding the geometry behind data analysis can lead to deeper insights and a more intuitive understanding of the data. But before we delve into geometry, it might be a good idea to discuss the many different names that are often used to describe the analysis of large data sets.
Most of these names have become essentially synonyms of each other, but they all have different origins and nuances. In academia, at least, a researcher’s intellectual background has a big impact on who they know and work with, what conferences they go to and what terminology they use. I don’t know how much of a difference it makes outside the university setting, but regardless, I think it’s a good idea to know the subtle implications of these different terms.
Machine Learning – This term comes from computer science, particularly artificial intelligence. Originally, the idea was to have a computer “learn” a pattern from a data set, then make decisions about new situations based on what it had learned. This is now called supervised learning, or classification, and machine learning has expanded to a much wider range of types of data analysis.
Data Mining – This term originally referred to a subfield of statistics. In some sense, the goal of all statistics is to analyze and summarize data, but data mining is (or at least was originally) the field of statistics that focused on large, high dimensional data sets.
Signal Processing – As the name implies, the engineering field of signal processing is the study of how to encode and decode signals. The decoding part is where the data analysis comes in, since you can think of a data set as a (possibly noisy) signal. Many of the techniques that have become standard in data analysis have their roots in signal processing.
Knowledge discovery – Short for Knowledge Disovery in Databases (KDD), this term refers to a multi-step process in which data is accumulated in a database, analyzed then interpreted. Technically, data mining/machine learning is just one step in this process. KDD is often associated with relational databases (such as MySQL) as opposed to the newer and less structured NoSQL storage methods (such as Hadoop) usually associated with “Big Data” (see below.)
(Business) Analysis/Analytics/Intellegence – As suggested by what’s in the parentheses, these terms refer specifically to the use of data analysis in business. There’s plenty of confusion about what each term means specifically, but the general consensus seems to be that analytics refers to the computational part (the processing of data) and analysis is the human part (interpreting the data and making decisions based on it.) Intelligence refers mostly to accumulating and organizing the data, though some sources suggest that business intelligence (BI) can also refer to the whole process – gather, process, interpret – similar to knowledge discovery.
Data Science – Recently, “Data Scientist” has become a popular job title for companies looking for technical experts with interdisciplinary backgrounds. A number of people have pointed out that this is kind of a misnomer, since by definition, all scientists study data. If it were up to me, I would have used a term more like “Business science”, since data science usually means the application of data analysis techniques to business problems. But I’ll admit that “business science” doesn’t sound as cool as “data science”.
Big Data – This term generally refers to the challenges and promises associated with larger and larger data sets. But it was also (at least originally) meant to allude to the term “Big Oil”, suggesting the massive corporations that have and will continue to exploit this new resource. In practice, much of the technology associated specifically with “Big Data” (such as Hadoop) is designed for accumulating, storing and dispensing unstructured data. To gain insights from this data, it is often necessary to extract a structured form of data via the Big Data machinery, then analyze it using “small data” techniques.
Informatics – Since Bioinformatics is application of data analysis to biology (particulary molecular biology), it seems like informatics should also be a synonym of data science. However, informatik is the German word for Computer Science, so would get confusing if we tried to use it in the specific sense of data analysis. However, it does seem to occasionally get used in that sense.
Are there any others that I missed? These were all the words I could think of, but I’m sure there are more. If you know of any, please let me know in the comments.