A Simple Vocabulary of Data Science Concepts
This is a short summary of data science concepts I wrote for undergraduate students at Temple University.
Data are records that contain facts about some phenomenon. The data can be unstructured such as natural language writings and images or structured such as weather readings and stock prices. The structure of the data is stored as metadata, that is, data about data. Metadata can often be found as a data dictionary that describes a dataset by defining a label, data type and a short description for each column (variable) in the dataset. By dataset we usually (but not always) refer to a two-dimensional table that has variables as columns and observations as rows.
Data science is the study of the extraction of generalizable knowledge from data. Analytics is the application of such methods for a purpose in a particular context. Information is the data processed — often through analytics — to be useful in some real-life context. By contrast, we may call a skillful application or a potential application of information to practice as an expression of knowledge.
In order to turn data into information and actionable knowledge, we test hypotheses that are falsifiable predictions based on an idea how the world works. These ideas are often called theories or models. The data can come from many different sources and often require data cleansing and integration before we can perform analytics with the data. For instance, many organizations provide open data that are made freely available to anyone to analyze.
The data are usually stored as flat files (e.g. Excel spreadsheet) or in a relational database — very large datasets are often processed using technologies that distribute the data over a network to multiple computers to be processed in parallel. Such data are often called big data and described by significant velocity, variety and volume (3Vs).
There are many ways to represent data. These include data visualizations, infographics and dashboards, some of which may be interactive. Data visualizations are powerful ways to reveal patterns in the data but they can also be misleading if constructed incorrectly. Infographics combine data visualizations and other audiovisual elements to tell a data-based story. Another important way to turn data into information is to create indicators or metrics. For instance, a key performance indicator (KPI) measures performance against a stated goal. Scorecards are data visualizations that combine several KPIs and entities to give an overview of their performance.
Data analytics is often divided into three main types according to the nature of analytical output. Descriptive analytics reveal patterns and indicators from the data and thus describe the phenomenon of interest. Predictive analytics attempt to predict the values for data we don’t have using the data we have. When the data we don’t have is about the future, we call this forecasting. Prescriptive analytics prescribes a course of action based on its predicted (forecasted) outcomes.