How machine learning from data works
Many definitions of ML revolve around the automated detection of meaningful patterns in data. Two prominent examples include:
- AI pioneer Arthur Samuelson defined ML in 1959 as a subfield of computer science that gives computers the ability to learn without being explicitly programmed.
- Tom Mitchell, one of the current leaders in the field, pinned down a well-posed learning problem more specifically in 1998: a computer program learns from experience with respect to a task and a performance measure of whether the performance of the task improves with experience (Mitchell 1997).
Experience is presented to an algorithm in the form of training data. The principal difference from previous attempts of building machines that solve problems is that the rules that an algorithm uses to make decisions are learned from the data, as opposed to being programmed by humans as was the case, for example, for expert systems prominent in the 1980s.
Recommended textbooks that cover a wide range of algorithms and general applications include James et al (2013), Hastie, Tibshirani, and Friedman (2009), Bishop (2006), and Mitchell (1997).
The challenge – matching the algorithm to the task
The key challenge of automated learning is to identify patterns in the training data that are meaningful when generalizing the model's learning to new data. There are a large number of potential patterns that a model could identify, while the training data only constitutes a sample of the larger set of phenomena that the algorithm may encounter when performing the task in the future.
The infinite number of functions that could have generated the observed outputs from the given input makes the search process for the true function impossible, without restricting the eligible set of candidates. The types of patterns that an algorithm is capable of learning are limited by the size of its hypothesis space that contains the functions it can possibly represent. It is also limited by the amount of information provided by the sample data.
The size of the hypothesis space varies significantly between algorithms, as we will see in the following chapters. On the one hand, this limitation enables a successful search, and on the other hand, it implies an inductive bias that may lead to poor performance when the algorithm generalizes from the training sample to new data.
Hence, the key challenge becomes how to choose a model with a hypothesis space large enough to contain a solution to the learning problem, yet small enough to ensure reliable learning and generalization given the size of the training data. With more informative data, a model with a larger hypothesis space has a better chance of being successful.
The no-free-lunch theorem states that there is no universal learning algorithm. Instead, a learner's hypothesis space has to be tailored to a specific task using prior knowledge about the task domain in order for the search for meaningful patterns that generalize well to succeed (Gómez and Rojas 2015).
We will pay close attention to the assumptions that a model makes about data relationships for a specific task throughout this chapter and emphasize the importance of matching these assumptions with empirical evidence gleaned from data exploration.
There are several categories of machine learning tasks that differ by purpose, available information, and, consequently, the learning process itself. The main categories are supervised, unsupervised, and reinforcement learning, and we will review their key differences next.
Supervised learning – teaching by example
Supervised learning is the most commonly used type of ML. We will dedicate most of the chapters in this book to applications in this category. The term supervised implies the presence of an outcome variable that guides the learning process—that is, it teaches the algorithm the correct solution to the task at hand. Supervised learning aims to capture a functional input-output relationship from inpidual samples that reflect this relationship and to apply its learning by making valid statements about new data.
Depending on the field, the output variable is also interchangeably called the label, target, or outcome, as well as the endogenous or left-hand side variable. We will use yi for outcome observations i = 1, ..., N, or y for a (column) vector of outcomes. Some tasks come with several outcomes and are called multilabel problems.
The input data for a supervised learning problem is also known as features, as well as exogenous or right-hand side variables. We use xi for a vector of features with observations i = 1, ..., N, or X in matrix notation, where each column contains a feature and each row an observation.
The solution to a supervised learning problem is a function that represents what the model learned about the input-output relationship from the sample and approximates the true relationship, represented by . This function can potentially be used to infer statistical associations or even causal relationships among variables of interest beyond the sample, or it can be used to predict outputs for new input data.
The task of learning an input-outcome relationship from data that permits accurate predictions of outcomes for new inputs faces important trade-offs. More complex models have more moving parts that are capable of representing more nuanced relationships. However, they are also more likely to learn random noise particular to the training sample, as opposed to a systematic signal that represents a general pattern. When this happens, we say the model is overfitting to the training data. In addition, complex models may also be more difficult to inspect, making it more difficult to understand the nature of the learned relationship or the drivers of specific predictions.
Overly simple models, on the other hand, will miss complex signals and deliver biased results. This trade-off is known as the bias-variance trade-off in supervised learning, but conceptually, this also applies to the other forms of ML where too simple or too complex models may perform poorly beyond the training data.
Unsupervised learning – uncovering useful patterns
When solving an unsupervised learning problem, we only observe the features and have no measurements of the outcome. Instead of predicting future outcomes or inferring relationships among variables, unsupervised algorithms aim to identify structure in the input that permits a new representation of the information contained in the data.
Frequently, the measure of success is the contribution of the result to the solution of some other problem. This includes identifying commonalities, or clusters, among observations, or transforming features to obtain a compressed summary that captures relevant information.
The key challenge is that unsupervised algorithms have to accomplish their mission without the guidance provided by outcome information. As a consequence, we are often unable to evaluate the result against a ground truth as in the supervised case, and its quality may be in the eye of the beholder. However, sometimes, we can evaluate its contribution to a downstream task, for example when dimensionality reduction enables better predictions.
There are numerous approaches, from well-established cluster algorithms to cutting-edge deep learning models, and several relevant use cases for our purposes.
Use cases – from risk management to text processing
There are numerous trading use cases for unsupervised learning that we will cover in later chapters:
- Grouping securities with similar risk and return characteristics (see hierarchical risk parity in Chapter 13, Data-Driven Risk Factors and Asset Allocation with Unsupervised Learning)
- Finding a small number of risk factors driving the performance of a much larger number of securities using principal component analysis (Chapter 13, Data-Driven Risk Factors and Asset Allocation with Unsupervised Learning) or autoencoders (Chapter 19, RNN for Multivariate Time Series and Sentiment Analysis)
- Identifying latent topics in a body of documents (for example, earnings call transcripts) that comprise the most important aspects of those documents (Chapter 14, Text Data for Trading – Sentiment Analysis)
At a high level, these applications rely on methods to identify clusters and methods to reduce the dimensionality of the data.
Cluster algorithms – seeking similar observations
Cluster algorithms apply a concept of similarity to identify observations or data attributes that contain comparable information. They summarize a dataset by assigning a large number of data points to a smaller number of clusters. They do this so that the cluster members are more closely related to each other than to members of other clusters.
Cluster algorithms differ in what they assume about how the various groupings were generated and what makes them alike. As a result, they tend to produce alternative types of clusters and should thus be selected based on the characteristics of the data. Some prominent examples are:
- K-means clustering: Data points belong to one of the k clusters of equal size that take an elliptical form.
- Gaussian mixture models: Data points have been generated by any of the various multivariate normal distributions.
- Density-based clusters: Clusters are of arbitrary shape and defined only by the existence of a minimum number of nearby data points.
- Hierarchical clusters: Data points belong to various supersets of groups that are formed by successively merging smaller clusters.
Dimensionality reduction – compressing information
Dimensionality reduction produces new data that captures the most important information contained in the source data. Rather than grouping data into clusters while retaining the original data, these algorithms transform the data with the goal of using fewer features to represent the original information.
Algorithms differ with respect to how they transform data and, thus, the nature of the resulting compressed dataset, as shown in the following list:
- Principal component analysis (PCA): Finds the linear transformation that captures most of the variance in the existing dataset
- Manifold learning: Identifies a nonlinear transformation that yields a lower-dimensional representation of the data
- Autoencoders: Uses a neural network to compress data nonlinearly with minimal loss of information
We will pe deeper into these unsupervised learning models in several of the following chapters, including important applications to natural language processing (NLP) in the form of topic modeling and Word2vec feature extraction.
Reinforcement learning – learning by trial and error
Reinforcement learning (RL) is the third type of ML. It centers on an agent that needs to pick an action at each time step, based on information provided by the environment. The agent could be a self-driving car, a program playing a board game or a video game, or a trading strategy operating in a certain security market. You find an excellent introduction in Sutton and Barto (2018).
The agent aims to choose the action that yields the highest reward over time, based on a set of observations that describes the current state of the environment. It is both dynamic and interactive: the stream of positive and negative rewards impacts the algorithm's learning, and actions taken now may influence both the environment and future rewards.
The agent needs to take action right from start and learns in an "online" fashion, one example at a time as it goes along. The learning process follows a trial-and-error approach. This is because the agent needs to manage the trade-off between exploiting a course of action that has yielded a certain reward in the past and exploring new actions that may increase the reward in the future. RL algorithms optimize the agent's learning using dynamical systems theory and, in particular, the optimal control of Markov decision processes with incomplete information.
RL differs from supervised learning, where the training data lays out both the context and the correct decision for the algorithm. It is tailored to interactive settings where the outcomes only become available over time and learning must proceed in a continuous fashion as the agent acquires new experience.
However, some of the most notable progress in artificial intelligence (AI) involves RL, which uses deep learning to approximate functional relationships between actions, environments, and future rewards. It also differs from unsupervised learning because feedback on the actions will be available, albeit with a delay.
RL is particularly suitable for algorithmic trading because the model of a return-maximizing agent in an uncertain, dynamic environment has much in common with an investor or a trading strategy that interacts with financial markets. We will introduce RL approaches to building an algorithmic trading strategy in Chapter 21, Generative Adversarial Networks for Synthetic Time-Series Data.