Innovation and Alternative Data Archives

Tuning Machine Learning Models

Tuning is the process of maximizing a model’s performance without overfitting or creating too high of a variance. In machine learning, this is accomplished by selecting appropriate “hyperparameters.”

Hyperparameters can be thought of as the “dials” or “knobs” of a machine learning model. Choosing an appropriate set of hyperparameters is crucial for model accuracy, but can be computationally challenging. Hyperparameters differ from other model parameters in that they are not learned by the model automatically through training methods. Instead, these parameters must be set manually. Many methods exist for selecting appropriate hyperparameters. This post focuses on three:

Grid Search
Random Search
Bayesian Optimization

Grid Search

Grid Search, also known as parameter sweeping, is one of the most basic and traditional methods of hyperparametric optimization. This method involves manually defining a subset of the hyperparametric space and exhausting all combinations of the specified hyperparameter subsets. Each combination’s performance is then evaluated, typically using cross-validation, and the best performing hyperparametric combination is chosen.

For example, say you have two continuous parameters α and β, where manually selected values for the parameters are the following:

Then the pairing of the selected hyperparametric values, H, can take on any of the following:

Grid search will examine each pairing of α and β to determine the best performing combination. The resulting pairs, H, are simply each output that results from taking the Cartesian product of α and β. While straightforward, this “brute force” approach for hyperparameter optimization has some drawbacks. Higher-dimensional hyperparametric spaces are far more time consuming to test than the simple two-dimensional problem presented here. Also, because there will always be a fixed number of training samples for any given model, the model’s predictive power will decrease as the number of dimensions increases. This is known as Hughes phenomenon.

Random Search

Random search methods resemble grid search methods but tend to be less expensive and time consuming because they do not examine every possible combination of parameters. Instead of testing on a predetermined subset of hyperparameters, random search, as its name implies, randomly selects a chosen number of hyperparametric pairs from a given domain and tests only those. This greatly simplifies the analysis without significantly sacrificing optimization. For example, if the region of hyperparameters that are near optimal occupies at least 5% of the grid, then random search with 60 trials will find that region with high probability (95%).

To illustrate, imagine a 15 x 30 grid of two hyperparameter values and their resulting scores ranging from 0-10, where 10 is the most optimal hyperparametric pairing (Table 1).

Table 1 – Grid of Hyperparameter Values and Scores

Highlighted in green are the 21 pairings with the highest scores out of the 450 total combinations. Let’s take these 21 pairings to be our desired target range. What if we were to sample points from this grid to see if any lands within the target? Each random draw has a 21/450 ≈ 4.67% of doing so. If we randomly select 60 points, all independent of one another, then the probability that none of them land in the target, or in other words all of them miss, is

Therefore, the probability that at least one of them succeeds in hitting the desired interval is 1 minus that quantity.

In this particular example, sampling just 60 points from our hyperparameter space yields over a 94% chance of selecting a hyperparameter value within our desired interval near the maximum value. In other words, in a scenario with a 5% desired interval around the true maximum, sampling just 60 points will yield a sufficient hyperparameter pairing 95% of the time.

There are two main benefits to using the random search method. The first is that a budget can be chosen independent of the number of parameters and possible values. Based on how much time and computing resources you have available, random search allows you to choose a sample size that conforms to a budget but still allows for a representative sample of the hyperparameter space. The second benefit is that adding parameters that do not influence performance does not decrease efficiency.

Bayesian Optimization

The idea behind Bayesian Optimization is fundamentally different from grid and random search. This process builds a probabilistic model for a given function and analyzes this model to make decisions about where to next evaluate the function. There are two main components under the Bayesian optimization framework.

A prior function that captures the behavior of the unknown objective function and an observation model that describes the data generation mechanism.
A loss function, or an acquisition function, that describes how optimal a sequence of queries are, usually taking the form of regret.

The most common selection for a prior function in Bayesian Optimization is the Gaussian process (GP) prior. This is a particular kind of statistical model where observations occur in a continuous domain. In a Gaussian process, every point in the defined continuous input space is associated with a normally distributed random variable. Additionally, every finite linear combination of those random variables has a multivariate normal distribution.

There are a number of options when choosing an acquisition function. Choosing an acquisition function requires choosing a trade-off in exploration of the entire search space vs. exploitation of current promising areas.

Probability of Improvement

One approach is to choose an improvement-based acquisition function, which favors points that are likely to improve upon an incumbent target. This strategy involves maximizing the probability of improving (PI) over the best current value. If using a Gaussian posterior distribution, this can be calculated as follows:

Where,

In each iteration, the probability of improving is maximized to select the next query point. Although the probability of improvement can perform very well when the target is known, using this method for an unknown target causes the PI to lose reliability.

Expected Improvement

Another strategy involves the case of attempting to maximize the expected improvement (EI) over the current best. Unlike the probability of improvement function, the expected improvement also incorporates the amount of improvement. Assuming a Gaussian process, this can be calculated as follows:

Gaussian Process Upper Confidence Bound

Another method takes the idea of exploiting lower confidence bounds (upper when considering the maximization) to construct acquisition functions that minimize regret over the course of their optimization. This requires the user to define an additional tuning value, . This lower confidence bound (LCB) for a Gaussian process is defined as follows:

There are a few limitations to consider when choosing Bayesian Optimization over other hyperparameter optimization methods. The power of the Gaussian process depends highly on the covariance function, and it is not always clear what the appropriate covariance function choice should be. Another factor to consider is that the function evaluation itself may involve a time-consuming optimization procedure. It’s important to find the best hyperparameters for your model, but in many cases, the complexity associated with finding the best hyperparameters using Bayesian Optimization may exceed the project’s established budget. If possible, one should always consider utilizing parallel computing when performing this technique to maximize computing resources and cut back on time.

Conclusion

Choosing an appropriate set of hyperparameters is crucial for machine learning model accuracy. We have discussed three different approaches for selecting hyperparameter values and the trade-offs associated with choosing one optimization method over another. Time, budget, and computing abilities are all factors to consider when choosing a method. Small hyperparameter spaces and lax restraints for budget and computing resources may make Grid Search the best option. For larger hyperparameter spaces or more computing constraints, a simple random search with a sufficient sample size or a Bayesian optimization technique may be more appropriate.

References

https://stats.stackexchange.com/questions/160479/practical-hyperparameter-optimization-random-vs-grid-search

http://scikit-learn.org/stable/modules/grid_search.html

https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

https://arimo.com/data-science/2016/bayesian-optimization-hyperparameter-tuning/

https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf

https://arxiv.org/pdf/1206.2944.pdf

http://auai.org/uai2016/proceedings/papers/73.pdf

http://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/slides/lec21.pdf

https://www.cs.ox.ac.uk/people/nando.defreitas/publications/BayesOptLoop.pdf

https://arxiv.org/pdf/1602.02355.pdf

Evaluating Supervised and Unsupervised Learning Models

Model evaluation (including evaluating supervised and unsupervised learning models) is the process of objectively measuring how well machine learning models perform the specific tasks they were designed to do—such as predicting a stock price or appropriately flagging credit card transactions as fraud. Because each machine learning model is unique, optimal methods of evaluation vary depending on whether the model in question is “supervised” or “unsupervised.” Supervised machine learning models make specific predictions or classifications based on labeled training data, while unsupervised machine learning models seek to cluster or otherwise find patterns in unlabeled data.

Unsupervised Learning

Common unsupervised learning techniques include clustering, anomaly detection, and neural networks. Each technique calls for a different method of evaluating performance. We’ll focus on clustering models as an example. Clustering is the task of grouping a set of objects in such a way that objects in the same cluster are more like each other than they are to objects in other clusters. Various algorithms are capable of clustering, including k-means and hierarchical, which differ in their definitions of a cluster and how to find one.

Evaluating Unsupervised Learning Models

Let’s assume that we need to cluster banking customers together into groups based on the amount and magnitude of risk they pose. After the clustering algorithm has grouped the customers into distinct clusters, we need to evaluate how well those clusters were formed. The lack of labels on an unsupervised learning model’s training data makes evaluation problematic because there is nothing to which the model’s results can be meaningfully compared. If we were to manually group these customers, we could then compare our manual groupings with the algorithm’s, but often this is not an option due to time or labor constraints, so we need a more efficient way to determine how well the algorithm performed.

One way would be to determine 1) how close each customer within each cluster is to every other customer in its cluster (the “intra-cluster” distance”) and 2) how close each cluster of customers is to other clusters (the “inter-cluster” distance), and then to compare the two distances. Models that produce relatively small intra-cluster distances and relatively large inter-cluster distances evaluate favorably because they appear to be doing a good job of grouping like customers with discrete characteristics.

Supervised Learning

Within supervised learning there are techniques for both regression and classification tasks. While some techniques are suited to either regression or classification, some can be used for both. For example, linear regression can only be used for regression while support vector machines and random forests can be used for either. While each of these is a different technique, the metrics that we use to evaluate them are the same, so we can even compare these models to one another. In our examples, we’ll focus on flagging credit card purchases as fraud, a classification task, and predicting housing prices, a regression task.

Evaluating Supervised Models

The task of evaluating how well a supervised learning model performs is more straightforward. Because supervised learning models learn from labeled training data, once they have been fitted using training data, they can be tested against data from the same population and therefore has the same labels.

For example, let’s say we need to classify whether a credit card transaction is fraudulent and we have a dataset of transactions with labels of either “fraud” or “not fraud.” We can (and sometimes do¹) train our model on all the available data, but this prevents us from fairly evaluating it because no “independent” data remains for testing and overfitting² becomes difficult to detect. This problem can be avoided by splitting the available data into training and testing sets.

This can be accomplished in various ways. For simplicity, we’ll first talk about splitting our dataset into two sets: a training set (typically 70% of the whole dataset) from which the model learns and a test set (the other 30%). Because the test set is withheld from the model during training, it can contribute to an unbiased evaluation of how well a model performs on previously unseen data. This protects against overfitting and allows us to evaluate how our model would perform “in the wild” on new data as it emerges.

Cross-validation is another antidote for overfitting. Cross-validation involves partitioning data into multiple groups and then training and testing models on different group combinations. For example, in a 5-fold cross-validation we would split our transaction data set into five partitions of equal sizes. We would then train our model on four of those five partitions and test our model on the remaining partition. We would then repeat the process—selecting a different partition to be the test group and training a new model on the remaining set of four partitions. We would repeat three more times, for a total of five rounds of cross-validation, one for each fold. We will then have five different models, each having been trained and tested on a different subset of data and each having their own weights and prediction accuracy. At the end, we combine these models by averaging their weights together to estimate a final predictive model.

Classification metrics are the measures against which models are evaluated. The simplest and most common such metric is accuracy. Accuracy is computed by dividing the number of correct predictions by the total number of predictions. In our supervised transaction classification model example, if we tested our model on one hundred transactions and correctly predicted their label (fraud/not fraud) for ninety-five of them, then the accuracy of our model is 95%.

Accuracy is the simplest, most understandable metric we can use, but we wouldn’t want to rely on accuracy alone because it doesn’t distinguish between false positives, transactions incorrectly classified as fraud, and false negatives, transactions incorrectly classified as non-fraud. For this we need a confusion matrix.

A confusion matrix is a 2-by-2 table that sorts predictions into one of four classifications: true positive, true negative, false positive, and false negative. Our transaction classification model might generate a confusion matrix like this one:

The confusion matrix indicates that, out of 100 total transactions, our model correctly predicted fraud four times and correctly predicted not fraud 91 times, yielding an overall accuracy of 95%. The confusion matrix, however, also enables us to see the number of times the model incorrectly predicted that a transaction was fraud—a false positive which occurred on two out of the 100 transactions. We can also see the number of times the model predicted a transaction was not fraud when it was—a false negative which occurred on three out of the 100 transactions.

While the model appears to boast a fairly strong “true negative” rate—the percentage of non-fraud messages correctly classified as such (91/(91+2)=97.8%), the model’s “true positive” rate—the percentage of fraud messages correctly flagged as such (4/(4+3)=57.1%) is far less attractive. Breaking down the model’s performance in this way paints a different and more complete picture than the 95% accuracy rate alone.

Evaluation methods apply to regression models, as well. Let’s assume we have a regression model that’s been trained to predict housing prices. The model’s predicted prices can be compared with actual prices using the mean squared-error, which measures the average of the squares of the errors, which are the differences between the actual and predicted price. The lower the mean squared-error, the better the model.

All models need to be subjected to evaluation—when they are built and throughout their lives. Supervised and unsupervised learning models pose different sorts of evaluation challenges, and selecting the right type of metrics is key.

[1] Many fraud detection models are also built using neural networks and other unsupervised learning techniques.

[2] Overfitting occurs when a model makes generalizations about coincidental data elements that in reality are not germane to the analysis. Continuing the example of fraud detection, overfitting may occur if model training detects a correlation between the length of a customer’s name (or whether the customer’s name begins with a vowel) and the likelihood that a transaction is fraudulent. Testing is likely to expose random, spurious correlations of this type for what they are, as they are not likely to be replicated in the test data set that has been held out from the training data. A model that has been “overfit” to its training data is likely to return a considerably lower accuracy ratio on the test data.

https://en.wikipedia.org/wiki/Cross-validation_(statistics)

http://www.oreilly.com/data/free/files/evaluating-machine-learning-models.pdf

https://en.wikipedia.org/wiki/Cluster_analysis#Evaluation_and_assessment

http://www.mit.edu/~9.54/fall14/slides/Class13.pdf

https://stats.stackexchange.com/questions/79028/performance-metrics-to-evaluate-unsupervised-learning

Feature Selection – Machine Learning Methods

Feature selection in machine learning refers to the process of isolating only those variables (or “features”) in a dataset that are pertinent to the analysis. Failure to do this effectively has many drawbacks, including: 1) unnecessarily complex models with difficult-to-interpret outcomes, 2) longer computing time, and 3) collinearity and overfitting. Effective feature selection eliminates redundant variables and keeps only the best subset of predictors in the model, thus making it possible to represent the data in the simplest way.This post begins by identifying steps that must be taken to prepare datasets for meaningful analysis—and how machine learning can help. We then introduce and discuss some commonly used machine learning techniques for variable selection.

Data Cleansing

Real world data contains a wide range of holes, noise, and inconsistencies. Before doing any statistical analysis, it is crucial to ensure that the data can be meaningfully analyzed. In practice, data cleansing is often the most time-consuming part of data analysis. This upfront investment is necessary, however, because the quality of data has a direct bearing on the reliability of model outputs.

Various machine learning projects require different sorts of data cleansing steps, but in general, when people speak of data cleansing, they are referring to the following specific tasks.

Cleaning Missing Values

Many machine learning techniques do not support data with missing values. To address this, we first need to understand why data are missing. Missing values usually occur simply because no information is provided, but other circumstances can lead to data holes as well. For instance, setting incorrect data types for attributes when data is extracted and integrated from multiple sources can cause data loss.

One way to investigate missing values is to identify patterns for missing data. For example, missing answers for certain questions from female respondents in a survey may indicate that those questions are only asked of male respondents. Another example might involve two loan records that share the same ID. If the second record contains blank values for every attribute except ‘Market Price,’ then the second record is likely simply updating the market price of the first record.

Once the early-stage evaluation of missing data is complete, we can set about determining how to address the problem. The easiest way to handle missing values is simply to ignore the records that contain them. However, this solution is not always practical. If a relatively large portion of the dataset contains missing values, then removing all of them could result in remaining data that may not be a good representation of the initial population. In that case, rather than filtering out relevant rows or attributes, a more proper approach is to impute missing values with sensible values.

A typical imputing method for categorical variables involves replacing the missing values with the most frequent value or with a newly created “unknown” category. For numeric variables, missing values might be replaced with mean or median values. Other, more advanced methods for dealing with missing values, e.g., listwise deletion for deleting rows with missing data and multiple imputation for substituting missing values, exist as well.

Reducing Noise in Data

“Noise” in data refers to erroneous values and outliers. Noise is an unavoidable problem which can be caused by human mistakes in data entry, technical problems, and many other factors. Noisy data adversely influences model performance, so its detection and removal has a key role to play in the data cleaning process.

There are two major noise types in data: class noise and attribute noise. Class noise often occurs in categorical variables and can include: 1) non-standardized class labels, 2) duplicate records mapping to different class labels, and 3) mislabeled records. Attribute noise refers to corruptive values and outliers, such as percentages inappropriately greater than 100% and placeholders (e.g., 999,000).¹

There are many ways to deal with noisy data. Certain type of noise can be easily identified by sorting the data—thus isolating text input where numeric input is expected and other placeholders. Other noise can be addressed only using statistical methods. Clustering analysis groups the data by similarity and can help with detecting irrelevant objects and outliers. Data binning is used to reduce the impact of observation errors by combining ‘neighborhood’ data into a small number of bins. Advanced smoothing algorithms, including moving average and loess, fit the data into regression functions to eliminate the effect due to random variation and allow important patterns to stand out.

Data Normalization

Data normalization converts numerical values into specific ranges to meet the needs of a model. Performing data normalization makes it possible to aggregate data with different scales. Several algorithms require normalized data. For example, it is necessary to normalize data before feeding into principal component analysis (PCA) so that all variables have zero mean and unit variance and therefore the same weight. This also applies when performing support vector machines (SVM), which assumes that the input data is in range [0,1] or [-1,1]. Unnormalized data slows down model convergence time and skews results.

The most common way of normalizing data involves Z-score. Also known as standard-score normalization, this approach normalizes the error by dividing the difference between the data and mean by standard deviation. Z-score normalization is often used when min and max are unknown. Another common method is feature scaling, which brings all values into range [0,1] by dividing the difference between the data and min by the difference between max and min. Other normalization methods include studentized residual, t-statistics, and coefficient of variation.

Feature Selection Methods²

Stepwise Procedures

A stepwise procedure adds or subtracts individual features from a model until the optimal mix is identified. Stepwise procedures take three forms: backward elimination, forward selection, and stepwise regression.

Backward elimination is the simplest method. It fits the model using all available features and then systematically removes features one at a time, beginning with the feature with the highest p-value (provided the p-value exceeds a given threshold, usually 5%). The model is refit after each elimination and process loops until a model is identified in which each feature’s p-value falls below the threshold.

Forward selection is the opposite of backward elimination. It includes no variables in the model at first and then systematically adds features one at a time, beginning with the lowest p-value (provided the p-value falls below a threshold). The model is refit after each addition and loops until additional features do not help model performance.

Stepwise regression combines backward elimination and forward selection by allowing a feature to be added or dropped at each iteration. Using this method, a newly added variable in an early stage may be removed later, and vice versa.

Criterion-Based Procedures

A variable’s p-value is not the only statistic that can be used for feature selection. Penalized-likelihood criteria, such as akaike information criterion (AIC) and bayesian information criterion (BIC), are also valuable. Lower AICs and BICs indicate that a model is more likely to be true. They are given as: nlog (RSS/n) + kp, where RSS is residual sum of square (which decreases as the model complexity increases), n is sample size, p is numbers of predictors, and k is two for AIC and log(n) for BIC. Both criteria penalize larger models as p goes up, and BIC penalizes model complexity more heavily, which explains why BIC tends to favor smaller models in comparison to AIC. Other criteria are 1) Adjusted R², which increases only if a new feature improves model performance more than expected, 2) PRESS, summing up squares of predicted residuals, and 3) Mallow’s C_p Statistic, estimating the average MSE of prediction.

Lasso and Ridge Regression

Lasso and ridge regressions are powerful techniques for dealing with large feature coefficients. Both approaches reduce overfitting by penalizing features with large coefficients and minimizing the difference between predicted value and observation, but they differ when adding penalized terms. Lasso adds a penalty term equivalent to the absolute value of the magnitude of coefficients, so that it zeros out target variables’ coefficients and eliminates them from the model. Ridge assigns a penalty equivalent to square of the magnitudes of the coefficients. Even though it does not shrink the coefficient to zero, it can regularize and constrain the coefficients to control variance.

Lasso and ridge regression models have been widely used in finance since their introduction. A recent example used both these methods in predicting corporate bankruptcy.³ In this study, the authors discovered that these regression methods are optimal as they handle multicollinearity and minimize the numerical instability that may occur due to overfitting.

Dimensionality Regression

“Dimensionality reduction” is a process of transforming an extraordinarily complex, “high-dimensional” dataset (i.e., one with thousands of variables or more) into a dataset that can tell the story using a significantly smaller number of variables.

The most popular linear technique for dimensionality reduction is principal component analysis (PCA). It converts complex dataset features into a new set of coordinates named principal components (PCs). PCs are created in such a way that each succeeding PC preserves the largest possible variance under the condition that it is uncorrelated with the preceding PCs. Keeping only the first several PCs in the model reduces data dimensionality and eliminates multi-collinearity among features.

PCA has a couple of potential pitfalls: 1) PCA is sensitive to the scale effects of the original variables (data normalization is required for performing PCA), and 2) Applying PCA to the data will hurt its ability to interpret the influence of individual features since the PCs are not real variables any more. For these reasons, PCA is not a good choice for feature selection if interpretation of results is important.

Dimensionality reduction and specifically PCA have practical applications to fixed income analysis, particularly in explaining term-structure variation in interest rates. Dimensionality reduction has also been applied to portfolio construction and analytics. It is well known that the first eigenvector identified by PCA maximally captures the systematic risk (variation of returns) of a portfolio.⁴ Quantifying and understanding this risk is essential when balancing a portfolio.

[1] http://sci2s.ugr.es/noisydata
[2] http://www.biostat.jhsph.edu/~iruczins/teaching/jf/ch10.pdf
[3] Pereira, J. M., Basto, M., & da Silva, A. F. (2016). The Logistic Lasso and Ridge Regression in Predicting Corporate Failure. Procedia Economics and Finance, v.39, pp.634-641.
[4] Alexander, C. (2001). Market models: A guide to financial data analysis. John Wiley & Sons.

Machine Learning and Portfolio Performance Analysis

Attribution analysis of portfolios typically aims to discover the impact that a portfolio manager’s investment choices and strategies had on overall profitability. They can help determine whether success was the result of an educated choice or simply good luck. Usually a benchmark is chosen and the portfolio’s performance is assessed relative to it.

This post, however, considers the question of whether a non-referential assessment is possible. That is, can we deconstruct and assess a portfolio’s performance without employing a benchmark? Such an analysis would require access to historical return as well as the portfolio’s weights and perhaps the volatility of interest rates, if some of the components exhibit a dependence on them. This list of required variables is by no means exhaustive.

There are two prevalent approaches to attribution analysis—one based on factor models and the other on return decomposition. The factor model approach considers the equities in a portfolio at a single point in time and attributes performance to various macro- and micro-economic factors prevalent at that time. The effects of these factors are aggregated at the portfolio level and a qualitative assessment is done. Return decomposition, on the other hand, explores the manner in which positive portfolio returns are achieved across time. The principal drivers of performance are separated and further analyzed. In addition to a year’s worth of time series data for the variables listed in the previous paragraph, covariance, correlation, and cluster analyses and other mathematical methods would likely be required.

Normality Assumption

Is the normality assumption for stock returns fully justified? Are sample means and variances good proxies for population means and variances? This assumption is worth testing because Normality and the Central Limit Theorem are widely assumed when dealing with financial data. The Delta-Normal Value at Risk (VaR) method, which is widely used to compute portfolio VaR, assumes that stock returns and allied risk factors are normally distributed. Normality is also implicitly assumed in financial literature. Consider the distribution of S&P returns from May 1980 to May 2017 displayed in Figure 1.

Figure One: Distribution of S&P Returns

Panel (a) is a histogram of S&P daily returns from January 2001 to January 2017. The red curve is a Gaussian fit. Panel (b) shows the same data on a semi-log plot (logarithmic Y axis). The semi-log plot emphasizes the tail events.

The returns displayed in the left panel of figure 1 have a higher central peak and the “shoulders” are somewhat wider than what is predicted by the Gaussian fit. This mismatch in the tails is more visible in the semi-log plot shown in panel (b). This demonstrates that a normal distribution is probably not a very accurate assumption. Sigma, the standard deviation, is typically used as a measure of the relative magnitude of market moves and as a rough proxy for the occurrence of such events. The normal distribution places the odds of a minus-5 sigma swing at only 2.86×10-5 %. In other words, assuming 252 trading days per year, a drop of this magnitude should occur once in every 13,000 years! However, an examination of S&P returns over the 37-year period cited shows drops of 5 standard deviations or greater on 15 occasions. Assuming a normal distribution would consistently underestimate the occurrence of tail events.

We conducted a subsequent analysis focusing on the daily returns of SPY, a popular exchange-traded fund (ETF). This ETF tracks 503 component instruments. Using returns from July 01, 2016 through June 31, 2017, we tested each component instrument’s return vector for normality using the Chi-Square Test, the Kurtosis estimate, and a visual inspection of the Q-Q plot. Brief explanations of these methods are provided below.

Chi-Square Test

This is a goodness-of-fit test that assumes a specific data distribution (Null hypothesis) and then tests that assumption. The test evaluates the deviations of the model predictions (Normal distribution, in this instance) from empirical values. If the resulting computed test statistic is large, then the observed and expected values are not close and the model is deemed a poor fit to the data. Thus, the Null hypothesis assumption of a specific distribution is rejected.

Kurtosis

The kurtosis of any univariate standard-Normal distribution is 3. Any deviations from this value imply that the data distribution is correspondingly non-Normal. An example is illustrated in Figures 2, 3, and 4, below.

Q-Q Plot

Quantile-quantile (QQ) plots are graphs on which quantiles from two distributions are plotted relative to each other. If the distributions correspond, then the plot appears linear. This is a visual assessment rather than a quantitative estimation. A sample set of results is shown in Figures 2, 3, and 4, below.

Figure Two: Year’s Returns for Exxon

Figure 2. The left panel shows the histogram of a year’s returns for Exxon (XOM). The null hypothesis was rejected with the conclusion that the data is not normally distributed. The kurtosis was 6 which implies a deviation from normality. The Q-Q plot in the right panel reinforces these conclusions.

Figure Three: Year’s Returns for Boeing

Figure 3. The left panel shows the histogram of a year’s returns for Boeing (BA). The data is not normally distributed and shows a significant skewness also. The kurtosis was 12.83 and implies a significant deviation from normality. The Q-Q plot in the right panel confirms this.

For the sake of comparison, we also show returns that exhibit normality in the next figure.

Figure Four: Year’s Returns for Xerox

The left panel shows the histogram of a year’s returns for Xerox (XRX). The data is normally distributed, which is apparent from a visual inspection of both panels. The kurtosis was 3.23 which is very close to the value for a theoretical normal distribution.

Machine learning literature has several suggestions for addressing this problem, including Kernel Density Estimation and Mixture Density Networks. If the data exhibits multi-modal behavior, learning a multi-modal mixture model is a possible approach.

Stationarity Assumption

In addition to normality, we also make untested assumptions regarding stationarity. This critical assumption is implicit when computing covariances and correlations. We also tend to overlook insufficient sample sizes. As observed earlier, the SPY dataset we had at our disposal consisted of 503 instruments, with around 250 returns per instrument. The number of observations is much lower than the dimensionality of the data. This will produce a covariance matrix which is not full-rank and, consequently, its inverse will not exist. Singular covariance matrices are highly problematic when computing the risk-return efficiency loci in the analysis of portfolios. We tested the returns of all instruments for stationarity using the Augmented Dickey Fuller (ADF) test. Several return vectors were non-stationary. Non-stationarity and sample size issues can’t be wished away because the financial markets are fluid with new firms coming into existence and existing firms disappearing due bankruptcies or acquisitions. Consequently, limited financial histories will be encountered and must be dealt with.

This is a problem where machine learning can be profitably employed. Shrinkage methods, Latent factor models, Empirical Bayes estimators and Random matrix theory based models are widely published techniques that are applicable here.

Portfolio Performance Analysis

Once issues surrounding untested assumptions have addressed, we can focus on portfolio performance analysis–a subject with a vast collection of books and papers devoted to it. We limit our attention here to one aspect of portfolio performance analysis – an inquiry into the clustering behavior of stocks in a portfolio.

Books on portfolio theory devote substantial space to the discussion of asset diversification to achieve an optimum balance of risk and return. To properly diversify assets, we need to know if resources have been over-allocated to a specific sector and, consequently, under-allocated to others. Cluster analysis can help to answer this. A pertinent question is how to best measure the difference or similarity between stocks. One way would be to estimate correlations between stocks. This approach has its own weaknesses, some of which have been discussed in earlier sections. Even if we had a statistically significant set of observations, we are faced with the problem of changing correlations during the course of a year due to structural and regime shifts caused by intermittent periods of stress. Even in the absence of stress, correlations can break down or change due to factors that are endogenous to individual stocks.

We can estimate similarity and visualize clusters using histogram analysis. However, histograms eliminate temporal information. To overcome this constraint, we used Spectral Clustering, which is a machine learning technique that explores cluster formation without neglecting temporal information.

Figures 5 to 7 display preliminary results from our cluster analysis. Analyses like this will enable portfolio managers to realize clustering patterns and their strengths in their portfolios. They will also help guide decisions on reweighting portfolio components and diversification.

Figures 5-7: Cluster Analyses

Figure 5. Cluster analysis of a limited set of stocks is shown here. The labels indicate the names of the firms. Clusters are illustrated by various colored bullets, and increasing distances indicate decreasing similarities. Within clusters, stronger affinities are indicated by greater connecting line weights.

The following figures display magnified views of individual clusters.

Figure 6. We can see that Procter & Gamble, Kimberly Clark and Colgate Palmolive form a cluster (top left, dark green bullets). Likewise, Bank of America, Wells Fargo and Goldman Sachs form a cluster (top right, light green bullets). This is not surprising as these two clusters represent two sectors: consumer products and banking. Line weights are correlated to affinities within sectors.

Figure 7. The cluster on the left displays stocks in the technology sector, while the clusters on the right represent firms in the defense industry (top) and the energy sector (bottom).

In this post, we raised questions about standard assumptions that are made when analyzing portfolios. We also suggested possible solutions from machine learning literature. We subsequently analyzed one year’s worth of returns of SPY to identify clusters and their strengths and discussed the value of such an analysis to portfolio managers in evaluating risk and reweighting or diversifying their portfolios.

1 2 3 4 56

Tuning Machine Learning Models

Grid Search

Random Search

Table 1 – Grid of Hyperparameter Values and Scores