Big data quant techniques, combined with advances in statistical programming and visualization, have created opportunities to view the economic landscape holistically. These techniques allow analysts to quickly process, analyze, summarize, and visualize hundreds of thousands economic variables and make accurate decisions quickly. These techniques are particularly useful when developing classification economic forecasting models.

Chart 1 is a graphic summary of the economic topography with the S&P 500 (orange line) overlaid on the economic landscape from 1950 to present. Chart 1 was generated by collecting, analyzing, and summarizing over 200,000 international economic variables.

**Chart 1: **Economic topography, 1950 to present

### The challenges of big data

Working with highly dimensional data is challenging. Trying to make sense of hundreds of thousands economic indicators is not intuitive, but dimensionality can be often be reduced. This leads to faster and more accurate decision making.

Dimensionality reduction occurs in the natural world; the perfect example is human sensory data. When deciding whether the piece of food you are consuming tastes good or bad, and whether you should consume more or not, the brain will receive data from millions of sensors, extract the necessary features, and make a binary decision: eat more or stop eating. The problem can be represented as shown in figure 1.

The remainder of the post will be spent providing the reader a brief explanation of how one can go from 200,000+ economic indicators to a go/no-go investment decision.

**Figure 1:** should I consume more food?

### Dimensionality reduction and feature extraction

After framing your problem, deciding on what data to collect, collecting, and then processing the data, the next critical step is often dimensionality reduction. Many dimensionality reduction techniques exist. One popular linear dimensionality reduction technique is principal component analysis (“PCA”), which transforms the original data set to a new coordinate system by finding the axis on which the data has the most variance. PCA achieves this by finding the eigenvalues and eigenvectors of the covariance matrix of a data set; the axis with the largest eigenvalue is chosen as the first principal component. Then, the axis with the next highest eigenvalue, which is also orthogonal to the prior axis is chosen. The process can be repeated until the number of principal components equals the number of variables in the original data set. By dropping principal components with low eigenvalues, the key features of the data are maintained while dimensionality is reduced.

Linear dimensionality reduction is very useful, however some data sets do not lend to being summarized linearly. In this case, a nonlinear dimensionality reduction technique may be employed. Just like in the linear approach, many nonlinear dimensionality reduction techniques exist. While each approach varies, the general idea is to map points from the high dimensionality manifold to a lower dimensionality manifold, while maintaining the distances between points.

### Visualization

Once dimensionality is reduced, data lends itself more easily to graphing and visualization. Visualizing data is one of the easiest ways for a quant to hypothesize relationships, which can then be mathematically developed and tested. Looking at chart 1, it is difficult to see any possible relationships between the topography of the economic landscape and the S&P 500 index, however when the data is zoomed in on (chart 2) possible relationships begin to appear.

**Chart 2:** Economic topography, 2000 to present

Upon closer inspection of chart 2, the inference can be made that a relationship between the economic topography and the S&P 500 index exists. One dimension of the economic topography appears to precede the peaks and valleys of the S&P 500 index. While 3D plots are a useful visualization tool for quickly viewing all of the data in one window, a simple 2D plot can also provide a lot of information by graphing each variable against the response.

### Feature extraction and classification

After visualizing the data, dimensions which appear to have best chances of a relationship with the response can be isolated, then classification and model building can begin. Remember, the goal is to generate a possibility matrix to layout investment decisions like we did for eating decisions.

A couple of quick ways to help classify data and extract features include using built in functions in statistical software which split data into groups, and building tree classification models. When a tree-building algorithm learns the data, the rules are excellent starting points for writing further feature extraction and classification algorithms. To maximize flexibility and testing of the program, we find that writing our own feature extraction and classification algorithms works best. This way we can target and test specific features in the data.

### Model building

After the specific features have been extracted and classified, the problem can be framed per figure 2, below, which should be familiar to the reader (see figure 1). If there are ‘n’ number of features, with ‘m’ number of classes per feature, the number of possibilities is n x m. Now a quant can build a classification model with the features as variables and the S&P 500 index as the response, and probability distributions for each possibility can be generated. Bayesian networks are a useful tool for this type of analysis. An example result of this type of model, a probability distribution table, is shown in table 1.

**Figure 2:** Example possibility matrix using economic topography data

**Table 1:** Example probability distribution table

### Applications

Armed with a probability distribution table, ideally with some skewed distributions, as shown by the far right and far left columns in table 1, an analyst can then write an algorithm to implement table 1 into a trading program. Also, the node set created by using the economic topography can be written into a larger, more complex, model incorporating many other variables such investor sentiment, economic fundamentals, and technicals. Finally, the model and results are also easily implemented into analysts’ larger toolkit, which includes fundamental, and bottom up analysis.

### DISCLAIMER

Any material provided in this blog is for general information use only. You should not act based solely upon the materials provided herein. Vital Data Science Inc. advises you to obtain professional advice before making investment decisions. Your use of these materials is entirely at your own risk. In no event shall Vital Data Science Inc, its officers, directors or employees be liable for any loss, costs or damages whatsoever.