Using Factor Analysis to reduce number of attributes

In my last post on using machine learning for everyday use cases, i’d mentioned factor analysis as a way to reduce large number of items (e.g., news articles’ attributes) into smaller set of variables. Some people asked me for examples of this, so this post is an attempt to explain how factor analysis can be used for what is known as dimensionality reduction.

Issues with large number of attributes

Let’s say you have a list of customers, and you want to analyze some aspect. It’s quite easy to analyse your list if they have a relatively small number of attributes – say 10. What if the number of attributes increases to 20? 100? Sure, manageable. What about 1000 or 10000 or more? or what about attributes that are not obvious (e.g., intention to watch a movie)?

Recall that in a typical machine learning algorithm, these attributes form the input matrix based on which you predict an outcome. So as the number of attributes increase, your algorithm will get computationally expensive plus difficult to program (and debug etc). There are additional issues of overfitting — meaning your machine learning model will fit your training set extremely well but still may not be able to predict that well.

One way to address this would be to group some of the related attributes together and run your algorithm based on that “grouped” attribute as input. Now in some cases, it’s easy to group some attributes because it would be obvious.

For example, let’s say you have attributes that describe a customer’s height and weight. Are they directly proportional to each other? Probably not. But are they correlated? Probably yes. But many of these correlations are not that obvious and there could be underlying patterns that are hidden.

Factor Analysis to reduce number of variables

Factor analysis is a technique to reduce the number of attributes when the relationships between those attributes are not that obvious. Essentially, Factor analysis analyzes interrelationships (or correlations) among a large number of items and reduces the large number of these items into smaller sets of factors. This smaller set of factors can then be used in further analysis — e.g., in logistics regression or neural network to predict your outcome.

Here is another concrete example. This study analysed how social media is used within organizations and came up with a list of 31 activities. These are examples of organisational processes which can benefit by use of social media. Of course, there could actually be many more activities depending on the scenario. The linked post has a chart that shows these activities. Now, if I had to do any analysis, it meant creating a model and analysing the impact on these 31 variables. A factor analysis (actually Principal Component Analysis to be precise) was carried out on these 31 variables and it grouped them into 8 variables. So for example, the factor analysis suggested that following variables from those 31 variables be grouped together:

smvaluechain

Fig: Multiple attributes grouped together by factor analysis

You will also probably agree that all these activities appear to be correlated as all of them relate to sales are marketing activities. So instead of analysing all these variables separately, you can thing of “Sales and Marketing” as one factor that encompasses all these 7 different activities (variables). Similarly, other groupings followed similar patten and I ended up with 8 high-level variables which in place of 31 variables.

Okay, so once you have a smaller, more manageable set of attributes, you can then use the grouped variables in your machine learning algorithms for further analyses. This will not only improve the performance but also result in better algorithms and improved predictions. In this study, i eventually used these 8 variables for further analysis using Confirmatory Factor Analysis and SEM. But more about that later.