Factor Analysis: A Process to Build Quality Representative Indices

Building a representative index for a series of measurements that measure a certain phenomenon – for instance, attitudes toward a government policy on refugees – is a central challenge in research work. Usually, researchers tend to present the simple unweighted average across the relevant building items and use the Cronbach's alpha or the Intra-Class-correlation moments to support the quality of the index. That is, if these moments indicate strong intra-correlation among items, this index is considered representative. In fact, if we attempt to produce a stable and valid index, this process is much more complex and is built of several steps (see the figure below).

Firstly, we need to divide the process between two central analytical approaches: the exploratory factor analysis (EFA) and the confirmatory factor analysis (CFA) – where a factor is an overall construct composed of its building items, usually answers to survey questions, and factor analysis is the process by which items are converged into a representative index. We need to randomly split the dataset into (at least) two subsets for each analytical step. The first exploratory analysis searches for an estimate for the number of dimensions which best represent the phenomenon. In other words, we try first to know how many indices would best express this phenomenon.

For instance, if we work with a questionnaire that surveys dissatisfaction at work, it would be reasonable to assume more than one index that explains this issue. Indices like emotional dissatisfaction, physical dissatisfaction, organizational dissatisfaction and the like may exist and may be measured using the questionnaire.

Next, the confirmatory analysis measures the goodness of fit of these latent indices or constructs. Here we emphasize the difference between unobserved latent variables, that is, variables that presumably exist but cannot be measured directly, and those items already measured directly via the survey questions. The objective of the second step is to neutralize the effect of measurement errors related to the observed items, such that the final index represents the phenomenon we analyze without these errors. By the confirmatory process, we mean that there is a theoretical framework that already provided the grounding for this construct, and our aim is to validate our measurements by this process. When there is a clear and firm theoretical background, we can integrate this confirmatory model, also termed the measurement model, into the final structural equation model without the need for another subset of the data in order to estimate the regression weights.

Note that theory is highly important to save statistical resources (observations, repeats, trials). In greater detail, as can be seen in the figure, the exploratory process starts with partial data and produces dimensions – numbers of indices – for replacing missing values (by itself a statistical topic that deserves careful attention). The second exploratory analysis utilizes the full data set after missing values are imputed using the multiple imputation procedure. This is vital to preserve all cases/observations in the dataset. This procedure allows us to use the full dataset in further analyses.

We now turn back to the exploratory analysis on the complete data which provides the final dimensions for the confirmatory analysis. Although the exploratory analysis indicates a certain number of dimensions in the first place, the confirmatory analysis may not fit well into the expected number of dimensions. We solve this discrepancy by parceling the set of items into (usually) three independent sets. This intermediate step usually dramatically improves the goodness of fit of the measurement model and produces more stable indices.

For instance, by combining items that are asymmetric to the right and those that are asymmetric to the left we get a symmetric distribution for the latent index. A number of iterations to locate uncorrelated items and other problematic structures are necessary to improve the goodness of fit. The final outcome should be a stable and valid measurement model which can be integrated into the full structural equation model.

We also mention here another point to take into account. We start with the assumption that all items in the model are normally distributed. However, this is not always the case. Many variables are ordinal, or count, and are distributed approximately by Poisson or similar count distributions. Other variables may be categorical and so on. Ignoring these distribution patterns may cause biased estimates. By correcting for these distribution patterns, we expect the estimates to be unbiased.

Dr. Gabriel Liberman – Data-Graph Statistical Consulting 

comment closed