AB-44 - Paper
Immediate assessment of batch classification quality.
Francois J. LESCREVE *
Center for Recruitment and Selection
Belgian Armed Forces
1. Introductory section
Batch classification is used in selection settings where the data from a number of applicants are processed in order to decide which applicants will be assigned to a number of different vacant jobs. Batch classification, in opposition to sequential systems, processes the data of a whole group of applicants simultaneously. This is appropriate in settings where the enlistment is organized in groups, such as annual recruitments. Modern batch classification systems are generally composed of two major elements.
In the first element it is attempted to quantify the value of assigning a specific person to a specific job or a certain type of jobs. In the military, similar jobs are often labeled as Military Occupation Specialties (MOS) or as trades. The quantified values are called payoff-values and can be computed in several ways. Multiple linear regressions (MLR) are widely used. In MLR models, the payoffs usually are predicted performance scores on an external criterion that was used as dependent variable when designing the MLR model. Another method to produce payoff-values is the Subject Matter Experts-method (SME). In this method, subject matter experts are asked to give a specific weight to the selection variables for each MOS or trade. The payoffs can then be calculated as weighted sums. Artificial Neural Networks are also promising tools to generate payoff-values. The payoffs are computed for all person-job combinations and usually arranged in a payoff-matrix with the applicants as rows and the jobs as columns. The matrix is then squared by adding dummy-jobs.
When the payoff-matrix is ready, the second major element of the classification model is used. Since the matrix was squared it is possible to link each applicant to a job (a real one or a dummy) and each job to an applicant. That can be done by means of an algorithm that maximizes the sum of the payoff values identified by linking a person to a job. This classifies the applicants and also identifies the ones who are selected versus the ones who are rejected.
2. How to assess the quality of a batch classification model?
Any organization considering or using a batch classification system will undoubtedly want to assess its quality. But how should we express this quality? To begin with, it is important to note that the outcome of such a classification model depends on quite a number of aspects. Let us briefly review some of them.
The outcome is related to the applicant group. The selection ratio together with the level and distribution of relevant aptitudes and characteristics in the group is obviously of paramount importance.
The outcome is also related to the vacant jobs. These do not only affect the selection ratio but also have a certain level of differentiation as to their attractiveness and the level and profile of aptitudes and characteristics they require. In general, the more differentiated the jobs are, the more powerful the effect of the classification algorithm will be.
The outcome is also highly depending on the payoff computation. The quality of the payoffs depends on things such as the measurement quality of the variables used in the model and their differential validity, the judicious setting of the weights and the integration of metric and categorical data and preferences.
Finally, the classification outcome is conditioned by the chosen objective function and the used algorithm.
The complexity inherent to a batch classification system makes it rather inappropriate to summarize its quality by a single overall number. In many cases the practitioner will be better off with a series of indicators each focusing on a specific aspect of the classification quality. Such indicators are indeed available and can be grouped according to the moment at which they can be obtained.
Some indicators depend on data that are not available at the time the classification algorithm is performed. These criterion data typically comprise attrition rates and performance measurements. Quality indicators based on such data include predictive validity coefficients of the payoff-values, differential validity of predictors, logistic regression models against pass-fail criteria, cross checks of the used linear models, etc. Such quality indicators can be called delayed or a posteriori indicators.
Other quality indicators do not require data which arent available immediately after the classification algorithm runs. These can be labeled a priori or immediate quality indicators. Given the title of this paper, we will concentrate our attention on these. These indicators are less powerful than the ones relying on criterion data and cannot provide the practitioner with final statements concerning the quality of the used system, but it offers one tremendous advantage: it allows him or her to modify certain parameters used in the classification model before the assignment decisions are carried out. Put in other words, these indicators allow to detect problems in the classification outcome and to rectify them by altering the parameters of the classification system. The classification model can subsequently be reran until the classification quality is acceptable. It is only at that time that the applicants are informed of the outcome.
Well now review some immediate quality indicators. To illustrate them, well also present some screen views originating from the Measures of Merit-module of the Psychometric Model which is the batch classification model currently used in the Belgian Armed Forces. The examples come from the classification for the annual Flemish non-commissioned officer recruitment in July, 1998.
2.1. The fill rate.
The first indicator is the fill rate. An important issue is whether or not the vacant jobs will be filled. If the classification model doesnt find suitable applicants for all jobs, how many and which jobs are then left vacant? Did the algorithm have a lot of choice to fill a certain MOS? Are there applicants who didnt get a job but remain available in the event that another candidate resigns for the job he or she got assigned to? These questions can be answered easily for instance by a table like the one presented in following figure.
|
The first three columns in this table identify the jobs. The next ones give the number of vacant jobs (NUM_JOBS) and the number of persons assigned to them (NUM_Assign). The column Shortfall indicates the number of positions which couldnt be filled. The last two columns give the number of applicants that was eligible for the job (that is, who met all criteria and therefore got a payoff-value for that job) and the number of still Available applicants after the assignment. Those are the ones that have an acceptable payoff but werent selected in the first place. If the user wants to remedy a shortfall, he or she can lower some thresholds that reject a large number of applicants for that trade or artificially increase the payoffs for the trade so that the algorithm will direct the applicants preferentially to it. A large number of available persons on the other hand, offers the possibility to increase certain minimum thresholds when that is believed to be desirable. One should note however that usually there is a lot of overlap in the groups of available persons for different trades.
2.2. The Mean Predicted Performance.
The second quality indicator is the Mean Predicted Performance (MPP). Given that the payoff-values are computed using a model based on the relationship between predictors and performance (such as the multiple linear regression model), it becomes possible to estimate the later performance of an individual in a specific trade. After the classification model ran, one can compute the MPP for each trade and compare those with known average performance in the same trades. This quality indicator requires stable prediction models and those are not always available. Its diagnostical power tends to be low as well.
2.3. Descriptive statistics for the groups assigned to trades.
Another approach of the classification quality is based on the descriptive statistics of the groups of applicants that are assigned to the different trades. Aptitudes and other characteristics measured at the interval or ratio level can be summarized by their average whereas categorical data can be shown in contingency tables.
|
The three columns on the left side present the name of the variable and its theoretical minimum and maximum values. The next three columns show the averages for the variable in the row for all applicants in the model (ALL), all assigned applicants (ALL_ASS) and all applicants that were not assigned to a job (ALL_NOT). The remaining columns show the average of the row-variable for the applicants assigned to the jobs identified by the column-header. When examining the variable ST_PINP for instance (standardized intelligence measurement), one can see that the group of assigned persons has an average of 68.4 whereas the not-assigned group has only 44.7. The persons assigned to the job 2 even have an average of 75.9.
This table is very useful to compare the assigned group versus the not-assigned group to see the selection-effect of the classification model on each variable. This table also contains the necessary data to compare the averaged aptitude profiles for different groups. Such a table however is not very user-friendly for that purpose. That is the reason why another - graphical instrument was developed. Next figure presents it.
|
This screen allows to generate graphs very easily. The user can choose any metric variables he or she wants and then select certain profiles. These profiles can include any individual applicant, groups assigned to a specified Job-ID or MOS and the three reference groups: all assigned, all not-assigned and all applicants in the model.
In this example, some average aptitudes are compared for the groups assigned to the MOS Air Traffic Control (Profile 1, MOS 240) and Airfield Defense (Profile 2, MOS 250). On average, the Air Traffic Control group performs better in General Intelligence (ST_PINP) and Technical English (ST_ENG_T) and lower on Physical Fitness (ST_PHYS). The personality score (ST_KAHO) of both groups is similar. Since this is in accordance with what was desired, no corrective action is required.
Both previous screens focused on metric data. For categorical data, one can check the frequencies of the different variable-classes for several relevant groups.
|
The left column in this table exhibits the categorical variable name and the second column shows the different categories or classes of that variable. The remaining columns contain the observed frequencies of the variable-class in the row for different groups: the three reference groups and the groups assigned to the jobs in the column header.
When looking at the variable FAC_P for instance, which describes the general medical fitness with three classes (1-2-3) that do not exclude the candidate, we notice that no applicant got a FAC_P of 1, 305 applicants got a 2 and 61 of them got a 3. When we look further and use some elementary statistics we can say that the odds to be assigned rather than not-assigned are at least 2.5 times higher for the FAC_P 2 candidates than for the FAC_P 3 applicants (lower bound of 95% exact confidence interval). This can be related to the used coefficients for the classes of the variable to check whether the outcome is desirable.
2.4. Respect of the applicants preferences.
A modern classification system shouldnt be based on aptitudes only but needs to include the expressed preferences of the applicants as well. When this is the case, it will be of interest to see to what extend the classification model respected the preferences of the applicants. In the Psychometric Model, the applicants are requested to express their preferences towards each trade on a 1 to 99 scale. As a quality indicator for the classification model, well compute the average preference for a specific trade from the group of applicants that is assigned to that same specific trade.
|
In this table, the cell values represent the average preference of the group indicated in the column header, for the MOS in the left column. The column ALL indicates the popularity of a MOS. The most relevant cells are highlighted. They represent the preference for a MOS as expressed by the group assigned to that MOS. Low values indicate to a certain extend that the applicants assigned to that trade didnt really want this trade. Very high values could result from giving too much weight to the choices of the applicants, perhaps at the expense of not taking their aptitudes enough into consideration. Problems discovered through this table can be corrected by adapting the weight given to the preferences of the applicants.
2.5. Respect of set profiles.
The following quality indicator attempts to check whether the profiles defined by the weights used to compute the payoffs for a trade, correspond to the aptitude profiles of the applicants assigned to that trade. To do so, one needs to consider the variables used to calculate the payoffs for a specific trade. If you standardize these over all the acceptable applicants to a common mean and variance, and then take the average on these standardized variables for the group of applicants assigned to that trade, one can see the departure from the overall mean as an indicator of the weight actually given to the variable in the model. It is further possible to express these trade-averages and the weights used to compute the payoffs on the same scale and to compare them pairwise. This can be done graphically or by means of correlations.
2.6. Specificity of set profiles.
The last proposed quality indicator consists of the correlation matrix of the payoffs. Highly positively correlated payoffs indicate a possible lack of differentiation between the requested aptitude profiles. If the concerned trades are not considered to be very similar, one should try to identify means to discriminate between them and to incorporate these in the classification model.
3. Future directions
When using the immediate quality indicators as described, a practitioner can get a very accurate idea of the quality of the used batch classification system. Such a quality assessment however, still requires a good amount of expertise. Therefore it is recommended to develop expert systems detecting problems and suggesting ways to correct them to assist the user of such classification systems.
_________________________
References
DARBY, M., SKINNER, J. and ALLEY, W (1996)A Methodology for Evaluating the Classification Potential of Experimental Tests. . Proceedings of the 38th annual conference of the International Military Testing Association, p 285 - 289.
DARBY, M., ALLEY, W and CHENG, C. (1996) The Practical Benefits of Personnel Testing: An extension of the Taylor-Russell Tables to Multiple Job Categories. Ibidem p. 268-273.
DARBY, M., GROBMAN, J. et al. (1996) The Generic Assignment Test and Evaluation Simulator (GATES), user manual. Human Resources Directorate, Manpower and Personnel Research Division, USAF.
LESCREVE, F. (1993) A Psychometric Model for Selection and Assignment of Belgian NCOs. Proceedings of the 35th Military Testing Association. US Coast Guard, p. 527-533.
LESCREVE, F. (1995) The Selection of Belgian NCOs: The Psychometric Model goes operational. Proceedings of the 37th annual conference of the International Military Testing Association. Canadian Forces Personnel Applied Research Unit, p. 497-502.
LESCREVE, F. (1997) Data modeling and processing for batch classification systems. Proceedings of the 39th annual conference of the International Military Testing Association (in press)