Ordering of Skills: A/B testing

As the ordering of skills is likely to greatly influence users' picks, testing different ordering of skill is essential to identify options improving users' experience.

This page summarises a proposal for how to implement A/B testing of how to elicit skills for SAyouth.mobi users’ inclusive CV. This research is motivated by concerns that users may struggle to select between different skills, and an interest in comparing different potential metrics for skill priority. The proposed research is an on-platform randomised evaluation of how users are asked to capture the skills relating to their work experiences.

Background

In the search to highlight and surface the skills South African youth have as a result of the activities they have performed and are currently involved in, we conducted a survey that asked microentrepreneurs to rate the importance of different skills associated with their occupation. Specifically, we asked about 1500 microentrepreneurs “On the 1 to 5 scale where 1 is ‘Not important at all’ and 5 is ‘Very important’, how important is [skill tag] to your work?”. The results showed a tendency for participants to assign high importance to all skills presented to them.

This poses a potential challenge for the Harambee platform. If job seekers prioritize all skills equally, they might be overly reliant on the initial set of skills they encounter during their platform search. This could lead them to select skills that aren't necessarily the most relevant for their desired careers, hindering their job search effectiveness. To address this concern, our research aims to investigate how the order of skill presentation on the Harambee platform influences job seeker choices.

Thus we seek to answer the following questions:

  1. What is the impact of presenting skills to platform users in different orders on the skills selected?

  2. Is there an order that offers users the best support in their job search journey?

  3. Do different skill priorities have downstream impacts on job search patterns?

  4. Do different skill priorities have an impact on employer interest in the candidate?

  5. How often should we ask users to update their skills?

Proposal

We propose implementing on-platform A/B testing of how to surface skills to users. This testing should help us to evaluate how users respond to different presentations of skills and consequently refine how best to elicit skill reporting from platform users.

The ideal method for testing skills surfacing is evaluating actual user skill selections on platform. This would be achieved by comparing how similar (statistically comparable) users select skills when presented with different list orders and lists of varying numbers of skills. Thus, we wish to randomly assign users to be presented with different lists when they are asked to select the skills they wish to include on their CV.

We propose a number of potential skill priority metrics:

  1. Random This serves as a neutral benchmark. It is also the most theoretically “hands off”.

  2. Transferability score Transferability score = number of appearances of this skill in ESCO / number of occupations in ESCO . This may nudge users towards listing skills that are commonly used in many other opportunities, or in jobs commonly listed on the platform. If users are encouraged to select skills that are more transferable, we may be able to improve the visibility of skills that firms are interested in knowing about.

  3. Inverse scarcity of skills Inverse scarcity of skills = inverse of number of other users with that score, conditional on the number being above 10 (to rule out idiosyncratic skills). More rare skills may also help job seekers to stand out more from the rest of the field of applicants, and help firms differentiate between job seekers more easily.

  4. Some weighted average of these goals Its likely some mixture of these priorities is appropriate. It may also be that different jobs or different fields may require different weightings of the set of priorities. Developing a preferred weighted average may require some desk work to get input from field experts and literature about relative importance of different measures for particular industries or worker experience profiles.

  5. “Other” field to search for skills Offering users the chance to describe their own skills offers another extreme hands off benchmark. This arm can also be used to validate the accessibility of the language used by the taxonomy to describe the skills surfaced to users.

  6. More skill options For each of the above options we propose surfacing five skills. In this option we propose extending the list to surface more skill options for users to choose from.

We will follow a stratified random sampling strategy based on gender and the type of work the individual has performed (i.e. currently unemployed, microentrepreneur, and employed). The sample will be randomly assigned into treatment groups and each group will be presented with a pre-ranked set of skills. The testing could run for one to three weeks depending on logistical constraints for Harambee and the target sample size for each treatment cell. A/B testing in principle does not require detailed power calculations because patterns are informative even if they do not cross the threshold for academic statistically certainty. Samples as small as 20 users per arm could already offer indicative patterns, although on-platform may make it easy to reach far more users. A larger sample would naturally offer the advantages of more precise estimates of effect and the opportunity to account for the role demographic characteristics may play in mediating responses to different skills list orderings. We propose two levels of randomisation.

Across Users Effect

Randomise at the day/week level. While we prefer a day level randomisation, we are happy to work with logistical constraints of bringing various lists live on the platform. We could do every two days, or every week if one of these is easier. Each new randomisation block we will apply a different skill priority ranking. Cycle through each of the treatment arms repeatedly over 3 weeks. This will generate a treatment arm for each list priority order and therefore enable the central comparison of different potential skill priorities.

Within Users Effect

Once a month prompt an opportunity to review skills reported previously and surface skills in a randomly chosen priority (may be the same or different priority users have previously been asked). This will enable us to examine whether the same user reacts differently to receiving skills in different orders, and for users who are shown the same list twice, how stable elicitation is. A measure of skill reporting stability is useful because it can be used to decide how often the platform should prompt users to update their skills or CV. There are a number of reasons to believe that this will be less useful information than the primary randomisation, however, it will be important that the platform has the technical capacity to encourage users to modify which skills they are capturing in their inclusive CV so that we can implement the findings of the A/B testing for platform users in the control group.

Measurement

Many of the indicators of which list to prefer can be collected from administrative data already on the platform. The primary measure of impact is the simple comparison of skills chosen under different treatment conditions. Statistical properties of the skills chosen offer information a number of dimensions of choice quality:

  1. Correlation between skills chosen under different list orders can tell us the degree to which the different orders make a difference to the skills ultimately chosen.

  2. Variation in skills chosen within a single priority type can tell us whether all individuals are behaving the same way, indicative of mechanical clicking/less engagement with individual matches with skills or thoughtful engagement.

  3. Measures of how often the first three skills, or the first skill, or the last skill, or the skill most centrally displayed on the screen can capture indicators or mechanical clicking, or reduced engagement.

  4. We could manually review a subset of respondents to evaluate skill choice against their profile.

  5. UX measures of time to complete fields, number who start and abandon part way through.

Demographics which may mediate how users interact with skill lists can also be drawn from user profiles. We would likely focus on:

  • Gender

  • Date of birth (Month and Year)

  • Occupation/activity

  • Education level

  • Geographic location

  • Experience history (previous years employed)

  • Disability status

All of this information is available for users on the platform, so it would simply be a question of pulling the relevant administrative data.

If there is interest in using downstream outcomes to adjudicate between list priorities, we could also draw data on which jobs individuals who have been exposed to different lists click on and apply for. We could also ask firms to select skills they would like to see prioritized among platform applicants from similarly ordered lists to obtain a measure of firm preferences.

We could also collect additional data either using on-platform prompts, off-platform follow up surveys, or some small focus groups. Additional data would enable us to measure user satisfaction with the different lists, elicit user criteria for how they select between skills and evaluate how confident or committed users are about the skills they selected.

Last updated