Project 3

Variability in the wild. Assessing the influence of random, and intentional variability differences in a massive online dataset

Overview

Differences in amount of variability experienced during learning are not limited to the contex of controlled Psychology experiments. In the real world, individuals performing an identical task may experience differing levels of variability due to internal factors (e.g. self selection;personal preference), externally manipulated factors (e.g. the decisions of a teacher or coach), or simple random chance. The dataset utilized for project 3 features a large set of users experiencing different amounts of variability via both the externally-manipualted, and randomization mechanisms.

Data generated from the online brain training games of Lumos labs have been used in many Cognitive Science investigations in recent years (Gershman 2020; Jaffe et al. 2022; Steyvers and Schafer 2020; Steyvers et al. 2019) . A subset of these prior works (Donner and Hardy 2015; Steyvers and Benjamin 2019) examined data from Lost in Migration (LIM), a gamiefied version of the Eriksen Flanker task (Eriksen and Eriksen 1974), which is also the source of data for the current project. The prior works have provided strong evidence that users do tend improve their performance in LIM over the course of both short term and long term practice, though none have examined the influence of variation on such improvements.

Unlike previous investigations using data from LIM, the dataset under consideration for the proposed project includes logs from users who trained with both the standard/baseline version of LIM, and users who were assigned to a “split-test” version. Split-tests are internal experiments conducted by Lumos labs, wherein a subset of users are assigned to use a modified version of the game. Split test users completed anywhere from 1-15 sessions with the modified game, before being permanently switched to the baseline version. In the present case, the each of the split-test versions of LIM exposed users to a wider range of possible values of the game stimulii (e.g. rotation, size, orientation). The users assigned to split-tests were found to peform worse than baseline users (personal communication with Lumos Labs). However the analyses conduced by the Lumos labs researchers only compared baseline users against split-test users while the split-test users remained in the modified version of the game. Such comparisons are most analagous to the those made between varied and constant participants in the learning stages of standard Psychology experiments. A common finding from such experiments is for higher-variation pariticpants to perform worse during an initial learning phase, but to then outperform lower-variation subjects in a subsequent generalization or retention phase (Berniker, Mirzaei, and Kording 2014; Catalano and Kleiner 1984; Gonzalez and Madhavan 2011; Lee, Lovibond, and Hayes 2019; Sabah et al. 2019; Wrisberg, Winter, and Kuhlman 1987). It may thus be interesting to also compare split-test users to baseline users AFTER the split-test users have switched into the baseline version of the game - analagous to the design of previous studies that trained varied and constant subjects from distinct conditions, and then tested all subjects from the training conditions of the constant subjects(Goode, Geraci, and Roediger 2008; Green, Whitehead, and Sugden 1995; Kerr and Booth 1978). Unlike in-lab studies which typically only examine the effect of presence/absence of higher-training variation, the fact that the split-test users in our LIM dataset also varied in the amount of varied-training they receive (1-15 spit test games before switching to baseline, see Figures 5 and 6 ), will enable us to parametrically examine the influence of the amount of variation.

Methods

Dataset and Game Description

Link to YouTube video of task demo

The Lost In Migration Game Interface. The apperance of single trial of the game. The user responds by indicating the direction of the central target bird (in this case - up).

Figure 1: The Lost In Migration Game Interface. The apperance of single trial of the game. The user responds by indicating the direction of the central target bird (in this case - up).

The dataset we received had already undergone some filtering, retaining only users who completed at least 5 game sessions, and only users who completed at least 99% of their game sessions on the browser-based web version of the game (as opposed to the mobile version). The total dataset consists of 8,551 individual users, 168,307 separate game sessions, and 8,207,980 total trials of the game. The demographic data included with each user consists of the age, gender, and highest education level achieved. The data set also includes a global timestamp for each game session, which will allow us to control for the influence of temporal spacing between game sessions. Such considerations may be important for reducing noise in our model predictions, due to the real-world reality of users having full control over the timing of each game session (e.g. the first 10 gameplays could occur on the same day, or spread out over a month).

The seven spatial layouts that the target and flankers can appear in. The layouts vary trial to trial, and all seven layouts are present in both the baseline and split-test versions of LIM. In all cases, the task is to respond with the direction of the central bird

Figure 2: The seven spatial layouts that the target and flankers can appear in. The layouts vary trial to trial, and all seven layouts are present in both the baseline and split-test versions of LIM. In all cases, the task is to respond with the direction of the central bird

Each game of LIM has a set duration of 45 seconds, during which users complete as many trials as can fit within that time frame. Figure 1 illustrates the screen view for a trial of the game. On each trial, users are presented with a configuration of 5 “birds” - a central “target” bird and 4 “flanker” birds. The task is to indicate the direction of the central bird as fast as possible using the up/down/right/left arrow keys on their device. The task parameters that define each trial include the direction of the target bird, the direction of the flanker birds, the XY coordinates of the target bird on the screen, and the spatial layout of the birds (Seven layouts - shown in Figure 2). On 50% of trials, the flanker birds are pointing the same direction as the target birds (congruent), while on the other half of trials the flanker birds point in any of the 3 incongruent directions. The dependent measures recorded on each trial are the reaction time, accuracy (correct or incorrect response), response direction (up,down,right,left), and the gamified performance score shown to the users (a combination of reaction time and accuracy). Figure 3 shows the rapid improvements that occur in the early stages of the task and also indicates the fairly large effect of user age on performance.

Learning curves over the first 200 trials. Users are separated into bins according to their age. Note that these smooth, aggregated curves are not necessarily represenatitve of the typical individual user

Figure 3: Learning curves over the first 200 trials. Users are separated into bins according to their age. Note that these smooth, aggregated curves are not necessarily represenatitve of the typical individual user

Split-Test Data

A subset of the users in the dataset participated in an internal experiment or “split-test” conducted by Lumos Labs, wherein new users could be randomly assigned to either the standard version of the game (described above) or to one of several modified game versions. The split-test versions of LIM each expanded the range of possible values for a dimension of the task, see 4. Note that users could be assigned to a version of the split test that included one, two, or all three of the expanded dimensions.

Expanded dimensions of the split test
  • Rotation and Orientation version - A continuous range of discrepancies between target and flanker birds on incongruent trials ( compared to fixed to 90, 180, or 270 degrees in base version)
  • Distance between birds version - A 0-20px continuous distance between target and flankers (constant value of 41px in base)
  • Size of bird version - The size of the target bird and flankers birds (same size for all 4) are independently drawn from a range of 35-60px (both fixed at 41px in base)
Illustration of expanded dimensions in the split-test versions fo the game.Users could be assigned to a version of the split test that included one, two, or all three of the expanded dimensions. (note that orientation and rotation are manipulated together). Distance between flankers manipulation not shown

Figure 4: Illustration of expanded dimensions in the split-test versions fo the game.Users could be assigned to a version of the split test that included one, two, or all three of the expanded dimensions. (note that orientation and rotation are manipulated together). Distance between flankers manipulation not shown

At the start of each new game, Split-test users had a small probability of being permanently switched to the standard version of the game. Figure 5 shows the number of users in the baseline or split-test version of the game over the first 10 games. Figure 6 separates users who only experience the baseline version from users who start with the split-test, and eventually switch to the baseline version. A small number of users who begin with the baseline version are later assigned to a single game of a split-test version, however the frequency of such users may be too low to be useful for the present purposes.

Figure 5: Frequency of users playing in the baseline and split test versions of LIM over the first 10 games.

Figure 6: Frequency of users in each possible sequence of experiencing baseline and split-test versions. Base users are those that experience purely the baseline version. Split indicates the number of users who have thus far only completed the split-test version, and split_base indicates users who started out in the split test, but have switched into the baseline version. Base_split users are those that start with the baseline version, and are then randomly assigned to a single game with the split-test and then switch back to the baseline version (thus each of the green bars reflect a distinct set of users)

Trial-by-trial influence of variability

We can now turn to the nature of random variability in the game.

Evaluating the influence of trial-level variability in LIM is possible due to: 1) the existence of many different types of trials, or trial-states, and 2) the randomization of trial presentation sequences giving rise to between-user differences in the variability of trial types experienced.

Randomization

As mentioned above, the trial selection process is not entirely-random, due to the constraint congruent and incongruent trials occur at approximately equal proportions. However, all other aspects of the trial generation process are random (i.e. no dependence on previous states, user performance, number of gameplays). A simple consequence of such randomization, is that some users will experience a wider range of trial-states, particularly in the early stages of the game. Figure 7 illustrates a toy case wherein two users receive four trials with discrepant levels of variation in spatial layout, the XY coordinates of the birds on the screen, and bird direction.

A) Example of a sequence of first four trials with relatively high variation, in terms of spatial layouts, bird directions, and position on the screen. B) Example of a sequence of first four trials with lower variation in the aforementioned dimensions.

Figure 7: A) Example of a sequence of first four trials with relatively high variation, in terms of spatial layouts, bird directions, and position on the screen. B) Example of a sequence of first four trials with lower variation in the aforementioned dimensions.

Measuring Trial-by-trial variability

To quantify the amount of variability experienced by a user, we may start by simply taking the number of unique trial-states that the user has encountered after a given number of trials with the game. Each trial of the game can be defined a long many different dimensions , both discrete (e.g. bird direction, layout), and continuous (X and Y coordinates on the screen). For simplictly, consider only the 3 categorical/ordinal trial dimensions, which consist of 4 values for target direction, 4 for flanker direction, and 7 spatial layouts, resulting in 112 (\(4*4*7\)) distinct trial-states. Figure 8 demonstrates how users who have completed the same total number of trials will still differ in the number of unique trial-states experienced.

Figure 8: Distribution of number of unique trials experienced over the course of early training. The X axis indicates the total number of game trials completed, and the Y axis reflects the number of unique trial-states experienced up to that point. Hover mouse over boxes to see range/median/quantiles.

Although nice for its simplicity, quantifying variability as the number of unique states experienced is a fairly coarse and limited metric. One limitation comes through the lack of spatial relations between trial-states, for example n 7 the lower variation example in panel B consists of four unique trial-states, but neverthless covers a far narrower region of the full state-space compared to the high variation example in panel A. Another issue, reflected in 8, is that given enough trials, the differences between users will eventually become negligible as they each approach the maximum of 112 unique states encountered. A more suitable measure may therefore be uniformity of the frequency distribution of trial types (e.g. a measure of entropy).

Computational Modelling

Similarity Between Trials

A central challenge of project 3 will be to establish an appropriate measure of distance between the current trial \(trial_n\) and prior experience \(trial_{1:n-1}\). Similarity could be defined as the simple dimensional/featural overlap between two trials, or by modelling a trial-state as a point in a multidimensional space, and using some distance metric (e.g., Euclidean distance) to compute the distance between trials, and then transforming that distance into psychological space with an exponential or Gaussian function.

As was described above, each trial can be defined a long a number of dimensions. However are unlikely to be equal in their influence, and may thus need to be differentially weighted.Determining the identies of the dimensions themselves may also be nontrivial, for instance whether the full set of 16 combinations of target direction (up,down,right,left), and flanker directions (up,down,right,left) can be simplified into a single metric of flanker rotation relative to target location,e.g. 0 , 90 , 180 and 270 .

Once a similarity measure between trials has been established, we can use the individual trial histories for each user to compute how similar a given trial is to the totality of their prior experience in the game (i.e., the summed similarity of all previous trials), or to the similarity of a subset of their more recent trials. We can then begin to perform inferential statistics assessing the extent to which users matched in total experience with the game may perform differently on individual trials as a function of the similarity between their prior experience and that particular trial. Additionally, our similarity metric can be used to attempt to explain differences in performance between baseline, split-test, and split-test -> baseline users.

Measurement model of learning and performance

A prerequsite for assessing the influence of variability/similarity on the trial to trial level is to have an adequate measure of trial-level performance. Individual reaction times, the primary performance measure of the task, are likely far too noisy to be suitable for this purpose. Instead of modelling raw reaction times, we can instead attempt to model the latent level of skill the user has for a task as some function of the number of trials completed. Two of the most widely used functions for relating number of trials completed to performance are the exponential and power functions (see Equation 1 and Equation 2). Both functions are commonly fit with three free parameters, \(u\), \(a\) and \(c\), which reflect starping performance, asymptotic performance, and learning rate, respectively. The debate between these two learning functions is longstanding and continues to this day (Evans et al. 2018; Heathcote, Brown, and Mewhort 2000). Previous research using data from the base verion of LIM did find a slight advantage for the power model(Steyvers and Benjamin 2019). However the investigation of Steyvers included data from a far larger number of games from each user than what the proposed project will consider (due to the constraint of when sufficient between-user trial variability occurs ref/figure). We thus intend to test the suitability both learning functions. With an appropriate model of learning perforance obtained for each user, we will then be able to evaluate the ability of our measures of variability and similarity expereinced up to \(trial_{n-1}\) at predicting performance at on \(trial_n\).

Exponential Model: \(y_t = u - ae^{-ct}\)

Power Model: \(y_t = u - at^{-c}\)

References

Berniker, Max, Hamid Mirzaei, and Konrad P. Kording. 2014. “The Effects of Training Breadth on Motor Generalization.” Journal of Neurophysiology 112 (11): 2791–98. https://doi.org/10.1152/jn.00615.2013.
Catalano, John F., and Brian M. Kleiner. 1984. “Distant Transfer in Coincident Timing as a Function of Variability of Practice.” Perceptual and Motor Skills 58 (3): 851–56. https://doi.org/10.2466/pms.1984.58.3.851.
Donner, Yoni, and Joseph L. Hardy. 2015. “Piecewise Power Laws in Individual Learning Curves.” Psychonomic Bulletin & Review 22 (5): 1308–19. https://doi.org/10.3758/s13423-015-0811-x.
Eriksen, B. A., and C. W. Eriksen. 1974. “Effects of Noise Letters Upon the Identification of a Target Letter in a Nonsearch Task.” Perception & Psychophysics 16: 143–49.
Evans, Nathan J., Scott D. Brown, Douglas J. K. Mewhort, and Andrew Heathcote. 2018. “Refining the Law of Practice.” Psychological Review 125 (4): 592–605. https://doi.org/10.1037/rev0000105.
Gershman, Samuel J. 2020. “Origin of Perseveration in the Trade-Off Between Reward and Complexity.” Cognition 204: 104394.
Gonzalez, Cleotilde, and Poornima Madhavan. 2011. “Diversity During Training Enhances Detection of Novel Stimuli.” Journal of Cognitive Psychology 23 (3): 342–50. https://doi.org/10.1080/20445911.2011.507187.
Goode, M. K., L. Geraci, and H. L. Roediger. 2008. “Superiority of Variable to Repeated Practice in Transfer on Anagram Solution.” Psychonomic Bulletin & Review 15 (3): 662–66. https://doi.org/10.3758/PBR.15.3.662.
Green, D. Penelope, Jean Whitehead, and David A. Sugden. 1995. “Practice Variability and Transfer of a Racket Skill.” Perceptual and Motor Skills 81 (December): 1275–81. https://doi.org/10.2466/pms.1995.81.3f.1275.
Heathcote, A., S. Brown, and D. J. Mewhort. 2000. “The Power Law Repealed: The Case for an Exponential Law of Practice.” Psychonomic Bulletin & Review 7 (2): 185–207.
Jaffe, Paul I, Russell A Poldrack, Robert J Schafer, and Patrick G Bissett. 2022. “Discovering Dynamical Models of Human Behavior.” bioRxiv, 36.
Kerr, R, and B Booth. 1978. “Specific and Varied Practice of Motor Skill.” Perceptual and Motor Skills 46 (2): 395–401.
Lee, Jessica C, Peter F Lovibond, and Brett K Hayes. 2019. “Evidential Diversity Increases Generalisation in Predictive Learning.” Quarterly Journal of Experimental Psychology 72 (11): 2647–57. https://doi.org/10.1177/1747021819857065.
Sabah, Katrina, Thomas Dolk, Nachshon Meiran, and Gesine Dreisbach. 2019. “When Less Is More: Costs and Benefits of Varied Vs. Fixed Content and Structure in Short-Term Task Switching Training.” Psychological Research 83 (7): 1531–42. https://doi.org/10.1007/s00426-018-1006-7.
Steyvers, Mark, and Aaron S. Benjamin. 2019. “The Joint Contribution of Participation and Performance to Learning Functions: Exploring the Effects of Age in Large-Scale Data Sets.” Behavior Research Methods 51 (4): 1531–43. https://doi.org/10.3758/s13428-018-1128-2.
Steyvers, Mark, Guy E Hawkins, Frini Karayanidis, and Scott D Brown. 2019. “A Large-Scale Analysis of Task Switching Practice Effects Across the Lifespan.” Proceedings of the National Academy of Sciences 116 (36): 17735–40.
Steyvers, Mark, and Robert J. Schafer. 2020. “Inferring Latent Learning Factors in Large-Scale Cognitive Training Data.” Nature Human Behaviour 4 (11): 1145–55. https://doi.org/10.1038/s41562-020-00935-3.
Wrisberg, Craig A., Timothy P. Winter, and Jolynn S. Kuhlman. 1987. “The Variability of Practice Hypothesis: Further Tests and Methodological Discussion.” Research Quarterly for Exercise and Sport 58 (4): 369–74. https://doi.org/10.1080/02701367.1987.10608114.