LEARNING AND ACTING WITH PREDICTIVE COGNITIVE MAPS
by
ARTHUR WILLIAM JULIANI
A DISSERTATION
Presented to the Department of Psychology
and the Graduate School of the University of Oregon
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
December 2020
DISSERTATION APPROVAL PAGE
Student: Arthur William Juliani
Title: Learning and Acting with Predictive Cognitive Maps
This dissertation has been accepted and approved in partial fulfillment of the requirements
for the Doctor of Philosophy degree in the Department of Psychology by:
Margaret Sereno Chairperson
Dasa Zeithamova Core Member
Thien Nguyen Core Member
Richard Taylor Institutional Representative
and
Kate Mondloch Interim Vice Provost and Dean of the Graduate School
Original approval signatures are on file with the University of Oregon Graduate School.
Degree awarded December 2020
ii
©c Arthur William Juliani 2020
iii
DISSERTATION ABSTRACT
Arthur William Juliani
Doctor of Philosophy
Department of Psychology
December 2020
Title: Learning and Acting with Predictive Cognitive Maps
Humans and other mammals possess two remarkable abilities: the capacity to store and
retrieve a seemingly boundless series of episodic memories, and the capacity to quickly
make sense of and navigate their changing environments. The latter has been described as
a cognitive map, and along with the capacity to store and retrieve narrative memories, has
been largely localized to the medial temporal lobe. Recent theorists have suggested that
these two capacities are both aspects of a single unified system of ‘experience construc-
tion.’ In such a system, complex high-dimensional sensory experiences represented in the
cortex are indexed by a low-dimensional representation within the medial temporal lobe.
The dynamics of this representation then allow for the generation of coherent sequences of
activation which correspond to coherent narrative experiences, as well as coherent trajec-
tories through the environment, supporting both memory and navigation.
Such a theoretical perspective bears a strong resemblance to a recent class of deep neu-
ral networks called generative temporal models. In this work we explore this connection
by introducing a series of increasingly complex generative temporal models, and analyzing
each of their properties. We find that these models are able to learn representations which
iv
bear a strong resemblance to known representations within the medial temporal lobe, such
as place and time cells. Furthermore, we demonstrate that these representations are useful
for rapidly learning to perform downstream goal-directed navigation tasks using biolog-
ically plausible reinforcement learning rules. We also examine the ways in which these
models can be extended to display adaptation to changes in the structure or content of the
environment, a key property of the cognitive map. Finally, we compare the behavior of
artificial agents utilizing these learned representations to those of humans in a complex vir-
tual navigation task. In doing so, we find evidence that humans utilize a hybrid behavioral
strategy, and that such a strategy can be modeled by artificial agents utilizing a learned
place cell like representation.
v
CURRICULUM VITAE
NAME OF AUTHOR: Arthur William Juliani
GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED:
University of Oregon, Eugene OR
North Carolina State University, Raleigh NC
DEGREES AWARDED:
Doctor of Philosophy, Psychology, 2020, University of Oregon
Master of Science, Psychology, 2015, University of Oregon
Bachelor of Arts, Psychology, 2013, North Carolina State University
AREAS OF SPECIAL INTEREST:
Cognitive Neuroscience
Machine Learning
PROFESSIONAL EXPERIENCE:
Senior Research Engineer, Unity Technologies, 2017-2020
Graduate Teaching Fellow, University of Oregon, 2014-2016
Data Science Intern, Duke University, 2013
GRANTS, AWARDS, AND HONORS:
vi
Nvidia GPU Grant, Nvidia Corporation, 2016
PUBLICATIONS:
Juliani, A., Khalifa, A., Berges, V. P., Harper, J., Teng, E., Henry, H., Crespi, A.,
Togelius, J., & Lange, D. (2019). Obstacle tower: A generalization challenge in vision,
control, and planning. International Joint Conferences on Artificial Intelligence 2019.
Juliani, A. W., Yaconelli, J. P., & Sereno, M. E. (2019). Learning to Integrate Egocen-
tric and Allocentric Information using a Goal-directed Reward Signal. Journal of Vision,
19(10), 162-162.
Taylor, R. P., Juliani, A. W., Bies, A. J., Boydston, C., Spehar, B., & Sereno, M. E.
(2018). The implications of fractal fluency for biophilic architecture. Journal of biourban-
ism, 6, 23-40.
Juliani, A. W., Bies, A. J., Boydston, C. R., Taylor, R. P., & Sereno, M. E. (2016). Nav-
igation performance in virtual environments varies with fractal dimension of landscape.
Journal of environmental psychology, 47, 155-165.
Juliani, A., Bies, A., Boydston, C., Taylor, R., & Sereno, M. (2016). Spatial localization
accuracy varies with the fractal dimension of the environment. Journal of Vision, 16(12),
1370-1370.
Juliani, A., Leidheiser, W., McLaughlin, A., Allaire, J., & Gandy, M. (2013, Septem-
ber). Cognitive Ability Predicts Older Adult Performance in a Complex Task but is Moder-
ated by Social Interaction. In Proceedings of the Human Factors and Ergonomics Society
Annual Meeting (Vol. 57, No. 1, pp. 1740-1744).
vii
ACKNOWLEDGMENTS
I want to acknowledge the support of my entire committee, and the faculty of the Depart-
ment of Psychology as a whole. In particular, I wish to express gratitude to Margaret Sereno
for her role as an advisor and source of constant support as my interests and career goals
continued to develop throughout my time as a graduate student. I also wish to acknowledge
the support of my professional colleagues, whose support it possible for me to complete
this work.
viii
TABLE OF CONTENTS
I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
I.1 Neuroscientific Evidence for Cognitive Maps . . . . . . . . . . . . . . . . 5
I.1.1 Place, Grid, and Other Spatial Cells . . . . . . . . . . . . . . . . . 7
I.1.2 Time, Event, and Other Non-spatial Cells . . . . . . . . . . . . . . 11
I.1.3 Replay, Preplay, and Structured Temporal Sequences . . . . . . . . 13
I.2 Computational Theories of Mammalian Navigation . . . . . . . . . . . . . 18
I.2.1 Path Integration, Attractors, and Other Early Models . . . . . . . . 18
I.2.2 Vector Navigation, Neural Networks, and Other Later Theories . . . 21
I.2.3 Prospective and Successor Models . . . . . . . . . . . . . . . . . . 23
I.2.4 Goal Signals and the Hippocampus . . . . . . . . . . . . . . . . . . 26
I.2.5 Policy Learning from Real and Imagined Experience . . . . . . . . 29
I.3 Generative Temporal Models . . . . . . . . . . . . . . . . . . . . . . . . . 32
I.3.1 Basics of Generative Temporal Models . . . . . . . . . . . . . . . . 34
I.3.2 Extending GTMs with Memory and Multiple Latent States . . . . . 37
I.3.3 Hippocampal Index Theory and a Language Metaphor . . . . . . . 40
II. THE HIPPOCAMPUS AS A GENERATIVE TEMPORAL MODEL . . . . . . . 44
II.1 Place and Time Cells in a GTM Latent State . . . . . . . . . . . . . . . . . 45
II.1.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 47
II.1.2 Modeling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 48
II.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
ix
II.2 Place-like Cells are Distributed based on Underlying Agent Behavior . . . . 54
II.2.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 55
II.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
II.3 Internally Generated Sequences and Auto-regressive Models . . . . . . . . 57
II.3.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 58
II.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
II.4 Generative Temporal Models Learn Temporal Community Structure . . . . 60
II.4.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 61
II.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
II.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
III. LATENT STATES AND GOAL-DIRECTED NAVIGATION . . . . . . . . . . . 67
III.1 State Cells for Actor-Critic Learning . . . . . . . . . . . . . . . . . . . . . 68
III.1.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
III.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
III.2 State Cells for Successor Feature Learning . . . . . . . . . . . . . . . . . . 73
III.2.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 74
III.2.2 Modeling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 75
III.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
III.3 Fast Convergence with Successor Similarity Learning . . . . . . . . . . . . 78
III.3.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 79
III.3.2 Modeling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 79
III.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
III.4 Rollouts, Replay, and Dyna Learning . . . . . . . . . . . . . . . . . . . . . 81
III.4.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 82
III.4.2 Modeling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 83
III.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
III.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
x
IV. CONTENT GENERALIZATION AND DUAL STREAM WORLD MODELS . . 87
IV.1 Learning Content Agnostic Latent Representations . . . . . . . . . . . . . 89
IV.1.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 92
IV.1.2 Modeling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 94
IV.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
IV.2 Goal-directed Navigation in Environments with Novel Content . . . . . . . 99
IV.2.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 100
IV.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
IV.3 Learning from Egocentric Observations . . . . . . . . . . . . . . . . . . . 102
IV.3.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 103
IV.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
IV.4 Goal-directed Navigation from Egocentric Observations . . . . . . . . . . . 107
IV.4.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 107
IV.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
IV.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
V. STRUCTURAL GENERALIZATION AND CONTEXT MODELS . . . . . . . 112
V.1 Learning an Index-based Context Representation . . . . . . . . . . . . . . 114
V.1.1 Modeling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 114
V.1.2 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 116
V.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
V.2 Learning a Map-based Context Representation . . . . . . . . . . . . . . . . 119
V.2.1 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 120
V.2.2 Modeling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 120
V.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
V.3 Learning Implicit Context Representations . . . . . . . . . . . . . . . . . . 122
V.3.1 Modeling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 123
V.3.2 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 125
xi
V.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
V.4 Adapting to Changes in Context and Content . . . . . . . . . . . . . . . . . 127
V.4.1 Modeling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 128
V.4.2 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 129
V.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
V.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
VI. HUMAN AND AGENT BEHAVIOR IN COMPLEX ENVIRONMENTS . . . . 133
VI.1 Human Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . 135
VI.2 Environmental Complexity and Human Navigation . . . . . . . . . . . . . 139
VI.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
VI.3 Evidence for a Hybrid Behavioral Strategy in Humans . . . . . . . . . . . . 143
VI.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
VI.4 Artificial Agent Behavior Varies with State Space Type . . . . . . . . . . . 148
VI.4.1 Modeling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 148
VI.4.2 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 149
VI.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
VI.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
VII.GENERAL DISCUSSION AND CONCLUSION . . . . . . . . . . . . . . . . . 158
VII.1Maps, Memories, and Models . . . . . . . . . . . . . . . . . . . . . . . . . 159
VII.2Connections to Contemporary Modeling Research . . . . . . . . . . . . . . 163
VII.3Biological Implications and Open Questions . . . . . . . . . . . . . . . . . 167
VII.4Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
xii
LIST OF FIGURES
1 Diagram of a variational auto-encoder . . . . . . . . . . . . . . . . . . . . 35
2 Diagram of a World Model . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Diagram of a Recurrent State Space Model . . . . . . . . . . . . . . . . . . 37
4 Diagram of the Generative Temporal Model with Spatial Memory . . . . . 39
5 Explanation of Gumbel-Softmax distribution . . . . . . . . . . . . . . . . . 46
6 The simple two-dimensional “gridworld” environment . . . . . . . . . . . . 48
7 Representative activation patterns of the first 18 units in the latent variable z
in world models trained using gumbel-softmax, gaussian, and deterministic
latent distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8 Reconstruction errors of three model types trained to auto-encode spatial
observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9 Example activation patterns for nine units of GTM with GS latent space
models trained using different values of β for regularization loss . . . . . . 53
10 Representative activation patterns of the 64 units in the latent variable z
by time-step in world models trained using gumbel-softmax, gaussian, and
deterministic latent distributions . . . . . . . . . . . . . . . . . . . . . . . 53
11 Reconstruction errors of three model types trained to auto-encode temporal
observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
12 Action probability distributions for each of the five biased policies . . . . . 56
13 Activation patterns of latent units trained with a biased behavioral policy . . 57
xiii
14 Inferred and generated latent variables during a single trajectory. . . . . . . 59
15 Comparison between ground-truth observations, their reconstructions from
the inferred latent variable, and their reconstruction from the rollout of the
generative model using a gumbel-softmax latent space. . . . . . . . . . . . 60
16 Diagram of a graph environment . . . . . . . . . . . . . . . . . . . . . . . 62
17 Fractal Rollout Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
18 Latent space activations for each of the 16 units in the network. . . . . . . . 64
19 Multi-dimensional scaling of latent representations of learned model com-
pared to true underlying topography of environment . . . . . . . . . . . . . 64
20 Diagram of two-dimensional reinforcement learning environment with sin-
gle goal and single agent. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
21 Actor-Critic agent mean time-steps per-episode for each basis function . . . 73
22 Example value estimate maps . . . . . . . . . . . . . . . . . . . . . . . . . 74
23 Diagram of experimental design for successor learning experiment . . . . . 75
24 Mean time-steps per-episode for the two state space representations using
either a successor representation or actor-critic learning algorithm . . . . . 78
25 Mean time-steps per-episode for SSL and SR based learning algorithms
with different basis functions . . . . . . . . . . . . . . . . . . . . . . . . . 80
26 Mean time-steps per-episode for SSL based learning algorithms with dif-
ferent basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
27 A large circular gridworld environment used to compare performance of
purely online and Dyna-assisted learning. . . . . . . . . . . . . . . . . . . 83
28 Mean time-steps per-episode for a fully online learning algorithm, and an
online algorithm augmented with various rollout lengths of Dyna . . . . . . 85
29 Diagram of the Dual Stream World Model . . . . . . . . . . . . . . . . . . 91
30 Four variable content environments each with a different topography . . . . 93
xiv
31 Reconstruction errors from rollouts of both World and DSWM models in
four different topographical environments . . . . . . . . . . . . . . . . . . 97
32 Examples of reconstructed observations from rollouts of both World and
DSWM models in four different topographical environments . . . . . . . . 98
33 Examples of activations of first four units of inferred and generated s from
DSWM model in each of the four different environment topographies. . . . 99
34 Four different environment topographies, each showing the initial goal lo-
cation for the first 50 episodes (top) and the second goal location for the
following 50 episodes (bottom) . . . . . . . . . . . . . . . . . . . . . . . . 100
35 Learning curves in goal-directed navigation task for each of the four unique
environmental topographies . . . . . . . . . . . . . . . . . . . . . . . . . . 101
36 Three dimensional gridworld environment rendered using Unity . . . . . . 104
37 Reconstruction errors from rollouts of both World and DSWM models in
four different topographical environments . . . . . . . . . . . . . . . . . . 105
38 Examples of reconstructed observations from rollouts of both World and
DSWM models in four different topographical environments . . . . . . . . 106
39 Examples of activations of selected four units of inferred and generated s
from DSWM model in each of the four different environment topographies. 107
40 Starting agent and goal positions for each of the four topographies in the
3D environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
41 Learning curves in goal-directed navigation task for each of the four unique
environmental topographies . . . . . . . . . . . . . . . . . . . . . . . . . . 109
42 Diagram of a Contextual World Model . . . . . . . . . . . . . . . . . . . . 115
43 Examples of sixteen environments with fractal topographies . . . . . . . . . 117
44 Classification accuracy of index-based contextual world model . . . . . . . 118
45 Reconstruction error for predicted trajectories of future observations for
both WORLD and CWORLD models . . . . . . . . . . . . . . . . . . . . 119
xv
46 True environment topography alongside predictions from the CWORLD-M
model at test-time for environment topographies A-E . . . . . . . . . . . . 121
47 Reconstruction error for predicted trajectories of future observations for
both WORLD and CWORLD models . . . . . . . . . . . . . . . . . . . . 122
48 Diagram of CWORLD-U model . . . . . . . . . . . . . . . . . . . . . . . 124
49 The nine test environments with hand-crafted Euclidean geometries . . . . 126
50 Reconstruction error for predicted trajectories of future observations for
both WORLD and contextual variants . . . . . . . . . . . . . . . . . . . . 127
51 Reconstruction error for predicted trajectories of future observations for
both WORLD and contextual variants . . . . . . . . . . . . . . . . . . . . 127
52 Diagram of a Tri-Stream World Model . . . . . . . . . . . . . . . . . . . . 128
53 Reconstruction error for predicted trajectories of future observations for
both WORLD and contextual variants . . . . . . . . . . . . . . . . . . . . 130
54 Examples of units from the c latent space of a TSWM model . . . . . . . . 131
55 Example first-person perspective of participant performing navigation task . 136
56 Visual representation of the four possible conditions within each block of
trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
57 Examples of different seed used to generate three environment topographies
each with different complexity levels . . . . . . . . . . . . . . . . . . . . . 138
58 Mean human performance by fractal dimension . . . . . . . . . . . . . . . 141
59 Mean human performance by fractal dimension in four stages of a single
block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
60 Mean human performance per trial by fractal height threshold . . . . . . . . 142
61 Mean human performance over time within a single block . . . . . . . . . . 144
62 Mean human performance by block change condition . . . . . . . . . . . . 145
63 Mean human performance by block change condition . . . . . . . . . . . . 146
xvi
64 Activation profiles of first sixteen units of inferred latent s and z spaces in
the TSWM model trained on a single fractal island topography. . . . . . . . 151
65 Mean agent performance with three different state spaces . . . . . . . . . . 152
66 Mean agent performance within each change condition, and utilizing one
of three different state spaces . . . . . . . . . . . . . . . . . . . . . . . . . 154
xvii
LIST OF TABLES
1 Statistics from final 20 episodes of each training session for goal-directed
agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2 Statistics from final 20 episodes of each training session for goal-directed
agents in 3D environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
xviii
CHAPTER I
INTRODUCTION
The “above” is what is “on the ceiling,” the “below” is what is “on the floor,” the “behind” is
what is “at the door.” All these wheres are discovered and circumspectly interpreted on the paths
and ways of everyday associations, they are not ascertained and catalogued by the observational
measurement of space.
-Martin Heidegger, Being and Time
Humans and other mammals can quickly become familiar with and skillfully navigate
new spaces. This is thought to be possible thanks to the existence of a mental representation
of the space which we quickly generate and update unconsciously. Over half a century
ago this idea was made more concrete with the proposal of a cognitive map of space in
mammals (Tolman, 1948). This ‘map’ was demonstrated in rodents as one which is quickly
learned from experience, conforms to the unique structure of a space, and is used by the
animal to navigate that space. The following decades saw the discovery of place cells in
the hippocampus, leading researchers to focus on this area as the site of the cognitive map
(O’Keefe, 1976; O’Keefe & Nadel, 1978; Morris, Garrud, Rawlins, & O’Keefe, 1982).
Subsequent to the discovery of place cells was the discovery of a series of other spa-
tially selective cells in the nearby regions of the hippocampus, collectively part of the me-
dial temporal lobe. Most notable among these was the discovery of grid cells (Hafting,
Fyhn, Molden, Moser, & Moser, 2005), which explicitly encode spatial information that
corresponds to the position of an animal within an environment. Since then, there has
1
been the discovery of a variety of spatial-information-encoding cells and sub-regions within
the hippocampal formation. These have been shown to encode a variety of different sig-
nals ranging from animal head orientation (Taube, Muller, & Ranck, 1990) to environment
boundaries (Lever, Burton, Jeewajee, O’Keefe, & Burgess, 2009).
This spatial role of the hippocampus can be contrasted with the alternative perspec-
tive that the hippocampus is primarily involved in the formation, consolidation, and recall
of episodic memories in animals, particularly in humans (Tulving & Markowitsch, 1998).
Early lesion studies confirmed the essential role the hippocampus plays in ensuring that
narrative experience enters long-term memory. Patients with hippocampal damage show
a severely degraded ability to create new memories of personal experiences, a condition
referred to as anterograde amnesia (Scoville & Milner, 1957; Aggleton & Brown, 1999). It
has also been shown that patients with similar hippocampal damage are also unable to imag-
ine new experiences with the same level of coherency as individuals without such damage
(Hassabis, Kumaran, Vann, & Maguire, 2007), suggesting that the region is more generally
involved in the construction of coherent narrative experiences (Hassabis & Maguire, 2009).
This encoding and decoding of coherent experiences in the hippocampus has been stud-
ied in much greater depth in rodents than in humans. This work has led to the discovery
of replay, a phenomena characterized by trajectories of place cells corresponding to an en-
vironment spontaneously reactivating when the animal is at rest after having experienced
that environment (Louie & Wilson, 2001; Foster & Wilson, 2006). These replay events
have been shown to take place both during sleep and waking states, as well as to proceed in
the “forward” and “reverse” directions. Studies have also found the existence of so-called
preplay events, which take place prior to the animal experiencing a certain environment
(Dragoi & Tonegawa, 2011, 2013). The value of these events to the animal has been theo-
rized to be in their ability to both aid in the consolidation of memories as well as to support
planning future behavior (Pezzulo, van der Meer, Lansink, & Pennartz, 2014).
Functional accounts of replay and preplay in the hippocampus point to a unified in-
2
terpretation of the role of the medial temporal lobe. Rather than performing both spatial
navigation as well as memory storage and retrieval independently, the hippocampus can
be interpreted as an experience construction system, as proposed by Hassabis and Maguire
(2009). In this theory, the role of the hippocampus is to generate coherent sequences of
activation which correspond to extended narrative experiences, or episodic memories. The
fundamental building block in this system is the neural representation contained within the
hippocampus. This representation has been proposed to serve as an index into a cortical
state in the related hippocampal index theory (Teyler & DiScenna, 1986).
These indices take the form of place cell representations when the state space of in-
terest is spatial (O’Keefe, 1976), and take other forms such as time cell representations
(Eichenbaum, 2014), or event cell representations (Sun, Yang, Martin, & Tonegawa, 2020),
when there are other relevant aspects of the environment required to form meaningful in-
dices of experience. This hippocampal representation can then serve as the basis for a
state space upon which behaviorally motivated learning can take place. In many cases
this learning involves spatial navigation toward physical locations, and as such, the entire
system appears to be a spatial navigation one. In cases where the behaviorally salient envi-
ronment representation is non-spatial, then the state space induced within the hippocampus
bears non-spatial properties (Behrens et al., 2018).
This interpretation of the medial temporal lobe as an experience construction system,
one which indexes cortical experiences and learns their temporal dynamics, bears a strong
resemblance to a recent class of neural networks referred to as generative temporal mod-
els. Like the proposed role of the MTL, these models also infer latent states from high-
dimensional sensory streams, and learn to spontaneously generate coherent sequences of
these latent states, which can then be decoded into high-dimensional sensory information.
These models are also often then used to then guide goal-directed behavioral learning in
artificial agents (Ha & Schmidhuber, 2018). The goal of this dissertation is to further clar-
ify this connection, and explore its limit through the description and empirical evaluation
3
of a series of increasingly complex generative temporal models.
In the following text of the introduction, each of the findings discussed above will be
expanded upon to provide a fuller picture of the current neurobiological and computational
understanding of the hippocampal formation, and its role in the creation and support of
cognitive maps. We will then introduce the main theme of this work, a class of neural
networks referred to as generative temporal models, and discuss their connection with the
medial temporal lobe and its cognitive mapping abilities.
The body of this text will then turn to the introduction and analysis of a series of in-
creasingly complex generative temporal models which capture various aspects of the con-
struction system of the medial temporal lobe. The second chapter will introduce a simple
generative temporal model, and demonstrate the ability for this model to develop place
and time-like cells within its latent representation. We will also demonstrate the ability
for such a model to perform replay, and analyze the hidden representations of the model,
showing that they display temporal community structure, a key aspect of the hippocampal
representation (Schapiro, Turk-Browne, Norman, & Botvinick, 2016).
The third chapter of this work will then turn to utilizing the learned latent states of a
generative temporal model for the purpose of goal driven navigation. We will explicitly
utilize known reinforcement learning algorithms which have been connected with reward
learning in the brain, specifically actor-critic and successor representations (Niv, 2009;
Stachenfeld, Botvinick, & Gershman, 2017). Building on these methods, we introduce a
novel reinforcement learning algorithm which learns more rapidly than previous related
methods. Here we will also demonstrate the usefulness of the replay capabilities of a
generative temporal model in guiding goal-directed learning.
In the fourth chapter we turn to the problem of content generalization, the ability to learn
representations which are invariant to non-structural changes in sensory stimuli within an
environment. Here we introduce a more complex generative temporal model which utilizes
multiple latent states, as well as a storage and lookup mechanism for enabling episodic
4
memory. We demonstrate that this model is able to learn allocentric representations, in
the form of place-like cells, directly from egocentric observations. We then validate this
method on both allocentric 2D environments as well as egocentric 3D environments with
various topographies.
In chapter five we then turn to the question of context generalization, the ability to
learn representations which adapt to changes in the structure of the environment. Here
we explore a number of approaches for augmenting a generative temporal model with a
contextual representation. Along with two latent representations learned using a super-
vised learning signal, we introduce an additional model which learn an implicit contextual
representation in an entirely unsupervised fashion. We draw a connection between this
representation and the parahippocampal gyrus.
Finally, in chapter six we present a set of experiments in a novel realistic 3D virtual
environment conducted both with human participants and with artificial agents. In both
cases, the entity interacting with the environment is tasked with performing a goal-directed
navigation task toward a hidden goal location within the environment. We use this task to
test for the effect of environment complexity on human performance. In addition, a set of
environment-change conditions are used to examine where human’s behavior in this task
can be classified on the spectrum between model-based and model-free decision making
strategies. We find evidence for a hybrid strategy. We then demonstrate that an artificial
agent using a latent state space from a generative temporal model learns a policy with a
similar set of adaptation characteristics to that of humans performing the task.
I.1 Neuroscientific Evidence for Cognitive Maps
The hippocampal formation is a system of brain regions within the medial temporal lobe,
containing the hippocampus, entorhinal cortex, and subiculum, among other connected re-
gions. It has historically been implicated in two broad categories of cognitive function, the
development of and access to episodic memories (Tulving & Markowitsch, 1998; Aggle-
5
ton & Brown, 1999), and the representation of a spatial cognitive map (O’Keefe & Nadel,
1978; Behrens et al., 2018). These functions were discovered in independent contexts, and
originally existed as distinct lines of research. Part of the limbic system, the formation
is also densely connected to other important areas such as the prefrontal cortex (Preston
& Eichenbaum, 2013), implicating it in the process of high-level decision making (Tanji
& Hoshi, 2001). This section will describe the lines of research around these two broad
interpretations, and the empirical evidence for each, both from behavioral and neural data.
Early research into spatial learning in rodents suggested than rather than simply learning
stimulus-response mappings, some animals are able to develop abstract representations
of their environments, and use them for navigation (Tolman, 1948). In a set of classic
experiments, Tolman showed that rodents were able to quickly take never before visited
paths to regions of space associated with a known reward, suggesting that the rodents had
developed an abstract representation of the environment they were able to utilize in the
task. These abstract representations were referred to as a “cognitive map,” because of their
apparently spatial nature, and their specific application to navigation in the case of rodents.
Early research into the role of the hippocampal formation in mammalian cognition
made clear the potential contribution of the brain region to the formation of this cognitive
map (O’Keefe & Nadel, 1978). This was supported by the discovery of cells within the hip-
pocampus which were robustly selective to an animal occupying a specific position in space
(O’Keefe, 1976). The idea that this selectivity could be used to support a general-purpose
map of an animal’s location within the world, and thus be used for the selection of intelli-
gent behavior has been built upon and continuously developed throughout the proceeding
decades (for a review, see Behrens et al., 2018).
This development has been grounded in the gradual discovery of populations of cells
within the hippocampal formation which are selective to different aspects of the environ-
ment within which an animal finds itself within. Most studied among these have been the
place and grid cells (O’Keefe, 1976; Hafting et al., 2005), with a wealth of additional cell
6
types having been discovered as well (Solstad, Boccara, Kropff, Moser, & Moser, 2008;
Behrens et al., 2018). Evidence from lesion studies suggest that the spatial information
represented in these regions is critical for performing navigation in animals (Morris et al.,
1982). This has led to a large amount of theoretical work attempting to provide compu-
tational models of both how these representations are learned, as well as how they could
be used to aid in active navigation and memory for animals (Hasselmo, 2009; Erdem &
Hasselmo, 2012; Bush, Barry, Manson, & Burgess, 2015).
I.1.1 Place, Grid, and Other Spatial Cells
Early evidence for the existence of a cognitive map in mammals came from experiments
conducted in the 1970s by O’Keefe and collaborators (O’Keefe, 1976; O’Keefe & Nadel,
1978). This early work was conducted on rodents as they moved around in a small enclosed
maze. During this movement recordings of cellular activation were collected via electrodes
from the CA1 region of the hippocampus. Hundreds of cells in this region were monitored,
and it was discovered that a large number of them preferentially responded to specific
spatial locations within the maze.
Further experimentation suggested that while some of these activation patterns were
the result of incidental features of the environment, a non-trivial number of them displayed
robust activation despite various manipulations of the sensory and motor experience of the
animal. This suggested that these cells in some way coded for an abstract notion of the
“place” the animal found itself within. This sense of place was semi-invariant to incidental
features of the environment such as lighting conditions. It was also found that there was no
direct connection between the position of the animal within space and the position within
the CA1 region of the cell which preferentially fired for that region of space. To the re-
searchers at the time, the given responsiveness of a place cell seemed arbitrarily related to
the spatial properties of the region of its affinity.
In subsequent decades, follow-up work was conducted to more rigorously determine
7
the firing properties of place cells (Muller, Kubie, & Ranck, 1987; Muller & Kubie, 1987).
Using a video monitoring system, Muller and Kubie were able to characterize the statistical
properties of the place cells, and their impact from changes in the environment. Most
compelling was the discovery that the place fields were able to quickly remap their location
of preference when a cue card serving as the primary landmark in the environment was
rotated. The complete removal of the cue card resulted in only minor shifts in place field
firing, suggesting that they were supported by more complex perceptual anchors than just
the cue card. More recently research has been conducted which provides evidence for
the existence of similar place-specific cell populations in the human hippocampus as well
(Ekstrom et al., 2003).
In the years following the discovery of place cells, there remained an open question
regarding how it was that the semi-invariant and non-uniform representation of the place
cells was generated and sustained during navigation. It was hypothesized that there must be
a more consistent underlying representation of space (possibly developed from pure ego-
motion cues) that serves as a foundation for the more environment-specific place cell repre-
sentation (O’Keefe & Nadel, 1978). This representation was finally discovered in rodents
in the mid-2000s in a region not of the hippocampus proper, but in the entorhinal cortex,
specifically this region was the medial entorhinal cortex (MEC) (Hafting et al., 2005). This
population of cells with this highly uniform spatial firing pattern became known as the grid
cells, named for their triangular tiled pattern of activation.
Unlike the place cell populations, within which each cell responded preferentially to
just a single region of space, grid cells respond with periodic firing that resulted from the
spatial position of the animal. As such, each cell displayed a “grid” of activation for a
given environment, each with a uniquely offset phase. The specific periodicity and scale
of these activation patterns were found to vary in a predictable manner across the region in
the entorhinal cortex, with the spatial scale increasing with distance from the dorsal end of
the entorhinal cortex. Most importantly this grid representation develops after the animal
8
has been placed into a new environment. They then remain invariant to manipulations
of sensory features of the environment such as lighting or other visual cues. The fast and
stable representation is critical for supporting a useful navigation-oriented representation of
space. It was subsequently found that there are additional populations of grid cells deeper
in the entorhinal cortex whose firing patterns are dependent on head-cell firing, supporting
a bridge between purely head direction selective cells and position selective only grid cells
(Sargolini et al., 2006).
More recent fMRI work has provided evidence that humans possess an analogous re-
gion of grid cells in their entorhinal cortex (Doeller, Barry, & Burgess, 2010) as well,
suggesting that the region may be shared by most mammals, and not specific to rodents.
In the study by Doeller and collaborators, participants were placed into an fMRI machine
and asked to perform a foraging task in a virtual environment. The BOLD signal was then
measured and analyzed in the entorhinal region as a function of the direction and speed of
the participant’s movement through the virtual space. The finding that this signal corre-
sponded to the expected firing pattern from grid cell recordings from rodents provided the
evidence for a similar system. This pattern of activity was one which synced with the ex-
pected six-fold symmetry found in grid cell firing patterns. A similar activation pattern was
found in later work in which human participants were given an imagined navigation task
(Horner, Bisby, Zotow, Bush, & Burgess, 2016). Due to the lack of spatial resolution of
fMRI, it is difficult to determine the exact structure of activation at the cellular level in this
region in humans, making it unclear whether there is simply an analogically similar pattern
of activation or whether humans indeed possess individual cells with grid-like firing profile
as rodents do.
In addition to place and grid cells, an array of other spatially selective cells have been
discovered within the hippocampal formation, including border and head-direction cells
(for a review of additional spatially selective cell types, see Behrens et al., 2018). Evidence
for these were discovered using largely similar methods to those originally used by O’Keefe
9
years earlier (O’Keefe, 1976), with single-unit recording from rodents within an artificially
constructed environment primarily being the method of choice.
Head-direction cells were identified in the early 1990s, and as their name suggests, they
consist of a population of cells in the subiculum of rodents which preferentially responded
to the animal’s head facing a specific direction in space (Taube et al., 1990). Similar to
place cells, these cells were found to be robust to other environmental stimuli which were
non-essential for determining the primary feature of activation: the direction of the animal’s
head. Each head-direction cell fires rapidly when the head is oriented in a specific direction,
and maintains a low baseline level of firing otherwise. Also similar to place cells, but unlike
grid cells, there is no topographical organization of the cells within the brain region that
corresponds to firing preference in head direction space.
Another cell population with specific spatial firing features in the hippocampal for-
mation are the border cells, which preferentially respond to the animal’s proximity to a
boundary in the environment, with greater activation as the animal gets closer to the pre-
ferred boundary for the cell (Solstad et al., 2008). This cell type was later generalized into
a “Boundary Vector Cell,” found in the subiculum (Lever et al., 2009). These boundary
vector cells responded to proximity to border regardless of the animal’s orientation or head
direction, and maintained firing even when the animal was not necessarily in proximity to
the boundary. This suggests that the cells could be used to compute distance to a given
boundary, rather than simply providing a binary signal reflecting the presence or absence
of a proximal boundary. It has been hypothesized that these boundary cells may be used to
determine the limits of a given environment for the animal for the purpose of aligning grid
cell responses.
Taken together the cell types discussed above seem sufficient for an understanding of
the hippocampal formation as a purely spatial mapping system. Indeed, as will be dis-
cussed in Section I.2, a large amount of theoretical work has been done to demonstrate the
sufficiency of these cell types for navigation. In more recent years however, more sophis-
10
ticated recording techniques and experimental designs have shed doubt on the concept of
the hippocampus exclusively as a representation system for space.
I.1.2 Time, Event, and Other Non-spatial Cells
The picture of the hippocampus as a spatial cognitive map has been complicated in recent
years by a variety of findings showing that in addition to cells which fire based on spatial
features of the environment (such as place, grid, and border cells), there are additional
cells which fire regularly according to non-spatial aspects of the environment. One of the
more prominent of these is a class of cells which fire based on the elapsed time within
a specific task, referred to as “time cells” (for a review, see Eichenbaum, 2014). Early
evidence for this was put forward by Pastalkova and colleagues who showed that activation
patterns in rodent hippocampus reflect internally generated sequences which corresponded
to delay in the task rather than spatial position or other physical stimuli (Pastalkova, Itskov,
Amarasingham, & Buzsáki, 2008). The existence of cells with this firing profile suggest
that the activation patterns of the hippocampus reflects more than just a spatial selectivity,
and points to a more general organizing principle behind these representations.
Subsequent research showed similar results in the case where the animal was stationary
as well (MacDonald, Lepage, Eden, & Eichenbaum, 2011), and were isolated to be specif-
ically anchored around task-specific temporal delays. Their work consisted of examining
activation patterns in CA1 of the rodent hippocampus. The task involved the animal mov-
ing through a circular line maze. At the beginning of the maze, the animal was presented
with one of two colored objects. In the next phase the animal remained in a fixed position
in the maze for ten seconds. In the final phase the animal was then presented with an odor
at the end of the maze. As expected, the researchers found place cells which were sensitive
to the animal’s spatial location within the maze. In addition, they also found cells which
were sensitive to the temporal delay in the second phase of the task. Importantly, these
patterns could not be explained by the animal’s position, rotation, or velocity. When the
11
delay was extended in the second phase, the cells “remapped,” with a different set of cells
now corresponding to time cells for the task.
Surprisingly, the same cells which display an activation profile consistent with time
cells sometimes also display place cell like activation in the work of MacDonald et al.
(2011), suggesting that the simplistic narrative of place cells supporting spatial representa-
tion only is at best missing critical aspects related to temporal coding. If the hippocampus
does not provide a spatial cognitive map, then what could be a more appropriate alterna-
tive? The evidence provided above suggests that the hippocampus represents experiences
in both a spatial and temporal manner, but specifically one which is environment specific,
in which neither a spatial or temporal component is dominant, and rather the needs of the
task and environment are captured in the place cells.
Indeed, recent work looking at human hippocampal activation using fMRI shows that
a spatio-temporal signal rather than a spatial or temporal one provided the best fit for the
representation in the region (Deuker, Bellmund, Schröder, & Doeller, 2016). This was done
using a task where participants navigated a virtual environment in which the spatial and
temporal distances between objects in the environment were manipulated. The researchers
referred to the joint representation learned as an “event map” to capture it’s more abstracted
nature.
More recently work in rodents has demonstrated the existence of specific cells which
do not respond to either temporal or spatial properties of a task per-se, but rather to a
more complex and general relationship between the animal and its environment (Sun et
al., 2020). In this work, Sun and colleagues recorded from the rodent hippocampus while
the animals performed a navigational task around a series of circular tracks. The shape
and length of these tracks differed, such that there was no specific temporal or spatial
correspondence between turning the left corner of one track and turning the left corner of
another. Despite this, cells within hippocampus reliably responded to such events as turning
a specific corner, suggesting that these cells encoded a notion of “event” rather than time
12
or place.
Taken together, this evidence suggests that cells in the hippocampus are sensitive to the
task and environmental structure rather than to exclusively the spatial or temporal properties
of the environment itself. As such, cells in the hippocampus are remapped as necessary to
provide a state space which contains a coherent representation of the structure of any given
task. We can then interpret the development of spatially and temporally specific cells as just
one instantiation of a more general representation of the abstract state structure of a task
or environment. The development and utilization of these abstracted state representations
will be a major focus of the subsequent chapters of this work.
I.1.3 Replay, Preplay, and Structured Temporal Sequences
The construction system hypothesis states that the hippocampus serves to support both the
consolidation and recall of past memories, but also the imagination of novel experiences
(Hassabis & Maguire, 2009). One key element of both cases is their temporally extended
and coherent nature. As such, rather than studying place cell activity in isolation at a
given moment, we would expect it to follow a temporally extended pattern of activation,
or at least a representation which is amenable to temporal extension. Such evidence has
indeed been discovered. Early evidence suggested that concurrent pairs of place cells that
were activated beforehand in a reward-driven task were reactivated together at a frequency
greater than chance when in slow wave sleep (Wilson & McNaughton, 1994). Due to
lack of technical sophistication in the recording equipment at the time, only correlations
between pairs of cells could be shown in the rodents studied.
Subsequent research was able to not only verify this early finding, but extend it to long
sequences of place cell firing patterns which lasted over dozens of seconds (A. K. Lee
& Wilson, 2002). In their work, Lee and colleagues showed that place cell firing corre-
sponding to a full trajectory of the animal through a linear maze were reactivated during
short wave sleep. Significantly, these patterns of activation were not present during sleep
13
before the animal had experienced the linear maze, meaning that they were the result of
experience running through the maze. Concurrent work also demonstrated that that these
“replay” events also took place during REM sleep in rodents (Louie & Wilson, 2001). The
explanations provided for these replay events during sleep is one of memory consolida-
tion, with one prominent theory suggesting that replay of the event allows it to be bound to
neocortical areas for long-term retention (Marr, Willshaw, & McNaughton, 1991; Nyberg,
Habib, McIntosh, & Tulving, 2000).
In addition to being found in sleeping animals, replay has also been demonstrated in
awake animals after some level of exposure to an environment (Foster & Wilson, 2006).
Unlike earlier studies on sleep-based replay, in which the sequence of place cell activation
was in the same order both during the experience and during replay, Foster and Wilson
found that awake replay took place in the reverse order. Because of this, the phenomenon
was aptly named reverse-replay. Similar to earlier experiments done on sleeping rodents,
they utilized a linear maze in which the animal ran back and forth. Reverse-replay activa-
tion took place while the animal was at rest at an end of the maze, with the place cells firing
from the unit correspond to the animal’s current location backwards.
The phenomenon of replay and reverse replay has been further generalized by studies
showing that awake animals experience both types of replay, dependent on their position
within an environment (Diba & Buzsáki, 2007). Diba and Buzsaki found that once an
animal had experienced a linear maze, the characteristic reverse replay was detectable while
the animal was at rest at the end of the maze after running through it. In addition to this, a
forward replay of the place cell sequence was found when the animal was at the beginning
of the maze. These findings suggest that the replay activity is in relationship to the animal’s
context within the environment, and that replay activity during the awake state emanates
outward from the animal along the previously experienced trajectory.
The results presented above all described the replay of events which span the order of
a few seconds when originally experienced, and correspond to only one to two meter long
14
trajectories. The experiences of animals (including humans) in the wild however typically
involve much longer sequences of movement, often spanning up to miles in the cases of
long-distance runners. Davidson and colleagues were able to show that replay occurs in
rodents at scales an order of magnitude larger than earlier work (Davidson, Kloosterman,
& Wilson, 2009). While replay trajectories have typically been observed to be confined
to the duration of a sharp wave ripple event in the hippocampus, they found that these
extended replay events took place over the course of multiple such ripples, with each ripple
corresponding to a sub-section of the full trajectory being replayed. This decomposition of
long sequences into shorter sub-sequences provides a mechanism for the potential temporal
abstraction of movements into a coarser temporal and spatial scale.
The picture of awake replay has been made richer by results showing that waking re-
play events need not be linked to the environment the animal is currently within (Karlsson
& Frank, 2009). In a set of experiments conducted by Karlsson and Frank, it was shown
that replay events corresponding to a previously experienced environment took place while
an animal was resting in a second environment. These replay events were referred to as
“remote replay,” because the animal is no longer in spatial (or temporal) proximity to the
original environment which they corresponded to. They found that the activation of replay
events for the two environments were independent of one another, with a separate asso-
ciation process responsible for the local activity in the current environment enabling the
remote replay.
Everyday subjective experience of recalling memories suggests that a kind of replay
should be evident in humans as well. Similar to difficulties in showing the existence of grid
cells however, the lack of spatial (EEG) or temporal (fMRI) resolution makes it difficult
to isolate individual replay events in the human hippocampus. Nevertheless, research has
been able to demonstrate that it is possible to decode the identity of individual experience
trajectories in humans from the human hippocampus during replay at a frequency greater
than chance using fMRI (Chadwick, Hassabis, Weiskopf, & Maguire, 2010). Similar work
15
has also isolated the specific contribution of the hippocampal region to reconstruction of
episodic memories, specifically when a temporally structured experience is being recalled
(Lehn et al., 2009).
There has been debate over the functional role of each of these kinds of replays in
their various contexts (for a reivew, see Foster, 2017), with the predominant understanding
revolving around memory consolidation. This is particularly true for the role of replay dur-
ing sleep, where there is a history of evidence for brain-wide synaptic changes that would
support such a mechanism (Bliss & Collingridge, 1993). Replay during the awake state is
more complicated, as it could be seen to potentially interfere with the present experience,
unless it is significantly delayed from the animal taking any action. One proposed theory
is that the reverse replays experienced after movement through a maze serve to quickly
propagate backward state information. In the case where the final state is rewarding for the
animal, a kind of value iteration process may take place where the value at the final state is
propagated backward to earlier states (Foster, Morris, & Dayan, 2000).
In the case of forward replay prior to animal action, the activation has been hypothe-
sized to serve a planning-like function. Additional evidence for this comes from the fact
that when there is more than one possible path available to the animal, multiple forward
replays will correspond to different possible paths (Pfeiffer & Foster, 2013). Combined
with a mechanism for evaluating future states, this would serve as a simple tree-search
like planning method. This predictive perspective on replay will be expanded upon in the
following section.
It is of value to discuss the hypothesized existence of one additional form of place cell
sequence activation in animals, “preplay.” Like replay, preplay involves the activation of a
sequence of place cells which correspond to the movement through a physical environment.
Unlike replay however, which is conditioned on the animal actually having moved through
that space before, preplay takes place before the animal has experienced the environment
(Dragoi & Tonegawa, 2011, 2013). Dragoi and Tonegawa propose that the existence of
16
preplay shows that place cell sequences are already connected together prior to experience,
and the experience simply binds this sequence to a set of actual stimuli.
The existence of preplay is somewhat controversial however. There have been unsuc-
cessful attempts to replicate the findings of Dragoi and Tonegawa using highly-sensitive
recording equipment and more rigorous statistical techniques, calling into question the
original finding (Silva, Feng, & Foster, 2015). The argument has also been made that the
existence of coherent activation sequences of place cells before experience would render
their activation as replays afterwards incomprehensible as reflection of any sort of mem-
ory or learning (Foster, 2017). The argument for experience-dependence in place cell se-
quence activation opens up the possibility for a kind of spectrum of reactivation between
experience-less preplay and the replay of only trajectories explicitly experienced by the an-
imal. Indeed, evidence for this comes from experiments showing that rodents experienced
replay events for never-taken trajectories along a maze (Gupta, van der Meer, Touretzky, &
Redish, 2010).
More recent work has shown that preplay events take place not only sequentially, but
often in a cyclic fashion (Kay et al., 2020), with theta wave activity rapidly transitioning
between encoding multiple sets of possible future trajectories in rodents. Such activation
patterns would make possible much more efficient exploration of future possible behaviors
in any given environment, since rather than serially imagining trajectories, they could be
explored in near-parallel, in a method with similarities to how modern implementations of
the monte-carlo tree search algorithm functions (Silver et al., 2016).
The structure of these temporal sequences has also been the object of study for many.
Given the apparently arbitrary nature of remapping, the question might naturally arise as
to why specific sequences of place cells seem to activate together at all, specifically within
conditions of seemingly pure preplay. A review of some of these ideas was recently pre-
sented by Dragoi (2020). One recent insightful finding has been the discovery that as a rule
place cell sequences seem to be made up of repeating motifs, consisting mostly of three
17
units (Liu, Sibille, & Dragoi, 2018). These motifs form a kind of grammar regarding the
activation of place cell sequences, with larger sequences being made up of these motifs, but
the structure within the motif being largely immutable.
With all of these computational possibilities opened up by the wealth of cell types and
their temporal dynamics, we can now turn to the attempts made thus far to derive and
validate concrete computational principles by which all of these make possible a cognitive
map.
I.2 Computational Theories of Mammalian Navigation
Taken together, it would seem that boundary vector, head-direction, and grid cells provide
a relatively extensive representation of the spatial situation of an animal at a given time.
Unlike place cells, whose firing patterns are environment-specific, these three populations
are all relatively invariant to disruptions in the environment, suggesting that when com-
bined with environment specific perceptual cues, they could serve to provide a foundation
for the computation of the local-specific activation patterns found in place cells. Since
the discovery of these cell types, a large body of theoretical work has been undertaken to
attempt to provide biologically valid computational models for how the development and
maintenance of these representations could take place (for examples, see Samsonovich &
McNaughton, 1997; Burgess, Barry, & O’keefe, 2007; Hasselmo, 2009; Erdem & Has-
selmo, 2012; Bush et al., 2015). While not exhaustive, this section will provide a survey of
some of the more influential of such theories and models, and how they have been used to
reason about the mechanisms by which animals with these representations would be able
to efficiently navigate the environments they find themselves within.
I.2.1 Path Integration, Attractors, and Other Early Models
One potential solution to the problem of spatial navigation in animals is path integra-
tion (Mittelstaedt & Mittelstaedt, 1980) (for a review, see McNaughton, Battaglia, Jensen,
18
Moser, & Moser, 2006). In path integration an animal utilizes self-motion cues to keep
track of its location relative to a global starting position. This ability has also been re-
ferred to as “dead reckoning,” with the implication being the ability of an animal or person
to “reckon” about their location in the absence of external sensory cues. Beyond being a
skill that appears intuitive for humans familiar with navigating their worlds, the system of
representation provided by the hippocampal formation provides all of the components nec-
essary for such a path integration system. More specifically, internal vestibular information
generated by the animal’s movement can drive the firing of head-direction cells, providing
an oriented path signal with direction and velocity of movement (McNaughton, Chen, &
Markus, 1991). Given a starting position, this incremental signal from head-direction cells
can then be integrated together to produce a representation of the animal’s updated location
after movement.
The idea that such a system could be used for the maintenance of a cognitive map
with place cell like activity was expanded upon in the late 1990s by Samsonovich and
McNaughton, in a model which was able to mimic the recurrent structure of the hippocam-
pus using an attractor network to generate units with place cell like activation patterns
(Samsonovich & McNaughton, 1997). This model was referred to as a map-based path
integrator, which consisted of an attractor network with activation layers imposed onto a
2D plane, thus providing an activation pattern with similar periodic firing as that of the
grid cells. This representation was fed by a set of sensory inputs, which activated head and
motion detectors, then feeding into the series of attractor maps, referred to as “charts,” This
process ultimately produces activation patterns in the final layer of cells which are reminis-
cent of place cell firing profiles. While this model predated the discovery of grid cells, the
use of multiple 2D charts with semi-periodic firing foreshadowed the eventual discovery
and incorporation of grid cells into subsequent models.
With the discovery of grid cells, models of navigation and path integration followed
which explicitly incorporated this population of cells into the model. These fell into two
19
broad categories: models which utilized the attractor dynamics as described in (Samsonovich
& McNaughton, 1997), such as (McNaughton et al., 2006), and models which instead
utilized an oscillatory interference pattern to generate grid and place cell representations
(Burgess et al., 2007; Hasselmo, 2009; Erdem & Hasselmo, 2012; Bush et al., 2015). In all
cases the spatial structure of the representation chosen for these models was of importance.
In order for the finite capacity of a fixed population of cells to represent a large and varied
amount of space, a looping representational space has typically been employed. In the case
of a set of cells with a 1D representation such as the head-direction cells, a ring is the rep-
resentation of choice, with the torus being the 2D extension used to model populations of
grid cells.
This latter set of models use as a foundation the theta-wave oscillations within the
hippocampal formation. The offset phases of these signals in different populations of cells
can then be used to generate an interference pattern. If projected onto a 2D plane, this
interference pattern reflects the periodic firing pattern found in grid cells (Burgess et al.,
2007). Place cell firing then corresponds to the conjunction of specific spatial information
with a grid cell firing pattern. This basic interference model has been extended to simulate
the storage and retrieval of experienced trajectories by an animal (Hasselmo, 2009), as well
as goal-directed navigation in rodents (Erdem & Hasselmo, 2012).
In the work of Erdem and Hasselmo (2012), the interference model of place cell devel-
opment and firing is combined with a basic model of the prefrontal cortex in which specific
place cells, assumed to correspond to specific spatial states of the animal, are correlated
with a goal provided by the prefrontal cortex. This model then uses a basic form of lin-
ear lookahead to plan out a path to the desired goal. While not explicitly mentioned in the
original work, this lookahead corresponds to a form of model-based reinforcement learning
(Sutton & Barto, 2018) in which the animal is able to scan its model of the environment for
rewarding states by probing neighboring place cells adjacent to the currently active cell.
20
I.2.2 Vector Navigation, Neural Networks, and Other Later Theories
Aside from being used for path integration, the system of spatial representation cell pop-
ulations in the hippocampal formation also have the potential to enable the inverse func-
tionality: vector navigation. In contrast to path integration in which a starting location and
movement information is used to predict final location, in vector navigation a desired path
vector is computed between a starting point and a goal location (Bush et al., 2015). Once
computed, an animal could then follow this vector within this space in order to navigate to
a desired location in physical space. Bush and colleagues show in their set of simulations
that the problem of vector navigation can be reduced to finding a straight line (or plane in
the 2D case) between the starting and goal position within the grid representation when the
phases of all different scales of grid cell representation are aligned. They discuss a variety
of possible biologically grounded models which might accomplish this, including mod-
els which introduce intermediate cell populations such as distance cells (Fiete, Burak, &
Brookings, 2008) and vector cells (Climer, Newman, & Hasselmo, 2013), and the model of
(Erdem & Hasselmo, 2012), providing a consistent unified framework for the consideration
of grid cell models of spatial navigation.
Most recently the development and downstream utilization of grid cells has been mod-
eled in more ecologically valid conditions using Deep Neural Networks (DNNs) and more
realistic virtual environment (Banino et al., 2018; Cueva & Wei, 2018). Concurrent mod-
eling studies both showed that a neural network containing a Recurrent Neural Network
(RNN) layer (Williams & Peng, 1990; Hochreiter & Schmidhuber, 1997), when trained to
perform a dead-reckoning task develops representations in the RNN layer that are strikingly
similar to those of rodent grid cells. In both cases the information provided to these net-
works consists of the starting position and angular velocity at a series of time-steps during
the simulation of rodent movement within an enclosed environment. The networks were
trained to predict the absolute position of the animal within the environment in x and y
coordinates, as well as the head direction of the animal. Unlike previous studies, there was
21
no special structure imposed on the networks which a-priori biased them toward grid-like
representations. Despite this, consistent grid-cell and border-cell like activation patterns
were found in a large number of the neurons within the RNN in both studies.
The RNN network was chosen because of its recurrent connections, which can be
thought of as modeling the highly recurrent structure within the hippocampus and entorhi-
nal cortex. One crucial element in both sets of experiments is that unlike traditional neural
networks which are trained as deterministic function approximators, the addition of noise
to either the inputs or the activation of the hidden units was required for the formation of
the grid cell representation. This is hypothesized to mirror the stochastic noise inherent in
biological neural systems.
While Cueva and Wei (2018) only showed that a supervised learning procedure could
produce grid-like cells in a neural network, Banino and colleagues went further and demon-
strated that this representation could then be used in a goal-directed navigation task, sug-
gesting that the networks learned in an unsupervised way to perform vector navigation
similar to the more formal system and simulations described by (Bush et al., 2015). Banino
et al. showed that an artificial agent trained with the addition of the learned grid-cell repre-
sentation performed significantly better on a series of navigational tasks designed to mimic
those found in the traditional rodent navigation literature, such as the Morris Water Maze
(Morris et al., 1982).
This work provided the first end-to-end model of learned grid-cell activity as well as
ecologically valid application of this representation in a virtual 3D environment. One main
point of interest is that these spatial representations and vector-navigation ability came
about without strong explicit priors on the structure of the system, or the loss function used.
This suggests that the system within the hippocampus may be an instance of a more general
mechanism for representing state spaces and producing efficient means of navigating them.
22
I.2.3 Prospective and Successor Models
The picture of the spatial representation in the hippocampus presented in the preceding
section suggests that space is represented with respect to the physical make-up of the en-
vironment via bottom-up information from sensory systems, and the animal then uses this
goal agnostic representation elsewhere in the brain in order to determine the optimal path
to take. While the existence of time and event cells described in the preceding section has
disrupted this notion, one could still imagine that when the hippocampus represents space,
it does so in a straightforward Euclidean fashion. The earliest evidence from place cell
firing however disagrees with this picture. Rather than providing an even representation of
space, place cells are known to fire in biased ways with respect to the structure of the envi-
ronment (O’Keefe, 1976). This biased firing represents a warping of the represented space
that is specific to the bodily possibilities of movement available to the animal. At the end
of (Muller & Kubie, 1987), the authors mention that their recordings show that the place
cell system may be used to encode a forward-looking and action-oriented representation of
space for the animal, which they refer to as “Kinematics.”
These findings and others have led some theorists to propose that place cells are better
modeled by a successor representation of the environment rather than a purely geometric
one (Stachenfeld et al., 2017). In this model the activation of any particular place cell would
be reflective of the exponentially discounted sum of future states reachable by the animal
from its current state. Using a temporal difference update rule, this representation could be
developed quickly both online and offline as an animal moves around the space or is at rest.
This would also mean that the place cell firing patterns are inherently predictive of future
animal behavior, rather than simply descriptive of the space itself. Such a representation
also easily allows for the incorporation of time cells, whose firing patterns are naturally
reflective of the temporal structure of the task at hand. In that case the intervals of animal
immobility within the experiments described above correspond to distinct durations of time,
interpretable as unique states.
23
Interpreting place field sensitivity from the perspective of a successor representation
allows for a new perspective on a number of findings in the literature. Early findings that
in an open 2D circular maze place fields are uniform Gaussian (Muller et al., 1987) nat-
urally falls out of a successor representation, since the animal is equally likely to move
in any adjacent position when in a given position. Findings that in 1D linear mazes sep-
arate sets of place cells fire for each direction along the maze (Foster & Wilson, 2006)
are explainable when the place cells encode for a discounted sum of future states, and the
animal only turns around at the end of the maze. Furthermore, the skewed nature of the
receptive fields of place cells as described by (Mehta, Quirk, & Wilson, 2000) can be ex-
plained as corresponding to a prospective representation of the animal’s position, rather
than a geometric one. Similarly, the successor representation also helps explain the role
that boundaries play in shaping place cell firing patterns, which naturally skew away from
boundaries (Stachenfeld et al., 2017).
This move away from a strictly Euclidean metric representation of physical space to
a representation of an abstracted state space based on future reachability opens the door
to interpreting the representational nature of the hippocampus in a manner divorced from
notions of physical space and time as well. If what is being represented is a non-physical
state space which the animal can simulate itself “moving through,” then it should be the case
that this representation is utilized for tasks which are not physical in nature, but involve
only abstract relations. Evidence for this kind of representation has been presented in
humans (Schapiro et al., 2016; Garvert, Dolan, & Behrens, 2017). Schapiro et al. showed
participants a series of images during fMRI scanning. Unbeknownst to the participants, the
ordering of the image presentation was based on a predetermined graph structure, where
each image corresponded to a node in the graph, and the presentation order was based
on a random walk through the graph. The graph was explicitly broken into separate sub-
graphs, with only sparse connections between the sub-graphs, and dense connections within
them. The learned representation in the hippocampus after being exposed to sequences of
24
images from the graph was closer for images drawn from the same sub-region of the graph,
suggesting that their temporal proximity helped shape the nature of the representation. The
authors suggested the possibility of a successor-like representation being at play, but did
not explicitly test this hypothesis (Schapiro et al., 2016).
Garvert et al. (2017) conducted similar work nearly concurrently, but used a graph
structure that was not explicitly broken into sub-regions. During scanning, participants
were asked to provide orientation judgments of the images presented. The researchers
found that the representational similarity of the different images reflected their proximity
on the graph. In particular, the representation was found to be best modeled by a successor
representation as described in (Stachenfeld et al., 2017), in which images were represented
as similar to those that were likely to be presented in the future based on the structure
of the graph. Importantly, a Euclidean space representation based purely on the actual
distance between items on the graph was found to be a worse fit for the data than a successor
representation that considered the structure of the graph, and the policy used to walk it. This
suggested that the representation encoded in the hippocampus for the task was explicitly
future, and “action” oriented.
The successor representation bears a strong similarity to the Temporal Context Model
(TCM) introduced to explain recency and contiguity in the domain of episodic memory
(M. W. Howard & Kahana, 2002). In that model, a distributed vector representation is used
to describe all the items presented to a participant, along with a separate vector being used
to represent the temporal context. During learning the context vector is updated based on
each newly presented item along with the previous context. The model has been applied
to describe both canonical findings in the memory literature as well as to model the de-
velopment of place cells in animals (M. W. Howard, Fotedar, Datey, & Hasselmo, 2005),
with similar predictions as those of (Stachenfeld et al., 2017). It is perhaps unsurprising
that these models show similar predictions, as it has been shown by Gershman and col-
laborators that not only are these models similar, but can be shown to be equivalent under
25
certain circumstances (Gershman, Moore, Todd, Norman, & Sederberg, 2012). This con-
nection opens the possibility of bridging results from decades of research into both episodic
memory encoding and retrieval as well as spatial learning and navigation, and both concern
representing trajectories of experience in a way that can be generalized to new situations,
and ultimately used to guide future action.
I.2.4 Goal Signals and the Hippocampus
The world that humans and other animals find themselves in is not a neutral space. It is
filled with salient locations associated with goals and rewards that impact the nature of
our behavioral choices. These locations need not be physical, as anyone who has received
a pleasant surprising email, phone call, or text message can attest to. There is evidence
that this inherent salience of certain states of the world is reflected in the nature of the
representations present within the hippocampus.
Consistent with evidence presented in the preceding section that the place cell repre-
sentation of space is not uniform and Euclidean, research conducted by (Hollup, Molden,
Donnett, Moser, & Moser, 2001) found that there were significantly more place cells with
activation fields near the goal location in a water maze task as compared to any other re-
gion in the maze. This biased preference in activation occurred regardless of the actual
position of the goal platform within the maze. The researchers suggested that this was
because of a bias toward a behaviorally salient part of the environment. This biasing of
activation suggests not only a representation which is geared toward an abstract state space
representation, but also one which privileges the goal location in that state space.
Similar to the bias of place cell firing around a goal location is the bias of firing pat-
terns in the replay of trajectories during rest in rodents. Pfeiffer and Foster conducted
experiments showing that the trajectories in replay events are biased towards the goal lo-
cation in a 2D foraging task (Pfeiffer & Foster, 2013). They showed that this bias in the
kinds of trajectories replayed could not be explained by either frequency of visitation, or
26
by a simple function of the heading direction of the animal. Instead they reflected the likely
future behavior the animal would engage in after rest, and preferentially ended at the goal
location significantly more often than any other location in the maze. They furthermore
demonstrated that the replay trajectories in many cases corresponded to non-experienced
combinations of both start and end positions, suggesting the ability for the animal to gen-
eralize to novel goal locations and start positions.
The ability to replay novel trajectories was further demonstrated in work showing that
rodents were able to preplay trajectories to a novel goal even when that goal location was
never visited before by the animal (Ólafsdóttir, Barry, Saleem, Hassabis, & Spiers, 2015).
Using a T-maze, researchers examined spontaneous trajectory activation patterns in CA1
of the rodent hippocampus. The animal was placed at the end of the maze, with a barrier
preventing it from reaching the decision making fork in the maze. From there a rewarding
object was placed in the right wing of the maze, and made visible to the animal. During a
rest period following this presentation, recordings were made from CA1. These recordings
showed significant preplay-like events for the wing of the maze containing the reward,
but not for the wing without the reward. This suggests that the animal was able not only
to integrate the presentation of the goal into a general representation of the maze, but also
then perform a kind of planning corresponding to the future behavior of the animal, taking a
path to that rewarding location. These results, along with those of (Pfeiffer & Foster, 2013)
further complicate the distinction between the notions of replay and preplay, suggesting
that both exist as instances of a more general phenomena of “playing” or “simulating”
possible experiences within an abstracted state space. The extent to which these simulated
trajectories do or do not correspond to actually experienced trajectories is the extent to
which they might be referred to as either replay or preplay.
Pezzulo and colleagues propose to generalize this notion into phenomena which they
refer to as internally generated sequences (IGS) (Pezzulo et al., 2014), which corresponds
to any activation of cell populations in the hippocampus which reflect trajectories through
27
the represented state space that are not reflective of the actual position of the animal. They
propose that these IGS events are goal-driven, and correspond to a model-based learning
system. In this view, IGS events can either correspond to the updating of an environmen-
tal model, or the application of that model for forward-planning. The model they propose
bears a strong resemblance to the MCTS algorithm (Silver et al., 2016), in which an explicit
search procedure is combined with a learned value estimator. Such a view finds support-
ing evidence in research showing that these IGS events in the hippocampus are correlated
with similar sequence events in the ventral striatum, which are known to relate to value
estimation (Lansink, Goltstein, Lankelma, McNaughton, & Pennartz, 2009).
The existence of a bias toward goal locations and trajectories in the CA1 region of
the hippocampus suggests that goal signals are indeed influential on the hippocampal for-
mation as a whole. Consistent with this hypothesis have been results from a number of
recent studies showing the ability of researchers to decode goal signals in the entorhinal
cortex and subiculum (Spiers & Maguire, 2007; L. Howard et al., 2014; Chadwick, Jolly,
Amos, Hassabis, & Spiers, 2015). Spiers and Maguire had human participants engage in
a goal-directed navigation task in a virtual environment (Spiers & Maguire, 2007). While
conducting this task, fMRI was used to explore whether goal proximity could be decoded
from the brain. They found that both mPFC and entorhinal cortex signals were significantly
correlated with goal distance. Similar work using a set of snapshots from a video of city
navigation was used to demonstrate that a human’s Euclidean distance from a goal could
be decoded from the entorhinal cortex (L. Howard et al., 2014). Lastly, Chadwick and
colleagues demonstrated that goal direction could be decoded from the entorhinal cortex
in humans during a goal direction judgment task in a virtual environment (Chadwick et
al., 2015). Interestingly the goal direction signal was allocentric in the entorhinal cortex,
while a separate egocentric goal direction signal was found in the precuneus. Taken to-
gether these experiments provide evidence for the modulation of hippocampal formation
activity by a goal signal. Chadwick and colleagues hypothesized that it was a population
28
of head direction cells responsible for both goal direction and head direction representation
being decoded in their experiments. This suggests that there is a kind of negotiation be-
tween bottom-up sensory information and top-down goal or recall driven signals guiding
the active representation in the hippocampus at any given time.
The “top-down” nature of goal-selective activation described above must correspond
to some other brain region or regions. It has been hypothesized that the medial prefrontal
cortex (mPFC) in particular is a potential generator of such a goal signal or specification
(Poucet & Hok, 2017; Erdem & Hasselmo, 2012; Pezzulo et al., 2014). Such a theory states
that the mPFC would generate relevant goal states based on the current state of the animal
and relevant incoming sensory information, as mediated by the hippocampus. It would
then engage in a goal specification, which would influence the hippocampal representation
in the ways described above, thus inducing trajectory to or from the goal location within
the abstract representation space provided by the hippocampal formation. This activation
would then be passed to the ventral striatum, where explicit value estimations would be
produced. Indeed, recent evidence suggests that the hippocampal formation serves a critical
role in the stable functioning of such a model-based planning system in humans (Vikbladh
et al., 2019). What remains to be explained is what computational principles may serve as
the basis for the learning and application of such goal, state-space, and value estimation
representations.
I.2.5 Policy Learning from Real and Imagined Experience
Thus far we have described a system of representations which allow for the generation of
a state representation and goal signals, the simulation of an abstracted future-oriented state
space. There are a number of methods which can then be used to obtain value estimates
and optimal actions using these building blocks. These include TD-learning (temporal dif-
ference learning) methods which update a policy or value function every time step (Sutton
& Barto, 1990), Dyna, which enables additional offline learning using a model of the en-
29
vironment (Russek, Momennejad, Botvinick, Gershman, & Daw, 2017), and tree-search
planning methods (Daw, Niv, & Dayan, 2005). These have all been proposed at the higher
level of behavior, rather than as specific suggestions concerning the nature of how the
hippocampal state space representation is learned. In this specific domain there have been
similar suggestions, such as the TD-update rule proposed as a means of potentially learning
place cell activation patterns (Foster et al., 2000; Stachenfeld et al., 2017).
Evidence for the existence of internally generated hippocampal sequences during sleep
and rest have opened the door to more sophisticated models of learning in these systems. In
particular, the Dyna algorithm has been seen as a means of providing a unified explanation
for replay during sleep and awake states (Johnson & Redish, 2005; Russek et al., 2017). At
the simplest level, the reactivation of trajectory sequences can be interpreted as the brain
performing learning on these sequences. In models where we are directly learning a value
estimate, this reactivation would correspond to updating the value estimates for the states
which are part of the reactivated trajectory. If we assume that the state representation is
based on a successor representation as proposed above, then the update would be not to the
value estimation, but rather to the state representation itself. In either case, learning from
whole trajectories opens the possibility of applying more sophisticated learning rules than
TD.
This model of learning from internally generated sequences can be further extended if
we assume a non-uniform activation of the internally generated sequences. Indeed, there
is a evidence for this, as described in Section I.2.4. As described above, sequences which
lead to goals, either visited or known through observation are played more frequently than
random (Pfeiffer & Foster, 2013; Ólafsdóttir et al., 2015). This has been hypothesized to
correspond to a prioritized replay mechanism (Mattar & Daw, 2018), similar to what has
been proposed in the artificial intelligence literature as a mechanism for increased efficiency
and stability during learning (Moore & Atkeson, 1993; Mnih et al., 2015). In this system,
experiences are selected for activation (and learning) based on the strength of an error signal
30
(referred to as gain), combined with an expected visitation signal (referred to as need). By
biasing the activation of events and sequences using these factors, the canonical forward
and reverse replay events naturally fall out.
Take for example the well-studied phenomenon of reverse replay which takes place
when an animal reached a goal location. If this location is novel, then the value estimation
error signal will be large for that state, and lead to replay, and subsequent updating using
a TD-learning rule, thus decreasing the discrepancy between the estimated value of that
state and the experienced value of the state. From this point the state with the next greatest
discrepancy would then be the preceding state, then the state preceding that one. In this
way a reverse replay backwards all the way along the trajectory would take place until the
sequence reaches the starting position. Mattar and Daw (2018) also include a need term
in their model to explain the forward play of sequences from the animal’s current position
when the animal is placed at the beginning of a maze. This need term gives high priority
to states likely to be reached from the current state, such as those directly in front of the
animal.
Beyond fitting behavioral data to computational models, these theories around offline
prioritized replay as a mechanism for learning have begun to be tested empirically in hu-
mans (Momennejad, Otto, Daw, & Norman, 2018). Momennejad and collaborators have
recently shown that replay events in humans during rest not only correspond to what would
be expected by a prioritized replay mechanism, but that their reactivation is predictive of
subsequent performance improvements on a two-step decision task. In those experiments,
participants were exposed to a two-step decision task with certain transition and reward
values. The structure of the task was then changed for participants in one condition, those
participants had to re-learn the new optimal policy and value estimates for the task. The
researchers used multi-voxel pattern analysis to decode the replay events of the participants
during rest, finding that in the reevaluation condition there was significantly more replay.
They also show that the replay that takes place is consistent with a prioritized reactivation
31
scheme similar to what is described above. Importantly, this reactivation is predictive of
the extent of adaptation of the participants to the new task structure. While preliminary,
these results give initial support to the computational theories of prioritized replay, offline
learning, and the hippocampus as a cite of a successor representation.
I.3 Generative Temporal Models
The general framework of mapping sensory information into an abstracted state, repre-
senting that state in an predictive manner, combining the predictive representation with a
goal signal to produce value estimates and candidate actions has been proposed as a pos-
sible model of high-level decision making in mammals (Russek et al., 2017; Pezzulo, Ke-
mere, & Van Der Meer, 2017). In one such model, each of these processes can be roughly
mapped onto the sensory cortices, medial temporal lobe, and ventral and dorsal striatum,
respectively (Pezzulo et al., 2017). Indeed, there is evidence for sensory cortices learning
compressed state representations, (Van Essen & Maunsell, 1983) corresponding to both
the ‘what’ and ‘where’ or ‘how’ of the visual stream (Kravitz, Saleem, Baker, & Mishkin,
2011). There is evidence of the hippocampus performing pattern separation (Yassa & Stark,
2011), prospective representation learning (Stachenfeld et al., 2017; Garvert et al., 2017),
and preplay of future movement (Pfeiffer & Foster, 2013). Finally, there is evidence of
ventral striatum performing value estimation (Kable & Glimcher, 2007; Peters & Büchel,
2010), and guiding behavioral learning in the dorsal striatum (O’Doherty et al., 2004).
In addition to evidence of each of these distinct brain regions acting separately, there
is evidence for the relevant sets of connections to support the hypothesized goal-oriented
model of Pezzulo et al. (2017). These connections include between sensory cortex and
the MTL (Ji & Wilson, 2007; Kravitz et al., 2011), and between the hippocampus and
the striatum (Lansink et al., 2009; Pennartz, Ito, Verschure, Battaglia, & Robbins, 2011).
Taken together these interconnected systems provide one possible working framework for
goal-directed learning and action necessary to support and utilize a cognitive map.
32
This theoretical model of a predictive agent described above shares a strong resem-
blance to a class of neural network models called generative temporal models (GTMs).
Also more informally referred to as world models, in this work we will make use of both.
These models are trained to predict future sequences of observations within an environ-
ment, given a past sequence of observations and actions. In doing so, these models learn
the underlying structure of the environment they are trained in. These GTMs can then be
used to guide reinforcement learning and goal-driven decision making in the environment
which they were trained.
A central thesis of this work is that the connection between GTMs and the “experience
construction system” of the medial temporal lobe in humans is more than just a superficial
one. Here we propose that a certain class of GTMs can serve as a useful model for an array
of neural and behavioral findings associated with the hippocampus. The purpose of this
work is to demonstrate this with a series of informative examples, and provide insight into
how this theoretical model can be further extended with future work.
A main tenet of the construction systems hypothesis is that the hippocampus serves to
support the generation of coherent narrative experiences, whether they be of remembered
events in the past, or imagined events in the future. A more computationally specific way
of framing this hypothesis is that the hippocampus is responsible for maintaining a gener-
ative model of semantically coherent trajectories through an abstract state space which is
correlated with cortex states, and their accompanying experiential properties. In this sense,
the hippocampal formation is indeed a very specific kind of generative temporal model,
one capable of storing and making sense of, and allowing for the recall of, millions of
experiences that an animal may encounter in their life.
We believe that GTMs can generalize a number of the theoretical findings discussed
above in this introduction. Rather than starting from pre-existing computational building
blocks, this work seeks to demonstrate that cells with firing properties such as place and
time cells, as well as phenomena such as replay/preplay can naturally derive from unsu-
33
pervised learning of GTMs for the purpose of goal-driven navigation. In the course of this
dissertation, we will be demonstrating this with increasingly realistic stimuli and environ-
ments.
This section of the introduction seeks to further make clear this connection, as well as
to provide preliminary context and definitions for the generative temporal models and com-
ponents that will make up the majority of the work discussed in the subsequent chapters.
I.3.1 Basics of Generative Temporal Models
At the core of many recent instantiations of the generative temporal model framework is a
variational autoencoder (VAE) (Kingma & Welling, 2013). VAEs are a class of stochastic
generative models which learn compact latent representations of data using a log-likelihood
learning objective. Due to their theoretical grounding and connection with Bayesian learn-
ing, VAEs have recently emerged as computationally viable means of performing pre-
dictive coding (Friston & Kiebel, 2009), whereby the probability distribution over high-
dimensional observation spaces can be tractably computed.
A VAE is composed of two tightly coupled neural networks, one which performs infer-
ence of the latent variable from an observation, and the other which performs generation
of a predicted observation using a latent variable. These networks are sometimes also re-
ferred to as encoder and decoder networks, respectively. In a VAE, the sensory stream of
observations o is sent through the inference network to be encoded into a latent distribution,
from which a latent variable z is sampled. The specific nature of this distribution can vary,
with most common instantiations using a gaussian distribution (Kingma & Welling, 2013;
Higgins et al., 2016). As will be discussed below for its connection with the hippocampus,
the gumbel-softmax distribution (Jang, Gu, & Poole, 2016), which induces sparsity in the
representation is another choice for latent distribution within a VAE. Once a latent variable
is sampled, the generative process takes place, whereby a predicted sensory perception o∗
is decoded from z using the decoder network. See Figure 1 for a diagram of the variational
34
autoencoder network flow.
Observation
Decoding
(Generation)
Z
Encoding
(Inference)
Observation
Figure 1: Diagram of a variational auto-encoder. Boxes correspond to deterministic vari-
ables. Circles correspond to stochastic variables. Grey corresponds to input variables.
Green corresponds to output variables. Blue corresponds to network layers responsible for
context information.
A VAE is trained end-to-end in order to both maximize prediction accuracy as well
as to optimize a regularization term used to induce a smooth manifold in the latent space
(Kingma & Welling, 2013). This regularization term often takes the form of a KL (Kullback-
Leibler) divergence loss between the current distribution and some target prior distribution.
In the case of a gaussian latent distribution, the prior is a normal distribution. In the case of
a gumbel-softmax distribution, the prior is a uniform distribution. The weighting of these
two loss terms results in a trade-off between accuracy and generality. The regularization
term has been demonstrated to induce a disentangled latent representation in the gaussian
case (Higgins et al., 2016). In Chapter II, we demonstrate that a similar phenomenon takes
place when using a gumbel-softmax distribution.
By itself, a VAE is not sufficient to meet the criteria of a GTM, as there is no temporal
component allowing for the generation of future states from a current state. By extending
a VAE with a forward dynamics model however, it gains the ability to model the temporal
35
dynamics of an environment. A forward model is a function st+1 = f (st ,at), where s is a
state, and a is an action taken by an agent in the environment, and t is the current time-step
of the environment simulation. A recent example of this simple but powerful idea was the
“World Model” (Ha & Schmidhuber, 2018). This model combined a latent state represen-
tation from a VAE with a recurrent neural network, specifically implemented as a LSTM
(Hochreiter & Schmidhuber, 1997). Rather than modeling the temporal dependencies be-
tween the high-dimensional sensory observations o, the World Model learns to model only
the dependencies between the low-dimensional latent states z produced by the VAE, mak-
ing the learning problem significantly more tractable. See Figure 2 for a diagram of the
World Model. Furthermore, it makes possible planning within the latent space, since the
learned low-dimensional latent states can be used as the basis for performing reinforcement
learning.
Observation Observation*
Decoding Decoding
Z RNN Z*
Encoding
Observation Action
Figure 2: Diagram of a World Model. White corresponds to model input. Green corre-
sponds to model output. Blue corresponds to content information. Purple corresponds to
joint context and content information. Nodes marked with a ∗ correspond to values at the
next time-step, or predictions of those values.
This basic formulation has been extended in the “Recurrent State Space Model” (RSSM),
which augments the stochastic latent state z with an additional deterministic latent state h
36
kept within the RNN (Hafner et al., 2018). This results in both greater model representa-
tional capacity, but also the ability for the model to partition that capacity into representing
stochastic and deterministic aspects of the environment independently. By utilizing both
stochastic and deterministic latent states, RSSMs have been able to model the dynamics
of complex control tasks from high-dimensional visual observations, and use the model to
perform efficient reinforcement learning (Hafner, Lillicrap, Ba, & Norouzi, 2019). See Fig-
ure 3 for a diagram of the network flow of an RSSM. More recent models have extended
this formulation in a hierarchical manner, to enable modeling of environment dynamics
at multiple different temporal scales, enabling planning in environments with large state
spaces (Kim, Ahn, & Bengio, 2019).
Observation Observation*
Decoding Decoding
Z RNN Z*
Encoding
Observation Action
Figure 3: Diagram of a Recurrent State Space Model (RSSM). White corresponds to model
input. Green corresponds to model output. Purple corresponds to joint context and content
information.
I.3.2 Extending GTMs with Memory and Multiple Latent States
The generative temporal models described above are powerful methods for learning the
dynamics of an environment and using them to plan goal-directed behaviors. They fall
37
short of one key aspect of the capabilities of mammals with cognitive maps however, and
that is the ability to quickly learn from and make use of novel experiences. Humans and
other mammals are able to remember a series of events that only needs to take place once,
learning to navigate novel environments in a so-called “one-shot” manner. This is made
possible due in part to the highly plastic nature of the recurrent connections within the
hippocampus (Frank, Stanley, & Brown, 2004).
In addition to this plasticity, there is a critical separation between the content (objects
within scene) and context (location) of the incoming sensory stream of information. Rep-
resentations in the upstream LEC and MEC have been demonstrated to contain content and
context information respectively (Hafting et al., 2005; Deshmukh & Knierim, 2011). This
structured information allows for more intelligent storage and retrieval of latent states than
is possible in a World Model or RSSM, where this information is entangled together.
Attempts to use more structured and fast-adapting methods have resulted in a new class
of GTMs which are indeed able to capture many of the additional capabilities of the cog-
nitive map that the simpler models were lacking. Key to these innovations has been the
addition of various kinds of differentiable neural dictionaries (DND) used for additional
storage within the network beyond what a recurrent neural network is capable of (Pritzel et
al., 2017). These differentiable dictionaries are initiated at the beginning of an episode of
experience for a virtual agent, and are then used to store and recall information during the
episode. A simple example of this designed for 2D environments is the Generative Tempo-
ral Model with Spatial Memory (GTM-SM), which uses a VAE along with a hand-crafted
DND (Fraccaro et al., 2018). In a GTM with a DND, a new memory is written at each time
step in the form of a key-value pair, with the context variable serving as the key, and the
content variable serving as the value. During recall, stored keys are compared to a query
key, and used to determine which value to recall. See Figure 4 for a visual representation
of the network flow of a GTM-SM.
The recently proposed “Model-Based Predictor” (MBP) model utilized recurrent VAE,
38
Observation Observation*
Decoding Decoding
Write
Z DND ReadValue Memory Value Z*
Write
Key
Encoding S
Observation Action
Figure 4: Diagram of the Generative Temporal Model with Spatial Memory (GTM-SM).
Blue corresponds to content information. Yellow corresponds to a differentiable memory
store. Red corresponds to context information. While corresponds to model inputs. Green
corresponds to model predictions.
but additionally augmented with a differentiable memory module similar to a DND, but
which uses a multi-headed query system in order to enable more complex learned storage
and retrieval mechanisms (Graves et al., 2016). This model was furthermore trained end-
to-end to not only perform memory recall, but also to perform goal-directed navigation
tasks in a few-shot manner (Wayne et al., 2018). As such, the memory module acted as a
learn-able dictionary look-up, where new experiences could be stored and retrieved as was
demanded by the task.
Even more recently the “Tolman-Eichenbaum Machine” (TEM) has been proposed as
a model of entorhinal and hippocampal representation learning (Whittington et al., 2019).
This model similarly utilizes a VAE framework, but explicitly accounts for separate ‘con-
tent’ and ‘context’ input streams from the lateral and medial entorhinal cortices, respec-
tively. Like MBP, TEM also uses a differentiable memory module to store and retrieve the
bound representations. The resulting model demonstrates many predicted properties and
39
representations in the medial temporal lobe such as grid, border, and place cells, along with
neurally consistent remapping.
We propose and examine a novel variant of the dictionary-based GTM called a Dual
Stream World Model in Chapter IV of this work.
I.3.3 Hippocampal Index Theory and a Language Metaphor
Due to the success of dictionary-based generative temporal models, it is perhaps of value to
examine the dictionary metaphor more closely, as it pertains to the medial temporal lobe.
If the medial temporal lobe is a kind of dictionary, with keys and values, then it is for a
language of narrative experiences, or episodic memories.
This notion of a dictionary is closely related to that of the hippocampal index theory
(Teyler & DiScenna, 1986). According to this theory, the hippocampus quickly forms
a low-dimensional representation corresponding to the higher-dimensional cortex states.
This low-dimensional representation being an “index” for the higher-dimensional one. This
index can be interpreted in the simplest context as the latent state of a variational auto-
encoder, as discussed above. It can also be interpreted as a key of an entry in a dictionary,
with the value of that entry corresponding to the higher-dimensional state.
Humans use and deploy a verbal and written language composed of words which we
string together using a system of syntax and grammar. Each of these words has a corre-
sponding meaning, and a specific place within any given sentence that the word must go in
order to be semantically meaningful. Given a series of words in a sentence, there are only
so many words that might end the sentence, for example. Consider a sentence like ‘The
cat sat on the .’ Most people who have undergone traditional English education would
implicitly want to end that sentence with ‘mat.’ Furthermore, English speakers also know
what a ‘mat’ refers to in this context. Likewise, when we walk around our homes, and walk
into a kitchen, we know to expect an oven, a refrigerator, and cabinets.
This concept of language consisting of meanings, words, and a grammar can be a useful
40
metaphor for understanding the role that the medial temporal lobe, and the hippocampus
in particular plays in the mammalian brain. In this metaphor, we can think of place, time,
and event cells as being “state cells,” with each corresponding to the words of a language.
These word tokens can be seen as equivalent to the indices of the hippocampal index theory.
The temporal dynamics of the hippocampus, specifically of the CA1 and CA3 regions then
correspond to the syntactic structure within which the language unfolds, and how one index
follows another. The connection between the hippocampus and the cortex, mediated by the
lateral and medial entorhinal cortex acts as the process of storing and looking up words in
a dictionary, and associating words with their definitions.
The use of such a symbolic language is convenient for many reasons. It allows us to
swap in simple tokens consisting of a few syllables for complex ideas and objects within
the world. We then simply need to make use of a mapping between these high-dimensional
meanings and the words. A similar problem arises in the domain of memory and goal-
directed navigation. Our narrative experiences are filled with extremely high-dimensional
perceptual, cognitive, and affective information. Rather than storing and learning the tran-
sitions between these high-dimensional variables which exist as states of the cortex, the
hippocampus generates low-dimensional tokens in the form of sparsely-firing “state” cells.
These then serve as an index or placeholder for the cortical activation, and corresponding
phenomenal experience in the animal.
Because they are abstracted away from the cortex state itself, these hippocampal states
can also serve to enable generalization when the content of the sensory perception changes,
but the structural aspects of the environment remains the same. Such is the case if a room
needs to be navigated after a new coat of paint on the walls, or a new pattern on the rugs.
Such superficial changes should not change the state itself. Indeed, such changes to envi-
ronments do not result in hippocampal remapping of place cells in experiments with rodents
(Muller & Kubie, 1987).
In addition to the greater capacity for storing these low dimensional “state” cells comes
41
the related benefit of easier composability. Because these tokens are low-dimensional and
re-usable, it is possible to generate sequences or motifs of them with comparative ease,
compared to their high-dimensional cortical counterparts. These sequences can be seen as
being akin to stock phrases in languages. These would map to sequences of known experi-
ences, such as the experience of walking down a hallway, where a sequence of half a dozen
place cells might always activate in the same order every time the hallway is traversed. In
the same way, longer narrative experiences such as one’s trip across town to run errands
can be composed of sequences of these motifs without recourse to tying all of the underly-
ing cortical states together. In this way memories can be quickly formed and stored in the
hippocampus before the much longer-term process of transfer to long-term memory takes
place.
The use of these simple sequenced tokens also allows for the creation of novel se-
quences of “state” cell activations. In the same way that humans learn to play with language
to explore the linguistic possibilities, the processes of imagination and planning engage the
‘language‘ of the medial temporal lobe and can allow for the exploration of novel sequences
of events. Importantly however, these sequences are not, and cannot be arbitrary, as there
is a syntax and grammar to this language. In the same way that some sentences don’t make
grammatical sense, some sequences of experiences don’t make navigational sense. This
ties directly into empirical research which has shown that spontaneous place cell activity
follows motifs of groups of two or three units (Liu et al., 2018). These can be thought of as
the basic phrases by which the language of the hippocampus is composed. The breakdown
of this capacity is related to breakdowns in narrative coherence in patients with hippocam-
pal damage (Hassabis et al., 2007).
While not perfect, we believe that this metaphor has the value of providing an inter-
pretation for the success of recent GTMs such as MBP and TEM. Furthermore, it can help
guide the development of novel models, such as those we will present here. First however,
it will be of benefit to point out the properties of simpler generative temporal models, and
42
how even basic models can support the development of place-like cells.
43
CHAPTER II
THE HIPPOCAMPUS AS A GENERATIVE TEMPORAL
MODEL
The preceding chapter surveyed the current state of our understanding of the hippocampus
and its ability to support a flexible system of navigation which has been referred to as a
cognitive map. It also introduced a powerful class of computational models referred to as
GTMs which can match a number of the empirical findings of the cognitive maps in hu-
mans and other mammals with respect to navigational ability in both familiar and novel
environments. We now turn to a concrete demonstration of the properties of GTMs and
their relationship to the representations found in the hippocampal formation. Rather than
starting with a complex GTM, we will begin our analysis from first principles, demonstrat-
ing basic properties of a simple GTM, and only later moving on to a more complex model.
In this chapter, we will demonstrate that cells with firing patterns similar to those found in
hippocampal place and time cells, which we have referred to as ”state cells” can arise from
a basic form of a GTM. The learned representations will then be shown to display proper-
ties of hippocampal representations in humans and other mammals, namely the temporal
community structure (Schapiro et al., 2016).
44
II.1 Place and Time Cells in a GTM Latent State
As discussed above, the place cell was the first major spatially selective cell to be dis-
covered in the hippocampus (O’Keefe, 1976), and provided the initial evidence that the
hippocampus is an important brain region for those interested in understanding cognitive
maps in mammals (O’Keefe & Nadel, 1978). We likewise begin our analysis of the connec-
tion between GTMs and cognitive maps in the same way, by looking at the conditions under
which a simple GTM can be shown to develop units with place-cell like firing properties.
Key to the development of place-like cells in our model will be the use of the gumbel-
softmax (GS) distribution to represent the latent space of the variational auto-encoder (Jang
et al., 2016). This distribution was developed to allow for sampling from categorical dis-
tributions while maintaining differentiability, which is essential for solving certain tasks
with neural networks trained using backpropagation. This representation has the effect of
inducing sparsity on the representation being learned, due to the “softmax” operation. See
Figure 5 for a visual representation of the distribution, and example samples from it. In the
context of a model of the medial temporal lobe, this sparsity can be seen as being induced
by the dentate gyrus within the hippocampus, a region through which a significant amount
of incoming information passes, and which contains sparse connections to downstream
hippocampal regions (Leutgeb, Leutgeb, Moser, & Moser, 2007).
The use of a gumbel-softmax latent distribution also has an important connection to
clustering algorithms, where the size of the GS distribution determines the upper bound
on the number of possible clusters, and each cluster emerges in a “soft” and probabilistic
sense. This directly relates to a recently proposed theoretical model of hippocampal dy-
namics (Mok & Love, 2019). In the model proposed by Mok and Love, the hippocampus
performs clustering on the sensory stream of inputs, and place cells develop as a special
case of this in strictly spatial environmental contexts. Likewise, time cells emerge as the
temporal case, where specific durations of time are clustered into groups, and these are
45
Raw values Softmax distribution
Gumbel-softmax samples
Figure 5: Explanation of Gumbel-Softmax distribution. Top left: hand-generated underly-
ing values used to produce distribution. Top right: softmax distribution created from raw
values. Bottom: four random samples from the gumbel-softmax distribution.
used as downstream variables. The model proposed below can be seen as taking similar
computational inspiration as (Mok & Love, 2019), but demonstrating this principle within
the context of an end-to-end differentiable neural network, which has both greater biologi-
cal plausibility than the k-means clustering algorithm used in (Mok & Love, 2019), as well
as allows for greater representational capacity.
As discussed in the introduction, there is not a clear delineation between place cells
and other cells in the hippocampus which also display limited selectivity, such as time
cells. Indeed, there is evidence for cells taking on either place or time like properties
as the task and environment contingencies demand (MacDonald et al., 2011). Here we
also demonstrate that the same computational principle which allows for the development
of place cells also allows for the development of time cells in the case of the incoming
information providing a temporal signal, as has been found in the LEC (Tsao et al., 2018).
46
II.1.1 Evaluation Methods
We begin by defining a simple two-dimensional environment within which an artificial
agent might move and act. The observations available to the agent within this environment
will be the x and y coordinates of the agent’s position, as well as the time that has passed
since the beginning of the episode. While neither quantity is available as raw sensory
information to an animal directly, both are known to be represented in the entorhinal cortex
as a result of integrating sensory information over time. Specifically, Euclidean position
is decodable from the spatial information represented in MEC in the form of grid cells
(Hafting et al., 2005), and time information can be decoded from the LEC in the form of
ramping cells (Tsao et al., 2018). The actions available to the agent will be movement in
the northern, eastern, southern, and western directions by one unit per time step. While
a simplification of actual animal action, this can be seen as corresponding to a simplified
version of the animal’s head-direction system, which exists in the subiculum and provides
a global orientation input to the hippocampus (Taube et al., 1990).
In this environment, the positions the agent can occupy are discrete (as such it falls into
the category of virtual environments typically referred to as a “gridworld”, a term we will
use throughout this work), and the size of the environment is 12×12 units, with walls the
agent cannot move onto taking up the outer rim of units. As a result, there are 10× 10,
or 100 movable positions the agent can occupy. The observations the agent receives is
then a vector of length 3, corresponding to < x,y, t >. Likewise, the agent will produce
actions as a one-hot vector < n,n,n,n >, where n is 1 in the position corresponding to the
current movement-direction, and 0 elsewhere. See Figure 6 for a visual representation of
the gridworld environment.
The model is trained in an offline fashion, with the data being first collected from a
series of random walk trajectories with each initializing the position of the agent in a ran-
domized location in the environment. The random walk is based on a policy whereby
either a new action is taken with a uniform random probability, or with some probability
47
the previous action is repeated. Each random walk lasts 50 time-steps, and 1000 of these
trajectories, each referred to as an “episode” were collected.
x
y
Figure 6: The simple two-dimensional “gridworld” environment. Agent is represented
by a red square. Walls are represented by blue squares. Agent’s observation of x and y
coordinates represented as white arrows.
II.1.2 Modeling Methods
We then used the collected dataset to train a World Model (see Figure 2), as described by
(Ha & Schmidhuber, 2018), with minor modifications. In the original implementation of
the World Model, a gaussian distribution was used to represent the latent variable z. Here
we compare this approach to two other candidate latent space types, a gumbel-softmax
distribution, and a deterministic linear layer. The World Model can be broken into an
inference and generation phase which alternate throughout each time step of an episode of
training. Below are the explicit equations describing these phases.
The inference phase is governed by the following equations.
zt ∼ p(zt |ot) (II.1)
ht+1 = f (ht ,at ,zt) (II.2)
Where ht corresponds to the hidden state of the recurrent neural network, and zt refers
to the inferred latent state. The sampling of zt differs based on the distribution being used.
48
In the gaussian case, it is sampled as follows:
zt = µ(xt)+σ(xt)∗ ε (II.3)
Where µ and σ are outputs from the encoder network, and ε is sampled from a normal
distribution.
In the case of a gumbel-softmax (GS) distribution, z is sampled as follows:
exp(log(x
z t
)+g)
t = (II.4)∑exp(log(xt)+g)
Where g is sampled from the gumbel distribution, which consists of a transformation
of a uniform random sample between 0 and 1, u as follows: g =− log(log(u)).
Once a latent variable zt has been sampled, the generation phase then proceeds as fol-
lows.
zt+1 ∼ q(zt+1|ht+1) (II.5)
oqt = f (zt) (II.6)
oqt+1 = f (zt+1) (II.7)
We train the model using the same loss functions used in the original World Models
paper, which include a reconstruction loss, a forward model loss, and a regularization loss.
49
1 N
LO = ∑ |oq−o 2t t | (II.8)n n=1
LZ = DKL(p(zt |ot)||q(zt |ht)) (II.9)
LTotal = LO +LZ−βHs (II.10)
Where Hs is the regularization term which varies based on the distribution used, and
β is the strength of the regularization. This regularization term is essential to the training
of variational auto-encoders, as it enforces non-deterministic latent spaces, and has the
effect of inducing disentangled representations in the latent space as a result (Higgins et al.,
2016). In the case of the gaussian distribution, this is the KL divergence between the current
distribution and a normal distribution. In the case of the gumbel-softmax distribution, this
is the entropy of the distribution.
II.1.3 Results
First, we trained a model using only the < x,y > components of the observation space.
Using this dataset, we trained three separate models, each with a latent space size of 64,
but each containing a different latent distribution type: gaussian, gumbel-softmax, and a
deterministic linear layer. When comparing the learned latent spaces of these models, we
find that only the GTM trained with the GS latent space learned a representation with place-
like cells. This can be seen clearly in Figure 7, where the activation profile of units in the
GS model show extremely high spatial selectivity, and little redundancy between units. In
contrast, the spatial selectivity of the other models is non-coherent, and highly redundant.
We then compared the reconstruction error of the three models trained using different
latent distributions, we find a significant difference between all three (ANOVA, F(2,3897)=
158.668, p< 0.0001), with the gumbel-softmax model (Mean= 0.010,Std = 0.017) result-
ing in the lowest reconstruction error, followed by the gaussian model (Mean= 0.021,Std =
50
Gumbel-Softmax Latent Distribution
Gaussian Latent Distribution
Deterministic Latent Distribution
Figure 7: Representative activation patterns of the first 18 units in the latent variable z in
world models trained using gumbel-softmax, gaussian, and deterministic latent distribu-
tions. Around each box are the walls of the environment which were not accessible to the
agent.
0.024), followed by the model with a deterministic linear layer (Mean = 0.027,Std =
0.030). Pairwise comparisons result in highly significant differences between the three
(all p < 0.001). These results are presented in Figure 8.
These results suggest that in addition to supporting the development of structured place-
like cells, the gumbel-softmax distribution is also results in an auto-encoder with better
reconstruction accuracy for spatial information than a gaussian distribution or deterministic
linear layer.
In order to better understand the effect of the regularization term in the optimization
process of the GTM with gumbel-softmax latent space, we trained a set of four additional
models, each using a different value for β . We choose the following set of values, to
51
0.030
0.025
0.020
0.015
0.010
0.005
0.000
Deterministic Gaussian Gumbel-Softmax
Distribution
Figure 8: Reconstruction errors of three model types trained to auto-encode spatial obser-
vations. Error bars represent standard error.
provide a range of values with which to examine β ∼< 0.0,0.01,0.05,0.1 >. As described
in Higgins et al., in the case of a VAE with a gaussian latent space, there is a trade-off
between reconstruction accuracy, and disentanglement which is governed by the magnitude
of β . Here we seek to understand whether this holds also for a VAE gumbel-softmax
distribution as well. Specifically, we are interested in the extent to which the development
of cells with place-like coverage of an environment can be connected to the principle of
disentanglement described in (Higgins et al., 2016).
As can be seen in Figure 9, there is indeed a large impact of the strength of the regular-
ization term on the resulting latent space. In the case of a large value of β , each variable in
the latent space learns to represent a large portion of the environment. As the regularization
strength is decreased, each unit represents a smaller part of the space. However, when no
regularization is applied the units learn indistinct, and largely redundant activation patterns,
suggesting that the regularization term does indeed induce disentanglement. In this case,
β = 0.01 corresponds to the most place cell like latent space.
We then trained another set of three models using the same latent distributions, but
taking as input only the < t > component of the observation space to examine whether time-
like cells would emerge from each of the three models. We find that cells with an affinity
for specific offsets from the start of the episode emerge in the latent space of the gumbel-
softmax model, but not the other two. Instead, in the gaussian and deterministic cases, a
52
Reconstruction Error
β = 0.1
β = 0.05
β = 0.01
β = 0.0
Figure 9: Example activation patterns for nine units of GTM with GS latent space models
trained using different values of β for regularization loss.
single cell learns to represent duration as a scalar value, and the rest are not sensitive to the
input. Example latent space activations are presented in Figure 10. We can understand this
as the 1D case of the place-cell development described above. This process also generalizes
to high-dimensional observational spaces, as will be demonstrated in subsequent chapters.
Gumbel-Softmax Latent Distribution Gaussian Latent Distribution Deterministic Latent Space
Latent unit Latent unit Latent unit
Figure 10: Representative activation patterns of the 64 units in the latent variable z by
time-step in world models trained using gumbel-softmax, gaussian, and deterministic latent
distributions.
We can furthermore compare the quality of the reconstructions in the temporal obser-
53
Time-step
Time-step
Time-step
vation case. We find that like in the case with spatial observation data, there is a significant
difference between the three models (F(2,3897) = 70.885, p < 0.001), with the gumbel-
softmax model (Mean = 0.004,Std = 0.009) significantly outperforming both the gaussian
(Mean = 0.010,Std = 0.016) and the deterministic (Mean = 0.010,Std = 0.015) models
in terms of reconstruction quality (p < 0.001). See Figure 11 for a graphic representation
of these results. These results might be surprising, since in all three cases the model simply
needs to learn to return the same original input value. Due to the complex non-linear trans-
formations that are part of the of the variational-autoencoder architecture however, this task
is not entirely trivial.
0.010
0.008
0.006
0.004
0.002
0.000
Deterministic Gaussian Gumbel-Softmax
Distribution
Figure 11: Reconstruction errors of three model types trained to auto-encode temporal
observations. Error bars represent standard error.
Subsequent sections of this chapter will further explore the properties of a GTM trained
using a gumbel-softmax distribution, with the next chapter exploring the usefulness of this
representation for goal-driven navigation tasks.
II.2 Place-like Cells are Distributed based on Underlying Agent Behav-
ior
So far, we have demonstrated that both place-like and time-like cells can come about within
the latent space of a variational autoencoder with a gumbel-softmax distribution. To do so,
we used a semi-random walk policy to collect the dataset used to train the model. While
54
Reconstruction Error
such a policy is a reasonable proxy for animal foraging behavior (Viswanathan, Da Luz,
Raposo, & Stanley, 2011), it does not capture the just as prevalent behavior of goal-directed
navigation, or any biased movement through the space. It is known for example that in
rodents performing a goal-directed navigation tasks, the place cells in the hippocampus
cluster near the goal location (Hollup et al., 2001). This suggests a behavioral impact on
the structure and placement of place cells.
Here we explore the extent to which different behavioral policies induce different place
cell biases within the same environment in the generative temporal model introduced in the
previous section. We find the behavioral bias of the agent corresponds to a bias in activation
preference for the induced latent units as well, consistent with what is found in animals.
II.2.1 Evaluation Methods
In order to test for the influence of the behavioral policy on the distribution of place-like
cells, we developed five separate behavioral policies, four of which each having a move-
ment bias for the north, east, south, and west directions, respectively. The fifth policy was
the same as described in the previous section. In each of the biased policies there is a 50%
probability that the biased action will be taken at each time-step, and a 50% probability
that an action will be selected with uniform random probability instead. Figure 12 shows
the action probability distributions for each of the five policies. For each policy, we collect
1000 episodes of 50 time-steps each, and train each model using the same hyperparameters
described in the previous section.
II.2.2 Results
We find significant differences in the biases of the latent spaces induced by sets of obser-
vations generated from agents with different biased behavioral policies. These results can
be seen clearly visually in Figure 13. In the case of each of the biased policies, there is
a greater number of place cells in the region more likely to be visited by the behavioral
55
North biased South biased East biased West biased Unbiased
Figure 12: Action distributions for each of the five biased policies.
policy than anywhere else.
Statistical analysis reveals that there are indeed significant differences in the induced
latent spaces of each of the different models. We analyze both the x and y bias in the
models by taking the point in the environment of maximal sensitivity for each unit in the
latent space, and performing ANOVA analysis to determine distributional differences. We
find that there is a significant difference between the data types in both the x (F(4,495) =
65.93, p < 0.001) and y directions (F(4,495) = 69.04, p < 0.001).
As would be expected from the qualitative results presented in the figure, we find that
in the case of biases with respect to the y axis, there are significant differences between
the south and north policies from each other (p < 0.001), as well as between these policies
and the other three (p < 0.001). There are no significant differences between the other
three policies (p > 0.5). Likewise, when looking at biases with respect to the x axis, we
find significant differences between the east and west policies (p < 0.001), as well as sig-
nificant differences between each of these policies and the other three (p < 0.001), but no
differences between the other three (p > 0.5).
These results suggest that there is indeed a clear bias in the preference of the units in the
learned latent space of the generative temporal model. This preference is biased towards
parts of the state space of the environment which are more frequently visited, and thus
can benefit from greater representational capacity. Greater representational capacity then
56
Unbiased East biased North biased South biased West biased
A
B
Figure 13: Activation patterns of latent units trained with a biased behavioral policy. A.
Activation patterns of first six units of latent space of models trained with one of five
different biased behavioral policies. B. Contour map of firing affinity for each of the five
models.
corresponds to finer sensitivity to small (potentially behaviorally relevant) changes in that
region of the environment. This provides one potential explanation for the similar biases
seen in the formation of place in cells within the hippocampus of rodents (Hollup et al.,
2001).
II.3 Internally Generated Sequences and Auto-regressive Models
In the previous sections we described how place and time cell representations can come
about in generative temporal models trained to perform a simple prediction task with spatial
and temporal observations. An additional property of this class of models is their ability to
be used in an auto-regressive manner once trained. Concretely this means that rather than
providing the zt which was inferred from the current observation to the forward model, the
zt which the model generated at the previous time-step is used instead. If this process is
continued, an entire trajectory of “imagined” observations can then be decoded from the
57
sequences of latent states z. This process is sometimes referred to as performing a “rollout”
or “unrolling” the model, because of the recursive nature of the procedure, and these term
will be used below to refer to this process.
In this section, we will demonstrate that this unrolling procedure, when performed on
a fully-trained GTM, reliably produces coherent trajectories which can match the original
sequences of observations fed into the model. This capability bears a strong resemblance
to the phenomena of replay and preplay in the hippocampus (Foster, 2017). In both cases,
sequences of latent states are spontaneously generated in a coherent trajectory in the ab-
sence of additional sensory input. Here we show that action sequences generated from the
same policy used to infer the latent state can be used to “unroll” the model and generate
coherent sequences of place-like cell activations that match those which would come about
from exposing the model to the actual sequence of observations.
II.3.1 Evaluation Methods
In this section, we will use the same trained models from the previous section, three GTMs,
each with a different latent space distribution. Instead of examining the representational
quality of the z inferred from the observations, we will examine the quality of the predic-
tions of the z generated by the forward model.
We will utilize the same 2D gridworld environment described above, but examine solely
the < x,y > component of the observation space. We will examine both the quantitative
accuracy of each model being used in an auto-regressive manner to generate a trajectory
of predicted observations, as well as perform a qualitative examination of the latent space
representation during this rollout.
II.3.2 Results
We examine both the latent space representation during the rollout (z), as well as the re-
sulting predicted observations (o∗). In both cases, we find that they track their target,
58
suggesting that the model is indeed capable of learning the transition dynamics of the en-
vironment. Figure 14 displays the unit activations for the inferred z and the z generated
via the auto-regressive rollout. We find that these two largely match one another, suggest-
ing that the same sequence of actions results in the same activation pattern, regardless of
whether observations are being inferred directly, or the activation is the result solely of the
learned recurrent dynamics of the neural network.
Inferred Z Rollout Z
0 0
20 20
40 40
60 60
80 80
0 20 40 0 20 40
Timesteps
Figure 14: Inferred and generated latent variables during a single trajectory.
While a correspondence between the latent representations is useful to know, the value
of main interest is the reconstruction quality of the observations of the trajectory from the
latent space. In Figure 15 we compare the original observations in the trajectory to their
reconstructions from the inferred z variables, as well as to the predictions of auto-regressive
rollout. While there is some representational drift, we find that it is not catastrophic.
When we quantitatively compare the reconstruction errors of the three models, we find
that there is a significant difference in their capacity to reconstruct the observations from the
latent space induced during the auto-regressive rollout (F(2,3897) = 126.197, p < 0.001).
We furthermore find that as was the case in reconstruction from the inferred latent space, the
model trained using a gumbel-softmax latent space shows the lowest level of reconstruction
59
Latent Unit
1.0
0.8
0.6
Observation
Reconstruction
Rollout Prediction
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
X Position
Figure 15: Comparison between ground-truth observations, their reconstructions from the
inferred latent variable, and their reconstruction from the rollout of the generative model
using a gumbel-softmax latent space.
error of the three when constructing from the rollout latent space as well (p < 0.001).
Altogether, this suggests that GTMs are capable of both learning a meaningful latent
space, as well as learning a coherent forward model of the environment dynamics, which
retains this coherence even when unrolled in an entirely auto-regressive manner. Later, in
Chapter III, we will demonstrate the application of this model unrolling in improving the
learning process during a goal-directed navigation task using the Dyna algorithm (Sutton,
1991).
II.4 Generative Temporal Models Learn Temporal Community Struc-
ture
Thus far we have demonstrated that a GTM using a gumbel-softmax latent space is capable
of representations which bear similarities to both place and time cells in the hippocampus.
Furthermore, we have demonstrated that models with these cell types are useful for creating
coherent trajectories of experience entirely in the latent space, similar to the phenomena of
60
Y Position
replay and preplay in the hippocampus (Foster, 2017).
Beyond the place-like appearance of these units, and the ability to generate trajectories
of them, it is of interest to know whether these learned representations in and of themselves
display other known properties of hippocampal representations. One property of interest
is the temporal community structure which has been demonstrated in human hippocam-
pal representations (Schapiro, Rogers, Cordova, Turk-Browne, & Botvinick, 2013). This
structure results in sensory perceptions which are temporally more likely to co-occur being
represented more similarly within the hippocampus, regardless of the underlying sensory
similarity of the observation itself. This can be thought of as a process of sensory decorre-
lation followed by temporal correlation.
Schapiro et al. demonstrated this phenomena in humans exposed to a series of fractal
images drawn from a random walk on a graph. They demonstrated that the hippocam-
pal representations of these stimuli were best captured by their temporal structure, rather
than the properties of their visual appearance. This capability has been modeled using the
successor representation (Stachenfeld et al., 2017), as well as simple feed-forward neural
networks trained to perform a prediction task (Schapiro et al., 2013). Here we show that
community structure comes about within a predictive model in the absence of any explicit
successor learning, and in a model that is trained end-to-end to perform prediction from
raw visual observations.
II.4.1 Evaluation Methods
In order to demonstrate learned community structure in the latent representations of GTMs,
we utilize the same generative temporal model with a gumbel-softmax latent space de-
scribed in the above section. We change however the environment being used. In the work
of Schapiro et al. (2016) a series of fractal images were used as the stimuli, and rendered
to the human participants according to a random walk along a graph structure. Here we
use a similar series of fractal observation vectors as model input, and arrange them on a 2D
61
graph structure similar to the environment described earlier in this chapter. Instead of an
open field however, the states in this environment are arranged along a ring structure. Frac-
tal images were generated using the inverse-Fourier method, using a β = 2.5 (for details
on this method, see Bies, Boydston, Taylor, & Sereno, 2016). See Figure 16 for an image
of the graph structure along with examples of the fractal images used as stimuli to train the
model.
As done in previous sections, we collect the dataset using 1000 semi-random walks
through the environment of 50 steps each, and then separately train the model with the
collected dataset.
A. B.
Figure 16: Diagram of a graph environment. A. Graph structure used for environment in
community experiments. Nodes indicate states, and edges indicate connections between
states made possible by agent action. B. Examples of fractal images used as observations
in each node of the graph.
II.4.2 Results
We trained a GTM with a gumbel-softmax latent distribution for 5000 iterations, and find
that it is able to perform the reconstruction and predictions tasks highly accurately, with low
reconstruction and rollout losses (Mean = 3.181,SE = 0.186, Mean = 7.999,SE = 0.475).
See Figure 17 for example reconstruction images of a random trajectory through the graph
environment, and note that the reconstructed observations and predicted observations match
62
those of the true observations in structure.
Observations
Reconstructions
Rollout Predictions
Figure 17: Fractal Rollout Examples. Comparison between ground truth fractal observa-
tions in trajectory, their reconstructions from inferred latent variable, and their reconstruc-
tions from latent variable generated as part of auto-regressive rollout.
Satisfied that our model is capable of generating coherent trajectories through this frac-
tal graph state space, we can then turn our attention to the learned representations within
this model. The first question of interest is what kinds of latent space representations have
been learned from these non-visual observations. We find that in most cases the inferred z
representation has learned to assign a single unit to each of the individual fractal images,
resulting in a kind of ‘place cell’ representation, where each place is a single image. This
kind of extremely sparse representation has a connection to the so-called “grandmother”
cells found in the MTL (Quiroga, Reddy, Kreiman, Koch, & Fried, 2005). We can also
interpret this process as pattern separation (Yassa & Stark, 2011), where each observation
is encoded in a way orthogonal to the visual properties of the image. See Figure 18 for
activations of each of the 16 units in the latent space.
We then turn our analysis to the learned latent representations. We perform multi-
dimensional scaling on the inferred z, the generated z, and the hidden state of the recurrent
63
Figure 18: Latent space activations for each of the 16 units in the network.
network h. We find that while the two z representations are uncorrelated with the transition
structure of the environment, the recurrent network hidden state h displays temporal com-
munity structure (Procrustes transformation results: Error(z) = 0.862,Error(h) = 0.109).
The results of the multi-dimensional scaling procedure and Procrustes transformations are
presented in Figure 19.
MDS of latent space (z) MDS of RNN activations (h) Environment topography
Figure 19: Multi-dimensional scaling of latent representations of learned model compared
to true underlying topography of environment. The inferred latent space z shows no tem-
poral community structure, while the hidden state of the forward model h does.
64
These results suggest a two-fold process in the generative temporal model. The first
is that the inference from o to z involves a kind of pattern separation, where each stim-
uli is represented by a mutually exclusive set of representations, as seen in the activation
profile of the units presented in Figure 18. Secondly, the process of computing the for-
ward function zt+1 p(zt+1|at ,ht ,zt), involves a pattern completion process, whereby nearby
states are represented more similarly in the h representation. Whereas this is demonstrated
in (Schapiro et al., 2013) using a simple feed-forward artificial neural network, the latent
space z was pre-discretized in their experiments, and only zt+1 = f (zt) was learned. Here
we have modeled the same principle of learned temporal community, but in an end-to-end
fashion, where the model receives as input the raw fractal stimuli.
II.5 Discussion
In this chapter we have demonstrated how a simple generative temporal model with a
biologically-inspired latent distribution can capture a number of important properties of the
hippocampal formation. These include the development of place-like and time-like cells
with a behaviorally guided bias in distribution, the ability to generate long coherent latent
trajectories in the absence of ongoing observational input, the presence of learned represen-
tations which reflects environmental structure, and both pattern separation and completion
in the inference and generation processes respectively. Taken together, these findings sug-
gest that GTMs with gumbel-softmax latent layers are a strong candidate model for some
basic properties of hippocampal representation learning.
Our proposed model can be thought of as a kind of soft-clustering whereby observations
are probabilistically grouped into states (which we refer to here and elsewhere as “state
cells”), reflective of the number of units in the latent space. This bears a similarity to the
recently proposed model of hippocampal representation by Mok and Love (2019). In both
cases, the hippocampus can be thought of as learning a low-dimensional latent space for
abstract representations of the environment an animal is in. In many cases this information
65
in temporal or spatial in nature (O’Keefe & Nadel, 1978), but it need not be, and can instead
be information regarding other quantities.
There are of course many other models of place cell formation which have been pro-
posed (Samsonovich & McNaughton, 1997; Erdem & Hasselmo, 2012; Whittington et al.,
2019), but each of these make specific assumptions regarding the structure of the obser-
vation space, or of the environment itself. While the model proposed in this chapter is
simple in comparison to previous ones, its simplicity reflects a lack of strong assumptions
about the nature of the observations or the structure of the environment from which they
are drawn in order for the model to operate.
The soft clustering of the gumbel-softmax latent space used in the model has the addi-
tional effect of inducing a semi-discrete state space. By discretizing the high-dimensional
observations in an environment, they can then be used downstream for performing goal-
driven navigation using reinforcement learning. In the following chapter, we will explore
the efficacy of using the latent space z of a GTM as the state space when performing goal-
directed navigation tasks using reinforcement learning.
66
CHAPTER III
LATENT STATES AND GOAL-DIRECTED NAVIGATION
In the previous chapter we demonstrated that generative temporal models which utilize a
gumbel-softmax latent distribution can reproduce the existence of place and time cells and
display temporal community structure in those representations. Since they are generative
models, GTMs can also be used to generate “imagined” trajectories of experience, thus
drawing a useful connection to the replay and preplay phenomena found in the place cells
of the hippocampus (Foster, 2017), and an even more specific connection to the “internally
generated sequences” model of Pezzulo et al. (2014). Once general structured represen-
tations like the ones described above are learned, the natural next question is to ask what
downstream tasks these representations might be useful for.
In this chapter, we demonstrate that these learned latent representations are a strong can-
didate for providing the state space basis functions upon which value functions and policies
for goal-direction action can be built. Reinforcement learning algorithms are a prime can-
didate for modeling such learning (Niv, 2009), and there are a number of reinforcement-
learning based models of hippocampal-striatal learning. Here we focus on two specific
algorithms of interest in the literature, the classic Actor-Critic algorithm, and the more re-
cent Successor models of learning (O’Doherty et al., 2004; Stachenfeld et al., 2017). In
both cases, we demonstrate that the learned latent space provided by the GTM can support
the learning of optimal behavioral policies in a goal-driven navigation task more efficiently
than other state spaces.
67
In addition to demonstrating the efficacy of the learned representations for basic rein-
forcement learning, we also demonstrate how this state space can be used in the context
of fast-adaptation learning algorithms, where the goal location changes during the learning
process. Rather than learning entirely from online experience, it is also possible to take
advantage of our learned forward dynamics model to perform additional reinforcement
learning updates using the Dyna algorithm (Sutton, 1991), one of the proposed models of
hippocampal learning (Russek et al., 2017). We demonstrate that this results in faster learn-
ing compared to a fully online algorithm, and connect it to the replay phenomena using the
model of internally generated sequence learning (Pezzulo et al., 2014).
III.1 State Cells for Actor-Critic Learning
In the previous chapter we demonstrated how place and time like cells, here referred to
as “state cells” can naturally emerge from a specific kind of generative temporal model
utilizing a gumbel-softmax latent distribution (GTM-GS). While the properties of these
units are of interest in and of themselves, they are also of interest for their applicability to
downstream tasks such as goal-directed navigation.
One key area to look for with respect to potential downstream tasks is the hippocampal-
striatal axis, which is thought to be involved in memory-based decision making tasks
(van der Meer, Johnson, Schmitzer-Torbert, & Redish, 2010). It is known for example that
the hippocampus provides input to the striatum, and that during replay sequences place cell
activations precede corresponding cell activations in ventral striatum (Lansink et al., 2009).
Furthermore, there is evidence that different sub-regions of the striatum are specialized for
different aspects of conditional learning, with ventral striatum involved in value estimation
and dorsal striatum policy learning (O’Doherty et al., 2004). These two functions have
been proposed to work together as part of an Actor-Critic learning system, a method de-
rived from the reinforcement learning literature (Sutton & Barto, 2018). While the exact
relationship between dorsal and ventral striatum has been the topic of some debate, re-
68
sulting in the actor-critic formulation being made more nuanced in recent years (Atallah,
Lopez-Paniagua, Rudy, & O’Reilly, 2007; van der Meer et al., 2010), the underlying divi-
sion, and usefulness for capturing the main empirical findings remains (Tessereau, O’Dea,
Coombes, & Bast, 2020).
Here we demonstrate that the learned latent representations from the GTM model intro-
duced in the previous chapter serves as a useful basis function for performing reinforcement
learning using an actor-critic algorithm. We compare these to a set of alternative basis func-
tions, which we will demonstrate are either hand-generated using additional knowledge of
the state space and perform well, or result in slower or failed learning. In contrast, the
learned state space from the GTM-GS model is generated in a task-agnostic fashion, and
still results in good performance on downstream navigation tasks.
III.1.1 Methods
We utilize the same simple two-dimensional environment described in Chapter II, with the
same observation space of < x,y > spatial coordinates. We introduce now the additional
concept of a goal within the environment. This goal can be located in any free location in
the environment, and provides the agent a reward signal of r = 1 when the agent enters the
same position as the goal. At this point, the episode is terminated, and the agent is returned
to its starting position for the next episode. See Figure 20 for a diagram of this simple
environment.
We are interested in understanding to what extent the learned representations of a GTM
can be useful as a basis function for performing reinforcement learning. Rather than uti-
lizing a complex Deep Neural Network (DNN) for our policy and value networks, we use
simple linear functions which take the basis functions as input and compute π(a|s) (the pol-
icy) and V (s) (the value function) respectively. In these experiments we are interested in
the quality of the learned representations for supporting reinforcement learning. The ability
for a linear transformation to be sufficient to calculate an optimal policy and value function
69
Goal position
Agent 
start position
Figure 20: Diagram of two-dimensional reinforcement learning environment with single
goal and single agent.
serve as a useful measure of this quantity (Bellemare et al., 2019). As such, we avoid using
any deep or multi-layer neural networks in these experiments, to prevent the models from
simply learning sufficient intermediate representations from a poor basis function.
Given the generative temporal models discussed above, we have a number of choices
for potential basis functions which could enable reinforcement learning in an actor-critic
context. While there are many options, given that we are here concerned only with linear
function approximation, we only focus on basis functions which appear relevant to this
context, and compare four such different functions.
The first is the raw observation space itself, < x,y >. While this representation is
simple, it is not clear that it is amenable to linear function approximation. We include it here
for completeness. The second is the canonical “one-hot” state encoding (i.e. < 1,0,0... >
for first state). Deriving this basis function requires knowledge of the total number of states
in the environment, as well as a function for converting a given observation o into a state s.
We know however that in the tabular case, which is what linear function learning reduces
70
to with one-hot observations, algorithms such as actor-critic and Q-learning are guaranteed
to converge (Sutton & Barto, 2018). We derive the third and fourth state spaces from the
GTM-GS model, using the distribution z, and a discretized sample from the distribution,
respectively.
For each of these basis functions, we utilize a simple linear actor-critic model, and train
it using data collected in an online fashion. At each time-step, the agent receives an obser-
vation from the o, and uses it to compute a state s = f (o). With this state, a value function
V (s) and sampled action a ∼ π(a|s) are computed using linear transformations from a set
of learned weight matrices. The sampled action is then used to act in the environment,
producing a new observation o∗ as well as a reward r. We train the model to maximize the
discounted expected return R = ∑T tt=0 γ rt using the following temporal difference update
rules.
δt = rt + γV (st+1)−V (st) (III.1)
V (s )′t =V (st)+αδtst (III.2)
Q(st ,at)′ = Q(st ,at)+αδtst (III.3)
Where α is the learning rate and γ is the discount factor. We set these to 0.25 and
0.99 respectively. Actions are sampled by transforming the Q(s,a) function into a cate-
gorical probability distribution using the softmax function, exp(log(xt/τ))∑exp(log(x , and adjusting thet/τ))
weighting using a temperature parameter, τ , which we set to 0.01. We train each model
for 200 episodes of either 150 time-steps, or the number of steps it takes to reach the goal,
whichever comes first within the episode. We train all models with five different randomly
selected initialization seeds.
71
III.1.2 Results
To assess performance, we examined the mean and median number of time-steps to reach
the goal of the last 20 episodes in each of the five training runs per agent. The optimal
policy in this task can reach the goal in 11 time-steps. We find that as expected, the one-hot
basis function results in an agent which consistently learns an optimal policy for navigating
to the goal (Mean = 11.4,Median = 11). In contrast, the basis function consisting of the
raw observations from the environment results in an agent which is never able to arrive at
the goal in any of the five random initializations (Mean = 148.98,Median = 149). These
two results provide the extremes of a canonically good and bad basis function.
Unlike the observation basis function, the two basis functions based on the learned
latent space from the generative temporal model are able to in general support learning
optimal policies, though not with the same level of consistency or performance as the opti-
mal one-hot basis function. While the resulting agent learns an optimal policy in all trials
(Mean = 11.06,Median = 11), the “GS-Dist” model which utilizes the z softmax distri-
bution took significantly longer to converge than the one-hot encoding. Additionally, the
“GS-Sample” model, which utilizes a discrete sample from the z latent space of the model,
is able to learn as quickly as the one-hot basis function, but failed to converge in one of the
five runs (Mean = 25,Median = 11). See Figure 21 for the learning curves associated with
these results.
We also recorded the estimated value in each state of the environment from each model
during learning. These value estimates are presented in Figure 22. As expected from the
performance results presented above, the observation basis function fails to learn a coherent
value map. In contrast, the value maps for the three successful basis functions all assign
value to both the goal location, as well as the path leading from the agent start location to
the goal.
72
Observation
One-hot
GS-Dist
140 GS-Sample
120
100
80
60
40
20
0 25 50 75 100 125 150 175 200
Episode number
Figure 21: Actor-Critic agent mean time-steps per-episode for each basis function. Error
bars represent standard error over five random initialization seeds.
III.2 State Cells for Successor Feature Learning
In the previous section we demonstrated that the learned latent space of a generative tem-
poral model with a gumbel-softmax distribution can serve as a useful basis function for
performing actor-critic learning. This was of interest due to the actor-critic model being a
popular means of theoretically understanding the function of the ventral and dorsal striatum
(O’Doherty et al., 2004), and the induced latent space in a GTM with a gumbel-softmax
distribution bearing a strong similarity to hippocampal place cells.
Another model of interest for hippocampal-striatal learning is successor feature algo-
rithm (Barreto et al., 2017). In this case, rather than dividing the learning problem into
one with an actor and a critic, the representation of the reward w(s) is dissociated from the
representation of the environment dynamics ψ(s). This dissociation is useful because it
allows for a decoupling of the learning process between the two quantities, with the result
being that a model can be trained to learn to quickly adapt to changes in either goal location
(a change in w(s)), or to changes in environment structure or policy (a change in ψ(s)).
In terms of the biological realizability of this formulation, there is evidence that the
73
Steps to goal
Observation
0 0 0 0 0
2 2 2 2 2
4 4 4 4 4
6 6 6 6 6
8 8 8 8 8
10 10 10 10 10
12 12 12 12 12
0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5
One-hot
0 0 0 0 0
2 2 2 2 2
4 4 4 4 4
6 6 6 6 6
8 8 8 8 8
10 10 10 10 10
12 12 12 12 12
0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5
GS-Dist
0 0 0 0 0
2 2 2 2 2
4 4 4 4 4
6 6 6 6 6
8 8 8 8 8
10 10 10 10 10
12 12 12 12 12
0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5
GS-Sample
0 0 0 0 0
2 2 2 2 2
4 4 4 4 4
6 6 6 6 6
8 8 8 8 8
10 10 10 10 10
12 12 12 12 12
0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0 2.5 5.0 7.5 10.0 12.5
Figure 22: Example value maps for agents trained using different basis function. Shows
five random initialization seeds.
“value” signal in the ventral striatum is relatively sparse, and as such could be better thought
of as a reward representation r(s) rather than a value estimate V (s) in the traditional sense
(van der Meer et al., 2010). In this case, the hippocampus would provide both the basis
function s as well as the successor representation ψ(s). This would correspond to CA1
output from the hippocampus (Stachenfeld et al., 2017). The ventral striatum would pro-
vide w(s), and the dorsal striatum would take input from both and calculate the policy
π(a|s).
III.2.1 Evaluation Methods
We utilized a slightly modified environment compared to the previous section in order to
assess the ability of agents using successor models to perform adaptation to goal position
74
changes during learning. Rather than an open-field square environment, we utilize a T-
shape maze, in which the agent start location is at the far south end, and the goal is either in
the north west or north east arm of the maze. At the beginning of a set of episodes, the goal
is located at the north east arm. The agent interacts with the environment for 150 time-steps
per episode, and a total of 200 episodes. At episode 100, the goal position is moved from
the east arm to the west arm for the duration of the episodes. See Figure 23 for a schematic
of this simple experimental design.
Episode 0 – Episode 99 Episode 100 – Episode 199
Figure 23: Diagram of experimental design for successor learning experiment. Position of
goal changes halfway through training process. Blue corresponds to walls which the agent
cannot pass through. Red corresponds to agent starting location. Green corresponds to goal
location.
III.2.2 Modeling Methods
We expect the actor-critic algorithm to perform poorly in a context in which the goal lo-
cation rapidly changes during the training process. As such, agents in this experiment are
trained using both the actor-critic algorithm as a baseline, and an algorithm based on the
successor representation (Dayan, 1993). In this case, the quantities being learned are w(s′)
75
and ψ(s,a), with the former corresponding to the learned reward function, and the latter
corresponding to the learned successor representation. In both cases, the outputs are a lin-
ear function of the basis function s, which serves as input to the model. The reward function
is updated using a simple learning rule as follows:
δw = rt−w(s) (III.4)
w(s)′ = w(s)+αwδw (III.5)
Where αw corresponds to the reward learning rate. We set this to αw = 1 in our experi-
ments to encourage fast adaptation to changing reward locations.
The update rule for the successor representation follows a familiar temporal-difference
learning rule, with the state representation rather than the value being the propagated quan-
tity:
δψ = st + γψ(st+1,amax)−ψ(st ,at) (III.6)
ψ(st ,a ′t) = ψ(st ,at).+αψδψ (III.7)
Where αψ corresponds to the successor learning rate. We set this to αψ = 0.2. amax
corresponds to the action with the highest expected value, derived from the value func-
tion Q(s,a) = ψ(s,a) ∗w(s)T . This equation is also used to arrive at the policy, where
we convert the Q function into a categorical distribution using a softmax function, with a
temperature of τ = 0.01.
Due to the nature of the successor representation, only certain state representations are
useful as the basis function for computing ψ with. In particular, state representations with
continuous values (such as a gaussian latent space) cannot be accumulated using the above
equations without changing their underlying meaning. As such, we only compare the one-
76
hot state representation to the “GS-Sample” representation, which is also discrete.
Unlike the open-field environment which we used in previous experiments, the T-Maze
contains additional structure in the space of the environment. In order to allow our model
to learn from this structure, we utilize a more complex observation space for the GTM-GS
model. In addition to the < x,y > components of the vector, we also include < n,s,e,w >
components, which each provides a normalized distance of the agent from the nearest wall
in each of the four cardinal directions. These can be thought of as corresponding roughly
to the activation properties of boundary cells (Lever et al., 2009). Together, the observation
space is a vector of length 6, and we train a GTM-GS with a latent space of size 100.
III.2.3 Results
As expected, we find that in general the actor-critic models fail to learn an optimal pol-
icy after the goal change at episode 100, while the successor models are able to adapt
to the change. In the case of the successor models, we find that both the learned la-
tent state space and the one-hot state space are both able to serve as a basis function for
an agent which learns an optimal policy for the task. As in the previous environment,
an optimal policy requires 11 time-steps to reach the goal location. Both the one-hot
(Mean = 12.07,Median = 11) and “GS-Sample” (Mean = 11,Median = 11) agents learn
policies which reach this level by the final 20 episodes of the learning session for the agent.
See Figure 24 for a visual presentation of these results.
Furthermore, we find that the learned state space results in agents which are able to
even more quickly adapt to the change in goal location than the baseline agents. One
potential reason for this is the distribution of states in the learned space. Whereas the one-
hot encoding results in a completely uniform covering, the learned representation is biased
by the states the agent encounters, where more representational resources are devoted to
certain parts of the space than others. This results in a propagation of value information in
a potentially more efficient manner.
77
SR-GS-Sample
SR-Onehot
AC-GS-Sample
140 AC-Onehot
120
100
80
60
40
20
0 25 50 75 100 125 150 175 200
Episode number
Figure 24: Mean time-steps per-episode for the two state space representations using ei-
ther a successor representation or actor-critic learning algorithm. Goal location changes at
episode 100. Error bars represent standard error over five random initialization seeds.
III.3 Fast Convergence with Successor Similarity Learning
In the previous section, we demonstrated that a successor-based agent using the latent space
of a GTM-GS model can quickly adapt to changes in goal location during the learning
process. A limitation of this model however is the need for a specific kind of state space
in order for successor learning as described in (Dayan, 1993) and (Barreto et al., 2017) to
perform well. This limitation comes from the fact that the reward and value functions must
be linear functions in the state s and successor ψ(s) spaces. This excludes the use of the
gumbel-softmax distribution itself as a basis function, since it violates this requirement.
As such, in the previous experiments we used a discretized sample from the latent space
“GS-Sample,” which effectively removes much of the useful information about the spread
of a given state. This additional information can be interpreted as the model’s probabilistic
belief state about the agents true position in the world. We hypothesize that utilizing this
extra information when computing and updating the value function would lead to faster
convergence than learning exclusively from samples from the distribution.
78
Steps to goal
Here we propose a modified version of the successor learning algorithm which allows
for the use of successor features without the need for the strict linear function requirement,
thus expanding the class of usable state space representations. We demonstrate that this
algorithm enables much more rapid learning in a goal-directed navigation task by taking
advantage of the full state information present in the latent space of the GTM-GS model.
We do this by replacing the linear functions with cosine similarity computations, and as
such refer to this new algorithm as Successor Similarity Learning (SSL).
III.3.1 Evaluation Methods
In order to examine the efficacy of the proposed SSL algorithm, we use the same environ-
ment T-Maze environment presented in the previous, but restrict the number of episodes
from 200 to 100, and the number of time-steps per episode from 150 to 100. Both of these
changes were done in order to provide a more challenging test of learning performance for
the agents. We compare the traditional SR algorithm to our proposed SSL algorithm, using
both the learned basis functions, and the pre-computed one-hot basis functions.
III.3.2 Modeling Methods
In order to arrive at the SSL algorithm, we make a few important changes to the traditional
successor representation learning algorithm. First, in order to enable continuous-valued
probabilistic basis functions, we replace the dot-product with a cosine similarity metric to
compute the reward funciton: r = cos(w(s),s) and V (s) = cos(w(s),ψ(s)). The cosine
similarity between two vectors is defined as follows: cos(A,B) = A·B‖A‖‖B‖ . This has the
property of ensuring that the reward and value functions are always bounded between 0
and 1, as long as the two vectors are positively valued. In addition, this allows us to bypass
the requirement that these functions be linear combinations of the underlying quantities
being compared. As such, we can take advantage of the additional information in the “GS-
Dist” state space for learning. We also use a modified update rule for the reward function,
79
which sets w = s if r = 1 and w = 0 if w = s and r = 0. This effectively acts to cache
the most recent rewarding state, and use it to compare incoming successor states ψ(s) to
determine value V (s).
The result of these changes is that the reward and value functions now take on slightly
different semantic meanings than in the case of classic successor learning. The reward
function becomes a measure of how similar the current state is to the last known rewarding
state. The value function becomes a measure of how likely the current state is to lead to a
state like the last known rewarding state. We refer to this algorithm as Successor Similarity
Learning (SSL).
III.3.3 Results
We train all model variants with five randomly initialized seeds in order to understand the
performance and stability of each learning algorithm. We find that the proposed SSL al-
gorithm with the more expressive “GS-Dist” state space (Mean = 11.83,Median = 11.75)
outperforms the SR variants using the “GS-Sample” (Mean = 22.14,Median = 13.3) and
”Onehot” (Mean = 33.33,Median = 17.95) representations. See Figure 25 for the respec-
tive learning curves.
SSL-GS-Dist
SR-GS-Sample
SR-Onehot
140
120
100
80
60
40
20
0 20 40 60 80 100 120 140
Episode number
Figure 25: Mean time-steps per-episode for SSL and SR based learning algorithms with
different basis functions. Error bars represent standard error over five random initialization
seeds.
80
Steps to goal
To ensure that the benefits gained from SSL are indeed related to greater representa-
tional capacity, and not solely from the cosine similarity metric, we also conducted an
additional experiment comparing different state spaces all using agents trained with the
proposed SSL algorithm. Here we find that it is indeed the combination of the more ex-
pressive state space representation “GS-Dist” with a fast-adaptation algorithm which can
take advantage of it (SSL) that together confers the performance benefits we see in the
first experiment (Mean = 12.58,Median = 12.65). In fact, we find that the performance
curves for the SSL variants of “GS-Sample” (Mean = 34.75,Median = 16.95) and ”One-
hot” (Mean = 16.19,Median = 14.75)) state spaces are extremely similar to those of the
agents trained using the SR algorithm, verifying our intuition that the benefit from SSL
comes from supporting the utilization of a more expressive state space. See Figure 26 for
the respective learning curves.
SSL-GS-Dist
SSL-GS-Sample
SSL-Onehot
140
120
100
80
60
40
20
0 20 40 60 80 100 120 140
Episode number
Figure 26: Mean time-steps per-episode for SSL based learning algorithms with different
basis functions. Error bars represent standard error over five random initialization seeds.
III.4 Rollouts, Replay, and Dyna Learning
Thus far, we have demonstrated that the inferred latent space of a generative temporal
model serves as a useful state-space for performing various kinds of reinforcement learning.
By utilizing only the inferred latent space however, we are in effect throwing away half of
81
Steps to goal
the trained GTM, since in doing so we are ignoring the forward model, and the trajectories
through the latent space which it can generate.
A learned forward model has the potential to serve an additional purpose in the context
of reinforcement learning, since it provides a model of the world which can be used to
more rapidly train our value function and policy. The utilization of a learned model for
this purpose in reinforcement learning is referred to as Dyna (Sutton, 1991). It has been
shown to speed up learning in a number of contexts (Peng & Williams, 1993), including in
biologically plausible learning using successor representations (Russek et al., 2017).
The natural analog to Dyna in the mammalian brain is the phenomena of hippocampal
replay. In both cases sequences of experiences are “replayed” for the purpose of learning.
In the case of hippocampal replay, this has traditionally been interpreted as serving largely
a memory consolidation function (Foster, 2017). However, replay events are not random,
and often involve trajectories to known goals (Pfeiffer & Foster, 2013). Additionally, the
presence of replay events during rest is shown to correlate with better navigational task
performance (Momennejad et al., 2018). These empirical results suggest that addition to
supporting memory consolidation, there is also a significant behavior-learning component
involved in replay, consistent with the role of Dyna in reinforcement learning algorithms.
In the following experiment, we build on the results in the earlier chapter showing
that auto-regressively unrolling the forward model of a GTM results in the generation of
a coherent trajectory of experiences. Here we show that periodically auto-regressively
unrolling the model and using the pairs of latent states to update a successor representation
can lead to more rapid goal-directed navigation learning than learning in a purely online
manner for real experiences.
III.4.1 Evaluation Methods
In order to demonstrate the effectiveness of augmenting the online learning process with
Dyna, we build on the previous navigation experiments in which successor-based agents
82
navigated a gridworld. Given the efficacy of the SSL algorithm introduced earlier in this
chapter, we conduct these experiments using this algorithm. Here we compare multiple
SSL agents, each utilizing the “GS-Dist” state space. One of the trained models updates in
an online fashion, as described above, and the others update using both online experiences
as well as different length trajectories (5, 10, and 20 steps) of “imagined” experiences that
are the result of unrolling the GTM.
To better test for the usefulness of Dyna learning, and to make use of our more efficient
successor learning algorithm, we also introduce a larger circular environment of size 21×
21 which requires greater exploration on the part of the agent in order to arrive at the
goal location than the previous environments. See Figure 27 for a diagram of this circular
environment.
Goal position
Agent start position
Figure 27: A large circular gridworld environment used to compare performance of purely
online and Dyna-assisted learning.
III.4.2 Modeling Methods
In order to analyze the effectiveness of the Dyna procedure, we vary the length of the tra-
jectories unrolled by the model. We hypothesize that maximum benefit from Dyna will
take place with an intermediate trajectory length, since shorter trajectories may not provide
much additional information, and longer trajectories may provide a corrupted learning sig-
83
nal, due to accumulation of errors in the unrolling process. For the agent which uses Dyna,
at each time step there is some probability that an imagined trajectory will be initialized.
Once initialized, the trajectory will unroll for a fixed number of time-steps, or for as long
as it takes for the agent to imagine it has reached the goal location, whichever comes first.
During updates within the unrolling, only the ψ(s,a) is updated, and w(s) is fixed, and
used to determine the presence of an imagined goal, as well as to enable the computation
of Q(s,a), and guide the policy used during the imagined trajectory. We compare agents
utilizing Dyna trajectories with a 20% probability of being activated each time-step, and
unrolling the trajectory for either 5, 10, or 20 imagined time-steps.
III.4.3 Results
We find that in all cases the SSL algorithm using the GS-Dist state space are able to learn
the navigation task within 100 episodes. Furthermore, we find that augmenting the online
successor representation learning algorithm with an offline Dyna component enabled by
unrolling the GTM is indeed able to lead to consistently faster learning on the task, leading
to a near three times decrease in learning time.
Optimal performance in this task involves 17 time-steps from the agent start position to
the goal. On all five runs, the agents using Dyna were able to learn to solve the navigation
task optimally by the last episodes (Dyna-5: Mean = 17.7,Median = 17.05; Dyna-10:
Mean = 17.04,Median = 17; Dyna-20: Mean = 17.02,Median = 17). In contrast, the
agents without Dyna learned much slower, and less consistently (Mean = 18.6,Median =
17.3). See Figure 28 for a visual presentation of these results.
Comparing the number of episodes required to learn an optimal policy, we find that the
Dyna-10 model, which used trajectories of length 10 when performing Dyna resulted in the
fasted learning, with all five seeds converging to an optimal policy in less than 30 episodes
each. In contrast, the Dyna-5 and Dyna-20 agents took over 40 episodes to converge, while
the agents without any Dyna updates took over 70 episodes before all five agents converged
84
Online
Dyna-5
Dyna-10
140 Dyna-20
120
100
80
60
40
20
0 20 40 60 80 100
Episode number
Figure 28: Mean time-steps per-episode for a fully online learning algorithm, and an online
algorithm augmented with various rollout lengths of Dyna. Error bars represent standard
error over five random initialization seeds.
to the optimal policy. These results confirm that Dyna can indeed greatly increase the
learning process in a navigation task. It offers one strong possibility for explaining how
it is that animals are able to learn navigation tasks in a few numbers of exposures to the
environment or goal (few-shot learning).
III.5 Discussion
One important question about the learned representations of the hippocampus is their appli-
cation to downstream tasks such as spatial navigation. In this chapter we have demonstrated
that the latent space learned by a GTM-GS model can serve as a powerful state space basis
function for performing different kinds of biologically plausible reinforcement learning.
We demonstrated the efficacy of these models using two canonical reinforcement learn-
ing algorithms thought to be biologically plausible, the actor-critic algorithm of striatal
learning (O’Doherty et al., 2004) and successor representations algorithms (Dayan, 1993;
Stachenfeld et al., 2017). In both cases, we demonstrated that the learned latent space is
competitive with a pre-computed discretized latent space in terms of algorithm performance
85
Steps to goal
when training an agent to perform goal-driven navigation tasks.
Beyond online model-free reinforcement learning, the forward model of the GTM pro-
vides a means of performing additional “imagined” learning using the Dyna algorithm,
which we have demonstrated decreases convergence time. In addition to providing empir-
ical benefits, Dyna is closely related to the process of internally generated sequences of
place cell activations in the hippocampus found during animals at various times (Foster,
2017; Pezzulo et al., 2017). It has been hypothesized and theoretically demonstrated that
this replay behavior serves to aid in learning (Russek et al., 2017), and here we provide
additional theoretical evidence that this is indeed the case. The interpretation of replay
and preplay within a Dyna framework is also just one of many possibilities. It has also
been theoretically modeled as part of an explicit model-based planning scheme (Erdem &
Hasselmo, 2012), rather than as an augmentation to model-free learning as is done in Dyna.
Our work in this chapter can be seen to complement that of (Russek et al., 2017).
However, like the results presented in Chapter II related to (Schapiro et al., 2013), here
we present results which build on previous work, but extend it to an end-to-end model.
Whereas Russek et al. used exclusively a “one-hot” encoded state space, here we demon-
strate that a state space that is learned from raw observations can be used for successor
and actor-critic learning. In subsequent chapters, we will further extend this principle of
demonstrating our findings in more ecologically valid settings, as we extend from sim-
ple observation spaces and Euclidean environments to high-dimensional visually realistic
observations drawn from naturalistic fractal environments.
86
CHAPTER IV
CONTENT GENERALIZATION AND DUAL STREAM
WORLD MODELS
In the previous chapters, we demonstrated that a simple generative temporal model can be
used to learn a structured latent space which both displays a number of properties of hip-
pocampal cells, while also serving as a useful basis function for performing downstream
navigation tasks. Despite the demonstrated capabilities of this model, it is limited as a con-
vincing model of the medial temporal lobe in a number of important ways. Firstly, all of the
observations used were relatively low-dimensional, and in the case of many experiments,
already contained relevant spatial information explicitly provided. Secondly, we trained a
single model per environment, and demonstrated no capacity for generalization between
environments. Thirdly, the perspective of the agent’s observations and actions was allocen-
tric, as opposed to egocentric, which is the reference frame which all embodied mammals
actually utilize.
In this chapter, we seek to extend our generative temporal model in a number of im-
portant ways in order to achieve content-generalization, the ability to adapt to changes in
the content of an environment, while the structure remains the same. In order to address
this important capacity, we turn back to our original intention set out in the introduction,
which was to provide a full model of the medial temporal lobe, taking inspiration from
what we have referred to as the “language metaphor.” If we were to interpret the previ-
ous model from the perspective of the metaphor, we would say that the simple generative
87
temporal model described in the previous chapters learns something akin to a highly picto-
graphic language, where the signifier and the signified are intermingled together. From the
perspective of memory and navigation, this corresponds to the what (content) and where
(context) information being effectively fused into a single z representation. In the case
where there is only low-dimensional spatial or temporal information in a signal, this is not
an issue, since this fused representation reduces to a mostly where-based representation.
Also, in cases where there is only a single environment with a fixed structure and set of ob-
jects of interest, then a “fused” model such as the one described above could be considered
sufficient.
Of course, animals skillfully navigate not just one fixed environment, but any number
of environments, which might vary in content and structure over time. They also sense the
world through a series of sensory organs which provide a high-dimensional information
signal. Issues for a simple generative temporal model arise when the underlying environ-
ment and observations which we are attempting to learn are higher-dimensional, contain
non-spatial and spatial information, or vary over time. A canonical example of this situa-
tion is everyday egocentric narrative experience. In such cases, we take a series of actions,
and experience a series of things in different places at different times. Modeling each mo-
ment of this stream using a single z, and then attempting to learn a forward model of these
dynamics becomes an extremely daunting task. Especially when we would like to use the
same model to make sense both of my experience making breakfast in my home, as well as
the experience of making breakfast at a friend’s house.
In this chapter we will introduce and validate a novel generative temporal model which
we refer to as a Dual Stream World Model (DSWM). The main contribution of this model is
that like other recent biologically-inspired GTMs, such as GTM-SM (Fraccaro et al., 2018),
MBP (Wayne et al., 2018), and TEM (Whittington et al., 2019), it utilizes both a differen-
tiable memory store, as well as a separation of what and where variables. Unlike each
of these other models, it does so using general-purpose neural network building blocks,
88
which allow for it learn the dynamics of a variety of different environments with observa-
tional spaces ranging from simple vectors to high-dimensional visually realistic egocentric
observations.
Concretely, this involves splitting the formerly “fused” latent state space into sepa-
rate “definition” z and the “word” s representations. These two representations are then
used together in a differentiable neural dictionary to enable storage and retrieval of ex-
periences within an episode of learning. Instead of learning a dynamics model over both
representations, we only learn the dynamics over the “words” s, which are inherently lower-
dimensional and simpler to model. As we will show, this also enables generalization be-
tween environments with the same structure, but different objects or content within them.
Taken together, this model can be seen as a complete implementation of the “language
metaphor” and of an experience construction system described by Hassabis and Maguire
(2009).
In this chapter, we will introduce the Dual Stream World Model (DSWM), and demon-
strate how the separation of the latent space into a learned ‘what’ component z and a learned
‘where’ component s allows the model to learn the dynamics of complex environments with
high-dimensional observations, and how this enables generalization between environments
with similar structure. We will then demonstrate how the learned s latent space is a useful
low-dimensional state space for performing goal-directed navigation. Next we demonstrate
how this also allows for learning in egocentric observation spaces, and how the learned state
space s in egocentric environments can also be used for performing goal-directed naviga-
tion in a visually complex 3D environment.
IV.1 Learning Content Agnostic Latent Representations
In order to extend the generative temporal model introduced previously, we make two main
additions. The first is split the single encoding stream into two separate streams, each
encoding the incoming observations. As such, instead of a single latent space z, DSWM
89
uses two latent spaces z and s, with the former representing ‘what’ information, and the
latter representing ‘where’ information. This has a direct connection to the LEC and MEC
regions of the MTL, which are hypothesized to convey content and context information
downstream into the hippocampus (Deshmukh & Knierim, 2011; Hafting et al., 2005).
The second addition to the generative model is a mechanism by which this content and
context information can be bound together and later separated in order to enable storage and
retrieval of experiences within an episode for an agent. Here we use a simple differentiable
neural dictionary (DND) module within the neural network (Pritzel et al., 2017). This
DND is used to store and retrieve ‘what’ variables z using the ‘where’ s variables as the
lookup keys. The DND consists of a list of these s, z pairs. The DSWM also consists
of a forward model which is trained to learn the transition dynamics of only the ‘where’
variable st+1 ∼ p(st+1|st ,ht ,at). Doing so allows us to use different distributions for z and
s, which can vary in both size as well as kind of distribution. We can also use different
loss functions to train these two kinds latent spaces. See Figure 29 for a diagram of the
complete DSWM and its three main components, a content and context encoder, a context
forward model, and an associative look-up dictionary.
Key to the success of this model is that we can allocate representational capacity differ-
ently between the two latent variables. In the case of high-dimensional observations such
as visual information, it is desirable to allocate a larger representational capacity to z. We
can do so while maintaining a lower-dimensional latent space s which reflects the lower-
dimensional transition dynamics of the environments. Consider for example a human walk-
ing around in a one-block park area, containing a few trees, sidewalks, and benches. While
the sensory experience at any given time might be extremely rich, and require a complex
latent space to represent, encoding one’s location within the park is relatively straightfor-
ward, and requires significantly less representational capacity. Furthermore, the transition
dynamics governing one’s abstract location are much simpler than those governing exactly
what one might see next after turning 90 degrees to the right, for example.
90
DND DND Read
Memory Memory* Value Z*
Write Write Query
Value Key Key
Z S RNN S*
Content Context Content
Encoding Encoding Decoding
Observation Action Action* Observation*
Figure 29: Diagram of the Dual Stream World Model. Blue represents content information.
Red represents context information. Purple represents joint content and content informa-
tion. White represents model inputs. Green represents model outputs. Nodes marked with
a ∗ indicate information at the next time step of the simulation.
Key to this dissociation is the loss functions used to train each of the latent spaces of
the DSWM. Here we will use the same loss function for the z space as before, a simple
reconstruction loss paired with a regularization term to promote the disentanglement of
representations. There are multiple candidates for losses which can be used to train s. Here
we choose to use the ability to decode the position and orientation of the agent within the
environment to derive the loss function used to train the s representation.
This model can be seen as an instantiation of the memory indexing theory of (Teyler
& DiScenna, 1986). In this case, z represents the state of the cortex, and s serves as an
index for that state. Rather than hand-designing an index to be used, we learn the index
using a latent space which contains sufficient statistics which can be used to derive spatial
information about the agent’s location within an environment. Importantly, while we use
spatial information as a training signal to the model, this information is not available during
test time.
In this section we will demonstrate that the learned index s shows similarity to place
cells when trained in a series of maze environments with higher-dimensional visual obser-
91
vations. We will demonstrate that a DSWM outperforms a single stream world model in
a trajectory prediction task when the agent is exposed to environments with novel visual
properties (content information). In addition, we will demonstrate that all of the relevant
properties of the generative temporal models discussed in Chapter II have been retained in
the DSWM.
IV.1.1 Evaluation Methods
In order to better test the capabilities of the DSWM, we use a new set of environments
with more complex topographic structure, higher-dimensional observations, and greater
variability in appearance. Like in previous chapters, each environment is instantiated as a
2D gridworld, from which the agent can move in the four cardinal directions, but cannot
move through walls. Each environment is composed of 11×11 units. Instead of a simple
observation space of spatial coordinates or distances from walls, here we use images drawn
from a sliding window over a larger visual pattern map juxtaposed on the environment.
These “pattern maps” are generated by randomly selecting either a green or red pixel
to be placed in each unit of the environment that does not contain a wall. This can be
thought of as akin to changing the wallpaper or carpets within the same floor of a building,
the content changes, but the structure remains the same. In order to derive an observation,
the agent is provided with a 5×5 unit window around its current location, which displays
the content of the pattern map as well as the location of any walls within the environment,
which are represented as black squares. Each observation is presented to the agent as a
5×5×3 image.
We use environments with four different topographies. These consist of an open area
OpenMaze, an environment with four connected rooms RoomsMaze, an environment with
a symmetrical obstacle in the middle RingMaze, and an environment with four symmetri-
cal obstacles HallwayMaze. For each of these topographies, we generate 1000 different
fractal maps to provide a variety of different objects for the agent to observe. See Fig-
92
ure 30 for examples of these environment topographies, the pattern maps, and the derived
observations.
Topography Pattern map (x100) Observation
Open Maze
Rooms Maze
Ring Maze
Hallway Maze
Figure 30: Four variable content environments with different topographies. Left: environ-
ment topography. Blue corresponds to walls. Red corresponds to agent position. Middle:
Randomly generated pattern image used to derive observations based on agent location.
Right: Agent observations provide a 5×5 window around the agent position.
The datasets used to train each model was collected by running a semi-random be-
havioral policy for 1000 episodes of 50 steps each. In this case, we create four different
93
datasets, one for each unique topography, and randomly select one of 1000 pattern maps to
use for each episode.
IV.1.2 Modeling Methods
The DSWM consists of four main components. A content auto-encoder, a context encoder,
a forward model, and a differentiable neural dictionary. Concretely, we utilize a variational
encoder with a gumbel-softmax distribution for both the context and content components
(Jang et al., 2016). For the forward model, we utilize the same gated recurrent unit (GRU)
from the simpler GTM, and use as input both the latent ‘where’ state s as well as the current
action a. The differentiable neural dictionary (DND) is similar to that used by Pritzel et al.
and uses the latent context variables as keys, and the latent content variables as values. The
lookup process uses cosine similarity between a query key and the stored keys to determine
a similarity score. The top five stored values are then weighted by their similarity scores
using a softmax function to derive the retrieved z.
For any given time-step of simulation, the following series of steps take place. First
a new observation is observed from the environment. Next, that observation ot is used to
infer the latent ‘where’ st and ’what’ zt variables. The inferred ’where’ variable st and
’what’ variable zt are then stored together as a key-value pair in the DND Mt . The forward
model is then unrolled using both the next action at the agent takes, and the current inferred
’where’ variable st to produce a new ’where’ variable st+1 that is used to query the memory
to read a new ’what’ variable zt+1, which is decoded into a predicted observation ot+1.
This process is described in Figure 29. Concretely this corresponds to an inference and a
generation phase, which are described below.
94
Inference phase:
zt ∼ penc(zt |ot) (IV.1)
st ∼ penc(st |ot) (IV.2)
Mt = fwrite(Mt−1,st ,zt) (IV.3)
ht = f f orward(st ,at ,ht−1) (IV.4)
Generation phase:
st+1 ∼ q f orward(st+1|st ,at ,ht) (IV.5)
zt+1 ∼ qread(zt+1|Mt ,st+1) (IV.6)
ot+1 = fdecode(zt+1) (IV.7)
The model is then trained to minimize four objectives. Content reconstruction error:
mean squared error between original and predicted observations. Spatial information de-
coding: mean squared error between true and predicted position along with KL divergence
between predicted and true orientation, where applicable. Sequence coherence: KL diver-
gence between inferred and generated ‘where’ variables. Latent variable regularization:
the negative entropy of the ‘what’ and ‘where’ variable distributions, which acts as a regu-
larization term.
95
1 N
L = ∑ |oq−op 2Obs t t | (IV.8)n n=1
1 N
LPos = ∑ |posqt − posp|2t (IV.9)n n=1
LOri = DKL(p(orit |ot)||q(orit |st)) (IV.10)
LS = DKL(p(st |o,st−1)||q(st+1|st ,at)) (IV.11)
LTotal = LObs +LPos +LOri +LS−βsH(s)−βzH(z) (IV.12)
In the DSWM, we compose the z latent space using eight gumbel-softmax distributions
of size 16 each for a total of 128 units. We compose the s latent space with a single
gumbel-softmax distribution of size 49. In the WORLD baseline models (referred to as
GTM-SM in previous chapters), we use the same size latent space for z. In both model
types we use 256 units for the GRU hidden layer. We train each model using mini-batches
of three trajectories, each of length 50 for 10000 training iterations using a learning rate of
α = 5e−4 and regularization terms βs = 0.01 and βz = 0.0001.
IV.1.3 Results
The most immediate quantity to compare between the WORLD model and the DSWM is
the reconstruction accuracy of the model’s auto-regressive rollouts in a novel environment.
It is here that we expect that the additional complexity of the DSWM over the WORLD
will allow for better predictions. We use a separate set of five held-out pattern maps to
create five novel environments for each of the four different topographies to use as a test
set. We collect predictions based on first allowing the agent to run for 30 time-steps within
an environment, and then auto-regressively predicting the next 20 observations.
We find that for all tested environments the DSWM is able to more accurately predict
sequences of observations in these novel environments which were not part of the dataset
96
used for training (DSWM Mean = 6.025,Std = 6.573, WORLD Mean = 8.752,Std =
4.594, p < 0.001). See Figure 31 for the individual losses within each environment. These
results suggest that DSWM does indeed have additional generalization capacity compared
to the WORLD model.
Open Maze Rooms Maze Ring Maze Hallway Maze
10.0 10.0 10.0 10.0
7.5 7.5 7.5 7.5
5.0 5.0 5.0 5.0
2.5 2.5 2.5 2.5
0.0 0.0 0.0 0.0
DSWM WORLD DSWM WORLD DSWM WORLD DSWM WORLD
Model Model Model Model
Figure 31: Reconstruction errors from rollouts of both World and DSWM models in four
different topographical environments. Error bars represent standard error. In all environ-
ments, DSWM is able to significantly better predict trajectories of future observations than
the WORLD model.
We can also inspect qualitatively the predictions produced by each model. Example
auto-regressive rollouts from the two models are presented in Figure 32. We can see that
while both models are reasonably accurate at predicting the structure of the environment,
the WORLD model fails to predict the correct content in novel environments, whereas the
DSWM is able to predict both the content and structure. As such, this provides evidence
that the DSWM is able to adapt to an environment’s novel visual content as long as it retains
a familiar topographical structure.
We next examined the learned latent representations within the DSWM, asking whether
the learned representation of the s latent space reflects place-like firing properties. Given
the loss function which induces a representation from which the agent position can be
decoded, we would expect that such a representation would arise. This is not guaranteed
however, since the observations being encoded into s contain both spatial and non-spatial
information, and in some cases the non-spatial information dominates the observation.
To answer this question, we can qualitatively examine the learned representations of s
mapped onto the environment topography. The firing affinity of cells within the learned
97
Reconstruction Error
Open Maze
Original
WORLD
DSWM
Rooms Maze
Original
WORLD
DSWM
Ring Maze
Original
WORLD
DSWM
Hallway Maze
Original
WORLD
DSWM
Figure 32: Examples of reconstructed observations from rollouts of both World and DSWM
models in four different topographical environments. Environments use pattern map re-
served for testing, and not seen during training. In all environments, DSWM is able to
better predict the true trajectory of future observations within the novel environment.
representation is presented in Figure 33. We find that the representations can be best de-
scribed as indeed being place-like in their firing affinities. In particular, we find that the
inferred st units are highly spatially local, whereas the st+1 units generated by the forward
model have wider spatial selectivity. This can be seen as connected to the dentate gyrus /
98
CA3 and CA1 regions of the hippocampus, with the two regions being involved in either
latent state inference (pattern separation) or generation (pattern completion).
Open Maze Hallway Maze Rooms Maze Ring Maze
Inferred s
Generated s
Figure 33: Examples of activations of first four units of inferred and generated s from
DSWM model in each of the four different environment topographies.
IV.2 Goal-directed Navigation in Environments with Novel Content
Given the evidence that the DSWM is able to adapt to novel environment content when
being used to generate imagined trajectories, the next question we can ask is whether it can
do the same when serving as a state-space for performing goal-directed navigation. In this
section, we use the learned latent spaces from the trained models in the previous section
as basis functions for performing reinforcement learning as done in Chapter III. Instead
of performing navigation within the same environment used for training, we use a set of
environments with the same structural topographies, but different pattern maps, providing
different ‘content’ information within each observation of the environment.
Here we compare the DSWM context latent space s to that of the WORLD model latent
space z, as well as to a onehot-encoding baseline. We find that the DSWM latent space
provides a basis function for learning which results in both faster learning and overall better
99
performance than either the latent space form the WORLD model or the pre-computed
onehot encoding. Furthermore, we find that the DSWM can be used to perform additional
offline learning using the DYNA algorithm to further improve learning performance.
IV.2.1 Evaluation Methods
In order to examine the goal-directed navigational abilities of agents using the learned
state spaces, we use the same test environments from the previous section. We employ
a goal-directed navigation task which involves the agent finding a hidden goal in one of
the states of the environment. Halfway through a given training session, in this case, 50
episodes into training, the location of the goal changes to a new location. We use the same
set of goal locations for all topographies in order to allow for the consistent comparison
between results. As such, in all environments except for the “Rooms Maze,” there exists
the same optimal policy for each goal. Due to the nature of the topography of the “Rooms
Maze” environments, this optimal policy is slightly different, and involves dealing with the
bottleneck between rooms. See Figure 34 for a visual representation of the goal locations
before and after the change for each environment topography.
Open Maze Rooms Maze Hallways Maze Ring Maze
Figure 34: Four different environment topographies, each showing the initial goal location
for the first 50 episodes (top) and the second goal location for the following 50 episodes
(bottom). Red corresponds to agent start location. Blue corresponds to wall/obstacle loca-
tion. Green corresponds to goal location.
All agents are trained using the Successor Similarity Learning (SSL) algorithm, intro-
duced in Chapter III. All agents are trained using a learning rate of α = 0.1. Agents are
100
trained for 100 episodes each, with a maximum of 100 steps per episode using an environ-
ment from the test set of pattern maps. Each training session is repeated with five separate
agent initialization seeds in order to better understand learning dynamics.
IV.2.2 Results
Open Maze Ring Maze
Rooms Maze Hallway Maze
Figure 35: Learning curves in goal-directed navigation task for each of the four unique en-
vironmental topographies. Each curve represents the average of five separate initialization
seeds for the agent. Error bars represent standard error.
We find that for all four environments, the state space derived from the DSWM model
latent space s is able to match or outperform both the state space derived from the WORLD
model latent space s as well as the one-hot state space encoding. See Figure 35 for the
relevant learning curves for each agent. See also Table 1 for the reported mean and median
time-to-goal of the final 20 episodes of training for each agent.
We furthermore find that in all environment topographies, the addition of the Dyna
algorithm improves the performance of the DSWM state space-based agents, and results
101
Topography Optimal Statistic WORLD DSWM DSWM+DYNA ONEHOT
Mean 32.1 5.81 5.0 7.76
Open 5
Median 7.45 5.0 5.0 7.1
Mean 99.0 23.93 7.04 8.64
Rooms 7
Median 99.0 7.6 7.0 7.55
Mean 99.0 23.8 5.0 5.0
Ring 5
Median 99.0 5.0 5.0 5.0
Mean 79.22 5.0 5.0 5.0
Hallway 5
Median 99.0 5.0 5.0 5.0
Table 1: Statistics from final 20 episodes of each training session for goal-directed agents.
DSWM+DYNA results in most consistent learning, with near optimal performance in all
four topographies.
in optimal performance for three out of the four environments, with the “Rooms Maze”
performance being slightly below optimal. We can interpret these results as a clear sign
that the learned latent space in the DSWM model is both useful for predicting trajectories
of experience in novel environments, but also in subserving goal-directed navigation in
novel environments. Additionally, the DSWM+DYNA model performing best suggests
that the DSWM has learned a coherent model of the dynamics of the environment which
are able to abstract away the specific content of the environment.
IV.3 Learning from Egocentric Observations
Thus far we have demonstrated the properties of generative temporal models using exclu-
sively environments with observation and action spaces which are defined with respect to
allocentric coordinate systems. As such, we have missed out on a critical aspect of animal
learning and acting, the fact that they do so from a limited egocentric perspective. In this
section we introduce a new three-dimensional environment from which high-dimensional
egocentric visual observations can be derived. We then show that an agent using a DSWM
model can learn to predict trajectories though this more complex environment. Crucially,
in animals this ability involves the transformation of the purely egocentric sensory obser-
102
vations and actions into an allocentric reference frame, and then a reverse transformation
back into a egocentric coordinate space for prediction and goal-directed action (Zaehle et
al., 2007). We furthermore demonstrate that the DSWM is able to accomplish this in a
largely unsupervised manner.
IV.3.1 Evaluation Methods
In order to examine the properties of various generative temporal models within environ-
ments with an egocentric reference frame, we use a novel three-dimensional environment
built using Unity, a 3D rendering and physics engine, taking advantage of the ML-Agents
toolkit in order to enable the agents to interface with this environment (Juliani et al., 2018).
The environment can be thought of as a three-dimensional version of the gridworld envi-
ronment presented above. The environment consists of a set of nodes which the agent or a
wall can take up.
At a given time-step, the agent is presented with an observation derived from the agent’s
current position and orientation within the environment. This observation consists of a
64×64×3 color image presenting a 120-degree field of view. See Figure 36 for renderings
of the environment from multiple different angles, including the agent’s perspective. Within
the environment, the agent can take one of five actions: either move forward, move left,
move right, rotate 90 degrees to the left, or rotate 90 degrees to the right. There are four
possible orientations which the agent can take, consisting of facing each of the four cardinal
directions. As such, in a 7×7 environment, there are 196 possible states the agent can be
in, assuming there are no wall obstacles within the environment.
In order to test the generalization ability of the agent, we use a similar set of topogra-
phies and pattern maps as was done in the previous section. Instead of the pattern map
being overlaid on the open spaces of the environment, as was done in the 2D case, here we
overlay them on the wall obstacles instead. Additionally, we use three possible colors, red,
green, and blue, when defining the map in order to introduce greater visual variety to the
103
Top-Down Side-Perspective Egocentric Observation
Open Maze
Rooms Maze
Figure 36: Three dimensional gridworld environment rendered using Unity. Two example
topographies shown. Top: open maze. Bottom: rooms maze. Left: top-down perspective
of environment. Middle: side-perspective of environment. Right: egocentric observations
provided to agent within environment.
“content” of a given environment. Figure 36 contains an example of this pattern map in the
3D version of the “Rooms Maze” environment.
As was the case in the 2D environments, in order to generate a training set, we generate
100 pattern maps for each topography, and use four different topographies: “Open Maze,”
“Ring Maze,” “Rooms Maze,” and “Hallway Maze.” We then collect 100 episodes of 50
time-steps each for each topography, and train a separate WORLD and DSWM model on
each dataset. Due to the larger state space, we use a vector of length 128 for the latent con-
text space s in the DSWM. We also modify both the WORLD and DSWM models to use a
three-layer convolutional neural network (CNN) (LeCun, Bengio, & Hinton, 2015) to en-
code the image-based observations, and likewise use a de-convolutional network to decode
the predicted observations from these models. We use the same convolutional architec-
ture described by Ha and Schmidhuber (2018). Otherwise use the same hyper-parameters
defined in the 2D experiments, including training for 5000 iterations.
104
IV.3.2 Results
Examining the reconstruction error on a test-set of pattern maps for each environment, we
find that in all cases the DSWM (Mean = 223.35,Std = 213.39) is able to significantly
better predict future trajectories of observations than the WORLD (Mean = 304.17,Std =
215.23) model (p < 0.001). Figure 37 presents the reconstruction error for each model and
topography visually.
Open Maze Rooms Maze Ring Maze Hallway Maze
500 500 500 500
400 400 400 400
300 300 300 300
200 200 200 200
100 100 100 100
0 0 0 0
DSWM WORLD DSWM WORLD DSWM WORLD DSWM WORLD
Model Model Model Model
Figure 37: Reconstruction errors from rollouts of both World and DSWM models in four
different topographical environments. Error bars represent standard error. In all environ-
ments, DSWM is able to significantly better predict trajectories of future observations than
the WORLD model.
We can also examine the quality and coherence of the predictions of the two models. In
Figure 38 we present example rollouts in each of the four topographies on test-set pattern
maps. In all cases, the WORLD model produces trajectories with differ more severely than
the DSWM. In particular, the DSWM is better able to track the correct colors of the wall
obstacles, whereas the WORLD model often predicts incorrect colors.
Lastly, we can also examine the learned representations of the DSWM latent space
s. Critically, here we are learning the latent representation from egocentric observations,
which by definition do not a priori contain necessary allocentric information from which
a place code could be derived. Given our loss function and the existence of a recurrent
neural network processing these observations, there is reason to believe that the model
could learn to integrate the observation stream into an allocentric place code. In Figure 39
we present example activation patterns of the latent code s from a trained DSWM in each of
the four topographies. We find that a place-like code does indeed develop within the model,
105
Reconstruction Error
Open Maze
Original
WORLD
DSWM
Rooms Maze
Original
WORLD
DSWM
Ring Maze
Original
WORLD
DSWM
Hallway Maze
Original
WORLD
DSWM
Figure 38: Examples of reconstructed observations from rollouts of both World and DSWM
models in four different topographical environments. Environments use pattern map re-
served for testing, and not seen during training. In all environments, DSWM is able to
better predict the true trajectory of future observations within the novel environment.
providing evidence for a learned translation from egocentric to allocentric representation.
106
Open Maze Hallway Maze Rooms Maze Ring Maze
Inferred s
Generated s
Figure 39: Examples of activations of selected four units of inferred and generated s from
DSWM model in each of the four different environment topographies.
IV.4 Goal-directed Navigation from Egocentric Observations
Generating coherent trajectories and possessing a structured latent space can be useful to
the extent to which it supports useful goal-directed behavior for the animal. While we have
previously demonstrated that the DSWM latent space supports goal-directed navigation in
the allocentric case, here we demonstrate that it also does so in environments with high-
dimensional egocentric observations.
IV.4.1 Evaluation Methods
We use the same environments described in the previous section to test for goal-directed
navigational abilities. We compare a DSWM state space, a DSWM model augmented with
Dyna, and a one-hot encoded state space. The WORLD model state space is excluded here
due to the poor navigational performance in the 2D environments, suggesting that learning
would not be possible in 3D environments either. We use the SSL algorithm to train each
agent.
107
Due to the expanded state and action spaces of the environment, we use a fixed-goal
navigation task where the goal location remains constant throughout training. See Figure
40 for the starting agent and goal positions for each of the four topographies. For each
training session, we train the agent for 100 episodes of a maximum of 200 time-steps.
Training sessions are repeated five times, each with a different initialization seed for the
agent.
Open Maze Rooms Maze Hallways Maze Ring Maze
Figure 40: Starting agent and goal positions for each of the four topographies in the 3D
environment. Red: agent position. Green: goal position. Blue: wall positions.
IV.4.2 Results
We find that for all agents learning is more difficult in the 3D environments than in the 2D
variants. This is true both for the time to convergence, as well as for the ability for a given
algorithm to converge. This can be seen in the learning curves presented in Figure 41 and
the full table of results presented in Table 2. Despite the general increased difficulty in
learning, we find that the DSWM state space augmented with DYNA is either competitive
with or outperforms the pre-computed onehot state space. In both the open maze and Ring
maze, we find that agents with the DSWM state space by itself fail to converge to an optimal
policy in some of the five training sessions, while they converge in others.
Taken together, these results suggest that the DSWM is able to learn a useful state space
which is competitive with a pre-computed onehot encoding of the environment state. The
quality of this state space varies however based on the topography. The difficulty of the
108
Open Maze Ring Maze
Rooms Maze Hallway Maze
Figure 41: Learning curves in goal-directed navigation task for each of the four unique en-
vironmental topographies. Each curve represents the average of five separate initialization
seeds for the agent. Error bars represent standard error.
learning problem also varies by topography. We find that learning in the Open and Ring
Maze environments is easier, whereas the Rooms and Hallway Mazes are more difficult.
IV.5 Discussion
In this chapter, we introduced the Dual-Stream World Model, and analyzed its properties
with respect to both the coherent generation of trajectories of experience in environments
with novel content, as well as the ability to provide a support for goal-directed navigation.
This proposed model takes inspiration from recent generative temporal models which in-
clude differentiable memory stores, such as the Model-Based Predictor (MBP) (Wayne et
al., 2018), the Generative Temporal Model with Spatial Memory (GTM-SM) (Fraccaro et
al., 2018), and the Tolman-Eichenbaum Machine (TEM) (Whittington et al., 2019). While
related to each model, there are important differences which set the DSWM apart.
109
Topography Optimal Statistic DSWM DSWM+DYNA ONEHOT
Mean 86.9 18.0 45.31
Open 12
Median 12.5 12.0 35.85
Mean 65.51 30.06 103.58
Rooms 12
Median 25.1 26.9 118.7
Mean 87.18 9.96 8.17
Ring 8
Median 16.3 8.0 8.0
Mean 38.28 96.5 94.3
Hallway 12
Median 12.0 113.8 91.05
Table 2: Statistics from final 20 episodes of each training session for goal-directed agents
in 3D environment. DSWM+DYNA results in most consistent learning, with near optimal
performance in all four topographies.
While both the GTM-SM and DSWM use a similar DND as a storage and look-up
mechanism (Pritzel et al., 2017), DSWM uses a more general-purpose representation for
the ‘where’ variable s. Furthermore, we demonstrate the usefulness of this representation in
both a trajectory generation task as well as a navigation task, whereas the GTM-SM is used
only for a trajectory generation task. We believe that the learned representation described
in (Fraccaro et al., 2018) is not suitable to reinforcement learning, as it corresponds to
continuous-values x and y coordinates, which we demonstrated in Chapter III do not allow
for convergence during learning.
Likewise, whereas the TEM takes more specific inspiration from hippocampal anatomy,
and thus could be seen as more biologically plausible, the authors do not demonstrate the
usefulness of the learned representations for any navigation tasks. The TEM is also only
demonstrated with hand-crafted low-dimensional observations, whereas we have demon-
strated the efficacy of the DSWM on high-dimensional egocentric observations similar to
those an animal would encounter during navigation.
Lastly, we can compare the DSWM to the MBP. Both of these models are validated us-
ing high-dimensional observations on tasks of both trajectory generation and goal-directed
navigation. The MBP however utilizes a single latent state, and was not tested in the domain
in which the environment content significantly changes, and the capability of the model to
110
adapt to these changes is not clear.
As such, the DSWM can be seen as a meaningful addition to this growing ensemble of
dictionary-based models of hippocampal learning, with clearly demonstrated properties of
adaptability to changes in environmental content, while maintaining the ability to generate
coherent trajectories of experience, and support goal-directed navigation.
One potential weakness of the DSWM compared to the other models described is the
need for an auxiliary loss function based on spatial information in order to train the latent
representation s. In the other models described, this learning signal is not used, or is built
more explicitly into the models, as is the case of the GTM-SM. Given the evidence for
both representations of spatial position and head direction in regions adjacent to the hip-
pocampus, namely the MEC (Hafting et al., 2005) and subiculum (Taube et al., 1990), we
believe that it is not implausible for this signal to help guide the place cell representations
within the hippocampus proper. Still, we acknowledge that the ability to induce a similar
representation in an unsupervised manner would amount to a significant improvement of
the model presented here.
The DSWM can be seen as providing a means of largely actualizing the cognitive map-
ping system described in the introduction of this work. We have presented a model which
can adapt to changes in both goal location and the content of the environment in a rapid
manner consistent with experimental evidence in mammals. One aspect missing from the
proposed model so far however is the ability to adapt to changes in the structure of the
environment itself. It is this capability which we turn to in the next chapter.
111
CHAPTER V
STRUCTURAL GENERALIZATION AND CONTEXT
MODELS
In the previous chapter, we introduced a Dual Stream World Model, and demonstrated its
capacity to model a number of known properties of the medial temporal lobe. In particu-
lar, we demonstrated the ability for the model to learn to adapt to changes in environment
content, and to learn an allocentric representation useful for navigation from an egocentric
observation signal. So far however, we have focused on environments with fixed topo-
graphic structure. In doing so, we have ignored a key property of the cognitive map in
animals, the ability to adapt to structural changes in the environment (Tolman, 1948).
For a living animal, such structural changes can either take place in the form of arriving
in a novel environment, or in the introduction of shortcuts or roadblocks into a familiar
environmental structure. Both of these capabilities are based on the ability of the animal
to adapt to changes in environment topography, and to take advantage of non-reactive, and
generalized representation of space. In this chapter we explore a series of extensions to
generative temporal models which provide them with some capacity for structural general-
ization.
In addition to the ability to adapt to changes in the structure of an environment, cogni-
tive maps in animals also support the ability for animals to make sense of their surroundings
in many different environments. This making sense involves the generation of coherent
imagined trajectories of experience, as well as the ability to perform goal-directed naviga-
112
tion. In the case of imagining trajectories of experience, these can be from environments
which the animal does not currently inhabit (Karlsson & Frank, 2009). This can be seen
as the ability to store and retrieve not only the content of a specific map, but the ability to
store multiple such maps simultaneously.
The maintenance of multiple cognitive maps, and the ability to adapt to changes in
environment structure within a single map are both instances of a more general property
of the medial temporal lobe, and the cognitive maps they support. In both cases what is
additionally being represented alongside ‘what’ and ‘where’ information is an additional
‘how’ variable. In cases where environmental changes are large and discrete, this ‘how’
represents simply the different maps. In cases where changes are more subtle within the
environment, this ‘how’ represents specifics about the nature of the task. Returning to
the language metaphor presented earlier, this ‘how’ information can be thought of as the
specific grammar rules which apply at a given time and context.
Building on the previously demonstrated generative temporal models, and our under-
standing of the medial temporal lobe, we propose two new models, a context augmented
generative temporal model, or Contextual World Model (CWORLD), and a Tri-Stream
World Model (TSWM), which learns separate representations for ‘what,’ ‘where,’ and
‘how’ (or context).
In this chapter we will formally define the CWORLD and TSWM models, and define
a few possible loss functions which can be used to train the contextual representation, c.
We will demonstrate the efficacy of each with respect to both the quality of the model’s
trajectory predictions in novel environments, and explore the nature of the learned repre-
sentation c, which we connect with the contextual scene representation found within the
parahippocampal area in humans (R. A. Epstein, 2008).
113
V.1 Learning an Index-based Context Representation
Thus far we have examined the capabilities of generative temporal models in environments
with a single, fixed, structural topography. Animals in the wild however are able to adapt
their behavior to multiple different environments. In this chapter, we explore the ability for
a context-augmented generative temporal model (CWORLD) to learn to model the dynam-
ics of more than one environment structure. In doing so, the question which arises as to
what the ideal loss function is to train such a representation.
In this section, we explore one of the simplest possible objective functions to train a
contextual representation c. When there is a known, fixed set of environment topographies,
we can train the context representation to simply be useful for predicting the identity of the
environment topography. We find that the CWORLD model is indeed able to be trained to
perform this identity prediction task for the current topography the agent is exploring. Fur-
thermore, we show that the latent representation which supports this identification allows
the generative model to make more accurate predictions of trajectories of future observa-
tions than a WORLD model without any explicit contextual representation.
V.1.1 Modeling Methods
We compare a WORLD model, as described in Chapter II to a Contextual World Model
(CWORLD) described below. Similar to the DSWM, the CWORLD model uses two sep-
arate encoding streams, and two latent variables, z and c. Once encoded, both variables,
along with the selected action from the agent are used as input in the RNN layer of the
network to produce a generated z from which a predicted observation is decoded. Here we
train the inferred z representation using the traditional variational autoencoder loss func-
tion as done in all previous models. We train the contextual representation c to predict
the map identity using a cross entropy loss function after a series of decoding layers, in
addition to the variation regularization loss for the gumbel-softmax distribution. The equa-
114
tions governing the inference and generation process are provided below. For a graphical
representation, see Figure 42.
Observation*
Decoding
RNN Z*
Z C
Content Context
Encoding Encoding
Observation Action Action
Figure 42: Diagram of a Contextual World Model. Blue represents content information.
Red represents context information. Purple represents joint content and content informa-
tion. White represents model inputs. Green represents model outputs. Nodes marked with
a ∗ indicate information at the next time step of the simulation.
Inference phase:
zt ∼ penc(zt |ot) (V.1)
ct ∼ penc(ct |ot ,hct ) (V.2)
hct = f
c
f orward(ot ,at−1,ht−1) (V.3)
hzt = f
z
f orward(zt ,at ,ht−1) (V.4)
Generation phase:
zt+1 ∼ q f orward(zt+1|zt ,ct ,at ,hzt ) (V.5)
ot+1 = fdecode(zt+1) (V.6)
115
V.1.2 Evaluation Methods
In order to examine the ability for structural adaptation, we use a set of two-dimensional
environments with novel topographies. We generate these topographies using the inverse
Fourier fractal generation method (Bies, Boydston, et al., 2016). We use a value of β = 2.0
to define the fractal complexity, and threshold each generated fractal map at 0.65, setting all
values less than the threshold level to 0 and all values above it to 1. We then use the values
set to 1 to represent wall and obstacles within the environment, and all values set to 0 to
represent the navigable ground. We generate each topography by providing a unique seed
to the random number generated used in the fractal generation process. We use this process
to generate sixteen unique topographies within environments of 13× 13 units, including
two units on each edge for observation padding (see below). Figure 43 provides a visual
representation of these environments.
We collect 1000 trajectories of 50 time-steps each. For each trajectory, we randomly se-
lect one of the sixteen environments, and a random starting position for the agent within the
environment. The agent follows a semi-random allocentric movement policy as described
above. The observation provided to the agent at each time-step is a 5× 5 window around
the agent’s current position providing information about the presence or absence of a wall
in each location. As such, the total size of the observation vector is 25.
We train both models using a latent z representation composed of four gumbel-softmax
distributions with size 16 each, for a total representation size of 64. We likewise use the
same number and size of distributions for the c representation. We train the entire model
end-to-end using a learning rate of α = 5e−4. We train both models for 5000 iterations on
batches of entire trajectories, using a batch size of 3.
V.1.3 Results
We can first evaluate the ability of the CWORLD model’s contextual representation c to
enable prediction of the identity of the environment topography. Presented in Figure 44
116
A B C D
E F G H
I J K L
M N O P
Figure 43: Examples of sixteen environments with fractal topographies. Blue represents
walls. Black represents navigable space for the agent.
is the correlation matrix for the predicted and actual environment topography identities.
Predictions are taken from the model after a “burn-in period” of an initial 30 time-steps
within the environment to provide the agent an opportunity to develop the contextual rep-
resentation. Predictions are averaged over 100 episodes, and the final 20 time-steps of each
episode.
We find that in 11 of the 16 environments the model assigned a 50% or greater prob-
ability to the correct environment identity. In an additional two environments, the model
assigned a 40% probability (a plurality) to the correct environment identity. This leaves
only three environments, ‘I,’ ‘J’ and ‘O,’ which the agent had difficulty identifying. If
we examine these environments, we find that they all share largely the same topographic
features, with a prominent protrusion in the northwest corner, and largely open space other-
117
Misidentified Topographies
Figure 44: Classification accuracy of index-based contextual world model. Left: Cor-
relation matrix of predicted and actual environment topography identities. Right: Three
environment topographies which were misclassified by the CWORLD model.
wise. We can determine that overall this suggests the model is able to successfully classify
environment identify based on partial information concerning their topography which is
obtained via the observations available to the agent.
When examining the reconstruction errors of the WORLD and CWORLD models, we
find that there is a significant difference in quality between the two models. The WORLD
model produces poorer reconstructions (Mean = 2.98,Std = 3.04) than the CWORLD
model (Mean = 2.85,Std = 3.03), t(67198) = 5.37, p < 0.001. These results are presented
visually in Figure 45.
We can interpret these results as providing evidence that the addition of the contextual
variable c does indeed allow for greater reconstruction accuracy when predicting trajecto-
ries of observations in different environments. Put more simply, the model having a sense
of which environment it is in allows it to better make predictions about what it will observe
in that environment.
118
Fractal Environments
3.0
2.8
2.6
2.4
2.2
2.0
WORLD CWORLD
Model
Figure 45: Reconstruction error for predicted trajectories of future observations for both
WORLD and CWORLD models. CWORLD model is able to predict observations with
significantly less error than WORLD model.
V.2 Learning a Map-based Context Representation
In the previous section we demonstrated that a generative temporal model could learn a use-
ful context representation c by training the model to predict the identity of the environment
topography the agent was in. While the efficacy of this approach has been demonstrated, it
has a number of drawbacks. Firstly, it is not very biologically plausible, due to the lack of
an explicit numbered representation of each environment an animal experiences. Further-
more, using a fixed index results in a model only capable of learning a context for a fixed
number of environments. We know that animals learn to make sense of a large number of
environments. More importantly, using fixed indices for each environment imposes a strict
representational boundary between each environment, and prevents any use a generalized
knowledge gained in one environment to be applied to another. Finally, while this context
representation resulted in a statistically significant decrease in reconstruction error when
predicting trajectories of observations, this decrease was relatively modest. In this section,
we propose a second loss function to train the context representation c which addresses
these issues.
We propose a new objective, which consists of training the context representation c to
be useful for predicting the structure of the topography of the environment the agent is cur-
119
Reconstruction Error
rently within. This training objective has the benefit of not being bounded by the number of
training or testing environments, since each environment can be defined uniquely by their
topography. Secondly, since we are not training the model to predict a fixed set of infor-
mation, as was the case when predicting the index of the environment, this new objective
allows for generalization to novel environmental structures. In this section we demonstrate
the efficacy of this approach compared to both the context-less WORLD model and the
contextual world model using index learning (CWORLD-I). We refer to the approach we
propose and compare here as a Contextual World Model with Map-Prediction (CWORLD-
M).
V.2.1 Evaluation Methods
We use the same process to generate training environments from the previous section, using
inverse Fourier fractal generation. In addition to collecting the 5×5 observations for each
trajectory, we additionally collect a vector representing the environment topography. In the
case of a 13×13 environment, this corresponds to a binary vector of size 169.
V.2.2 Modeling Methods
We compare a WORLD model to CWORLD-I, and the proposed CWORLD-M. The latent
representation c of the CWORLD-M model is trained using a binary cross entropy loss
function. This loss function compares the true map topography vector to the predicted
vector, deriving a gradient used to improve the representation of c during training.
L(c) =−Σ(p log(q)+(1− p) log(1−q)) (V.7)
We use the same set of hyperparameters from the previous experiment with the CWORLD-
I model.
120
V.2.3 Results
We can first examine whether the prediction loss used by the CWORLD-M model was
able to produce a representation c which is indeed able to predict the environment topog-
raphy. While there is no baseline to measure this model’s prediction error against, we can
qualitatively evaluate the predicted topographies.
Figure 46 presents example topography predictions from the CWORLD-M model along-
side the true map topography. In most cases, the model is initially unsure of the correct
topography, and assigns medium probability of wall location to most of the units in the
environment. As the agent moves around and collects more evidence via observations, the
prediction becomes more certain, and in most cases eventually reflects the true underlying
environmental topography.
True 
Env 0 30
A …
B …
C …
D …
E …
Figure 46: True environment topography alongside predictions from the CWORLD-M
model at test-time for environment topographies A-E. White corresponds to regions of
high certainty there is a wall, which black corresponds to regions of low certainty. Eigh-
teen predicted topographies from the model are shown, consisting of the first and last nine
of the “burn-in” period.
We can next ask whether this map prediction task, while able to produce qualitatively
convincing predictions of the environment is actually useful for predicting future observa-
tion trajectories. We do so by comparing the reconstruction error produced by each model
121
when unrolling a trajectory of predicted future observations. In each case, we allow the
model a 30 time-step “burn-in” period, followed by a 20 time-step auto-regressive unroll
of the model. We find that there is indeed a significant difference in performance be-
tween the three models in this prediction task (F(2,100797) = 44.32, p < 0.001), with the
CWORLD-M model (Mean= 2.75,Std = 2.99) producing predicted observations with sig-
nificantly less deviation from the true observations than other other two models (p< 0.001).
Figure 47 presents these results visually.
Fractal Environments
3.0
2.8
2.6
2.4
2.2
2.0
WORLD CWORLD-I CWORLD-M
Model
Figure 47: Reconstruction error for predicted trajectories of future observations for both
WORLD and CWORLD models. CWORLD model is able to predict observations with
significantly less error than WORLD model.
V.3 Learning Implicit Context Representations
In the previous sections of this chapter we have presented two different objective functions
which can be used to train a useful context representation c within a generative temporal
model. In particular, the second function presented, the topography prediction task, can
be used in novel environments, since it does not rely in predicting a quantity limited by
the training dataset. One remaining issue with this objective function is that there is not a
biologically plausible complete topographical representation of the environment which an
animal might use to train such a context representation. Indeed, it seems unlikely that ani-
mals would maintain literal topographic representations of space, unless explicitly trained
122
Reconstruction Error
to do so. In this section, we propose an unsupervised loss function which shapes the context
space c to be useful for prediction tasks.
The insight we build on is that what we’d like our model to optimize is the predictive
quality of the dynamics model over z∗. Instead of developing a surrogate loss function for
this optimization problem, we can directly learn a c which helps to optimize this quantity.
The unsupervised loss function which we propose to train c simply consists of allowing
the gradient from the forward model to pass through the c and observation encoder. We
augment this with an additional forward model over the context c, so that both z and c can
evolve independently during auto-regressive trajectory predictions. We refer to this model
as CWORLD-U, due to the unsupervised nature of the learned context. We demonstrate
that this new model variant results in significantly greater predictive accuracy than either
of the previous proposed contextual world models.
This separate contextual representation which evolves on its own and guides the learn-
ing process of the z forward model can be seen as a kind of hierarchical system, where more
abstract information about the dynamics of the environment are encoded into c, while only
relevant local information is encoded into z. This has a connection to the interplay between
the hippocampus and the parahippocampal area, which is known to respond preferentially
to stimuli containing structural information (R. Epstein & Kanwisher, 1998), and contains
a more general contextual representation of the current scene (R. A. Epstein, 2008). This
region is known to tightly interface with the hippocampus to pass this information onward
(Van Hoesen, 1982). Here we propose that a potential purpose for this interplay is to aid the
trajectory generation which takes place within the hippocampus by providing it the correct
context with which to generate coherent sequences of activation.
V.3.1 Modeling Methods
The CWORLD-U model consists of a modified version of the previous CWORLD mod-
els. In this case, we augment the model with an additional forward model over the c latent
123
state. This is needed in order to ensure that when unrolling the model to predict trajecto-
ries of observations both the z and c states are kept up-to-date. This was not necessary in
CWORLD-I and CWORLD-M, where the c could be interpreted as a fixed quantity (either
an environment index or map representation). In the case where c is learned in an unsu-
pervised fashion, we expect that the representation will evolve over time, and therefore a
forward dynamics model is needed. Critically to training this model, the c and z dynam-
ics models take as input both c and z from the current time-step, but the gradient is only
allowed to flow backwards from the z dynamics model into c. This ensures that the latent
state c is formed to aid the development of z, and not the other way around. See Figure 48
for a visual representation of the network flow.
Observation*
RNN C*
Decoding
RNN Z*
Z C
Content Context
Encoding Encoding
Observation Action Action
Figure 48: Diagram of CWORLD-U model. Red represents context information. Purple
represents joint content and content information. White represents model inputs. Green
represents model outputs. Nodes marked with a ∗ indicate information at the next time step
of the simulation.
124
c ∼ q(c |c ,zt,at,hct+1 t+1 t t ) (V.8)
zt+1 ∼ q(zt+1|c zt ,zt,at,ht ) (V.9)
V.3.2 Evaluation Methods
In order to evaluate this novel model variant, we utilize both the same dataset consisting
of 16 fractal topographies, along with an additional larger dataset containing environments
with 100 different fractal topographies. As done previously, this new dataset consists of
1000 episodes of 50 time-steps each. Each trajectory is sampled from a random fractal
topography, and the agent starting position is randomized within the open space in the
environment.
In the smaller dataset, we compare the WORLD model to the three contextual vari-
ants, CWORLD-I, CWORLD-M, and CWORLD-U in their ability to generate coherent
trajectories of imagined observations in each of the sixteen environments. In the second
experiment using the larger dataset of 100 topographies, we then test on a set of hand-
crafted environment topographies. See Figure 49 for these test environment topographies.
We use this separate set of environments in order to examine the generalization ability of
these model with respect to their ability to predict trajectories of observations in these un-
seen environments. The CWORLD-I model is excluded from this analysis, as we do not
expect the loss function used to induce a representation which improves generalization.
We train all models for 5000 iterations, using a learning rate of α = 5e−3 and a batch
size of 3.
V.3.3 Results
We first evaluate the reconstruction accuracy of imagined trajectories from all four model
variants when trained and tested using the smaller dataset of 16 environment topographies.
125
Rooms Maze Open Maze U Maze
Ring Maze C Maze T Maze
Hallway Maze S Maze I Maze
Figure 49: The nine test environments with hand-crafted Euclidean geometries. Blue rep-
resents walls. Black represents navigable space for the agent.
Figure 50 presents these results graphically. We find that there is a significant difference
between the model’s predictive accuracy, with CWORLD-U (Mean = 2.572,Std = 2.912)
being able to predict future observation trajectories with significantly less error than all
other models (p < 0.0001).
We can interpret this result as providing clear evidence that a learned contextual repre-
sentation optimized to improve the dynamics model does indeed provide a better context
than either an index-based or map-based representation. This is likely because of the adap-
tive nature of the learned c in CWORLD-U, which can change based on the current needs
of the prediction problem, whereas the c in CWORLD-I and CWORLD-M is fixed.
We next turn to examining the predictive ability of the models trained using the larger
dataset of 100 fractal environments. As mentioned above, we evaluate these models on a
held-out set of nine hand-crafted topographies. See Figure 51 for a graphical presentation
of reconstruction errors. We find that the CWORLD-U model (Mean= 3.171,Std = 3.256)
is able to predict trajectories of future observations significantly better than the WORLD
(Mean = 3.582,Std = 3.296) or CWORLD-M (Mean = 3.443,Std = 3.127) models (p <
0.001). These results validate our intuition that the CWORLD-U model is able to learn an
126
Fractal Environments
3.4
3.2
3.0
2.8
2.6
2.4
2.2
2.0
WORLD CWORLD-I CWORLD-M CWORLD-U
Model
Figure 50: Reconstruction error for predicted trajectories of future observations for both
WORLD and contextual variants. CWORLD-U model is able to predict observations with
significantly less error than WORLD or other CWORLD models when evaluated on the
same sixteen environment topographies the models were trained on.
evolving context representation c which allows for better generalization to unseen environ-
ments than a model without a context, or one with a fixed context.
Fractal Environments (100)
4.00
3.75
3.50
3.25
3.00
2.75
2.50
2.25
2.00
WORLD CWORLD-M CWORLD-U
Model
Figure 51: Reconstruction error for predicted trajectories of future observations for both
WORLD and contextual variants. CWORLD-U model is able to predict observations with
significantly less error than WORLD or other CWORLD models when evaluated in a set
of nine hand-crafted environment topographies.
V.4 Adapting to Changes in Context and Content
Having demonstrated that context-enhanced generative temporal models can adapt their
understanding of the transition dynamics to novel environment structures, we turn our at-
tention to combining this ability with the content generalization explored in the previous
chapter, and made possible by the Dual Stream World Model.
127
Reconstruction Error Reconstruction Error
V.4.1 Modeling Methods
In the previous chapter we introduced the Dual-Stream World Model, which encoded in-
coming observations into separate z and s streams. Just as we augmented the WORLD
model to produce the CWORLD model, we can likewise augment the DSWM with an addi-
tional context streams c to produce a model which we refer to as a Tri-Stream World Model
(TSWM). These three streams can be thought of as roughly corresponding to transforming
the incoming series of observations into ‘what’ z, ‘where’ s, and ‘how’ c representations.
In this model, z and s function largely how they did in the DSWM, but the context variable
c is used as an additional input to the forward model over the s variables. Likewise, s and
c are additionally provided as input to the c forward model to generate c∗. See Figure 52
for a visual representation of this network flow. By augmenting the DSWM in this way, we
gain a generative model which is both capable of content generalization (DSWM) but also
structural generalization (context variable).
DND DND Read
Memory Memory* Value Z*
RNN C* Query
Key
Write Write
Value Key
RNN S*
Z S
C
Content Context Content
Encoding Encoding Decoding
Observation Action Action* Observation*
Figure 52: Diagram of Tri-Stream World Model. Red represents context information. Blue
represents content information. Purple represents joint content and content information.
White represents model inputs. Green represents model outputs. Nodes marked with a ∗
indicate information at the next time step of the simulation.
128
V.4.2 Evaluation Methods
In order to evaluate the adaptation ability of the TSWM, we again utilize a set of 100
fractal environments of size 13×13, which are generated using the inverse-Fourier method
(Bies, Boydston, et al., 2016). Because we will be evaluating models capable of content
and context generalization, we use the same 2D environment content found in Chapter IV,
where each open space in the environment is filled with either a green or red pixel. We use
this dataset to evaluate the generative modeling capabilities the TSWM compared to other
baseline models.
V.4.3 Results
We first evaluate the generative modeling performance of the TSWM compared to other
baseline models introduced previously. We find a significantly difference between the per-
formance of all models (F(6336,32) = 192.26, p < 0.001). We find that the CWORLD
(Mean= 10.832,Std = 4.887), DSWM (Mean= 10.741,Std = 6.507), and TSWM (Mean=
10.335,Std = 6.260) models all outperform a baseline WORLD model (Mean= 11.711,Std =
5.138) (p < 0.001). Between these more complex models, there is no significant difference
between the CWORLD and DSWM models (p = 0.123). We furthermore find that the
TSWM is able to predict observations trajectories with significantly less reconstruction er-
ror than either the WORLD, CWORLD, or DSWM models (p< 0.001), suggesting that the
contributions of the CWORLD and DSWM models are independent and complementary.
We present these results in Figure 53.
We next turn our attention to the kinds of representations being learned within the c
latent space. An examination of the activation patterns of units within the latent space sug-
gests that there is a general affinity of structural motifs within an environment. See Figure
54 for example of cells and their firing properties. We see that certain cells respond to cor-
ners of the environment, while others consistently respond to open spaces. Likewise, some
respond to long walls, while others respond to dead-ends. Collectively, this set of con-
129
Fractal Environments (100)
12.0
11.5
11.0
10.5
10.0
9.5
9.0
WORLD CWORLD DSWM TSWM
Model
Figure 53: Reconstruction error for predicted trajectories of future observations for both
WORLD and contextual variants. TSWM model is able to predict observations with sig-
nificantly less error than WORLD, CWORLD, or DSWM models when evaluated in a set
of nine hand-crafted environment topographies.
textual units provides a full picture of the nature of the environment topography, and thus
provides necessary information to the generative model to allow for predicting trajectories
of latent state representations s, and ultimately decoding imagined observations.
V.5 Discussion
The ability to skillfully imagine and navigate novel spaces involves the capacity to adapt not
only to changes in the content within an environment, but also to changes in the structure
of the environment itself. In this chapter, we introduced a class of generative temporal
models augmented with a contextual representation meant to enable this second class of
generalization. We demonstrated that there are a number of viable objective functions
which can be used to learn such a contextual representation, with both supervised and
unsupervised learning methods resulting in a working context representation.
Among supervised learning objective functions, we demonstrated that environment
classification and map prediction were both viable to induce a context representation useful
for observation trajectory prediction. Given their limitations and lack of biological plausi-
bility, we then demonstrated that an unsupervised learning signal was a more powerful and
biologically plausible option, outperforming the supervised learning alternatives. Finally,
130
Reconstruction Error
Open Maze
I Maze
S Maze
C Maze
Figure 54: Examples of units from the c latent space of a TSWM model. Each row consists
of hand-selected units chosen to demonstrate the structural selectivity of the cells. Each
cell responds to a specific structural motif in the environment.
we combined insights from these contextual models with the advances introduced in the
previous chapter regarding content generalization to propose a Tri-Stream World Model,
capable of both content and context generalization.
While the Tri-Stream World Model learns to generalize to unseen environments with
both novel content and structure better than those we compared it to, it is far from perfect in
its predictions. We believe that there are a number of promising future approaches which
can be taken to enable stronger conditioning of the dynamics model on the contextual
representation, such as the use of hypernetworks (Ha, Dai, & Le, 2016).
The ability to navigate and imagine sequences of observations in novel environments
depends on the brain’s ability to form both a representation of what is being observed,
131
where it is being observed, and how a given observation related spatially to others. Within
humans, this final contextual representation can potentially be localized to a number of
brain regions, depending on the nature of the task, and level at which ‘context’ is de-
fined. One meaningful candidate which to draw a comparison with however is the parahip-
pocampal gyrus, specifically the parahippocampal place area (PPA). While early research
connected the PPA to the identification of places and scenes in the brain (R. Epstein &
Kanwisher, 1998), more contemporary work has suggested that the PPA forms a contextual
representation of the local scene, useful for navigation (R. A. Epstein, 2008).
132
CHAPTER VI
HUMAN AND AGENT BEHAVIOR IN COMPLEX
ENVIRONMENTS
Throughout this work, we have taken continuous inspiration from biological findings, both
behavioral and neural. These findings have guided the classes of models considered, the ob-
jective functions used to train them, and the environments and tasks within which they are
evaluated. This has led us to a class of generative temporal models which can demonstrate
a number of known properties of the medial temporal lobe. What has been absent from this
analysis is a contribution of novel biological results to help validate the models proposed
and evaluated in purely theoretical contexts. In this chapter, we turn directly to this issue,
and seek to understand the relationship between human goal-directed navigational behavior
and that of the class of models we have discussed thus-far.
In particular, there is a wealth of research exploring the specific kinds of navigational
strategies which humans and other mammals employ. We reviewed much of this in Chapter
I, pointing out that these behaviors have largely been grouped into categories of model-free
strategies, model-based strategies, and hybrid strategies (Daw et al., 2005; Momennejad
& Haynes, 2012). The hybrid strategies being of particular interest for their typical in-
stantiation in the successor representation, and successor-based learning (Momennejad et
al., 2017). We have utilized both a model-free strategy (Actor-critic) and a hybrid strategy
(Successor learning) when modeling the policy learning behaviors demonstrated in previ-
ous chapters. Here we seek to compare these two strategies to empirical data from humans
133
performing a novel navigational task.
In this chapter, we will demonstrate that humans are able to adapt to changes in environ-
ment content and goal location in a rapid fashion, but adapt slower to changes in environ-
ment structure, suggestive of a hybrid decision making strategy. Similar results have been
shown in relatively artificial contexts such as the two-step task (Momennejad et al., 2017),
and in simplified euclidean virtual environments (de Cothi, 2020). Here we present what
we believe to be the first work demonstrating a hybrid decision making strategy in complex
3D environments involving surface, goal, and structural changes in the environment during
learning.
We utilize a visually realistic set of virtual fractal island environments, building on
earlier work exploring human navigational performance with respect to varying levels of
environment complexity (Juliani, Bies, Boydston, Taylor, & Sereno, 2016). Such virtual
environments allow for programmatically varying the environmental appearance, structure,
and goal location. Furthermore, fractal topographies are found in various aspects of natural
environments (Mandelbrot, 1983), such as coastlines and mountain ranges, and thus are
suited to ecologically valid simulation of navigation in the natural world.
As a first step, we provide a replication of the main findings of an earlier work utilizing
fractal environments, demonstrating that humans are better able to navigate fractal environ-
ments with a low-to-mid range value for the dimension, or complexity of the environment.
This finding provides further evidence for the fractal fluency theory, which proposes that
various aspects of the human visual and cognitive systems are most adapted to this range of
complexities (Juliani et al., 2016; Bies, Blanc-Goldhammer, Boydston, Taylor, & Sereno,
2016).
Next, we examine the effect various changes on the environment have on the learning
process. We do this in order to find evidence for either a model-based, model-free, or
hybrid decision making strategy. Much previous work has been dedicated to determining
when and how humans utilize different kinds of decision making strategies. Some recent
134
work has suggested that that humans navigate using a hybrid strategy (Daw et al., 2005;
Momennejad & Haynes, 2012), which manifests as selective disruption to various kinds of
environmental changes. This hybrid strategy has been modeled in the past using a successor
representation learning algorithm (Momennejad et al., 2017; de Cothi, 2020). Here we
present partial evidence for a hybrid strategy, with humans showing no disruption for visual
changes, apparent disruptions for changes to the terrain and goal location, but importantly
a consistent recovery from changes in goal location, but a less consistent recovery from
changes in terrain.
We also train a set of artificial agents to perform a modified version of this task using
Deep Reinforcement Learning. We proposed three different states spaces to use as input
into these agents, one based on the pre-computed location and orientation of the agent, one
based on the inferred state space s from a TSWM model, and one based on the inferred state
space z from the same model. We find that all agents were able to perform the task well,
but that only the agents trained using the inferred s latent state showed adaption to changes
in goal consistent with human behavior. Because of the nature of this representation, and
it sharing the property of geodesic representation with the successor representation, which
has been previously used to demonstrate hybrid behavioral strategies, we can interpret these
results as providing an additional approach to the question of human behavioral strategy.
Rather than focusing exclusively on the learning algorithm, we demonstrate the value of
examining the impact of the underlying representations utilized in learning on the induced
behavior.
VI.1 Human Experimental Methods
We recruited subjects for this study from the University of Oregon Human Subject Pool.
Participants were granted class credit for participating. Due to restrictions in place as a
result of the COVID-19 pandemic, all experiments were conducted online, at the partici-
pants’ convenience. Each participant was given an hour to complete the study, and in most
135
cases reported completing in in less time.
From the perspective of the participant, the study consisted of a website in which a
3D virtual environment was rendered from a first-person perspective. This environment
consisted of an island surrounded by water. The participants are instructed that they can
control their virtual avatar by moving it in the forward or backwards directions, or rotating
the perspective of the avatar to the left or right. These controls were provided via keyboard
buttons. Participants were instructed that their task was to follow prompts presented on
the screen, and to find a goal location hidden on the island. Figure 55 provides a series of
example screenshots of the perspective of the participant while exploring the island.
A B C D
Figure 55: Example first-person perspective of participant performing navigation task. A:
Participant begins trail in random location on island. B: Participant navigates around island
looking for goal location. C: Participant finds goal location indicator, which grows in size
as participant approaches. D: Participant touches goal indicator, ending trial.
When a participant navigated their avatar within a 10-meter radius of the hidden goal,
a sphere begins to be rendered, and its size increases the closer the avatar is to the goal
location. The trial ends successfully when the avatar makes contact with this sphere. Alter-
natively, the trials ends unsuccessfully if the participant goes 30 seconds without contacting
the sphere. In either case, at the start of a new trial the location and orientation of the avatar
is randomized, and the participant is instructed to find the sphere again.
The experiment consists of six blocks of 20 trials each. In each block, the first ten trials
keep all environment properties fixed. After the 10th trial, depending on the condition of
the block, one of five changes can take place. Note that at this point participants are notified
by a message on-screen that a change may have taken place.
136
The five possible changes consist of the following: either the superficial appearance
of the island and water changes, the goal location changes, the terrain changes by adjust-
ing the fractal ground threshold up, or the terrain changes by adjusting the fractal ground
threshold down, or no change takes place. In order to acquaint the participants with the
fact that the environment changes, the first block is always a color change condition, and
the next five consist of a random permutation of all five conditions, such that each partic-
ipant experiences all conditions at least once during the experiment, and the color-change
condition twice. See Figure 56 for an example of each of these change conditions.
B
A
C
D
Figure 56: Visual representation of the four possible conditions within each block of trials.
Green circle represents goal location. At beginning of each block, a random topography,
goal location, and environment appearance are selected. After 10 trials a change takes
place. A: no change. B: visual change. C: goal location change. D: topography change.
In addition to a different change condition, each block of trials contains a randomly
selected seed and fractal dimension (D) with which to generate the terrain of the virtual
137
island. We used a set of 30 random seeds, and three different values of D, 1.2, 1.4, and
1.6. These were chosen based on previous research which demonstrated that humans were
exceptionally poor at navigating environments consisting of a D > 1.6 (Juliani et al., 2016),
and as such we would expect that they would be equally poor at this task. We also excluded
environments with D = 1, as these would consist of a flat ground, and not be amenable to
the manipulations required to impose the terrain change conditions. See Figure 57 for
examples of the effect of varying the fractal dimension in three different random seeds.
A
B
C
D = 1.2 D = 1.4 D = 1.6
Figure 57: Examples of different seed used to generate three environment topographies
each with different complexity levels. Rows: Different random initialization seeds.
Columns: different values of D used to generate topographies.
Lastly, in addition to the fractal dimension and seed, the terrain is generated using a
specific threshold value to determine the point at which there is flat ground as opposed to
unnavigable terrain. This point is either 0.4 or 0.6, corresponding to more terrain and more
ground, respectively. The terrain height for a given block of trials is selected such that in
138
terrain-less condition blocks, it is 0.4 in the first half of trials, and 0.6 in the second half.
Likewise, in terrain-more condition blocks, it is 0.6 in the first half and 0.4 in the second
half. In other condition blocks the height is randomly selected at the beginning of the block
and held constant.
VI.2 Environmental Complexity and Human Navigation
The complexity of the environment has a meaningful impact on how humans are able to
skillfully navigate. This disparity has both been demonstrated in Euclidean environments
(O’Neill, 1992), and those composed of fractal topographies (Juliani et al., 2016). Specifi-
cally, in the case of fractal topographies, humans demonstrate relative optimal performance
in environments with low-to-mid fractal dimension (D = 1.2 to D = 1.4), or complexity.
One interpretation of these results is part of a fractal fluency theory whereby the human
visual system shows improved information processing for patterns within this range. In
addition to navigational performance, this preference has been demonstrated in aesthetic
judgments (Taylor, Spehar, Hagerhall, & Van Donkelaar, 2011; Bies, Blanc-Goldhammer,
et al., 2016), and discrimination and sensitivity (Spehar et al., 2015).
As part of a larger study exploring human navigational strategies during environmental
change, we attempt to replicate the finding that humans are able to best navigate environ-
ments with a low-to-mid complexity. In the original work from Juliani et al. (2016), two
navigation tasks were used, an object finding task and a map reading task. In the first case
humans were able to most quickly find the goal object in the low-to-mid complexity envi-
ronments, and in the second case they were able to make the most accurate judgments of
goal location within the same range.
Here we use a slightly different task than map reading or object discovery. We employ
a task inspired by the canonical Morris Water Maze (Morris et al., 1982). In this task,
the participant must find a hidden goal location within the environment. Once they do
so, they then are moved to a random location within the environment, and must return to
139
the location. The speed at which they return in subsequent trials determines the naviga-
tional performance. Participants complete a number of these sets of trials on environments
with different fractal topographies consisting of varying fractal dimension. We find that
participants are able to learn and remember the goal location best in environments with
low-to-mid fractal complexity, thus providing additional evidence for the fractal fluency
theory.
VI.2.1 Results
Overall, sixty-six participants completed the study. We removed five participants results
from the analyzed data due to insufficient successful completion rates of the task, resulting
in a total of sixty-one participants data being analyzed to compile results. We defined
insufficient task completion as a failure to locate the goal in 25% or more trials. We believe
that such participants were likely distracted or failed to properly attend to the task in the
absence of a controlled experimental environment, as the median failure rate was 6%, and
90% of participants had a failure rate of 20% or less.
Among the remaining participants, we first turn to the question of understanding their
ability to find and remember a goal location in the environment as a function of that en-
vironment itself. We find that there is a significant difference between participant perfor-
mance in each of the three fractal dimensions (F(2,6097) = 55.263, p < 0.001). Mea-
suring performance in time-to-goal (lower is better), we find that performance is best in
the D = 1.2 environments (Mean = 12.763,Std = 6.208), followed by D = 1.4 environ-
ments (Mean = 13.670,Std = 6.861), followed by the D = 1.6 environments (Mean =
14.998,Std = 7.182). Figure 58 presents these results graphically.
To better understand the impact of the fractal dimension on the learning process over
time, we further compare performance by fractal dimension at various stages of a given
block of trials. We divide each block of 20 trials into four evenly distributed stages (1-5,
6-10, 11-15, 16-20). This allows us to examine how performances changes over time in
140
16
15
14
13
12
11
10
1.2 1.4 1.6
Dimension
Figure 58: Mean human performance by fractal dimension. Lower time to goal corresponds
to better navigation performance. Error bars correspond to standard error.
the environment, by compare early in a block to later in a block. See Figure 59 for these
results.
17
Dimension
16 1.2
1.4
15 1.6
14
13
12
11
1 2 3 4
Stage
Figure 59: Mean human performance by fractal dimension in four stages of a single block.
Lower time to goal corresponds to better navigation performance. Stage 1: Trials 1 - 5.
Stage 2: Trials 6 - 10. Stage 3: Trials 11 - 15. Stage 4: Trials 16 - 20. Error bars
correspond to standard error.
We find that the main effect of relative performance with respect to fractal dimension
holds true. However, we additionally find that the differences between the two lower fractal
dimensions (D = 1.2 and D = 1.4) appears to diminish throughout the block, and by Stage
4 is in fact no longer significantly different (p = 0.339). We expect that by Stage 4 the
141
Time-to-goal (seconds)
Time-to-goal (seconds)
participant will have the most experience with a given fractal topography, and with this
experience participants are able to navigate both D = 1.2 and D = 1.4 environments with
similar proficiency. This aligns with the findings of (Juliani et al., 2016) and conforms to the
prediction of the fractal fluency theory that a value of D = 1.3 would correspond to optimal
performance. In contrast, participants show additional difficulties with environments of
D = 1.6 even at the end of a full block of trials.
We next examine an additional property of the fractal topographies, the threshold used
to determine the ground level. As mentioned above, this level was either set to 0.4 or 0.6
depending on the condition of the block as well as a randomization process which ensured
equal exposure to both levels for participants. We ask whether this value has an impact on
participant performance in the task, and find that participants are significantly faster at com-
pleting a given trial in the 0.6 threshold level trials (Mean= 13.438,Std = 6.805) compared
to the 0.4 level trials (Mean = 14.198,Std = 6.836) (t(6098) = −4.352, p < 0.001). We
present these results graphically in Figure 60. This result suggests that the additional open
space afforded by the higher threshold allowed participants to better localize themselves
and the goal location, and navigate between the two.
15.0
14.5
14.0
13.5
13.0
12.5
12.0
11.5
11.0
0.4 0.6
Fractal threshold
Figure 60: Mean human performance per trial by fractal height threshold. Lower time to
goal corresponds to better navigation performance. Error bars correspond to standard error.
142
Time-to-goal (seconds)
VI.3 Evidence for a Hybrid Behavioral Strategy in Humans
One approach to understanding human decision making has been to classify the ‘algorithm’
humans use to make decisions as being either a model-free or model-based strategy (Daw
et al., 2005). A model-free strategy is one which conditions the current action on only
the current state, whereas a model-based strategy would take additional information into
account, typically information present in predicted future states, or explicit memory of past
states (Niv, 2009; Sutton & Barto, 2018).
In recent years a third strategy has been proposed, a so-called hybrid decision making
strategy, where key information about future states is cached and re-used, but a model of
the entire environment need not be learned (Momennejad & Haynes, 2012). A popular
instantiation of this hybrid approach has been the successor representation, and its applica-
bility has been demonstrated both in a simple two-step decision making task (Momennejad
et al., 2017), as well as a more complex navigational task (de Cothi, 2020).
The mark of such a successor-based decision making strategy is the dissociation be-
tween adaptation to changes in the goal state versus changes to the structure of the environ-
ment. In a successor learning paradigm, the reward function and successor representation
are learned separately, and as a result an agent utilizing such a representation can adapt to
changes to goal and structure separately. In comparison, a model-free agent would learn
a joint value function, from which it is not possible to dissociate these two things. On
the other end of the spectrum, a model-based learning agent would dissociate reward and
structure, and would be able to adapt to both very rapidly, whereas a successor based agent
would adapt more quickly to changes in goal than to changes in structure. This is due to
the underlying successor representation being a statistical estimate, rather than a complete
model as in the model-based case.
Here we utilize the experimental design consisting of blocks of trials with different
change conditions to determine what kind of decision making strategy best matches human
143
behavior in a visually rich virtual navigational task. We find that humans are able to near-
instantly adapt to changes in superficial visual content, but adapt to both changes in goal
location and environment structure over a longer time course. Critically, we find that adap-
tation to goal location takes place faster, and with better final performance than adaptation
to changes in environmental structure, suggesting that a successor-like representation may
be guiding human behavior in this task.
VI.3.1 Results
We first compare the overall learning trend to validate that participants are indeed able to
find the goal location, remember it, and deploy a successful navigation strategy for return-
ing to it from multiple different locations. This trend is presented visually in 61. We find
that participants indeed show signs of learning over the course of each block of trials.
20
18
16
14
12
0 1 2 3 4 5 6 7 8 9 1011121314151617181920
Trial
Figure 61: Mean human performance over time within a single block. Error bars corre-
spond to standard error. Trial 11 corresponds to change trial.
We find that when distributing the trials into four stages (1-5, 6-10, 11-15, and 16-
20), there is a significant decrease (p < 0.001) in time-to-goal between the first (Mean =
15.34,Std = 7.52) and second stages (Mean = 13.0,Std = 6.36). We furthermore find
that the change in the environment halfway through the block disrupts performance, with
144
Time-to-goal (seconds)
the third stage (Mean = 14.01,Std = 6.96) performance being significantly worse than the
second stage (p < 0.001), but not as bad as the first stage (p < 0.001), suggesting that
environmental knowledge is retained. Finally, we find that in the fourth stage (Mean =
12.93,Std = 6.10) performance is not significantly different from that of the second stage
(p = 0.75), suggesting that participants are able to adapt to the changes.
We next turn our attention to the individual block conditions. We find that the partici-
pant’s performance was significantly impacted by the condition of the trial block (F(4,6095)=
5.29, p < 0.001). See Figure 62 for a graphical presentation of relative performance in each
condition. Between these conditions, we find only four significant differences. The first is
between the goal-change and no-change conditions (p = 0.041). The second is between the
terrain-less and no-change conditions (p = 0.015). The last two are between terrain-more
and visual-change (p = 0.023) and no-change (p = 0.001) conditions.
15.0
14.5
14.0
13.5
13.0
12.5
12.0
Goal Visual None Terrain-More Terrain-Less
Change Condition
Figure 62: Mean human performance by block change condition. Lower time to goal
corresponds to better navigation performance. Error bars correspond to standard error.
These results provide the following insights into the initial question regarding human
decision making strategies. Due to the lack of difference between the visual-change and
no-change conditions, we see that at the very least humans are not using an entirely reactive
model-free policy, since they are on the whole able to ignore the superficial visual changes
in the environment. Secondly, we find that the goal-change and terrain-change conditions
do indeed disrupt performance compared to the no-change condition. This analysis alone
however is not enough to provide evidence for either a hybrid or model-based strategy.
145
Time-to-goal (seconds)
In order to determine that, we next turn to a more fine-grained analysis of the impact of
condition on each stage of a block of trials.
16 Change Condition
Goal
15 Visual
None
Terrain-More
14 Terrain-Less
13
12
11
1 2 3 4
Stage
Figure 63: Mean human performance by block change condition. Lower time to goal
corresponds to better navigation performance. Stage 1: Trials 1 - 5. Stage 2: Trials 6 - 10.
Stage 3: Trials 11 - 15. Stage 4: Trials 16 - 20. Error bars correspond to standard error.
Figure 63 presents the participant performance over time for each of the block condi-
tions. We find that in the first stage, there is no significant difference between conditions
(F(4,1520) = 0.50, p = 0.73). In the second stage, we indeed find a significant difference
between conditions (F(4,1520) = 3.142, p = 0.013), with the participants in the terrain-
more condition being significantly better at finding the goal location than in the goal-change
(p = 0.04), visual-change (p = 0.001), or terrain-less (p = 0.003) conditions. These results
are perhaps not surprising, given that in the previous section we found that participants
performed better in environments where the fractal threshold was set higher. As such, par-
ticipants are able to better learns the task in the condition where all environments contain a
high terrain threshold.
We next turn our attention to the second half of the block, and the second two stages.
It is in this set of trials in which the environment change has taken place, that we expect
greater effects. Indeed, we find a significant difference between conditions in stage three
(F(4,1520) = 9.49, p < 0.001), with two distinct groups of conditions emerging. The first
group consists of the no-change and visual-change conditions. The second consists of the
goal-change and terrain-less and terrain-more conditions. All p-values between groups are
146
Time-to-goal (seconds)
less than p = 0.01 and all p-values within groups are greater than p = 0.1. This suggests
that changes in both the terrain and goal location disrupt the participants performance,
whereas a change to the visual appearance of the environment does not.
Next, we examine the relative performance within each condition in the fourth stage
of a given set of trials. We find that there is a significant difference between conditions
(F(4,1520) = 4.97, p < 0.001). Comparing the conditions, this difference comes primarily
from the terrain-more condition, which participants performed significantly worse on this
stage than all other conditions (p < 0.05). This result is again not surprising, since in the
second half of trials in a terrain-more condition block, the environment terrain will have a
lower threshold, which participants performed worse on overall.
We finally turn our attention to the original question regarding evidence for different
behavioral strategies. Given the lack of impact from the visual change condition, we can
rule out a purely model-free decision making strategy. This leaves two possibilities, a
hybrid or model-based strategy. A hybrid strategy based on a successor representation
would predict differences in learning between changes in the goal location and changes
in the environment structure, with environment structure being more disruptive. We find
some evidence for this, with the terrain-more condition resulting in a degraded performance
which continues beyond the initial change (stage 3) and persists through the end of the
block (stage 4). While we do not find a significant difference between the terrain-less
and goal-change conditions in the final two stages, terrain-less should result in an easier
environment to navigate, but instead is just as disruptive as the goal change.
Taken together, we believe that these results provide some additional evidence for a
hybrid decision making strategy based on a successor-like representation which dissociated
goal representation from environment representation. In the next section, we again turn to
neural network modeling to provide a set of artificial agents with which to compare with
the human decision makers presented here.
147
VI.4 Artificial Agent Behavior Varies with State Space Type
With an understanding of human behavior within the fractal island environments, we next
turn to an examination of the behavior of artificial agents learning the same task. In previous
chapters we demonstrated the ability of a series of generative temporal models to adapt to
changes in both environmental appearance, content, and structure. Here we evaluate for all
these of these together within a single environment, using a goal-directed navigational task
as the metric of this performance.
As demonstrated in the previous section, humans can rapidly adapt to changes in envi-
ronmental appearance, quickly adapt to changes in goal location, and more slowly adapt to
changes in environmental structure. We demonstrate similar capabilities of artificial agents
trained using the latent space of a TSWM on a task similar to that completed by the human
participants. We find that the inferred state space s results in agents with the most human-
like adaptation to environmental changes. In contrast, agents with a z state space, or agents
using the ground-truth agent position and orientation information show greater disruption
from goal-changes, and fail to fully adapt to such a change in goal location.
VI.4.1 Modeling Methods
In order to derive the candidate state spaces, we utilize the Tri-Stream World Model as de-
scribed in Chapter V. Observations from the environment are rendered as 64×64×3 color
images, and the model utilizes a CNN encoder to infer the z, s, and c latent representations.
The s and z state spaces from this model are then used as the input into separate reinforce-
ment learning models. In addition, we define a third state space using the ground-truth
spatial information concerning the agent’s position and orientation within the island, and
refer to this as the ‘Spatial Info’ state space.
All configurations of the 2D and 3D gridworld environments contained relatively small
state spaces on the order of 10s or 100s of states. For example, in the case of a 3D gridworld
148
of size 7× 7, there is a total of 196 states, counting each orientation and position combi-
nation. In contrast, even by discretizing the fractal island environment into 64×64 square
meters, and discretizing the orientation into eight directions, there are a total of 32768 pos-
sible states. In order to address this state space which is orders of magnitude larger than
earlier environments, we turn from linear reinforcement learning to deep reinforcement
learning, which has been shown to be successful in learning policies even in environments
with state spaces many orders of magnitude larger than those studied here (Silver et al.,
2016). Specifically, we utilize a two-layer neural network for the policy and value func-
tion, and train this model using the popular Proximal Policy Optimization (PPO) algorithm
(Schulman, Wolski, Dhariwal, Radford, & Klimov, 2017). The size of the intermediate
layer of this network is set to 128 unuts. While not biologically inspired, PPO is an actor-
critic method, which remains a popular method for understanding the hippocampal-striatal
axis (O’Doherty et al., 2004; Tessereau et al., 2020).
VI.4.2 Evaluation Methods
In order to tailor the behavioral task to an artificial agent, we implement a series of adjust-
ments to the fractal island environment and task. The observations presented to the agent
consist of 64× 64× 3 color images representing a 90-degree field of view. The agents
action space is simplified compared to that utilized by the human participants. The agent
space consists of six possible actions: move forward, rotate left, rotate right, move forward
and rotate left, move forward and rotate right, and move backward. This simplification is
designed to make the learning problem easier, and to avoid the issue of representing the
action space as a set of joint probability distributions. We also increase the effect of each
of these actions relative to the result of the human participants pressing the keyboard keys,
such that each agent action is equivalent to two consecutive button presses by the human.
This is similar to “action repeat” used frequently in agent simulations of ATARI games
(Mnih et al., 2015).
149
We furthermore modify the task itself in order to accommodate the artificial agents.
While retaining the hidden goal aspect of the task, we remove the visible goal used in the
human version of the task, and simply provide a +1 reward when the agent reaches within
4 meters of the goal location, and end the episode. This is done to ensure better consistency
with all previous modeling experiments where the goal location was hidden. Furthermore,
because the model is initially trained in an environment without any goals, the introduction
of a visual goal during policy-learning time would result in disturbed latent representations
due to out-of-distribution goal object observations.
Because of the extended time required to train a deep reinforcement learning policy
compared to a linear policy, we evaluate on an environment derived from a single initial-
ization seed (seed 0 in this case), and a fractal dimension of D = 1.2. We retain the policy
of training each agent using five random initialization seeds in order to collect information
about the distribution of learned behavior. Due to the agent being initialized with a random
behavioral policy, we also provide the agent with the equivalent of double the amount of
time each human received per-trial, corresponding to 300 agent time-steps. Finally, due
to the inherently less efficient learning in the agents compared to humans, we provide the
agents with 500 learning trials, with the change condition taking place after trial 250. Fur-
thermore, a unique agent is trained per change condition. With three agent state spaces, five
change conditions, and five repetitions per condition-state-space pair, a total of 75 agents
are trained in all.
The z and s state spaces were derived from a TSWM trained for 7500 iterations in a
dataset of 250 episodes of 50 time-steps each of a semi-random behavioral policy. The
model’s s and z latent states each consisted of eight gumbel-softmax distributions of size
16 each. As such, both latent state vectors were in total 128 units each. The TSWM was
trained using a learning rate of α = 5e−4.
All agents were trained using a learning rate of α = 0.005, and an entropy bonus of β =
0.02, which prevents premature convergence to sub-optimal policies during the learning
150
process, and a discount factor of γ = 0.99 to encourage long-term credit assignment. Each
agent was trained for 500 episodes of a maximum of 300 time-steps each.
VI.4.3 Results
Before examining the behavior of the agents, it is worthwhile to analyze the learned rep-
resentations z and s within the virtual fractal environment. Figure 64 presents example
activations for these two sets of latent states, gather from an agent performing a random
walk around the environment for 100 episodes of 50 time-steps each. We find that there is
no local coherence in activation for units within the z space. In contrast, we find that there
is high coherence for units in the s space, many with activation profiles consistent that of
with place cells. The nature of these response profiles will be of relevance for interpreting
the behavioral results presented below.
Inferred z
Inferred s
Figure 64: Activation profiles of first sixteen units of inferred latent s and z spaces in the
TSWM model trained on a single fractal island topography.
151
Turning to the behavior of the trained agents, we first examine the impact of the state
space type on agent performance over time during the learning process. We find that agents
utilizing the ‘Spatial Info,’ ‘Inferred z’ and ‘Inferred s’ state space types all support learning
the task, with each showing a significant decrease in time-to-goal between Stages 1 (Trials
1-125) and 2 (Trials 126-250) of the learning process (all p < 0.001). See Figure 65 for the
relative performance of each set of agents during learning.
Having verified that learning does indeed take place for all agents in this task, we
turn our attention to the second question, which is whether there are significant differ-
ences between agents with different state space types in the extent to which the learn
to perform the hidden-goal task. We find that in both pre-change stages, the agents us-
ing an ‘Inferred z’ state space (Stage 1: Mean = 111.26,Std = 45.47; Stage 2: Mean =
59.08,Std = 20.31) significantly outperforms agents with either the ‘Spatial Info’ (Stage
1: Mean = 154.80,Std = 54.01; Stage 2: Mean = 95.61,Std = 47.08) or the ‘Inferred s’
(Stage 1: Mean = 150.35,Std = 45.60; Stage 2: Mean = 86.63,Std = 24.61) state space
(p < 0.001).
160 State Space Type
Spatial Info
140 Inferred s
Inferred z
120
100
80
60
1 2 3 4
Stage
Figure 65: Mean agent performance with three different state spaces. Lower time to goal
corresponds to better navigation performance. Stage 1: Trials 1 - 125. Stage 2: Trials 126 -
250. Stage 3: Trials 251 - 375. Stage 4: Trials 376 - 500. Error bars correspond to standard
error.
Next we turn to the post-change conditions, first examining the result of environment
changes on agent performance in Stage 3. We find that averaged over all change conditions,
152
Time-to-goal (steps)
agents with each of the three state types (Stage 3. Inferred s: Mean = 112.78,Std = 69.76;
Inferred z: Mean = 123.45,Std = 105.53; Spatial Info: Mean = 119.45,Std = 86.94) are
disrupted by the change, measured as a significant difference between Stage 2 and Stage 3
performance (all p < 0.001). Furthermore, we find no significant differences between the
amount of disruption experienced by each set of agents (F(2,372) = 0.46, p = 0.63).
Finally, we analyzed the artificial agent’s ability to recover their performance from the
disruption caused by the environmental change, measured as a significant difference be-
tween Stage 2 and Stage 4 performance. We find that agents utilizing the ‘Inferred s’ (Stage
4. Mean = 64.19,Std = 25.28; p < 0.001) and ‘Inferred z’ (Stage 4. Mean = 81.26,Std =
87.30; p = 0.016) state spaces are able to recover in performance after the change, while
the ‘Spatial Info’ (Stage 4. Mean = 86.25,Std = 63.33; p = 0.25) agents are not. We find
however that the effect in the case of agents utilizing ‘Inferred z’ is relatively small, and fur-
ther analysis shows that agents utilizing the ‘Inferred s’ state space outperform both agents
with either the ‘Inferred z’ (p < 0.035) or ’Spatial Info’ (p < 0.006) state spaces.
We next turn to a more in-depth analysis of the performance of the trained agents within
each of the five different change conditions. Doing so allows us to better examine the
source of the difference between the ‘Inferred s’ and other two state spaces in their ability
to recover their performance after the environment change. Figure 66 presents the per-
condition learning curves for agents utilizing each state type.
We find that qualitatively, the overall trends for the no-change and visual-change con-
ditions are the same for agents with all three state spaces. In the case of both the no-change
and the visual-change, there is no disruption from the change (or lack thereof), and likewise
no need to recover from a disruption in performance. In both cases, the mean time-to-goal
actually decreases between Stage 2 and Stage 3. We find that this trend matches that of
the human participants, where there was no significant disruption from the change in visual
appearance, or in the no-change condition.
Turning to the terrain change conditions, we find different response patterns for each
153
Inferred z Spatial Info
Inferred s
Figure 66: Mean agent performance within each change condition, and utilizing one of
three different state spaces. Lower time to goal corresponds to better navigation perfor-
mance. Stage 1: Trials 1 - 125. Stage 2: Trials 126 - 250. Stage 3: Trials 251 - 375. Stage
4: Trials 376 - 500. Error bars correspond to standard error.
of the three sets of agents. Returning to the results presented from human participants,
disruption was found for both terrain-more and terrain-less conditions, with terrain-more
being more disrupted overall. In the set of ‘Spatial Info’ state space agents, the terrain-more
condition (Mean = 155.024,Std = 37.35) is more disruptive than the terrain-less condition
(Mean = 70.44,Std = 22.82) (p < 0.001). In the ‘Inferred s’ set of agents, the terrain-more
(Mean = 115.22,Std = 30.32) and terrain-less (Mean = 111.60,Std = 31.73) conditions
are equivalently disruptive (p = 0.737). In the ‘Inferred z’ set of agents, the terrain-less
condition (Mean = 149.64,Std = 89.39) is more disruptive than the terrain-more condition
(Mean = 84.64,Std = 26.85) (p < 0.001). As such, none of the three sets of agents clearly
resembles the human performance profile.
154
Finally, we turn to the goal-change condition. In this condition, human participants
were significantly disrupted by the goal location changing, but recovered to pre-change
performance levels by the end of the block of trials. We find that among the artificial agents,
the goal-change condition results in significantly greater disruption than all other conditions
for all three groups of agents (p < 0.001). Furthermore, we find that the agents utilizing
the ‘Inferred s’ state space are able to fully recover from this disruption (Stage 2: Mean =
86.76,Std = 30.24; Stage 4: Mean = 82.99,Std = 26.69; p = 0.764), whereas agents
utilizing either the ‘Inferred z’ (Stage 2: Mean = 49.61,Std = 18.11; Stage 4: Mean =
227.95,Std = 99.14; p < 0.001) or ‘Spatial Info’ (Stage 2: Mean = 78.06,Std = 34.30;
Stage 4: Mean = 136.81,Std = 90.68; p < 0.001) state spaces are not. We can interpret
these results as providing evidence that the agents utilizing the ‘Inferred s’ state space best
match the behavior of the human participants.
VI.5 Discussion
In this chapter we sought to understand the behavior of both humans and artificial agents
when performing a memory-based navigation task in a complex virtual environment. We
found broadly that both humans and agents are able to learn to consistently navigate to a
hidden goal location, doing so from a continuous stream of high-dimensional visual images
presented to them.
We demonstrated that the structure of the environment has a significant impact on hu-
man performance, with lower fractal dimension topographies corresponding to participants
reaching the goal location faster and more consistently. This can be seen as a partial replica-
tion of the results of Juliani et al. (2016), who found a similar trend on a set of topographies
generated using the same methods described here. It can also be interpreted within the con-
text of a larger body of work suggesting that humans respond to various visual stimuli of
differing fractal dimensional with a general processing preference for stimuli consisting
of a low-to-mid dimensional fractal (Spehar et al., 2015; Bies, Blanc-Goldhammer, et al.,
155
2016).
Using the results from the human participants, we also examined the effect various
changes on the environment had to the learning process. We did this in order to find evi-
dence for either a model-based, model-free, or hybrid decision making strategy. Previous
work has found that humans navigate using a hybrid strategy (Daw et al., 2005; Momen-
nejad & Haynes, 2012), which manifests as selective disruption to various kinds of envi-
ronmental changes. This hybrid strategy has been modeled in the past using a successor
representation learning algorithm (Momennejad et al., 2017; de Cothi, 2020). We find par-
tial evidence for a hybrid strategy, with humans showing no disruption for visual changes,
apparent disruptions for changes to the terrain and goal location, but importantly a consis-
tent recovery from changes in goal location, but a less consistent recovery from changes in
terrain.
Finally, we trained a set of artificial agents to perform a modified version of this task
using the PPO algorithm. We proposed three different states spaces to use as input into
these agents, one based on the pre-computed location and orientation of the agent, one
based on the inferred ‘where’ state space s from a TSWM model, and one based on the
inferred ‘what’ state space z from the same model. We found that all agents were able
to perform the task well, but that only the agents trained using the inferred s latent state
showed adaption to changes in goal consistent with human behavior.
While not directly analogous to the traditional means of classifying model-free, model-
based, and hybrid behavioral strategies, there is a connection which can be made between
these state spaces and these behavioral strategies. Rather than interpreting decision making
strategy as being the result of an algorithm, we can interpret it as being the result of the
representations utilized in a learning process. Here we compared three separate represen-
tations, each conveying different kinds of information to the agent.
The inferred z state space consists of an auto-encoded compressed representation of
the observation, and thus can be interpreted as providing the basis for a purely reactive
156
policy mapping in the agent. As such, in order to account for changes to goal location,
a complex series of mappings from visual features of the observation to predicted value
need to be re-aligned. In contrast, the inferred s state contains spatial information, but
unlike the ‘spatial info’ state, it contains information which is adapted to the structure of
the environmental topography. Recall that these learned representations contain place-like
firing properties, which show signs of a ‘geodesic’ representation, known to be found in
place cells (Stachenfeld et al., 2017). We believe that it is this structural accommodation
within the representation which enables the agents utilizing the state space to better adapt
to all change conditions.
157
CHAPTER VII
GENERAL DISCUSSION AND CONCLUSION
Humans and other mammals are able to quickly make sense of their environment, and
in doing so skillfully navigate their surroundings. The capacity to do so has long been
connected to the notion of a cognitive map (Tolman, 1948), which has been proposed to
be a major role of the hippocampus (O’Keefe & Nadel, 1978). In parallel, research into
human episodic memory led to an understanding of the central role that the hippocampus
also plays in memory encoding and retrieval (Tulving, 2002).
Recent theories connect these two functions under the notion of an experience-construction
system (Hassabis & Maguire, 2009). In such a system, the dynamics of an environment are
learned through experience, and then used to aid in both planning future actions in that
environment, as well as in memory recall and imagination. All three of these abilities rest
on the capacity of the hippocampus to spontaneously generate coherent trajectories of ex-
perience, a phenomenon referred to as replay (Foster, 2017), or preplay in the case of novel
sequences (Dragoi & Tonegawa, 2011).
Simply referring to the hippocampus as an experience construction system however is
insufficient, if we fail to define what an experience actually is. In most cases, experiences
can be thought of as being tied to the perceptual, affective, and cognitive phenomena at a
given delineated period of time. These phenomena are largely associated with the cortex,
with an experience of visual perception being associated with the visual cortex, for exam-
ple. It has been proposed that the role of the hippocampus is to provide a low-dimensional
158
index to these high-dimensional cortical states corresponding to phenomenal experiences
(Teyler & DiScenna, 1986). Rather than learning the transition dynamics between entire
cortical states, the hippocampus needs only to learn the transition dynamics which govern
the indices, which correspond to a kind of grammar (Liu et al., 2018).
The theories of cognitive maps, episodic memory, experience-construction, and mem-
ory indexing provide a blue-print for the potential function of the hippocampus, and the
broader medial temporal lobe within mammals. These theories also collectively describe a
hypothetical system which bears a strong resemblance to a class of recent neural network
models referred to as generative temporal models (Gemici et al., 2017; Ha & Schmidhu-
ber, 2018). In their simplest form, generative temporal models contain a system by which
observations from the environment are compressed into a latent state (indexing of episodic
memories), a dynamics model is learned over these latent states (experience-construction),
and these states and dynamics model is then used to guide goal-directed action (cognitive
map). This work has provided a series of demonstrations by which such capacities can be
realized by generative temporal models.
VII.1 Maps, Memories, and Models
This work has attempted to empirically demonstrate the connection between generative
temporal models and the medial temporal lobe by presenting a series of models, and
demonstrating their properties with respect to the theories outlined above. Starting with
a simple world model, we demonstrated that place and time cells can be learned in an
unsupervised fashion, and that these cells show activity patterns which are biased by the
behavioral policy of the learning agent. We then demonstrated that dynamics models can
be learned using these latent representations, and that the learned model displays temporal
community, a key element of hippocampal representation (Schapiro et al., 2016). Fur-
thermore, we showed that the process of latent state inference and generation within a
generative temporal model can be connected to pattern separation and completion within
159
the hippocampus.
We next turned to the question of goal-directed navigation. Building from the actor-
critic theory of learning in the dorsal and ventral striatum (O’Doherty et al., 2004), we
demonstrated that the learned latent representations from a generative temporal model are
useful as a basis function for performing reinforcement learning. Then, taking inspiration
from more contemporary theories of striatal-hippocampal axis function (van der Meer et
al., 2010), as well as evidence for a successor representation in CA1 of the hippocam-
pus (Stachenfeld et al., 2017), we demonstrated that the learned latent space from a world
model also enables successful learning using the successor representation in a goal-switch
task. We next introduced a simple extension to the successor representation algorithm
which enables it to be used with an extended class of basis functions, thus speeding up the
learning process. Finally, we showed that the dynamics model of the world model addition-
ally improves performance when used to provide Dyna-like updates (Sutton, 1991), which
can be seen as a form of experience replay, similar to that which takes place spontaneously
within the hippocampus (Pezzulo et al., 2014).
In the following chapter, we turned to an important aspect of cognitive maps, the ability
to learn representations which are based on the structure of an environment, and invari-
ant to that environment’s content. In the visual system, context and content information
are separated into separate streams, the dorsal and ventral streams, respectively. Within
the medial temporal lobe, this separation takes place largely within the entorhinal cortex
(Knierim, Neunuebel, & Deshmukh, 2014), with the lateral entorhinal cortex containing
content information in the form of object-detecting cells (Deshmukh & Knierim, 2011),
and the medial entorhinal cortex containing contextual information in the form of spatially
selective grid cells (Hafting et al., 2005).
In order to capture this separation of content and structure, we presented a novel ar-
chitecture, the Dual Stream World Model (DSWM), which separately encoded incoming
observations from the environment into different latent representations. By separating the
160
streams, the ‘what’ latent representation was trained only to auto-encode the observation,
while the ’where’ representation was trained to extract relevant spatial information from the
observation. The ’where’ latent states were then used as keys, and the ’what’ latent states
as values in a dictionary-based storage and retrieval system. Additionally, a forward model
was learned over the ’where’ information, enabling generalization between environments
with shared structure, but varying content. We evaluated this model on a set of 2D and 3D
environments, demonstrating both the ability to generate more coherent trajectories than a
single-stream model, but also a more useful representation for goal-directed navigation.
After demonstrating the capacity for content generalization with a DSWM, we next
turned to the question of structural generalization. We first introduced a context latent
variable into the world model, and demonstrated various methods for training this repre-
sentation. First, we showed that the representation could learn to identify the environment
index when trained on a fixed set of environment topographies, and that this representation
was then useful for modeling the dynamics within the environments. We then demonstrated
that the context representation could be trained to predict a 2D image of the environment to-
pography, and that this led to greater performance when predicting the transition dynamics
of environments.
With an understanding of the role that a contextual representation learned using a super-
vised loss signal could produce, we next introduced a fully unsupervised loss function to
train the contextual representation, and demonstrated that it outperformed both supervised
learning signals. We then extended this contextual representation to the DSWM model, in-
troducing the Tri-Stream World Model (TSWM). We showed that this additional contextual
representation can be interpreted as playing a similar role to that of the parahippocampal
area, providing spatial context information useful for understanding transition dynamics in
novel environments (R. A. Epstein, 2008).
With a fully realized generative temporal model, capable of generalization over changes
in environment content, structure, and goal location, we then turned out attention back to
161
the biological systems which inspired this model, humans. We examined human navigation
ability in a complex hidden goal navigation task in a visually rich 3D virtual environment.
We first demonstrated that human performance in this task was impacted by the statistical
structure of the environment topography in a way consistent with fractal fluency theory
(Bies, Blanc-Goldhammer, et al., 2016; Juliani et al., 2016), with participants performing
best in environments with low-to-mid level fractal complexity.
We next used the virtual navigation task to assess whether there was evidence for a
hybrid decision making strategy when adapting to environment changes, as recently pro-
posed by (Momennejad & Haynes, 2012; Momennejad et al., 2017; de Cothi, 2020). We
found that humans are able to near-instantly adapt to changes in environment appearance,
quickly adapt to changes in goal location, and more slowly adapt to changes in environ-
ment topography. Due to the difference between the disruption and adaptation profiles in
the goal-change and terrain-change conditions, we can interpret these results as providing
some evidence for a hybrid strategy.
In order to better understand these behavioral trends, and their relationship to vari-
ous learning algorithms, we trained a series of artificial agents using Deep Reinforcement
Learning to perform a modified version of the hidden-goal navigation task. We compared
three different state space types, one based on the inferred z from a TSWM, one based on
the inferred s from the same model, and one based on pre-computed location and orien-
tation of the agent. We found that only the agents utilizing the inferred s representation
showed signs of full adaption to the goal-change condition, showing a similar performance
profile to that of humans.
Given the similarity of the inferred s latent space and the place cells found in mam-
mals, along with the hypothesized role of place cells in guiding navigation, we believe that
the specific properties of this representation may be essential to some of the findings sug-
gesting that humans follow a hybrid decision making strategy. This is especially the case
when we consider that a key property of both the inferred s latent state and the successor
162
representations utilized to model hybrid decision making strategies is their conformity to
the topographical structure of an environment (Stachenfeld et al., 2017). We believe that
this analysis can serve as the starting point for a novel approach to determining the kind
of behavioral strategy being employed in a task. The nature of the representation being
utilized to guide a decision making policy contains important priors about the environment
which are just as important, if not more-so than the learning algorithm being used on top
of these representations.
VII.2 Connections to Contemporary Modeling Research
The generative temporal models presented in the preceding chapters can be seen as a small
subset of a growing class of models within the literature. While we largely focused on
the popular World Model, introduced by Ha and Schmidhuber (2018), there are a number
of other relevant models within the field. We chose the World Model for its popularity
in the field of machine learning, its simplicity, and because the original work by Ha and
Schmidhuber contained the basic building blocks of encoding into a latent state, learning
the dynamics of the state, and then using those learned dynamics to learn a behavioral
policy.
Since the introduction of the World Model, there have been a number of relevant ad-
vancements which have improved the adaptability and scalability of generative temporal
models. As mentioned in the introduction, these fall into a few categories, depending on
the nature of the task being learned. Two major themes include the introduction of memory
augmentation, and the separation of the latent state into multiple separate variables.
In memory augmentation, an additional differentiable memory mechanism is used to
store and retrieve latent states. This allows the model to quickly adapt to changes in the
environment without the need for backpropagation to update the weights of the network, an
often slower and more data intensive process. Within the literature, the nature of this mem-
ory mechanism has varied, with some model architectures adopting a simple differential
163
neural dictionary (Pritzel et al., 2017), such as the Generative Temporal Model with Spatial
Memory (GTM-SM) (Fraccaro et al., 2018). Others have opted for more complex storage
and retrieval mechanisms, such as the Memory-Based Predictor (MBP), which utilizes a
differentiable memory store with multi-headed storage and retrieval mechanisms (Wayne
et al., 2018). Still other models have sought to rely on more biologically plausible mech-
anisms such as a Hopfield Network for storage and retrieval of latent states, such as the
Tolman-Eichenbaum Machine (TEM) (Whittington et al., 2019).
In the Dual-Stream and Tri-Stream world models, we chose a straightforward imple-
mentation of the differentiable neural dictionary (DND), described by Pritzel et al. (2017).
We made this choice in order to avoid the more complex storage and look-up mechanisms
used in the MBP, as well as to avoid the capacity limitations inherent in Hopfield networks.
As such, the DSWM bears a resemblance to the GTM-SM, however we use a more struc-
tured latent representation for the key state, whereas in their work a simple two-dimensional
vector is used which corresponds to the x and y coordinates of the agent location. By us-
ing an arbitrary learned latent state for the key, our model can be applied to both spatial
and non-spatial environments, as well as be applied to downstream linear RL tasks. By
not using more complex storage mechanisms with multiple read and write heads, such as
the MBP, our model stores more redundant information, and can only access one relevant
experience at a time. In the experiments presented in this work, this limitation does not
present an issue, but in more realistic environments, where many different memories need
to be stored corresponding to different events which take place in the same location, our
retrieval mechanism would likely under-perform relative to these other models.
The second major theme has been the separation of the latent state into multiple sep-
arate latent variables. Doing so enables each latent variable to represent a unique sub-
set of the entire latent state, and enables novel model architectures which can deal with
each aspect of the state in a unique way. For example, the Recurrent State Space Model
(RSSM) (Hafner et al., 2018) introduces both a discrete and stochastic latent state, enabling
164
the model to separately model known and unknown aspects of the environment dynamics
independently. Multiple latent states have also been utilized within hierarchical models,
where lower-level latent states help to condition higher-level states, such as in the Stochas-
tic Latent Actor Critic Model (SLAC) (A. X. Lee, Nagabandi, Abbeel, & Levine, 2019).
Aside from separating the variables based on hierarchy or stochasticity, the latent states
can also be separated based on the type of environmental information being stored, such
as in the TEM (Whittington et al., 2019), and GTM-SM (Fraccaro et al., 2018), where
content and context variables are modeled separately, in both cases enables content-based
generalization.
We chose to focus on the separation of latent states based on content (what), context
(how), and location (where). As such, our model is similar to the TEM and GTM-SM
models. Doing so enables the model to separately learn to encode each of these three vari-
ables from the stream of incoming observations, and as such to generalize over changes
within the distributions of each of them. This generalization takes the form of adaptability
to changes in the content of an environment with the same structure, as well as the ability
to adapt to changes in the structure of the environment itself. Both the TEM and GTM-SM
models demonstrate content generalization, but not structural generalization, which is a
more difficult problem. While Chapter V presents initial results toward structural general-
ization, we note that the improvements from the contextual latent state are relatively mod-
est, and the problem remains not fully solved. While both our TSWM and the GTM-SM
model demonstrate learning from high-dimensional egocentric visual observations, TEM
was demonstrated using only low-dimensional “toy” problems. Despite this, TEM has been
shown to match real biological data in terms of both the presence of grid cells as well as
place cells, something not shown in our work.
Lastly, we want to address the choice of latent distribution in all of the models presented
in this work. All related contemporary models which we have discussed up to this point
have either utilized a gaussian or deterministic latent state. This likely follows because
165
of the great initial success demonstrated by the use of variational auto-encoders with a
gaussian latent distribution (Kingma & Welling, 2013). This choice is not unfounded, as it
has recently been demonstrated that representations in primate inferotemporal cortex can
be modeled using a gaussian VAE trained with a disentanglement loss function (Higgins et
al., 2020).
Instead of gaussian distributions, we chose to utilize gumbel-softmax distributions for
all latent states within our models. We were motivated in this decision by recent theo-
retical work which proposed that the medial temporal lobe is involved in clustering high-
dimensional state information in the cortex (Mok & Love, 2019). While Mok and Love
(2019) utilize a non-differentiable k-means clustering algorithm, we opted for the gumbel-
softmax distribution due to its capacity to be used as a latent distribution within a fully
differentiable neural network (Jang et al., 2016). The efficacy of a gumbel-softmax distri-
bution for learning a latent state space has also been previously demonstrated in a genera-
tive temporal model on a simple T-Maze navigation task (Corneil, Gerstner, & Brea, 2018).
We hypothesized that such a distribution would enable a “soft” probabilistic form of state
grouping, similar to the potential functional role of time and place cells.
From a practical perspective, we also found that for simple auto-encoding tasks, models
utilizing the gumbel-softmax distribution outperform those utilizing a gaussian distribution,
as described in Chapter II. More importantly, the use of this distribution allows for the nat-
ural development of time or place cells, depending on the nature of the observation stream
being learned by the model. We find that this is an inherent property of the distribution,
with such cell types always developing under a variety of conditions. This is in contrast to
recent modeling work showing the development of grid cells, where highly specific hyper-
parameters and activation functions are needed, and are difficult to reproduce (Banino et
al., 2018; Sorscher, Mel, Ganguli, & Ocko, 2019).
166
VII.3 Biological Implications and Open Questions
The connection between the medial temporal lobe and the class of neural networks known
as generative temporal models presented here is a starting point for a much more in-depth
set of potential future analyses. The connections drawn in this work raise a number of
relevant questions regarding the biological plausibility of the models presented here, as
well as pose potential research questions which could be explored within the context of
empirical biological research.
The first question which can be asked is regarding the nature of the gumbel-softmax
distribution as a basic building block of hippocampal representation. We have demonstrated
here that this distribution induces place and time-like cells when used in a model trained
to perform auto-encoding of spatial and temporal information, respectively. This auto-
encoding process can be interpreted as being part of the hippocampal indexing system
(Teyler & DiScenna, 1986). In particular, there is evidence that the dentate gyrus within
the medial temporal lobe contains sparse connections, from the entorhinal context and the
CA3 region of the hippocampus (Leutgeb et al., 2007). These sparse connections have
been referred to as performing pattern separation, and we find evidence of this in induced
gumbel-softmax representations in the experiments presented here. It would be possible
through empirical research to verify whether the induced representation by this pathway
matches more precisely the properties of a gumbel-softmax or similar distribution.
The next relevant question extends from our specific implementation of the hippocam-
pal indexing theory within the context of a generative temporal model. By utilizing such a
probabilistic model, we inherently arrive at an interpretation of the hippocampal represen-
tations within the context of inferred and generated latent variables. We have proposed that
this can be seen to map onto the dentate gyrus, CA3, and CA1 regions of the hippocam-
pus. The inference process thus taking place is as follows: latent state is inferred from
information within the entorhinal cortex, made sparse (and “pattern separated”) by dentate
167
gyrus, represented in CA3, with a prediction of the future latent state (“pattern comple-
tion”) generated within CA1. The properties of place cells in CA3 and CA1 provide some
evidence for this, as they match the induced distributions from the inferred and generated
latent states in the models we have presented here. Indeed, this theory has been recently
proposed by simultaneous other work (Sanders, Wilson, & Gershman, 2020). Further bio-
logical recording could be done within DG, CA3, and CA1 regions to determine whether
the place cells within this region best match those of inferred and generated latent variables
from a generative temporal model.
Directly related to the question of hippocampal inference is the phenomena of remap-
ping (Fyhn, Hafting, Treves, Moser, & Moser, 2007), whereby large changes in the struc-
ture of appearance of the environment induce a new set of place cells to fire. Within the
context of the theoretical models discussed here, this can be seen as a unique state space
being instantiated. While we did not directly address remapping in the work presented,
there are potential extensions which would make the study of this phenomena possible. We
believe that an unsupervised loss function for the s latent space (as opposed to the super-
vised spatial loss function demonstrated) within the DSWM could lead to remapping of s.
Furthermore, we explore inferring s using only integration within a recurrent neural net-
work. It is a promising avenue of research to explore the extent to which neural networks
which allow for greater conditioning, such as hypernetworks or networks with fast-weights
would better adapt to this problem (Ha et al., 2016). Indeed, models such as those presented
in (Whittington et al., 2019) and (Sanders et al., 2020), thus demonstrating its possibility
within a generative temporal model.
Through our analysis of context-augmented generative temporal models, we drew a
connection between a latent representation developed to extend the expressibility of the
dynamics model and the parahippocampal area. This connection was based on the evi-
dence that the parahippocampal area responds preferentially to stimuli which provide spa-
tial contextual information (R. A. Epstein, 2008). The hypothesis we put forth is that this
168
area integrated these contextual spatial cues from sensory information in order to aid and
modify the dynamics model represented within the hippocampus itself. This would imply a
dissociation between state information represented within the hippocampus itself, and con-
textual information represented within the parahippocampal area. The implicit contextual
model we presented here can be seen as one possible implementation of this system, and
serve as the basis for a prediction of biological function.
Moving beyond the medial temporal lobe, we also presented a potential novel model of
the hippocampal-striatal axis. The traditional interpretation of this system has been within
the context of an actor-critic model, where the hippocampus provided the state representa-
tion, the ventral striatum the value estimation, and the dorsal striatum the policy (O’Doherty
et al., 2004). Using recent work suggesting that the outgoing CA1 representation from the
hippocampus is best modeled using a successor representation (Stachenfeld et al., 2017),
we proposed that the ventral striatum may act as a reward representation as opposed to
a value representation. Thus, the hippocampal-striatal axis could be thought of as imple-
menting a successor learning algorithm, as opposed to an actor-critic algorithm. We find
some additional biological evidence for this in empirical work showing that the ventral
striatum learns a more local representation of value which may be more in-line with reward
identification or prediction than a traditional notion of value as predicted discounted reward
(van der Meer et al., 2010). To fully test this hypothesis, a more detailed study of the role
of the dorsal and ventral striatum in learning is necessary, as a successor-based theory of
policy learning would make specific predictions about the nature of the induced policy. For
example, we have utilized a cosine similarity metric to measure state similarity, and thus
determine the reward and value function values. Such a mathematical operation can be
directly tested for.
The hippocampus has a number of additional downstream connections beyond the stria-
tum. One group of particular interest are the more frontal regions. Both the medial pre-
frontal cortex (mPFC) and the orbital frontal cortex (OFC) have been studied in their re-
169
lationship to the hippocampus. In the case of the former, it has been demonstrated that
mPFC provides a goal-like signal to the hippocampal region (Ito, Zhang, Witter, Moser, &
Moser, 2015). Such a signal could help determine the specific nature of hippocampal re-
play events, biasing the generated trajectories towards states known to currently be salient,
or of interest to higher-level attention. Such a system was formalized in a model by Erdem
and Hasselmo (2012). In the case of the latter region, it has been shown that the OFC is
involved in value estimation, and represents states at a more abstracted level than that of the
hippocampus (Wikenheiser & Schoenbaum, 2016). The interaction of both regions, while
different in their purpose, both point to the mutual notion of hierarchical representation.
By representing state spaces at higher levels of abstraction than what is possible within
the hippocampus, animals are able to reason over longer spans of time, and do so in ways
more generalizable to diverse circumstances. We see modeling these dynamics within the
context of generative temporal models presented here as an intriguing future direction.
A last major question is the plausibility of a differentiable neural dictionary and re-
current neural network for representing the latent state generation (“pattern completion”)
process induced by the CA3 region of the hippocampus. We recognize that this modeling
choice was made largely for computational convenience, rather than biological plausibility.
One alternative used in related work is the Hopfield network, which was originally inspired
by the hippocampus (Hopfield & Tank, 1985; Whittington et al., 2019). While the original
Hopfield networks had restrictive computational limitations with respect to the number of
possible patterns which they could store and retrieve, modern versions of these networks
have been able to enable the storage and retrieval of orders of magnitude more patterns,
and the successful application to real-world problems such as natural language text gener-
ation (Ramsauer et al., 2020). We believe that such networks represent a promising future
direction, both for modeling the hippocampus, but also as a potential source of additional
hypotheses regarding the computational properties of the hippocampus itself.
170
VII.4 Conclusion
The function of the hippocampus and medial temporal lobe can seem miraculous. As hu-
mans we are able to not only recall a seemingly vast amount of memories from our child-
hood to today, but we are also able to put these memories into contexts, and create narratives
out of them. These narratives are both the re-telling of the past, but also serve to help create
new stories and to plan out possible future events. The coherence of these plans then makes
possible skillful navigation and action within our ever-changing world in order to realize
them. The models presented here represent a modest attempt at formalizing the system
which makes this possible. It is our hope that this formalism can help to provide a common
language for future developments within the fields of both neuroscience and machine learn-
ing, as they both continue to develop, providing reciprocal insights into both the nature of
biological and artificial intelligence.
171
REFERENCES CITED
Aggleton, J. P., & Brown, M. W. (1999). Episodic memory, amnesia, and the hippocampal–
anterior thalamic axis. Behavioral and brain sciences, 22(3), 425–444.
Atallah, H. E., Lopez-Paniagua, D., Rudy, J. W., & O’Reilly, R. C. (2007). Separate
neural substrates for skill learning and performance in the ventral and dorsal striatum.
Nature neuroscience, 10(1), 126–131.
Banino, A., Barry, C., Uria, B., Blundell, C., Lillicrap, T., Mirowski, P., . . . others (2018).
Vector-based navigation using grid-like representations in artificial agents. Nature,
557(7705), 429.
Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., & Silver,
D. (2017). Successor features for transfer in reinforcement learning. In Advances in
neural information processing systems (pp. 4055–4065).
Behrens, T. E., Muller, T. H., Whittington, J. C., Mark, S., Baram, A. B., Stachenfeld,
K. L., & Kurth-Nelson, Z. (2018). What is a cognitive map? organizing knowledge
for flexible behavior. Neuron, 100(2), 490–509.
Bellemare, M., Dabney, W., Dadashi, R., Taiga, A. A., Castro, P. S., Le Roux, N., . . . Lyle,
C. (2019). A geometric perspective on optimal representations for reinforcement
learning. In Advances in neural information processing systems (pp. 4358–4369).
Bies, A. J., Blanc-Goldhammer, D. R., Boydston, C. R., Taylor, R. P., & Sereno, M. E.
(2016). Aesthetic responses to exact fractals driven by physical complexity. Frontiers
in human neuroscience, 10, 210.
Bies, A. J., Boydston, C. R., Taylor, R. P., & Sereno, M. E. (2016). Relationship be-
tween fractal dimension and spectral scaling decay rate in computer-generated frac-
tals. Symmetry, 8(7), 66.
Bliss, T. V., & Collingridge, G. L. (1993). A synaptic model of memory: long-term
potentiation in the hippocampus. Nature, 361(6407), 31.
Burgess, N., Barry, C., & O’keefe, J. (2007). An oscillatory interference model of grid cell
firing. Hippocampus, 17(9), 801–812.
Bush, D., Barry, C., Manson, D., & Burgess, N. (2015). Using grid cells for navigation.
Neuron, 87(3), 507–520.
Chadwick, M. J., Hassabis, D., Weiskopf, N., & Maguire, E. A. (2010). Decoding indi-
vidual episodic memory traces in the human hippocampus. Current Biology, 20(6),
544–547.
Chadwick, M. J., Jolly, A. E., Amos, D. P., Hassabis, D., & Spiers, H. J. (2015). A goal
direction signal in the human entorhinal/subicular region. Current Biology, 25(1),
87–92.
172
Climer, J. R., Newman, E. L., & Hasselmo, M. E. (2013). Phase coding by grid cells in
unconstrained environments: two-dimensional phase precession. European Journal
of Neuroscience, 38(4), 2526–2541.
Corneil, D., Gerstner, W., & Brea, J. (2018). Efficient model-based deep reinforcement
learning with variational state tabulation. arXiv preprint arXiv:1802.04325.
Cueva, C. J., & Wei, X.-X. (2018). Emergence of grid-like representations by train-
ing recurrent neural networks to perform spatial localization. arXiv preprint
arXiv:1803.07770.
Davidson, T. J., Kloosterman, F., & Wilson, M. A. (2009). Hippocampal re-
play of extended experience. Neuron, 63(4), 497 - 507. Retrieved from
http://www.sciencedirect.com/science/article/pii/S0896627309005820
doi: https://doi.org/10.1016/j.neuron.2009.07.027
Daw, N., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal
and dorsolateral striatal systems for behavioral control. Nature neuroscience, 8(12),
1704.
Dayan, P. (1993). Improving generalization for temporal difference learning: The successor
representation. Neural Computation, 5(4), 613–624.
de Cothi, W. J. (2020). Predictive maps in rats and humans for spatial navigation (Un-
published doctoral dissertation). UCL (University College London).
Deshmukh, S. S., & Knierim, J. J. (2011). Representation of non-spatial and spatial in-
formation in the lateral entorhinal cortex. Frontiers in behavioral neuroscience, 5,
69.
Deuker, L., Bellmund, J. L., Schröder, T. N., & Doeller, C. F. (2016). An event map of
memory space in the hippocampus. Elife, 5, e16534.
Diba, K., & Buzsáki, G. (2007). Forward and reverse hippocampal place-cell sequences
during ripples. Nature neuroscience, 10(10), 1241.
Doeller, C. F., Barry, C., & Burgess, N. (2010). Evidence for grid cells in a human memory
network. Nature, 463(7281), 657.
Dragoi, G. (2020). Cell assemblies, sequences and temporal coding in the hippocampus.
Current Opinion in Neurobiology, 64, 111–118.
Dragoi, G., & Tonegawa, S. (2011). Preplay of future place cell sequences by hippocampal
cellular assemblies. Nature, 469(7330), 397.
Dragoi, G., & Tonegawa, S. (2013). Distinct preplay of multiple novel spatial experiences
in the rat. Proceedings of the National Academy of Sciences, 110(22), 9100–9105.
Eichenbaum, H. (2014). Time cells in the hippocampus: a new dimension for mapping
memories. Nature Reviews Neuroscience, 15(11), 732.
173
Ekstrom, A. D., Kahana, M. J., Caplan, J. B., Fields, T. A., Isham, E. A., Newman, E. L.,
& Fried, I. (2003). Cellular networks underlying human spatial navigation. Nature,
425(6954), 184.
Epstein, R., & Kanwisher, N. (1998). A cortical representation of the local visual environ-
ment. Nature, 392(6676), 598–601.
Epstein, R. A. (2008). Parahippocampal and retrosplenial contributions to human spatial
navigation. Trends in cognitive sciences, 12(10), 388–396.
Erdem, U. M., & Hasselmo, M. (2012). A goal-directed spatial navigation model using
forward trajectory planning based on grid cells. European Journal of Neuroscience,
35(6), 916–931.
Fiete, I. R., Burak, Y., & Brookings, T. (2008). What grid cells convey about rat location.
Journal of Neuroscience, 28(27), 6858–6871.
Foster, D. (2017). Replay comes of age. Annual review of neuroscience, 40, 581–602.
Foster, D., Morris, R., & Dayan, P. (2000). A model of hippocampally dependent naviga-
tion, using the temporal difference learning rule. Hippocampus, 10(1), 1–16.
Foster, D., & Wilson, M. (2006). Reverse replay of behavioural sequences in hippocampal
place cells during the awake state. Nature, 440(7084), 680.
Fraccaro, M., Rezende, D. J., Zwols, Y., Pritzel, A., Eslami, S., & Viola, F. (2018). Gen-
erative temporal models with spatial memory for partially observed environments.
arXiv preprint arXiv:1804.09401.
Frank, L. M., Stanley, G. B., & Brown, E. N. (2004). Hippocampal plasticity across
multiple days of exposure to novel environments. Journal of Neuroscience, 24(35),
7681–7689.
Friston, K., & Kiebel, S. (2009). Predictive coding under the free-energy principle. Philo-
sophical Transactions of the Royal Society B: Biological Sciences, 364(1521), 1211–
1221.
Fyhn, M., Hafting, T., Treves, A., Moser, M.-B., & Moser, E. I. (2007). Hippocampal
remapping and grid realignment in entorhinal cortex. Nature, 446(7132), 190–194.
Garvert, M. M., Dolan, R. J., & Behrens, T. E. (2017). A map of abstract relational
knowledge in the human hippocampal–entorhinal cortex. Elife, 6, e17086.
Gemici, M., Hung, C.-C., Santoro, A., Wayne, G., Mohamed, S., Rezende, D. J., . . .
Lillicrap, T. (2017). Generative temporal models with memory. arXiv preprint
arXiv:1702.04649.
Gershman, S. J., Moore, C. D., Todd, M. T., Norman, K. A., & Sederberg, P. B. (2012). The
successor representation and temporal context. Neural Computation, 24(6), 1553–
1568.
174
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A.,
. . . others (2016). Hybrid computing using a neural network with dynamic external
memory. Nature, 538(7626), 471.
Gupta, A. S., van der Meer, M. A., Touretzky, D. S., & Redish, A. D. (2010). Hippocampal
replay is not a simple function of experience. Neuron, 65(5), 695–705.
Ha, D., Dai, A., & Le, Q. V. (2016). Hypernetworks. arXiv preprint arXiv:1609.09106.
Ha, D., & Schmidhuber, J. (2018). World models. arXiv preprint arXiv:1803.10122.
Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2019). Dream to control: Learning
behaviors by latent imagination. arXiv preprint arXiv:1912.01603.
Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J.
(2018). Learning latent dynamics for planning from pixels. arXiv preprint
arXiv:1811.04551.
Hafting, T., Fyhn, M., Molden, S., Moser, M.-B., & Moser, E. I. (2005). Microstructure of
a spatial map in the entorhinal cortex. Nature, 436(7052), 801.
Hassabis, D., Kumaran, D., Vann, S. D., & Maguire, E. A. (2007). Patients with hippocam-
pal amnesia cannot imagine new experiences. Proceedings of the National Academy
of Sciences, 104(5), 1726–1731.
Hassabis, D., & Maguire, E. A. (2009). The construction system of the brain. Philosophical
Transactions of the Royal Society B: Biological Sciences, 364(1521), 1263–1271.
Hasselmo, M. E. (2009). A model of episodic memory: mental time travel along encoded
trajectories using grid cells. Neurobiology of learning and memory, 92(4), 559–573.
Higgins, I., Chang, L., Langston, V., Hassabis, D., Summerfield, C., Tsao, D., & Botvinick,
M. (2020). Unsupervised deep learning identifies semantic disentanglement in single
inferotemporal neurons. arXiv preprint arXiv:2006.14304.
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., . . . Lerchner,
A. (2016). beta-vae: Learning basic visual concepts with a constrained variational
framework. International conference on learning representations.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation,
9(8), 1735–1780.
Hollup, S. A., Molden, S., Donnett, J. G., Moser, M.-B., & Moser, E. I. (2001). Accumu-
lation of hippocampal place fields at the goal location in an annular watermaze task.
Journal of Neuroscience, 21(5), 1635–1644.
Hopfield, J. J., & Tank, D. W. (1985). “neural” computation of decisions in optimization
problems. Biological cybernetics, 52(3), 141–152.
Horner, A. J., Bisby, J. A., Zotow, E., Bush, D., & Burgess, N. (2016). Grid-like processing
of imagined navigation. Current Biology, 26(6), 842–847.
175
Howard, L., Javadi, A., Yu, Y., Mill, R., Morrison, L., Knight, R., . . . Spiers, H. (2014).
The hippocampus and entorhinal cortex encode the path and euclidean distances
to goals during navigation. Current Biology, 24(12), 1331 - 1340. Retrieved from
http://www.sciencedirect.com/science/article/pii/S0960982214005260
doi: https://doi.org/10.1016/j.cub.2014.05.001
Howard, M. W., Fotedar, M. S., Datey, A. V., & Hasselmo, M. E. (2005). The temporal
context model in spatial navigation and relational learning: toward a common ex-
planation of medial temporal lobe function across domains. Psychological review,
112(1), 75.
Howard, M. W., & Kahana, M. J. (2002). A distributed representation of temporal context.
Journal of Mathematical Psychology, 46(3), 269–299.
Ito, H. T., Zhang, S.-J., Witter, M. P., Moser, E. I., & Moser, M.-B. (2015). A
prefrontal–thalamo–hippocampal circuit for goal-directed spatial navigation. Nature,
522(7554), 50.
Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with gumbel-softmax.
arXiv preprint arXiv:1611.01144.
Ji, D., & Wilson, M. A. (2007). Coordinated memory replay in the visual cortex and
hippocampus during sleep. Nature neuroscience, 10(1), 100.
Johnson, A., & Redish, A. D. (2005). Hippocampal replay contributes to within session
learning in a temporal difference reinforcement learning model. Neural Networks,
18(9), 1163–1171.
Juliani, A. W., Berges, V.-P., Vckay, E., Gao, Y., Henry, H., Mattar, M., & Lange,
D. (2018). Unity: A general platform for intelligent agents. arXiv preprint
arXiv:1809.02627.
Juliani, A. W., Bies, A. J., Boydston, C. R., Taylor, R. P., & Sereno, M. E. (2016). Naviga-
tion performance in virtual environments varies with fractal dimension of landscape.
Journal of environmental psychology, 47, 155–165.
Kable, J. W., & Glimcher, P. W. (2007). The neural correlates of subjective value during
intertemporal choice. Nature neuroscience, 10(12), 1625.
Karlsson, M. P., & Frank, L. M. (2009). Awake replay of remote experiences in the
hippocampus. Nature neuroscience, 12(7), 913.
Kay, K., Chung, J. E., Sosa, M., Schor, J. S., Karlsson, M. P., Larkin, M. C., . . . Frank,
L. M. (2020). Constant sub-second cycling between representations of possible
futures in the hippocampus. Cell, 180(3), 552–567.
Kim, T., Ahn, S., & Bengio, Y. (2019). Variational temporal abstraction. In Advances in
neural information processing systems (pp. 11570–11579).
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114.
176
Knierim, J. J., Neunuebel, J. P., & Deshmukh, S. S. (2014). Functional correlates of
the lateral and medial entorhinal cortex: objects, path integration and local–global
reference frames. Philosophical Transactions of the Royal Society B: Biological
Sciences, 369(1635), 20130369.
Kravitz, D. J., Saleem, K. S., Baker, C. I., & Mishkin, M. (2011). A new neural framework
for visuospatial processing. Nature Reviews Neuroscience, 12(4), 217.
Lansink, C. S., Goltstein, P. M., Lankelma, J. V., McNaughton, B. L., & Pennartz, C. M.
(2009). Hippocampus leads ventral striatum in replay of place-reward information.
PLoS biology, 7(8), e1000173.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436.
Lee, A. K., & Wilson, M. A. (2002). Memory of sequential experience in the hippocampus
during slow wave sleep. Neuron, 36(6), 1183–1194.
Lee, A. X., Nagabandi, A., Abbeel, P., & Levine, S. (2019). Stochastic latent actor-
critic: Deep reinforcement learning with a latent variable model. arXiv preprint
arXiv:1907.00953.
Lehn, H., Steffenach, H.-A., van Strien, N. M., Veltman, D. J., Witter, M. P., & Håberg,
A. K. (2009). A specific role of the human hippocampus in recall of temporal
sequences. Journal of Neuroscience, 29(11), 3475–3484.
Leutgeb, J. K., Leutgeb, S., Moser, M.-B., & Moser, E. I. (2007). Pattern separation in the
dentate gyrus and ca3 of the hippocampus. science, 315(5814), 961–966.
Lever, C., Burton, S., Jeewajee, A., O’Keefe, J., & Burgess, N. (2009). Boundary vec-
tor cells in the subiculum of the hippocampal formation. Journal of Neuroscience,
29(31), 9771–9777.
Liu, K., Sibille, J., & Dragoi, G. (2018). Generative predictive codes by multiplexed
hippocampal neuronal tuplets. Neuron, 99(6), 1329–1341.
Louie, K., & Wilson, M. A. (2001). Temporally structured replay of awake hippocampal
ensemble activity during rapid eye movement sleep. Neuron, 29(1), 145–156.
MacDonald, C. J., Lepage, K. Q., Eden, U. T., & Eichenbaum, H. (2011). Hippocampal
“time cells” bridge the gap in memory for discontiguous events. Neuron, 71(4), 737–
749.
Mandelbrot, B. B. (1983). The fractal geometry of nature (Vol. 173). WH freeman New
York.
Marr, D., Willshaw, D., & McNaughton, B. (1991). Simple memory: a theory for archicor-
tex. In From the retina to the neocortex (pp. 59–128). Springer.
Mattar, M. G., & Daw, N. D. (2018). Prioritized memory access explains planning and
hippocampal replay. Nature Neuroscience, 21(11), 1609.
177
McNaughton, B. L., Battaglia, F. P., Jensen, O., Moser, E. I., & Moser, M.-B. (2006). Path
integration and the neural basis of the’cognitive map’. Nature Reviews Neuroscience,
7(8), 663.
McNaughton, B. L., Chen, L., & Markus, E. (1991). “dead reckoning,” landmark learn-
ing, and the sense of direction: a neurophysiological and computational hypothesis.
Journal of Cognitive Neuroscience, 3(2), 190–202.
Mehta, M. R., Quirk, M. C., & Wilson, M. A. (2000). Experience-dependent asymmetric
shape of hippocampal receptive fields. Neuron, 25(3), 707–715.
Mittelstaedt, M.-L., & Mittelstaedt, H. (1980). Homing by path integration in a mammal.
Naturwissenschaften, 67(11), 566–567.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., . . .
others (2015). Human-level control through deep reinforcement learning. Nature,
518(7540), 529.
Mok, R. M., & Love, B. C. (2019). A non-spatial account of place and grid cells based on
clustering models of concept learning. Nature communications, 10(1), 1–9.
Momennejad, I., & Haynes, J.-D. (2012). Human anterior prefrontal cortex encodes the
‘what’and ‘when’of future intentions. Neuroimage, 61(1), 139–148.
Momennejad, I., Otto, A. R., Daw, N. D., & Norman, K. A. (2018). Offline replay supports
planning in human reinforcement learning. eLife, 7, e32548.
Momennejad, I., Russek, E. M., Cheong, J. H., Botvinick, M. M., Daw, N., & Gershman,
S. J. (2017). The successor representation in human reinforcement learning. Nature
Human Behaviour, 1(9), 680.
Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning
with less data and less time. Machine learning, 13(1), 103–130.
Morris, R. G., Garrud, P., Rawlins, J. a., & O’Keefe, J. (1982). Place navigation impaired
in rats with hippocampal lesions. Nature, 297(5868), 681.
Muller, R. U., & Kubie, J. L. (1987). The effects of changes in the environment on the
spatial firing of hippocampal complex-spike cells. Journal of Neuroscience, 7(7),
1951–1968.
Muller, R. U., Kubie, J. L., & Ranck, J. B. (1987). Spatial firing patterns of hippocampal
complex-spike cells in a fixed environment. Journal of Neuroscience, 7(7), 1935–
1950.
Niv, Y. (2009). Reinforcement learning in the brain. Journal of Mathematical Psychology,
53(3), 139–154.
Nyberg, L., Habib, R., McIntosh, A. R., & Tulving, E. (2000). Reactivation of encoding-
related brain activity during memory retrieval. Proceedings of the National Academy
of Sciences, 97(20), 11120–11124.
178
O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., & Dolan, R. J. (2004).
Dissociable roles of ventral and dorsal striatum in instrumental conditioning. science,
304(5669), 452–454.
O’Keefe, J. (1976). Place units in the hippocampus of the freely moving rat. Experimental
neurology, 51(1), 78–109.
O’Keefe, J., & Nadel, L. (1978). The hippocampus as a cognitive map. Oxford: Clarendon
Press.
Ólafsdóttir, H. F., Barry, C., Saleem, A. B., Hassabis, D., & Spiers, H. J. (2015). Hip-
pocampal place cells construct reward related sequences through unexplored space.
Elife, 4, e06063.
O’Neill, M. J. (1992). Effects of familiarity and plan complexity on wayfinding in simu-
lated buildings. Journal of Environmental Psychology, 12(4), 319–327.
Pastalkova, E., Itskov, V., Amarasingham, A., & Buzsáki, G. (2008). Internally generated
cell assembly sequences in the rat hippocampus. Science, 321(5894), 1322–1327.
Peng, J., & Williams, R. J. (1993). Efficient learning and planning within the dyna frame-
work. Adaptive Behavior, 1(4), 437–454.
Pennartz, C., Ito, R., Verschure, P., Battaglia, F., & Robbins, T. (2011). The hippocampal–
striatal axis in learning, prediction and goal-directed behavior. Trends in neuro-
sciences, 34(10), 548–559.
Peters, J., & Büchel, C. (2010). Neural representations of subjective reward value. Be-
havioural brain research, 213(2), 135–141.
Pezzulo, G., Kemere, C., & Van Der Meer, M. A. (2017). Internally generated hippocampal
sequences as a vantage point to probe future-oriented cognition. Annals of the New
York Academy of Sciences, 1396(1), 144–165.
Pezzulo, G., van der Meer, M. A., Lansink, C. S., & Pennartz, C. M. (2014). Internally
generated sequences in learning and executing goal-directed behavior. Trends in
cognitive sciences, 18(12), 647–657.
Pfeiffer, B. E., & Foster, D. J. (2013). Hippocampal place-cell sequences depict future
paths to remembered goals. Nature, 497(7447), 74.
Poucet, B., & Hok, V. (2017). Remembering goal locations. Cur-
rent Opinion in Behavioral Sciences, 17, 51 - 56. Retrieved from
http://www.sciencedirect.com/science/article/pii/S2352154616302832
(Memory in time and space) doi: https://doi.org/10.1016/j.cobeha.2017.06.003
Preston, A. R., & Eichenbaum, H. (2013). Interplay of hippocampus and prefrontal cortex
in memory. Current Biology, 23(17), R764–R773.
179
Pritzel, A., Uria, B., Srinivasan, S., Badia, A. P., Vinyals, O., Hassabis, D., . . . Blundell, C.
(2017). Neural episodic control. In Proceedings of the 34th international conference
on machine learning-volume 70 (pp. 2827–2836).
Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., & Fried, I. (2005). Invariant visual
representation by single neurons in the human brain. Nature, 435(7045), 1102–1107.
Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Gruber, L., . . . others (2020).
Hopfield networks is all you need. arXiv preprint arXiv:2008.02217.
Russek, E. M., Momennejad, I., Botvinick, M. M., Gershman, S. J., & Daw, N. D. (2017).
Predictive representations can link model-based reinforcement learning to model-
free mechanisms. PLoS computational biology, 13(9), e1005768.
Samsonovich, A., & McNaughton, B. L. (1997). Path integration and cognitive mapping
in a continuous attractor neural network model. Journal of Neuroscience, 17(15),
5900–5920.
Sanders, H., Wilson, M. A., & Gershman, S. J. (2020). Hippocampal remapping as hidden
state inference. Elife, 9, e51140.
Sargolini, F., Fyhn, M., Hafting, T., McNaughton, B. L., Witter, M. P., Moser, M.-B., &
Moser, E. I. (2006). Conjunctive representation of position, direction, and velocity
in entorhinal cortex. Science, 312(5774), 758–762.
Schapiro, A. C., Rogers, T. T., Cordova, N. I., Turk-Browne, N. B., & Botvinick, M. M.
(2013). Neural representations of events arise from temporal community structure.
Nature neuroscience, 16(4), 486–492.
Schapiro, A. C., Turk-Browne, N. B., Norman, K. A., & Botvinick, M. M. (2016). Statis-
tical learning of temporal community structure in the hippocampus. Hippocampus,
26(1), 3–8.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347.
Scoville, W. B., & Milner, B. (1957). Loss of recent memory after bilateral hippocampal
lesions. Journal of neurology, neurosurgery, and psychiatry, 20(1), 11.
Silva, D., Feng, T., & Foster, D. J. (2015). Trajectory events across hippocampal place
cells require previous experience. Nature neuroscience, 18(12), 1772.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., . . .
others (2016). Mastering the game of go with deep neural networks and tree search.
nature, 529(7587), 484.
Solstad, T., Boccara, C. N., Kropff, E., Moser, M.-B., & Moser, E. I. (2008). Representation
of geometric borders in the entorhinal cortex. Science, 322(5909), 1865–1868.
180
Sorscher, B., Mel, G., Ganguli, S., & Ocko, S. (2019). A unified theory for the origin of
grid cells through the lens of pattern formation. In Advances in neural information
processing systems (pp. 10003–10013).
Spehar, B., Wong, S., van de Klundert, S., Lui, J., Clifford, C. W. G., & Taylor, R. (2015).
Beauty and the beholder: the role of visual sensitivity in visual preference. Frontiers
in human neuroscience, 9, 514.
Spiers, H. J., & Maguire, E. A. (2007). A navigational guidance sys-
tem in the human brain. Hippocampus, 17(8), 618-626. Retrieved from
https://onlinelibrary.wiley.com/doi/abs/10.1002/hipo.20298 doi:
10.1002/hipo.20298
Stachenfeld, K. L., Botvinick, M. M., & Gershman, S. J. (2017). The hippocampus as a
predictive map. Nature neuroscience, 20(11), 1643.
Sun, C., Yang, W., Martin, J., & Tonegawa, S. (2020). Hippocampal neurons represent
events as transferable units of experience. Nature Neuroscience, 23(5), 651–663.
Sutton, R. S. (1991). Dyna, an integrated architecture for learning, planning, and reacting.
ACM SIGART Bulletin, 2(4), 160–163.
Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of pavlovian reinforcement.
In Learning and computational neuroscience: Foundations of adaptive networks (pp.
497–537). MIT Press.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Tanji, J., & Hoshi, E. (2001). Behavioral planning in the prefrontal cortex. Current opinion
in neurobiology, 11(2), 164–170.
Taube, J. S., Muller, R. U., & Ranck, J. B. (1990). Head-direction cells recorded from the
postsubiculum in freely moving rats. i. description and quantitative analysis. Journal
of Neuroscience, 10(2), 420–435.
Taylor, R., Spehar, B., Hagerhall, C., & Van Donkelaar, P. (2011). Perceptual and physio-
logical responses to jackson pollock’s fractals. Frontiers in human neuroscience, 5,
60.
Tessereau, C., O’Dea, R., Coombes, S., & Bast, T. (2020). Reinforcement learning ap-
proaches to hippocampus-dependant flexible spatial navigation. bioRxiv.
Teyler, T. J., & DiScenna, P. (1986). The hippocampal memory indexing theory. Behav-
ioral neuroscience, 100(2), 147.
Tolman, E. C. (1948). Cognitive maps in rats and men. Psychological review, 55(4), 189.
Tsao, A., Sugar, J., Lu, L., Wang, C., Knierim, J. J., Moser, M.-B., & Moser, E. I. (2018).
Integrating time from experience in the lateral entorhinal cortex. Nature, 561(7721),
57–62.
181
Tulving, E. (2002). Episodic memory: from mind to brain. Annual review of psychology,
53(1), 1–25.
Tulving, E., & Markowitsch, H. J. (1998). Episodic and declarative memory: role of the
hippocampus. Hippocampus, 8(3), 198–204.
van der Meer, M. A., Johnson, A., Schmitzer-Torbert, N. C., & Redish, A. D. (2010).
Triple dissociation of information processing in dorsal striatum, ventral striatum, and
hippocampus on a learned spatial decision task. Neuron, 67(1), 25–32.
Van Essen, D. C., & Maunsell, J. H. (1983). Hierarchical organization and functional
streams in the visual cortex. Trends in neurosciences, 6, 370–375.
Van Hoesen, G. W. (1982). The parahippocampal gyrus: new observations regarding its
cortical connections in the monkey. Trends in neurosciences, 5, 345–350.
Vikbladh, O. M., Meager, M. R., King, J., Blackmon, K., Devinsky, O., Shohamy, D., . . .
Daw, N. D. (2019). Hippocampal contributions to model-based planning and spatial
memory. Neuron.
Viswanathan, G. M., Da Luz, M. G., Raposo, E. P., & Stanley, H. E. (2011). The physics of
foraging: an introduction to random searches and biological encounters. Cambridge
University Press.
Wayne, G., Hung, C.-C., Amos, D., Mirza, M., Ahuja, A., Grabska-Barwinska, A., . . .
others (2018). Unsupervised predictive memory in a goal-directed agent. arXiv
preprint arXiv:1803.10760.
Whittington, J. C., Muller, T. H., Mark, S., Chen, G., Barry, C., Burgess, N., & Behrens,
T. E. (2019). The tolman-eichenbaum machine: Unifying space and relational mem-
ory through generalisation in the hippocampal formation. bioRxiv, 770495.
Wikenheiser, A. M., & Schoenbaum, G. (2016). Over the river, through the woods: cog-
nitive maps in the hippocampus and orbitofrontal cortex. Nature Reviews Neuro-
science, 17(8), 513.
Williams, R. J., & Peng, J. (1990). An efficient gradient-based algorithm for on-line
training of recurrent network trajectories. Neural computation, 2(4), 490–501.
Wilson, M. A., & McNaughton, B. L. (1994). Reactivation of hippocampal ensemble
memories during sleep. Science, 265(5172), 676–679.
Yassa, M. A., & Stark, C. E. (2011). Pattern separation in the hippocampus. Trends in
neurosciences, 34(10), 515–525.
Zaehle, T., Jordan, K., Wüstenberg, T., Baudewig, J., Dechent, P., & Mast, F. W. (2007).
The neural basis of the egocentric and allocentric spatial frame of reference. Brain
research, 1137, 92–103.
182