SEMANTIC SEGMENTATION OF SATELLITE IMAGERY USING POSITIVE
AND UNLABELED LEARNING
by
MOHAMMAD ESHGHI
A DISSERTATION
Presented to the Department of Geography
and the Division of Graduate Studies of the University of Oregon
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
June 2022
DISSERTATION APPROVAL PAGE
Student: Mohammad Eshghi
Title: Semantic Segmentation of Satellite Imagery Using Positive and Unlabeled
Learning
This dissertation has been accepted and approved in partial fulfillment of the
requirements for the Doctor of Philosophy degree in the Department of Geography
by:
Prof. Amy Lobben Chair
Prof. Hui (Henry) Luan Core Member
Prof. Lucas Silva Core Member
Prof. Daniel Lowd Institutional Representative
and
Krista Chronister Vice Provost for Graduate Studies
Original approval signatures are on file with the University of Oregon Division of
Graduate Studies.
Degree awarded June 2022
ii
© 2022 Mohammad Eshghi
This work is licensed under a Creative Commons
Attribution 4.0 License.
iii
DISSERTATION ABSTRACT
Mohammad Eshghi
Doctor of Philosophy
Department of Geography
June 2022
Title: Semantic Segmentation of Satellite Imagery Using Positive and Unlabeled
Learning
The recent advances of deep learning in computer vision field have
revolutionized digital image processing. The adoption of vision-based deep learning
models in remote sensing has been promising. However, despite their success in
remote sensing image processing, deep learning models suffer from labeled data
scarcity, which is defined as the lack of large scale labeled datasets. This drawback
is important to pay attention to since manually labeling data is labor-intensive and
time-consuming. In addition, in many applications, the only information of interest
is the presence of the application-specific landcover or object within an image, and
thus it is not reasonable to spend extra time and cost to fully label the rest of an
image. Therefore, remote sensing image processing benefits greatly from positive
and unlabeled learning, which is a more general setting of semi-supervised learning
and addresses the availability of only a few labeled examples of the presence of the
application-specific event in a dataset.
This dissertation investigates the possibility of leveraging transfer learning
and ensemble learning frameworks in a positive and unlabeled learning setting for
semantic segmentation of satellite imagery. First, I create positive and unlabeled
satellite imagery datasets from an available binary positive and negative dataset
iv
to be used for model development. Next, I develop a deep homogenous transfer
positive and unlabeled learning which utilizes two distinct positive and negative
as well as positive and unlabeled satellite imagery datasets acquired by a same
satellite sensor (i.e. similar domain images). Building upon this, I extend the
homogenous aspect of the developed model to the heterogeneous case. In doing
so, the developed model will be able to not only learn from similar domain satellite
images and non-satellite images but also to leverage satellite images from dissimilar
domains. In the next stage, I develop a deep ensemble positive and unlabeled
learning model in order to incorporate the advantages of multiple different models
for a same task. Then, I investigate the possibility of a mixture of the proposed
models in transfer learning and ensemble learning frameworks for PU learning.
Finally, I conclude this dissertation by discussing the possible next steps for future
works.
v
CURRICULUM VITAE
NAME OF AUTHOR: Mohammad Eshghi
GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED:
University of Oregon, Eugene, OR, USA
K.N. Toosi University of Technology, Tehran, Tehran, Iran
DEGREES AWARDED:
Doctor of Philosophy, Geography, 2022, University of Oregon
Master of Science, Computer and Information Science, 2021, University of
Oregon
Master of Science, Geographic Information Systems Engineering, 2016, K.N.
Toosi University of Technology
Bachelor of Science, Geodesy and Geomatics Engineering, 2013, K.N. Toosi
University of Technology
AREAS OF SPECIAL INTEREST:
Machine Learning
GIScience
Computer Vision
Remote Sensing
PROFESSIONAL EXPERIENCE:
Graduate Employee, University of Oregon, Eugene, OR, USA, 2016-2022
Data Scientist (Intern), Apple, Cupertino, CA, USA, 2021
Machine Learning Research Scientist (Intern), Radiant Earth Foundation,
Washington D.C., D.C., USA, 2021
Data Engineer (Intern), Apple, Cupertino, CA, USA, 2020
Instructor, Machine Learning (GeoAI), University of Oregon, Eugene, OR,
USA, 2020
vi
Full-Stack Web Developer and Data Analyst, University of Oregon, Eugene,
OR, USA, 2017-2020
Instructor, GIScience-I, University of Oregon, Eugene, OR, USA, 2017
GRANTS, AWARDS AND HONORS:
Graduate Research Fellow (National Science Foundation Funded), National
Socio-Environmental Synthesis Center (SESYNC), 2021
Mather Graduate Fellowship, Department of Geography, University of
Oregon, 2021
Graduate Research Pursuit (National Science Foundation Funded), National
Socio-Environmental Synthesis Center (SESYNC), 2019-2021
Travel grant for Summer Institute on Cyberinfrastructure for
Socioenvironmental Synthesis (National Science Foundation Funded),
National Socio-Environmental Synthesis Center (SESYNC), 2019
Travel grant for Summer School on Reproducible problem solving with
CyberGIS and Geospatial Data Science (National Science Foundation
Funded), American Association of Geographers-University Consortium
for Geographic Information Science (AAG-UCGIS), 2019
Rippey Graduate Student Research Award, Department of Geography,
University of Oregon, 2019, 2021
Graduate Student Research Travel Award, Department of Geography,
University of Oregon, 2018
PUBLICATIONS:
Eshghi, M., & Schmidtke, H. R. (2018). An approach for safer navigation
under severe hurricane damage. Journal of Reliable Intelligent
Environments, (Vol. 4), pp. 161–185.
vii
ACKNOWLEDGEMENTS
I would like to express my deepest gratitude to my advisor, Prof. Amy
Lobben, whose consistent support and encouragement I will never forget.
Furthermore, I am thankful to and grateful for my parents and my sisters whose
constant limitless and unconditional love and support keep me motivated and
confident. My accomplishments and success are because of their sacrifices for me.
Last but not least, I thank my dear partner, Emily, for her love, patience, and
support.
viii
To my mother and father, to whom I owe everything.
To my sisters and my partner, for being driving force in my life.
ix
TABLE OF CONTENTS
Chapter Page
I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. Dissertation Scope & Contributions . . . . . . . . . . . . . . . 2
1.1.1. Deep Transfer Positive and Unlabeled Learning . . . . . . . 3
1.1.2. Deep Ensemble Positive and Unlabeled Learning . . . . . . 4
1.2. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . 5
II. LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . 6
2.1. Positive and Unlabeled Learning . . . . . . . . . . . . . . . . 6
2.2. Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . 16
2.3. Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . 22
III. DATASETS . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1. Standard Open Image Datasets . . . . . . . . . . . . . . . . . 30
3.2. Massachusetts Buildings Dataset . . . . . . . . . . . . . . . . 30
3.3. Inria Aerial Image Dataset . . . . . . . . . . . . . . . . . . . 34
3.3.1. Positive and Unlabeled Dataset . . . . . . . . . . . . . . 36
3.3.1.1. Homogeneous Case . . . . . . . . . . . . . . . 39
IV. DEEP TRANSFER POSITIVE AND UNLABELED
LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.1. Semi-supervised Learning . . . . . . . . . . . . . . . . 42
4.1.2. Weakly-supervised Learning . . . . . . . . . . . . . . . 45
4.1.3. Semi-supervised Domain Adaptation . . . . . . . . . . . 46
4.1.4. Unsupervised Domain Adaptation . . . . . . . . . . . . . 48
4.1.5. The motivation behind this work . . . . . . . . . . . . . 50
x
Chapter Page
4.2. Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1. Homogeneous case . . . . . . . . . . . . . . . . . . . 52
4.2.2. Heterogeneous case . . . . . . . . . . . . . . . . . . . 53
4.3. Methodologies . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.1. Homogeneous case . . . . . . . . . . . . . . . . . . . 54
4.3.1.1. Model architecture: UNET . . . . . . . . . . . . 55
4.3.1.2. Model backbone: VGG16 . . . . . . . . . . . . 56
4.3.1.3. Learning module: supervised . . . . . . . . . . . 57
4.3.1.4. Learning module: self-training . . . . . . . . . . 58
4.3.1.5. Learning module: positive-unlabeled . . . . . . . 59
4.3.1.6. Performance metric: Accuracy . . . . . . . . . . 61
4.3.1.7. Performance metric: IoU . . . . . . . . . . . . . 61
4.3.1.8. Performance metric: F1-score . . . . . . . . . . 62
4.3.2. Heterogeneous case . . . . . . . . . . . . . . . . . . . 62
4.3.2.1. Model architecture: DeepLabV2 . . . . . . . . . 65
4.3.2.2. Model backbone: ResNet-101 . . . . . . . . . . . 66
4.3.2.3. Learning module: consistency-training . . . . . . . 68
4.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.1. Homogeneous case . . . . . . . . . . . . . . . . . . . 69
4.4.1.1. Conclusions . . . . . . . . . . . . . . . . . . 82
4.4.2. Heterogeneous case . . . . . . . . . . . . . . . . . . . 85
4.4.2.1. Ablation Study . . . . . . . . . . . . . . . . . 90
4.4.2.2. Conclusions . . . . . . . . . . . . . . . . . . 91
4.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 91
xi
Chapter Page
V. DEEP ENSEMBLE POSITIVE AND UNLABELED
LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2. Methodologies . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2.1. Dropout . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.2. EMA . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.3. SWA . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.4. Feature ensemble . . . . . . . . . . . . . . . . . . . . 99
5.2.5. Model ensemble . . . . . . . . . . . . . . . . . . . . . 99
5.2.6. TreeNet . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.7. Contextual ensemble . . . . . . . . . . . . . . . . . . 100
5.2.8. Multi-scale ensemble . . . . . . . . . . . . . . . . . . 100
5.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 111
VI. CONCLUSIONS & FUTURE WORK . . . . . . . . . . . . . . . 113
6.1. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 115
REFERENCES CITED . . . . . . . . . . . . . . . . . . . . . . . . 116
xii
LIST OF FIGURES
Figure Page
1. From left to right: fully-labeled binary data and classifier,
one-class labeled data and classifier, partially-labeled binary
data and unlabeled data and semi-supervised classifier, and
positive and unlabeled data and classifier. Blue points are
positive data, red points are negative data; and bright colors
are labeled data and pale colors are unlabeled data. . . . . . . . . . . 7
2. Transfer Learning is a framework for reusability of the
extracted knowledge in a large-data source domain to train
a model in a limited-data target domain . . . . . . . . . . . . . . . 16
3. Sample images from ImageNet-ILSVRC dataset
(Russakovsky et al., 2015). . . . . . . . . . . . . . . . . . . . . 31
4. (a) Left and right columns show images and their
corresponding labels, respectively. From top to bottom:
a sample from Massachusetts buildings dataset (Mnih,
2013) and samples from its derived patches, one from
the middle, and the rest from top-left, top-right, bottom-
left, and bottom-right corners, respectively, showing the
mirroring effect when not enough pixels are available for
patch creation. (b) Selected final patches: images and their
corresponding labels in the left and right columns. . . . . . . . . . . 33
5. (a) Left and right columns show images and their
corresponding labels, respectively. From top to bottom:
a sample from Inria Aerial Image Dataset (Maggiori,
Tarabalka, Charpiat, & Alliez, 2017) and samples from
its derived patches, one from the middle, and the rest from
top-left, top-right, bottom-left, and bottom-right corners,
respectively, showing the mirroring effect when not enough
pixels are available for patch creation. (b) Selected final
patches: images and their corresponding labels in the left
and right columns. . . . . . . . . . . . . . . . . . . . . . . . . 35
xiii
Figure Page
6. PU data is created using a moving window over labels
corresponding to each image within the train set. At each
location of the moving window pixels that belong to the
positive class are changed to unlabeled with complete
randomness (i.e. no selection bias). . . . . . . . . . . . . . . . . . 37
7. Left and right columns show images and their corresponding
PU labels created by the moving window strategy. . . . . . . . . . . 38
8. Different PU train datasets created from Inria Aerial Image
dataset with different missing probabilities for the positive
class, all of which satisfying the SCAR assumption—graphs
in black, blue, green, yellow, red, and orange show the
distribution of the positive class and of the positive data in
PU5, PU6, PU7, PU8, and PU9 datasets, respectively. . . . . . . . . 39
9. Homogeneous transfer positive and unlabeled learning architecture. . . . 55
10. UNET architecture; the left-side is the contracting module
(i.e. encoder), the right-side is the expansive module (i.e.
decoder), the brown square is the segmentation output, and,
finally, the grey squares are intermediary feature maps to be
concatenated to their corresponding upsampled layer. . . . . . . . . . 57
11. The electromagnetic spectrum captured by satellite remote
sensing (shown as SRS in the image). Image adopted from
Pettorelli, Schulte to Bühne, C. Shapiro, and Glover-Kapfer (2018). . . . 63
12. The misalignments of different remote sensors across
different bands. Image adopted from Rocchio and Barsi
(n.d.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
13. Heterogeneous transfer positive and unlabeled learning architecture. . . 65
14. DeepLabV2 architecture; top-to-bottom are three
repetitions of left-to-right path in each row over 1.0, 0.75,
and 0.5 downscaled input images; left-to-right path are
multi-scale feature generators with different atrous rates
from 6 to 24; the final feature map is upsampled to generate
the input-size-like output. . . . . . . . . . . . . . . . . . . . . . 67
15. A sample building block showing the idea of residual
learning. Figure adopted from K. He, Zhang, Ren, and
Sun (2016). . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
xiv
Figure Page
16. Residual learning block: 2- versus 3-layer. Figure adopted
from K. He et al. (2016). . . . . . . . . . . . . . . . . . . . . . 68
17. The training loss (in red) and validation loss (in green) for
Target-PU model without ImageNet’s pre-trained weights
on (a) PU5, (b) PU6, (c) PU7, (d) PU8, (e) PU9 datasets. . . . . . . . 74
18. The training loss (in red) and validation loss (in green) for
Target-PU model with ImageNet’s pre-trained weights on
(a) PU5, (b) PU6, (c) PU7, (d) PU8, (e) PU9 datasets. . . . . . . . . 75
19. The training loss (in red) and validation loss (in green) for
Target-FT model without ImageNet’s pre-trained weights on
(a) PU5, (b) PU6, (c) PU7, (d) PU8, (e) PU9 datasets. . . . . . . . . 76
20. The training loss (in red) and validation loss (in green) for
Target-FT model with ImageNet’s pre-trained weights on
(a) PU5, (b) PU6, (c) PU7, (d) PU8, (e) PU9 datasets. . . . . . . . . 77
21. The training loss (in red) and validation loss (in green) for
Seamless-U model without ImageNet’s pre-trained weights
on (a) PU5, (b) PU6, (c) PU7, (d) PU8, (e) PU9 datasets. . . . . . . . 78
22. The training loss (in red) and validation loss (in green) for
Seamless-U model with ImageNet’s pre-trained weights on
(a) PU5, (b) PU6, (c) PU7, (d) PU8, (e) PU9 datasets. . . . . . . . . 79
23. The training loss (in red) and validation loss (in green) for
Seamless-PU model without ImageNet’s pre-trained weights
on (a) PU5, (b) PU6, (c) PU7, (d) PU8, (e) PU9 datasets. . . . . . . . 80
24. The training loss (in red) and validation loss (in green) for
Seamless-PU model with ImageNet’s pre-trained weights on
(a) PU5, (b) PU6, (c) PU7, (d) PU8, (e) PU9 datasets. . . . . . . . . 81
25. Left-to-right: Sample image patches from homogeneous
PU Inrial Dataset, the corresponding ground truth, and
predictions from four models trained on the PU5 case: (i)
Target-PU without pre-trained weights, (ii) Target-FT
without pre-trained weights, (iii) Seamless-U with pre-
trained weights, and (iv) Seamless-PU with pre-trained weights. . . . . 83
xv
Figure Page
26. Left-to-right: Sample image patches from homogeneous
PU Inrial Dataset, the corresponding ground truth, and
predictions from four models trained on the PU9 case: (i)
Target-PU without pre-trained weights, (ii) Target-FT
without pre-trained weights, (iii) Seamless-U with pre-
trained weights, and (iv) Seamless-PU with pre-trained weights. . . . . 84
27. Left-to-right: Sample image patches from heterogeneous
PU Inrial Dataset, the corresponding ground truth, and
predictions from (i) PixMatch(Augmentations) , (ii)
Seamless-PU with UNET trained on PU5, (iii) Seamless-PU
with DeepLabV2 trained on PU5, (iv) Seamless-PU with
UNET trained on PU9, (v) Seamless-PU with DeepLabV2
trained on PU9. . . . . . . . . . . . . . . . . . . . . . . . . . 89
28. Different deep ensemble architectures. . . . . . . . . . . . . . . . . 101
29. self-ensemble multi-scale UNET. . . . . . . . . . . . . . . . . . . 103
30. Cosine similarity of model weights for TreeNet (Decoder)
model over different checkpoints at different epochs during
the training phase for PU9 dataset. . . . . . . . . . . . . . . . . . 106
31. Cosine similarity of model weights for TreeNet (Encoder)
model over different checkpoints at different epochs during
the training phase for PU9 dataset. . . . . . . . . . . . . . . . . . 108
32. Cosine similarity of model weights for ModelNet model over
different checkpoints at different epochs during the training
phase for PU9 dataset. . . . . . . . . . . . . . . . . . . . . . . 108
33. Left-to-right: Sample image patches from homogeneous
PU Inrial Dataset, the corresponding ground truth, and
predictions from the baseline model and the proposed model
trained on PU9 dataset. . . . . . . . . . . . . . . . . . . . . . . 110
xvi
LIST OF TABLES
Table Page
1. Massachusetts Buildings Dataset Characteristics . . . . . . . . . . . 32
2. Inria Aerial Image Dataset Characteristics . . . . . . . . . . . . . . 36
3. The performance of the proposed model and the baselines in
the target domain . . . . . . . . . . . . . . . . . . . . . . . . . 72
4. Performance drop of PUDA model (Seamless-PU) without
the consistency loss on heterogeneous PU data from Inria
Image Dataset. The model uses a warm start using the
ImageNet pre-trained weights. . . . . . . . . . . . . . . . . . . . 86
5. Performance of unsupervised domain adaptation model
(PixMatch) and the proposed PUDA model (Seamless-PU).
All utilize a warm start using the ImageNet pre-trained weights. . . . . 88
6. The effect of the degree of the magnitude that the
consistency loss is incorporated, which is shown for the
case of PU9 dataset. . . . . . . . . . . . . . . . . . . . . . . . . 90
7. The performance of different ensemble models compared to
the performance of a single model. . . . . . . . . . . . . . . . . . 107
8. The mixture of Ensemble and Transfer PU models. . . . . . . . . . . 112
xvii
CHAPTER I
INTRODUCTION
Remotely-sensed (RS) imagery is among the most valuable geographically
referenced spatial big data that provides a unique opportunity for understanding
earth’s surface (Miller & Goodchild, 2015; S. Wang et al., 2015). Such an
opportunity is of great importance in different scientific domains including but
not limited to geography, ecology, epidemiology, social sciences, and emergency
management (S. Wang et al., 2015). In these scientific domains, it is crucial to
have access to and to be able to process the most recent imageries in order to
extract the necessary information. This requires frequent image acquisition and
automatic image processing. In terms of image acquisition, the production rate of
RS imagery has exploded in recent years due to increase in the number of satellites
and advances in drone imagery. On the other hand, with regards to automatic
image processing,the recent advances in deep learning in computer vision have
revolutionized digital image processing. For example, as a result of the success
of convolutional neural networks in different vision tasks such as object detection
(Girshick, Donahue, Darrell, & Malik, 2015; Redmon, Divvala, Girshick, & Farhadi,
2016) and semantic segmentation (Long, Shelhamer, & Darrell, 2015; Noh, Hong,
& Han, 2015; Ronneberger, Fischer, & Brox, 2015a), the interest in deep learning
models has greatly increased within the remote sensing research community
(X. X. Zhu et al., 2017).
Semantic segmentation task (also known as pixel-based classification in
remote sensing research community) refers to assigning a label (of a landcover
type or an object) to every pixel within an image (Kemker, Salvaggio, & Kanan,
2018). Semantic segmentation of RS imagery usually requires labeled datasets
1
with samples that are representative of all landcover types within the area in
order to train a classifier (W. Li, Guo, & Elkan, 2010), whereas only one specific
class of landcover types (or a specific class of objects) might be of interest
(Foody, Mathur, Sanchez-Hernandez, & Boyd, 2006). For example, landcover
classes such as agricultural lands may not be of interest when the objective is
to identify the manmade structures such as roads or buildings, and vice versa
(Foody et al., 2006; W. Li et al., 2010). In such cases, the intrinsically labor-
intensive and time-consuming approach of creating a set of representative training
samples can be avoided. However, this results in unsuitability of the widely-used
supervised learning framework, and thus alternative approaches, such as positive
and unlabeled learning framework, should be taken.
Positive and unlabeled (PU) learning is a more general setting of semi-
supervised learning, in which a binary classifier is learned based on a set of
(limited) labeled data for the positive class only (i.e. the class of interest) and a
very large amount of unlabeled data containing both positive and negative classes
(Bekker & Davis, 2020). Despite its significance, little attention has been paid to
PU learning within the remote sensing literature, whereas machine learning and
deep learning for remote sensing have a rich literature—for more details on machine
learning and deep learning in remote sensing, see Maulik and Chakraborty (2017)
and Y. Li, Zhang, Xue, Jiang, and Shen (2018).
1.1 Dissertation Scope & Contributions
In this dissertation, I investigate the potentials of deep learning-based PU
learning for semantic segmentation of satellite imagery. The contributions of this
dissertation can be categorized into two major learning frameworks: (i) Transfer
Positive and Unlabeled Learning, and (ii) Ensemble Positive and Unlabeled
2
Learning. The following pages present the two research questions and an overview
of the main contributions of this dissertation.
RQ1. How can Transfer Learning be incorporated in the context of positive
and unlabeled learning for semantic segmentation of satellite imagery?
RQ2. How can Ensemble Learning be incorporated in the context of positive
and unlabeled learning for semantic segmentation of satellite imagery?
1.1.1 Deep Transfer Positive and Unlabeled Learning. Transfer
learning (TL) is the idea of transferring the informative knowledge from one
domain (i.e. a source domain) to another (i.e. a target domain) (Torrey & Shavlik,
2010). TL is suitable when learning from the source domain with large amount of
labeled data can be informative to find a model using the limited labeled data in
the target domain, which can do better compared to models trained only by the
limited labeled data in the target domain (Karbalayghareh, Qian, & Dougherty,
2018)–for example, the limited labeled data in the target domain could be the
case for PU data (J. Chen & Liu, 2014). Although Transfer PU learning has been
studied outside of remote sensing literature to some extent (J. Chen & Liu, 2014;
Mignone & Pio, 2018), no study has been found (especially with a deep learning-
based approach) for semantic segmentation of RS imagery.
TL can be categorized into homogeneous and heterogeneous TL. In
homogeneous TL, the source domain and target domain have the same feature
space XS = XT . Heterogeneous TL, however, is considered when the source
domain and target domain have different feature spaces XS ≠ XT (Bashath et
al., 2022a). I consider homogeneous TL for situations when a model is developed
on a set of satellite images from a geographic location, and then it is used on
3
images from the same geographic location. The problem here is that the spectral
signature of objects is not exactly the same–i.e. different marginal distributions,
PS(X) ̸= PT (X), but same feature spaces, XS = XT . I propose a solution
which focuses on both shared feature spaces and shared parameters in a deep
learning-based framework. Next, I consider heterogenous TL for situations when
a model is developed on a set of general image datasets (like ImageNet) or satellite
images from a geographic location for a learning task, and it is used on satellite
images or satellite images from another geographic location. In this case, since
the two domains are related (i.e. both are images) transferring information is still
possible. I extend my proposed model for homogeneous case to heterogenous case,
in which the goal is to close the gap between feature spaces of the source and target
domains.
1.1.2 Deep Ensemble Positive and Unlabeled Learning.
Ensemble learning (EL) is essentially a methodological framework based on a
voting mechanism, in which the predictive power of multiple different learning
algorithms is fused in order to achieve a better predictive performance through
the collective voting–as opposed to any of the participating learning algorithms
individually (Dong, Yu, Cao, Shi, & Ma, 2020a; Sagi & Rokach, 2018a). In remote
sensing, research has shown the robustness of ensemble classifiers rather than a
single method strategy (e.g. X. Huang & Zhang, 2012a). However, little research
has focused on the ensemble PU learning in remote sensing leaving a gap in this
area of research. R. Liu et al. (2018) is among the few research targeting the
ensemble PU learning for RS imagery. Therefore, this dissertation investigates
deep ensemble learning frameworks in a PU learning setting. Building upon this,
4
I propose a deep ensemble PU learning that benefits from both ensemble and PU
worlds for semantic segmentation of RS imageries.
1.2 Dissertation Outline
The remainder of this dissertation is organized as follows. I provide an
overview of studies related to (i) PU learning, (ii) TL, and (iii) EL, in general,
and with a focus on remote sensing literature, in particular, in Chapter II. Next,
in Chapter III, I introduce the datasets used in this dissertation as well as the
PU dataset that I create on top of one of these datasets. Chapter IV presents
my approach for homogenous and heterogeneous transfer PU learning. Then, I
present the proposed ensemble PU learning in Chapter V. Finally, I conclude and
summarize my contributions and further discuss the opportunities for future work
in Chapter VI.
5
CHAPTER II
LITERATURE REVIEW
In this chapter, I review the recent research in positive and unlabeled (PU)
learning and other learning frameworks that are used as the base for developing
my methodologies. The literature review is categorized into three sections: (i) PU
Learning, (ii) Transfer Learning, and (iii) Ensemble Learning.
2.1 Positive and Unlabeled Learning
PU learning is a more general setting of both semi-supervised learning and
one-class classification, in which a binary classifier is learned based on a limited
set of labeled data from only one class and a very large amount of unlabeled
data containing examples from both positive and negative classes. Therefore, PU
learning is considered for situations where the data for negative class is either
absent or distributed too diversely (X. Chen, Gong, & Yang, 2021; M. Du Plessis,
Niu, & Sugiyama, 2015). As shown in Fig. 1, PU learning differs from semi-
supervised learning because it uses only labeled examples from positive class. It
also differs from one-class classification since it utilizes unlabeled data examples
(Bekker & Davis, 2020). It should be mentioned that the negative and unlabeled
(NU) learning task is applied when the labeled data belongs to the negative class,
and thus NU learning is essentially the same as PU learning (Hammoudeh &
Lowd, 2020; Niu, du Plessis, Sakai, Ma, & Sugiyama, 2016). The significance of PU
learning is due to its high number of applications (X. Chen et al., 2021). However,
PU learning is challenging due to the difficulty in model selection, model building,
and model assessment (W. Li, 2013).
6
Figure 1. From left to right: fully-labeled binary data and classifier, one-class
labeled data and classifier, partially-labeled binary data and unlabeled data and
semi-supervised classifier, and positive and unlabeled data and classifier. Blue
points are positive data, red points are negative data; and bright colors are labeled
data and pale colors are unlabeled data.
In binary classification, a classifier is trained using a set of training examples
of form {x, y} where x is a set of attributes/predictors/variables and y is the class
label; usually y = 1 refers to class positive and y = 0 refers to class negative. In PU
learning, however, the training data set contains examples of form {x, y, s} where x
and y are the same as the binary case. The new element here is s which is used to
denote whether the example is labeled (s = 1) or not (s = 0). In situations with PU
data, the binary loss functions are not valid anymore resulting in the emergence of
different PU learning algorithms.
Categories of PU learning algorithms include: two-step techniques, biased
learning models, and learning with incorporation of the positive class prior (Bekker
& Davis, 2020; Jaskie & Spanias, 2019). The two-step techniques assume that
positive data are very similar and are different from negative data. Therefore,
by identifying reliable negative samples from unlabeled data at the first step, a
supervised or semi-supervised learning model can be trained in the second step by
utilizing positive, reliable negative, and the rest of the unlabeled samples (Bekker
& Davis, 2020). For example, Dhurandhar and Gurumoorthy (2020) propose
an approach in which, using an unsupervised framework and independent from
classifiers, they identify and weigh positive and negative examples from unlabeled
7
examples. B. Liu, Lee, Yu, and Li (2002) employ a spy technique in which, first,
they put a subset of randomly selected positive examples into the unlabeled set.
Then, they determine a probability threshold to identify negative examples from
the unlabeled data. P. Yang, Liu, and Yang (2017) propose an adaptive sampling
approach for identifying reliable negative data from the unlabeled data. They
first consider all unlabeled data as negative data. Then, using a predictive model
trained on positive and negative (PN) data, the negative class probability for
unlabeled data is calculated. The probabilities for all unlabeled data belonging
to positive or negative classes are calculated iteratively by repeating the last two
steps, which, at the end, results in one or more robust negative dataset(s) based on
likelihood threshold. The two-step strategy has a drawback: identifying negative
samples is not always reliable and cannot always be accurate, thus such situations
result in poor performances.
In biased techniques, the unlabeled data are considered noisy samples of
the negative class. For example, Biased-SVM considers a modified cost function to
address the noisy data (Jaskie & Spanias, 2019; B. Liu, Dai, Li, Lee, & Yu, 2003).
W. S. Lee and Liu (2003) perform weighted logistic regression in which positive
samples have larger weights in comparison to the negative samples (Bekker &
Davis, 2020). H. Shi, Pan, Yang, and Gong (2018) solve PU learning by focusing
on risk minimization. They propose a decomposition technique for the loss function
in order to model the noisy negative labels. They show that estimations of the
negative class centroid can reduce the adverse effect caused by noisy negative
labels. To improve this further, Gong et al. (2019) propose a kernelized version
of the algorithm proposed by H. Shi et al. (2018) in order to address the non-linear
classifiers/decision boundaries. All in all, the performance of biased techniques
8
depends on the number of positive samples within the unlabeled set. In situations
where a large number of unlabeled data belong to the positive class, negative
unlabeled data adds too much uncertainty to the learning process, which results
in poor performances.
PU models that incorporate a class prior are probabilistic approaches that
rely on an estimation of the ratio of the positive instances in a population. Such
estimation requires an assumption on labeling mechanism, one of the most popular
being Selected Completely At Random (SCAR) (Bekker & Davis, 2020). In SCAR,
it is assumed that the labeled data are uniformly selected from the positive data
resulting in possibility of using traditional classifiers (Bekker & Davis, 2020; Elkan
& Noto, 2008). For example, Elkan and Noto (2008) first train a non-traditional
classifier on PU data to estimate the weights (i.e. the label frequency or class
prior) of examples on a validation set. Then, they convert the non-traditional
classifier to a traditional classifier using the found weights. Recent research has
been trying to relax the SCAR assumption and consider less restrictive assumptions
on labeled data. For example, Bekker, Robberechts, and Davis (2019) relax the
SCAR assumption to the less restrictive Selected At Random (SAR) assumption
that considers non-uniform labeling mechanisms. The SCAR assumption also relies
on availability of the class prior on which research (i.e. M. C. Du Plessis, Niu, &
Sugiyama, 2016; Jain, Delano, Sharma, & Radivojac, 2020; Łazęcka, Mielniczuk,
& Teisseyre, 2021; Perini, Vercruyssen, & Davis, 2020) has demonstrated how to
estimate the class prior from data. On the other hand, in some cases (C. Zhang,
Hou, & Zhang, 2020), the goal is to learn a model without pre-estimating the class
prior (in an isolated step).
9
A further advancement in class prior-based category is to find suitable risk
estimators for PU learning. The approaches in this category aim to apply distinct
loss functions that satisfy specific conditions to PU risk estimators. For example,
M. C. Du Plessis, Niu, and Sugiyama (2014) propose a cost-sensitive learning
between positive data and unlabeled data which results in using non-convex loss
functions, such as the ramp loss, in order to avoid the systematic estimation bias.
The cost-sensitive classifier does depend on the class prior probability estimation
unless the unlabeled data contains mostly positive examples. As an improvement
on the work by M. C. Du Plessis et al. (2014), M. Du Plessis et al. (2015) propose
a convex PU loss called double hinge loss that considers an ordinary convex loss
function for unlabeled samples and a composite loss function for positive samples.
Their loss function performs as accurate as the non-convex ramp loss while (i) it
can still cancel the systematic estimation bias, (ii) it causes estimators to converge
to the optimal solutions at the optimal parametric rate, and (iii) it has a much
lower computational burden. As an improvement on this approach, Kiryo, Niu,
Du Plessis, and Sugiyama (2017) propose a non-negative risk estimator for PU
learning, which modifies the lower-bound of the double hinge loss. In comparison
to the unbiased risk estimators, the non-negative risk estimator will not produce
negative empirical risks in case of learning flexible models such as deep neural
networks. The non-negative risk estimator is more robust against overfitting, and,
thus, it can be used on deep neural networks even when positive data are limited.
Building on non-negative risk estimator, X. Chen, Liu, Tu, Cao, and Yang (2018)
propose a manifold non-negative risk estimator by adding a manifold regularizer to
the non-negative risk estimator.
10
Following on deep learning-based models, H. Chen, Liu, Wang, Zhao, and
Wu (2020) propose a general variational principle for PU learning which introduces
a loss function which can be efficiently calculated without involving class prior
estimation. Na et al. (2020) propose a generative PU learning based on variational
auto encoders called VAE-PU that does not rely on SCAR assumption. The
proposed deep generative model is used to virtually generate unobserved data of
PU data in order to satisfy the risk function requirement resulted from SCAR-
independence condition. Guo et al. (2020) propose PUGAN which is based on
deep Generative Adversarial Networks (GANs), in which the assumption is that
the data produced by the generator should be considered as unlabeled rather
than negative, and, thus, the problem in GAN models become a PU learning
rather than a standard PN learning. Hou, Chaib-Draa, Li, and Zhao (2017)
propose GenPU in which GAN framework is used to identify both positive and
negative data distributions. Hu et al. (2021) propose a GAN-based model called
PAN in which they develop a new objective function based on Kullback–Leibler
divergence in order to address the problem with GAN which is that it focuses
disproportionately on the positive class (which in the case of GANs is the real
data). Zheng, Yuan, Wu, Li, and Lu (2019) propose a one-class GAN for fraud
detection with only benign users as training data. The generator searches for the
probability distribution of the malicious users, and the discriminator attempts
to distinguish between the generated malicious examples from the generator and
the benign examples from training data. Chiaroni, Rahal, Hueber, and Dufaux
(2018) apply GANs to learn the distribution of the negative class by generating
fake images, the distribution of which is closer to the distribution of negative class
within the unlabeled dataset. Then, a supervised convolutional neural network
11
(CNN) classifier is trained on the positive and fake generated negative samples.
Finally, in comparison to most PU research focusing on classification, Y. Yang,
Liang, and Carin (2020) propose the object detection problem as a PU problem due
to the challenging nature of collecting complete labels for object detection because
of the large number of instances, which makes this task more difficult than simply a
classification task. They utilize the Faster-RCNN architecture (Ren, He, Girshick,
& Sun, 2015) with NNPU loss function adopted from Kiryo et al. (2017).
Finally, Niu et al. (2016) compare PU learning to PN learning and
demonstrate that, under certain conditions, PU learning can perform as well as
PN learning. They establish risk bounds of risk minimizers in cases of PN and PU
data in order to investigate settings in which PU could outperform PN learning.
They estimate error bounds of the risk minimizers and discover that the PU bound
could be tighter than the PN bound under some assumptions such as Rademacher
complexity of the decision function being bound by a constant as well as the size
of the dataset. The aforementioned situation of the error band is proved in both
finite-sample and asymptotic cases in which, the size of unlabeled data should
increase at a rate faster than the increase in the sizes of positive and negative data.
Their results are independent of the specific forms of the function class and/or data
distribution. However, they still rely on availability of the class-prior probability.
• Positive and Unlabeled Learning in Remote Sensing. Remote
sensing classification is a complex process and requires consideration of many
factors, including selection of training samples and suitable classification approaches
(Lu & Weng, 2007). With regards to the type of classification approach, there
are three general categories: (i) pixel-, (ii) sub-pixel-, and (iii) object-based
methods. Pixel-based semantic segmentation algorithms assign a label to every
12
pixel in an image (Kemker et al., 2018), for which supervised learning has been
traditionally used (Dey, Zhang, & Zhong, 2010; W. Li et al., 2010). A crucial step
in the supervised approach is to collect representative training samples for all
landcover types in the image to ensure the success of training a classifier (W. Li,
2013; Tuia, Persello, & Bruzzone, 2016). If the goal is to identify only a specific
class of landcover types, such as urban regions vs non-urban regions (W. Li et
al., 2010), then it is not reasonable to collect negative data in a representative
way since the negative class is too diverse (M. Du Plessis et al., 2015). Therefore,
many researchers have started investigating PU learning classifiers for semantic
segmentation of RS imagery, including both active and passive imageries.
PU learning in remote sensing research originates from ecological niche
modeling for when the only available data is information about the presence of the
specie-of-interest. One-class classifiers such as one-class Support Vector Machine
(SVM) (Schölkopf, Platt, Shawe-Taylor, Smola, & Williamson, 2001) and Maximum
Entropy (MaxEnt) models (Phillips, Anderson, & Schapire, 2006) are the most
used approaches for one-class species distribution modeling (W. Li, Guo, & Elkan,
2011). These methods are adopted for species distribution extraction, such as
different types of vegetation, using semantic segmentation of RS imageries (Rapinel
& Hubert-Moy, 2021). For example, X. Liu, Liu, Gong, Lin, and Lv (2017) study
the effectiveness of different one-class classification methods to detect invasive
plants. In addition, one-class classifiers are used for other tasks such as single-land-
cover detection (W. Li & Guo, 2010), building detection (Krupiński, Lewiński, &
Malinowski, 2019; W. Yang, Yin, Song, Liu, & Xu, 2013), mapping forced labor
(McDonald et al., 2021), and flood masks (Brill, Schlaffer, Martinis, Schröter, &
Kreibich, 2021). Therefore, because of their extensive adoption, these methods
13
are extensively analyzed for one-class classification of RS imagery, demonstrating
that even the best performers among these methods still functions poorly in many
cases (Mack & Waske, 2017). Although other flavors of one-class classification,
such as sparse representation (Y. Chen, Nasrabadi, & Tran, 2012; Ran, Zhang,
Li, & Du, 2016; Song, Li, & Jia, 2020; Song, Li, Li, & Plaza, 2016), are studied
with competitive performances to other methods in their category, there are a few
important challenges regarding training an accurate one-class classification model,
including (i) when the size of and/or the number of the training image(s) is large,
and (ii) when the number of positive data examples is small (Mack et al., 2016).
In order to take advantage of the unlabeled data, based on the idea
introduced by Elkan and Noto (2008), W. Li et al. (2010) propose a PU learning
which includes training a classifier on PU data based on the SCAR assumption,
followed by dividing the classifier by the constant probability that a positive
example is labeled to achieve the final desired classifier. An advantage of their
proposed model is the possibility of its implementation with neural networks. Their
model appears to be superior to Biased SVM, Gaussian domain descriptor, one-
class SVM (W. Li et al., 2010) as well as to binary classifiers such as SVMs and
maximum likelihood classifiers (Deng, Li, Liu, Guo, & Newsam, 2018). Therefore,
their proposed model is used for many applications including extraction of built-
up areas (Djerriri, Benyelles, Attaf, & Cheriguene, 2019; Xia, Yokoya, & Adriano,
2021), RS images’ time series analysis (Desloires, Ienco, Botrel, & Ranc, 2022),
and crop mapping (L. Zhao et al., 2019). B. Liu, Zhu, Wang, Liu, and Yu (2011)
have also demonstrated the success of the model by Elkan and Noto (2008) for
passive SAR imageries. C. Zhu, Liu, Yu, Liu, and Yu (2012) point out that the
PU model by Elkan and Noto (2008) is very sensitive to the precision of estimating
14
the positive class frequency, which could be low when few positive data are present.
Therefore, they try to mitigate this issue by implementing a two-step approach:
(i) unreliable positive and reliable negative examples are identified using the Spy
detection method by B. Liu et al. (2002)—these examples are assigned different
weights calculated by the method in Elkan and Noto (2008), and, then, (ii) these
examples are used along with the (reliable) positive examples, and their respective
weights to train a binary classifier. Finally, Gui, Xu, Wang, Yang, and Pu (2020)
propose a two-step PU learning approach for extracting built-up areas from
PolSAR imageries.
The possibility of using PU learning in a deep learning framework is
important in remote sensing research. Xia and Yokoya (2021) investigate a self-
paced PU learning for identifying and assessing damage to buildings. Their
approach includes a two-step PU learning and a three-part loss function, one of
which is a hybrid loss function of cross-entropy supervised loss and the unbiased
risk estimator introduced by M. Du Plessis et al. (2015). Jian, Chen, and Cheng
(2021) investigate the method proposed in Zheng et al. (2019) for change detection
in time series RS imageries. Finally, Lei et al. (2021) propose a deep learning-
based PU learning which takes advantage of CNNs and PU loss function by
Kiryo et al. (2017). They make up the PU data from PN data for a training set,
but keep the PN data for test set. The positive data is created very carefully
to ensure zero noise. In addition, they create a subset of unlabeled data/pixels
within an image, which could be due to the architecture style they choose.
In this case, the architecture is a point-wise classification rather than a full
segmentation architecture, which may slow the prediction time. Finally, Santara
et al. (2019) explore PU learning for hyperspectral imageries using NNPU-based
15
risk minimization by Kiryo et al. (2017), and a one-versus-all classification based on
a two-step PU learning approach.
2.2 Transfer Learning
Transfer learning (TL) is the idea of transferring informative knowledge from
one domain (i.e. source domain) to another domain (i.e. target domain) (Fig. 2).
TL is suitable when learning from source domain with large amount of labeled data
can inform the target domain model with limited labeled data—which performs
better relative to training from scratch in the target domain with limited label
(Torrey & Shavlik, 2010). PU data domain is an example of limited data domain
(Mignone & Pio, 2018). Transferring informative knowledge among domains is all
the more important when a domain may not have enough information from labeled
data, but, when domains combined do have enough information to construct a
representative model—examples for this in PU learning are Mignone and Pio (2018)
and B. Liu et al. (2022). Therefore, the key with TL is the ability to transfer only
the beneficial part of the knowledge learned in the source domain to the target
domain (Day & Khoshgoftaar, 2017).
Figure 2. Transfer Learning is a framework for reusability of the extracted
knowledge in a large-data source domain to train a model in a limited-data target
domain
16
Concepts in TL include domain and its corresponding task(s). A domain
is defined by the feature space, X, and marginal probability distribution over the
feature space, P (X). Given a domain, a task is defined by the label space, Y , and
a predictive function, f(·) = P (Y |X), that provides a prediction for the label
of a given data example (e.g. pixel) (Tan et al., 2018). Since TL is a family of
different strategies and is not referring to a specific methodology (Bashath et al.,
2022b), TL categorization has evolved to include two approaches, homogeneous
and heterogeneous (Weiss, Khoshgoftaar, & Wang, 2016). TL is considered
homogeneous when the data in the source and target domains have a same feature
space (XS = XT ) and labels (YS = YT ). As a note, if XS = XT but the feature
distributions are different (PS(X) ≠ PT (X)), it is called domain adaptation (DA).
On the other hand, TL is defined as heterogeneous when the data in the source and
target domains have nonequivalent (and non-overlapping) feature spaces (XS ̸= XT )
with possible nonequivalent labels (YS ̸= YT ) (Day & Khoshgoftaar, 2017).
Homogeneous TL can be categorized into instance-, feature- (symmetric or
asymmetric), parameter-, relational-informational-, and hybrid-based techniques
(Weiss et al., 2016), some of which are of interest in my dissertation. In instance-
based techniques, the data samples in the source domain are reweighed to be
directly used in the target domain—e.g. Asgarian et al. (2018). Feature-based
techniques target the gap between the marginal and conditional distributions of the
source and target domains, and try to close the gap using asymmetric or symmetric
feature transformation. Parameter-based techniques try to share model parameters
between source and target domain. For example, Oquab, Bottou, Laptev, and
Sivic (2014) investigate the idea of transferring image representations learned in a
CNN-based model trained on a source domain to be re-used in a model on a target
17
domain. Finally, hybrid-based techniques transfer knowledge using both instances
and shared parameters (Weiss et al., 2016). On the other hand, in heterogeneous
TL, since the problem is the difference in feature spaces, only feature-based
techniques are considered for heterogeneous TL (Day & Khoshgoftaar, 2017).
DA has gained a lot of attention as well. For example, Ganin and Lempitsky
(2015) propose an approach that trains deep architectures on labeled data in the
source domain and unlabeled data in the target domain. Their architecture uses
a deep feature extractor that is connected to a deep label predictor for the source
domain data and a deep domain predictor that performs the unsupervised DA on
both source and target domain data. Therefore, the latter ensures the extraction
of domain-invariant features. Universal training (i.e. shared encoders/feature
extractors) can focus on either one domain or multiple domains—for example
Nam, Lee, Park, Yoon, and Yoo (2021), Ouali, Hudelot, and Tami (2020), Kim
and Kim (2020), Z. Wang et al. (2020), Kalluri, Varma, Chandraker, and Jawahar
(2019), and Hoffman, Wang, Yu, and Darrell (2016). In addition, there are GAN-
based approaches to DA such as Musto and Zinelli (2020) that are used for
translating source domain images to target domain images using GAN—it should
be noted that GAN-based translation approaches are widely used in remote-sensing
literature.
The potential for leveraging the advantages of TL and DA in PU learning
is huge. However, research in PU learning mostly focuses on the knowledge of
single domain for learning a classifier; not enough work has focused on utilizing
TL for PU learning (B. Liu et al., 2022). Among the few, for example, B. Liu
et al. (2022) propose a mix of transfer and ensemble learning framework for PU
learning. They take a parameter-based approach, in which SVM parameters and
18
regularization terms are shared between source and target domains. Meng, Xie,
and Sun (2021) take an instance-based approach, in which they use a PU learning
model to identify examples that are close to target domain to be used later with
the data in the target domain with a gradient boost decision tree model. As a
limitation of their work, they show that negative transfer (i.e. when TL does
not improve and even worsen the quality of learning) can occur in case of multi-
source tasks. Mignone, Pio, D’Elia, and Ceci (2020) study a setting in which both
source and target domains have PU data. They propose an approach in which,
first, the unlabeled data are converted to weighted labeled data in each of source
and target domains separately, and they then use a parametric-based approach
to create a new classification model based on different models trained separately
in each of the source and target domains. Bhat and Culotta (2017) use a feature-
based approach for document classification, in which they reweigh the features’
importances based on information extracted from a base document classifier on PU
data. Hammoudeh and Lowd (2020) investigate the PU learning under covariate
shift where the distribution of positive data shifts in the test data in comparison
to the training data (as opposed to a fully-labeled source domain and PU target
domain setting). Loghmani, Vincze, and Tommasi (2020) convert the open-set
DA problem to a PU problem. In other words, instead of a source domain with
label space YS and a target domain with label space YT such that YS ⊂ YT ,
they consider the source data as positive and the target data as unlabeled, which
violates the SCAR assumption. Therefore, they use a mix of an autoencoder loss
(as the domain agnostic discriminative loss of choice) and the NNPU loss by Kiryo
et al. (2017) in order to avoid the unreliable predictions that occur in deep neural
networks when NNPU with logarithmic loss is used with the unlabeled samples
19
belonging to a different domain. Finally, Sonntag, Behrens, and Schmidt-Thieme
(2022) investigate PU-DA where source domain is completely labeled and the
target domain has PU data. They propose a two-step approach, in which reliable
positive and negative pseudo-labels are selected in the target domain using a PU
loss on target domain and a classic loss on the source domain. They then train a
supervised classifier on the target domain.
• Transfer Learning in Remote Sensing. TL in remote sensing
literature has gained a lot of attention. M. Zou and Zhong (2018) investigate a
parameter-based TL in which model parameter values (i.e. weights) from a pre-
trained model on an open image dataset such as ImageNet (Russakovsky et al.,
2015) are used to for fine-tuning a model in the target domain (i.e. the domain
of interest). They do this by either adjustmenting of all weights or reusing them,
depending on whether the size of the training data in the target domain is large
enough or not, respectively. A. X. Wang, Tran, Desai, Lobell, and Ermon (2018)
also show the success of using a pre-trained model on images of one geographic
location to be used and fine-tuned on images of another geographic location. X. He,
Chen, and Ghamisi (2019) use a fully connected layer (as the start point of their
model) in order to convert hyperspectral images with more than three channels into
three channel data, which will then use pre-trained model weights on ImageNet
for the rest of the model. Wurm, Stark, Zhu, Weigand, and Taubenböck (2019)
take advantage of transferring pre-trained fully convolutional networks for mapping
slums in various satellite images. Other examples of such TL approach are Najjar,
Kaneko, and Miyanaga (2017) and Z. Zhou et al. (2021). Their success in using
pre-trained model parameters is due to the similarities in low-level features of the
two domains (M. Zou & Zhong, 2018). However, different studies such as Xie,
20
Jean, Burke, Lobell, and Ermon (2016) have been able to get the same benefits
using data rich proxy labels for cases in which the two learning tasks are not very
related (Ghosh, Jia, & Kumar, 2021). All in all, although this type of parameter-
based approach results in model improvement, as opposed to training a model from
scratch, better results could be obtain from a feature-based approach or a hybrid of
both.
The spectral shifts among the feature distributions of RS images of different
geographic locations can cause the model trained on one geographic location to fail
when used in a rather different geographic location. Therefore, in order to have
a model that is robust to shift among the datasets, remote sensing turns to DA
methods (Tuia et al., 2016). DA can be done in an unsupervised or semi-supervised
approach: with unsupervised DA, the two unlabeled domains must match, while
semi-supervised DA assumes that the source domain contains fully labeled data and
the target domain contains a set of unlabeled data (Tuia et al., 2016).
Tasar, Happy, Tarabalka, and Alliez (2020b) address the domain shift
between train and test datasets. In their approach, they use a generative
adversarial style-transfer network to create fake images that have the style of
test images. These fake images are used to fine-tune a model trained with the
training data. Tasar, Tarabalka, Giros, Alliez, and Clerc (2020) address the multi-
source DA. They use a generative adversarial style-transfer network to remove the
marginal gap between each source domain and the target domain, which results
in all the data having similar marginal distributions. Then, a model is trained
on the transformed source data and used for prediction on the data in the target
domain. There are many other style-transfer approaches such as Ji, Wang, and Luo
(2020), Tasar, Giros, Tarabalka, Alliez, and Clerc (2020), L. Shi, Wang, Pan, and
21
Shi (2020), D. Zhao, Li, Yuan, and Shi (2021), Tasar, Happy, Tarabalka, and Alliez
(2020a), Y. Zhang et al. (2019), and M.-Y. Liu, Breuel, and Kautz (2017). Finally,
in their transformation-based DA, Chakraborty and Roy (2020) adopt a hierarchy
of different approaches, including stacked auto-encoder neural network and fuzzy
theory.
Lucas, Pelletier, Schmidt, Webb, and Petitjean (2021) address the semi-
supervised DA in which some labeled samples are available in the target domain.
Their proposed model is first trained on a source domain, and then trained on
the target domain using a source-regularized loss function that shrinks the model
weights estimations with respect to those learned on the source domain. To
alleviate the fails in general unsupervised DA methods on RS imagery due to its
difference with other datasets, such as urban scene/driving datasets, Iqbal and
Ali (2020) propose a weakly-supervised DA for built-up region segmentation, in
which they assume the availability of weak-labels that are image level labels in the
target domain with pixel-level unlabeled images. They guide the adaptation in the
segmentation model by the use of image classification for the weak-labels during the
training process.
2.3 Ensemble Learning
Ensemble learning (EL) is a methodological framework, with the goal of
establishing a better predictive performance by training and combining multiple
learners each one of which, individually, may not produce as strong of predictive
performance (Dong, Yu, Cao, Shi, & Ma, 2020b)—if the individual learning models
are of same type, this approach is known as homogeneous EL, or heterogeneous
EL otherwise (Ganaie, Hu, et al., 2021). The fundamental idea of EL is based on
the voting mechanism implemented using different strategies including decreasing
22
variance (bagging), decreasing bias (boosting), or ideally both (stacking) (Z.-
H. Zhou, 2021). A primary tenet behind EL is to avoid inductive bias by allowing
a better search in the hypothesis space or even extending it such that the
generalization performance is improved by avoiding a single poor learner and/or
a model hypothesis space that does not properly approximate the ground truth
hypothesis (Z.-H. Zhou, 2021). Although EL is very successful in machine learning
research, it faces new challenges with incorporation of deep neural networks due to
their huge increase in training time and space (Y. Yang, Lv, & Chen, 2021). Efforts
to address these challenges resulted in the emergence of deep EL that combines the
advantages of both deep learning and EL (Ganaie et al., 2021).
Deep EL algorithms can be categorized into three main approaches: (i)
extracting features from deep learning models and using them as input to other
traditional classifiers, (ii) having an ensemble of the output of the individual
learners that are deep learning models, and (iii) training end-to-end deep learning
architectures that incorporate EL principles (Y. Yang et al., 2021). Some of these
deep EL categories utilize the traditional EL approaches such as bagging, boosting,
and stacking, and are mostly accompanied with a TL approach—for example, Doshi
and Yilmaz (2020) propose an ensemble model for object detection task.
For the first category of deep EL algorithms, Korzh, Joaristi, and Serra
(2018) propose a four-step deep transfer ensemble learning for classification task.
They first fine-tune each of multiple CNNs with different architectures, and then
they fine-tune an ensemble of these models, which include the initial weights
identified from the previous step. Finally, they use the ensemble of the CNNs as
a feature extractor for another classifier such as SVMs. Although, their approach
23
could be considered in the second category of deep EL algorithms too, their
approach has more overlap with the first category.
For the second category of deep EL algorithms, Kandaswamy, Silva,
Alexandre, and Santos (2015) propose an approach to reduce the impact of layer
selection in TL by an ensemble of multiple different-style transfer for generic
features that are previously learned. Kumar, Kim, Lyndon, Fulham, and Feng
(2016) propose the ensemble of CNNs, in which they first fine-tune the pre-trained
CNNs on a large dataset of natural images, and then they apply the fine-tuned
CNNs as either a feature extractor for a SVM classifier or a classifier for medical
image classification. As mentioned, their approach investigates the first category of
deep EL algorithms as well (i.e. CNNs as feature extractors for SVM classifiers).
Finally, for the third category of deep EL algorithms, Devassy and Antony
(2021) propose a mix of transfer and ensemble learning, in which they use pre-
trained models to create intermediate feature maps to be concatenated and fed to
a set of dense layers with a binary classification output—the pre-trained weights
and layers are freezen in the EL stage. Yu et al. (2021) propose a CNN-based EL
for image dehazing, in which two subnetworks are used, one for capturing global
image representation using TL, and the other subnetwork for capturing domain-
specific representation. These two subnetworks create two feature-maps that are
then concatenated and inserted into the rest of the main network for purpose of
outputting dehazed images. Bousselham et al. (2021) propose an end-to-end deep
EL for semantic segmentation that avoids the heavy cost of multi-stage training of
ensembles of deep networks. In order to do this, they take advantage of multiple
independent decoders, each of which gets trained on one of the multi-scale features
set produced by a feature pyramid network approach.
24
Some other approaches in deep EL that do not fit into the aforementioned
categorization are also worth mentioning. Nozza, Fersini, and Messina (2016)
demonstrate the positive effect of EL through different ensemble methods such
as simple-voting on reducing the cross-domain generalization error that occurs in
domain adaptation. Nigam, Huang, and Ramanan (2018) investigate the possibility
of an ensemble of models learned from dataset that differ in scene structure,
viewpoints, and objects statistics in order to transfer knowledge from multi diverse
domains to the target domain.
In terms of EL for PU learning, P. Yang, Humphrey, James, Yang, and
Jothi (2016) propose an ensemble of PU learning models in order to alleviate the
class imbalance within the dataset. They first, for each classifier, create a balanced
training set by randomly subsampling from the unlabeled set. Then, they train
SVM classifiers using a biased PU approach. They use a correction factor approach
(Elkan & Noto, 2008) to reduce the prediction bias on the unlabeled data by
considering all of them as negative. This bias is further reduced by an ensemble
of each of these SVM models that are used to produce the final prediction.
Nguyen, Li, and Ng (2012) apply an ensemble of classifiers for PU problem in time
series classification. Claesen, De Smet, Suykens, and De Moor (2015) propose a
bagging-base ensemble of a variation of SVM models for PU learning. In their
approach: (i) the variability between different models is increased by resampling
from both positive and unlabeled data, which results in more robustness against
false positives, and (ii) the relative misclassification penalty between positive and
unlabeled examples is controlled by introducing an extra degree of freedom in the
model. Basile, Di Mauro, Esposito, Ferilli, and Vergari (2019) propose a general
probabilistic generative approach to PU learning, in which the goal is to find
25
reliable negative examples that are in the lowest density regions of the positive class
distribution. They next create a mixture of generative models using the adoption
of bagging ensemble from the discriminative framework. P. Yang, Li, Chua, Kwoh,
and Ng (2014) propose a two-step PU learning approach in which, after identifying
the reliable negative examples, both positive and reliable negative examples are
used to assign weights to unlabeled data. The assigned weights represent the
likelihood of the unlabeled data belonging to either positive or negative class.
Finally, an ensemble of classifiers are used for the final prediction. Jowkar and
Mansoori (2016) propose a two-step PU learning approach in which the second
step consists of a graph-based ensemble of three weighted classifiers.
• Ensemble Learning in Remote Sensing. In remote sensing,
researchers have demonstrated the robustness of ensemble classifiers over a single
method strategy—for example, Benediktsson, Chanussot, and Fauvel (2007),
X. Huang and Zhang (2012b), and Rahman, Smith, and Timms (2013). X. He
and Chen (2020) take advantage of TL by reusing pre-trained models in order to
alleviate limited train data in the target domain. Then, they use a classifier-level
ensemble of multiple different fine-tuned models. Since their approach considers
target domain of hyperspectral images, and the pre-trained models come from
domains with three channels, they randomly select three channels of hyperspectral
images for the fine-tuning process. In a different vein, some researchers have
focused away from a random selection of channels from hyperspectral images,
and have proposed different approaches that take advantage of all available
channels within hyperspectral images. For example, X. He et al. (2019) propose
a fully connected layer added to the beginning of the three-channel compatible
networks’ architectures to map the multi-channel hyperspectral images onto
26
three channels input for such networks. Korzh et al. (2018) propose a four-step
approach for an ensemble of pre-trained CNNs. First, they fine-tune each CNN
separately; then, they construct an ensemble of these fine-tuned networks. Their
approach results in improved EL performance in two ways: (i) using the ensemble
of models, themselves, as either the final classifier or the feature extractor, then
fed into another model as the final classifier, and (ii) fine-tuning the ensemble of
models using joint loss function. Fan, Xu, and Zhang (2021) propose a classifier-
level bagging-based ensemble for classifying house damage. Jamali et al. (2021)
investigate the use of two different deep EL approaches for complex wetland
classification: (i) majority voting classifier-level ensemble, and (ii) feature-level
ensemble of multiple deep networks to be fed into a final classifier. They show that
the latter results in improved predictive performance. Iyer, Sriram, and Lal (2021)
utilize an ensemble of deep learning models, in which they calculate the mean of
the probability scores of each model output per class, and, then make a prediction
based on the maximum of the averaged probability values.
Plazas, Ramos-Pollán, and Martínez (2021) propose a classifier-level
ensemble-based semi-supervised deep learning approach which iteratively labels
those unlabeled data with high confidence predictions. Gu et al. (2022) propose
an ensemble semi-supervised learning in which, first, supervised deep CNNs
classifiers are trained on labelled data to be used as feature extractor. Then, in the
second stage, unlabeled data are exposed to a self-learning process, which exploits
pseudo-labelling continuously in order to improve the model. For self-learning,
the fine-tuned models in the first step are used to produce different feature maps
to increase the diversification among the learners in the ensemble module. The
ensemble framework uses cross-checking mechanism among the classifiers such that
27
their disagreement is minimized. This way, the ensemble components are trained
together as opposed to traditional separate training strategy.
Although research has shown promising results of ensemble classifiers in
remote sensing, little research has focused on the ensemble PU learning in remote
sensing leaving a gap in this area of research. In one of the few examples, R. Liu
et al. (2018) investigate EL for PU learning. Three ensemble methods (majority
vote, weighted average, and weighted vote combination rules) are compared with
different stand-alone one-class classification and PU learning methods. They show
the superiority of weighted average and weighted vote ensembles over other models
for classifying the urban areas from RS imagery. Wu, Qiu, Jia, and Liu (2020)
investigate the bagging-based approach to PU learning using decision tree method
for generating a landslide susceptibility map. Finally, X. Liu, Liu, Datta, Frey, and
Koch (2020) propose a weighted voting ensemble of three one-class classification
models for mapping invasive plants.
The literature reviewed in this chapter, covered PU, Transfer, and Ensemble
learning paradigms. The goal of these paradigms of learning models is to alleviate
the labeled data scarcity in many real-world applications, including vision-based
applications. Although each of these methods are studied extensively, in general,
and are introduced and used in remote sensing, in particular, the potentials of
the combined advantages of these methods have not been researched much. Such
research is still in its infancy with a few research demonstrating promising results
in different leaning tasks. Therefore, the goal of this dissertation is to investigate
such potentials for semantic segmentation task. More specifically, I investigate two
of the possible avenues that are Transfer PU learning and Ensemble PU learning
28
for semantic segmentation of RS imagery where labeled data scarcity is a burden on
developing learning models.
29
CHAPTER III
DATASETS
Several non-remotely-sensed and remotely-sensed imageries and their derived
positive and unlabeled variation datasets were used in this dissertation research.
All of the datasets are online and freely available. In the sections below, I briefly
describe each of these datasets: (i) the standard open image dataset ImageNet, (ii)
Massachusetts Buildings Dataset, and (iii) Inria Aerial Image Buildings Dataset, in
that order.
3.1 Standard Open Image Datasets
ImageNet, ILSVRC image dataset (Russakovsky et al., 2015), is used as
the standard open image dataset for this dissertation. The ImageNet consists
of 1281167 train images, 50000 validation images and 100000 test images for
1000 object classes—see Fig. 3 for sample images. Although this dataset is not
directly used for training the models in this dissertation, it is indirectly used by
incorporating the pre-generated parameters/weights of models pre-trained on them
(as the starting point or warm start) for some of my models and baselines.
3.2 Massachusetts Buildings Dataset
The Massachusetts buildings dataset (Mnih, 2013) is a high resolution
dataset with a spatial resolution of 1.0 meter. This dataset consists of a single
source covering almost 340 km2 in total of mostly urban and suburban areas of
various sizes of buildings. The reference map consists of labels for each image that
shows pixels that either do or don’t belong to buildings—most of the labels are
30
derived automatically with the test and validation sets being manually corrected for
better performance assessments.
The dataset consists of 151 images of size 1500 × 1500 including 137 images
for the train set, 4 images for the validation set and 10 images for the test set.
Images are converted into patches of size 256 × 256—if not enough pixels are
available at the edges of an images, enough pixels are created with the mirroring
technique (from OpenCV library) of the patch’s edges in order to create a 256× 256
patch. This process results in 5436 total image patches including 4932 images
for the train set, 144 images for validation set and 360 images for test set—see
Fig. 4a for a sample image with some of its corresponding derived patches. Finally,
in order to avoid the class imbalance problem (since it is out of the scope of this
dissertation), a patch selection is performed based on the criterion that, within each
patch, at least 25% of the pixels must belong to the class building. It should be
noted that an upper bound for the positive class is not needed since class imbalance
in favor of the class of interest is not problematic. Therefore, the final dataset
consists of a total of 3559 image patches including 3201 images for the train set,
Figure 3. Sample images from ImageNet-ILSVRC dataset (Russakovsky et al.,
2015).
31
83 images for validation set and 275 images for test set—see Fig. 4b for a sample
of final patches. Table 1 represents a summary of the dataset characteristics before
and after pre-processing. Finally, the dataset is standardized using equation 3.1,
where c ∈ C refers to a specific channel at the time and µc and σc are the mean
and the standard deviation for that specific channel calculated over the entire train
set.
xcc − µcx̂ = (3.1)
σc
Table 1. Massachusetts Buildings Dataset Characteristics
Dataset Total Images Training Images Validation Images Testing Images Image Size Resolution (m/p)
Original 151 137 4 10 1500× 1500 1.0
Processed 3559 3201 83 275 256× 256 1.0
32
0 0 0 0
200 200
50 50
400 400
600 600 100 100
800 800
150 150
1000 1000
1200 1200 200 200
1400 1400
250 250
0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400 0 50 100 150 200 250 0 50 100 150 200 250
0 0 0 0
50 50 50 50
100 100 100 100
150 150 150 150
200 200 200 200
250 250 250 250
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250
0 0 0 0
50 50 50 50
100 100 100 100
150 150 150 150
200 200 200 200
250 250 250 250
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250
0 0 0 0
50 50 50 50
100 100 100 100
150 150 150 150
200 200 200 200
250 250 250 250
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250
0 0 0 0
50 50 50 50
100 100 100 100
150 150 150 150
200 200 200 200
250 250 250 250
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250
0 0 0 0
50 50 50 50
100 100 100 100
150 150 150 150
200 200 200 200
250 250 250 250
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250
(a) (b)
Figure 4. (a) Left and right columns show images and their corresponding labels,
respectively. From top to bottom: a sample from Massachusetts buildings dataset
(Mnih, 2013) and samples from its derived patches, one from the middle, and the
rest from top-left, top-right, bottom-left, and bottom-right corners, respectively,
showing the mirroring effect when not enough pixels are available for patch
creation. (b) Selected final patches: images and their corresponding labels in the
left and right columns.
33
3.3 Inria Aerial Image Dataset
The Inria Aerial Image Labeling Dataset (Maggiori et al., 2017) consists of
public domain aerial orthorectified RGB-color imagery with a spatial resolution
of 0.3 meter. The images cover dissimilar urban settlements for different
geographic locations such as cities and towns with different building densities. The
dataset covers a total of 810 km2, 405 km2 in each of train and test sets. Due to
unavailability of reference maps for the test set, the available original train set is
used to create my train, validation, and test sets with proportions of 70%, 15%,
and 15% of the original train set for this dissertation. The reference map consists of
public domain official building footprints with labels for each pixel either belonging
or not belonging to buldings.
The dataset consists of 180 images of size 5000 × 5000. As with the
Massachusetts buildings dataset, images are converted into patches of size 256× 256
with the same policy that if not enough pixels are available at the edges of an
images, enough pixels are created with mirroring technique in order to have a
256× 256 patch. This process results in 72000 total image patches of size 256× 256.
Then, again, in order to avoid the class imbalance problem, a patch selection is
performed based on the criterion that, within each patch, at least 25% of the
pixels must belong to the class building. Therefore, the final dataset consists
of a total of 18702 image patches. Thus, with the 70%, 15%, and 15% portion
strategy, the train set, validation set, and test set have respectively 13092, 2805,
and 2805 images. See Table 2 and Fig. 5 for, respectively, a summary of the dataset
characteristics (before and after pre-processing) and a sample image with some of
its corresponding derived patches (before and after pre-processing). Finally, the
dataset is standardized in the same manner as for Massachusetts buildings dataset.
34
0 0 0 0
1000 1000 50 50
2000 2000 100 100
3000 3000 150 150
4000 4000 200 200
250 250
0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 50 100 150 200 250 0 50 100 150 200 250
0 0 0 0
50 50 50 50
100 100 100 100
150 150 150 150
200 200 200 200
250 250 250 250
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250
0 0 0 0
50 50 50 50
100 100 100 100
150 150 150 150
200 200 200 200
250 250 250 250
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250
0 0 0 0
50 50 50 50
100 100 100 100
150 150 150 150
200 200 200 200
250 250 250 250
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250
0 0 0 0
50 50 50 50
100 100 100 100
150 150 150 150
200 200 200 200
250 250 250 250
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250
0 0 0 0
50 50 50 50
100 100 100 100
150 150 150 150
200 200 200 200
250 250 250 250
0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250
(a) (b)
Figure 5. (a) Left and right columns show images and their corresponding labels,
respectively. From top to bottom: a sample from Inria Aerial Image Dataset
(Maggiori et al., 2017) and samples from its derived patches, one from the middle,
and the rest from top-left, top-right, bottom-left, and bottom-right corners,
respectively, showing the mirroring effect when not enough pixels are available
for patch creation. (b) Selected final patches: images and their corresponding labels
in the left and right columns.
35
Table 2. Inria Aerial Image Dataset Characteristics
Dataset Total Images Training Images Validation Images Testing Images Image Size Resolution (m/p)
Original 180 – – – 5000× 5000 0.3
Processed 18702 13092 2805 2805 256× 256 0.3
3.3.1 Positive and Unlabeled Dataset. A bottleneck for developing
models on domains with limited or no labeled data (such as positive and unlabeled
(PU) learning problem) is the validation of the model itself since there are no fully-
labeled samples available for such validation (Tuia et al., 2016). Although research
such as W. Li and Guo (2013) has been trying to develop performance metrics on
naturally PU data for PU model evaluation and assessment, commonly acceptable
and used performance metrics are still those developed for positive and negative
(PN) data. Therefore, for this dissertation, I create PU train data on top of the
PN train data from Inria Aerial Image dataset. In doing so, PU data are used for
model development, and PN data are used for model evaluation and assessment.
I use the pre-processed data that I created above from Inria Aerial Image
dataset to create the PU data. I choose these data owing to the fact that the
data do not suffer from a class-imbalance problem, which could lead to generating
suboptimal classifiers due to the heavy influence of the majority class in comparison
to the influence of minority class on model training (Chawla, Japkowicz, &
Kotcz, 2004). Although there is research such as X. Chen et al. (2021) that
investigates the class-imbalance problem in PU learning, it is out of the scope
of this dissertation, and, thus, the PU dataset is created in a way that avoids
such problem (as shown in Fig. 6). The process for creating the PU data involves
applying a moving window of the size of 32 × 32 and stride 32 that scans the label
36
maps corresponding to each image within the train set. At each location of the
moving window, the pixels that belong to the positive class are changed randomly
to unlabeled, with some probability (which, in this case, is 0.5). Therefore, since
there is complete randomness and no selection bias, it is safe to say that the
created data follow the Selected Completely At Random (SCAR) assumption. It
could be argued that the chosen size for the moving window may introduce a bias,
and, thus, the SCAR assumption may not be met; however, the counter argument
is that the moving window’s size is random and chosen independently from the
positive class characteristics, arguably still fulfilling the SCAR assumption. Fig. 7
shows some samples of PU train data resulting from the PU data making process.
Figure 6. PU data is created using a moving window over labels corresponding to
each image within the train set. At each location of the moving window pixels that
belong to the positive class are changed to unlabeled with complete randomness
(i.e. no selection bias).
37
0 0
50 50
100 100
150 150
200 200
250 250
0 50 100 150 200 250 0 50 100 150 200 250
0 0
50 50
100 100
150 150
200 200
250 250
0 50 100 150 200 250 0 50 100 150 200 250
0 0
50 50
100 100
150 150
200 200
250 250
0 50 100 150 200 250 0 50 100 150 200 250
0 0
50 50
100 100
150 150
200 200
250 250
0 50 100 150 200 250 0 50 100 150 200 250
0 0
50 50
100 100
150 150
200 200
250 250
0 50 100 150 200 250 0 50 100 150 200 250
0 0
50 50
100 100
150 150
200 200
250 250
0 50 100 150 200 250 0 50 100 150 200 250
Figure 7. Left and right columns show images and their corresponding PU labels
created by the moving window strategy.
38
Next, I repeat the aforementioned procedure in order to create different
PU train datasets with different levels of missing labels from the positive class.
Therefore, the probability with which the pixels of positive class that are under the
moving window are converted to unlabeled pixels is changed to 0.6, 0.7, 0.8, and 0.9
for a total of four additional cases. Thus, there are five totally different PU train
dataset from Inria Aerial Image dataset that are named PU5, PU6, PU7, PU8, and
PU9, all of which satisfy the SCAR assumption. See Fig. 8. In other words, PU9
means that the probability that a pixel that belongs to the positive class is labeled
is 1− 0.9 = 0.1 = p(s = 1|y = 1).
Figure 8. Different PU train datasets created from Inria Aerial Image dataset with
different missing probabilities for the positive class, all of which satisfying the
SCAR assumption—graphs in black, blue, green, yellow, red, and orange show the
distribution of the positive class and of the positive data in PU5, PU6, PU7, PU8,
and PU9 datasets, respectively.
3.3.1.1 Homogeneous Case. As mentioned in the previous
chapter, the spectral shifts among the feature distributions of remotely-sensed
images of different geographic locations can cause the model trained on one
geographic location to fail when used in a different geographic location (Tuia et
al., 2016). This failure can be considered as the heterogeneous case of transfer
learning/domain adaptation. However, if the training and testing (or future
inference) happens at the same geographic location, then the homogeneous transfer
39
learning/domain adaptation is the case. Since this dissertation investigates both
homogeneous and heterogeneous transfer learning/domain adaptation and the
PU dataset created so far is for heterogeneous case, here, I explain the dataset
that is created for homogeneous case. The Inria Aerial Image dataset includes
different geographic locations (i.e. cities). So, I selected image patches and their
corresponding label maps for one of the cities—in this case Chicago with 6855
patches. Next, I randomly selected a subset of the patches containing half of the
patches with PN labels with the 85% and 15% portion strategy for train and
validation sets. Then, the other half is used for creating PU data with the 70%,
15%, and 15% portion strategy for train, validation, and test sets. This results in
2914 and 514 images for train and validation sets in the PN data, and 2399, 514,
and 514 images for train, validation, and test sets in the PU data, respectively. Like
the previous case, this selection process resulted in PU5, PU6, PU7, PU8, and PU9
datasets, each of which represent a different level of missing labels from the positive
class.
All of the datasets described above will be used in the following two chapters
in which I develop learning models that address the research questions of this
dissertation.
40
CHAPTER IV
DEEP TRANSFER POSITIVE AND UNLABELED LEARNING
The large quantities of frequently-acquired multi-temporal and multi-source
remotely-sensed (RS) imageries provide great opportunities for real-time earth
monitoring (Tuia et al., 2016). However, such monitoring requires automatic image
processing for which supervised learning models have shown success, though with
the downside of relying on availability of fully-labeled data for model training. In
addition, there is no guarantee that models trained on images from one geographic
location will perform well on images from another geographic location due to
distribution shifts within the RS imageries between the two geographic locations
that arise from differences in acquisition techniques, atmospheric conditions, objects
of interest, and so forth (Tuia et al., 2016). There are two lines of research that
address these issues: (i) transfer learning (TL) and (ii) positive and unlabeled
(PU) learning. TL and one of its specific variations, domain adaptation (DA),
leverage the benefits of multi-temporal and multi-source aspects of RS imageries.
On the other hand, the goal in PU learning is to alleviate the labeled data scarcity,
especially when creating a fully-labeled data may not be fundamentally justified
due to the interest in only one specific landcover class or object (W. Li et al.,
2010). Both TL/DA (D. Zhao et al., 2021) and PU learning (Deng et al., 2018)
have shown promising results in semantic segmentation of RS imageries and are
of great importance for (close-to) real-time predictions on newly acquired RS
imageries. However, researchers have not yet attempted to combine these two
approaches, especially in the context of RS imageries. Therefore, in this chapter,
I investigate the possibility of hybrid methodologies that benefit from both TL/DA
and PU learning. Specifically, I start with the simpler scenario that is homogeneous
41
TL/DA and PU learning. Then, I discuss the advantages and disadvantages of such
scenario. Next, I relax the homogeneous assumption for the proposed model such
that it can handle the heterogeneous case. Finally, I discuss the limitations and
opportunities for further improvements.
4.1 Background
Deep learning models have been successful in many computer vision tasks,
including one of the most important of them, semantic segmentation. Such success
is mainly due to the supervised learning approach that depends on collecting large
amount of dense pixel-wise labels (Mittal, Tatarchenko, & Brox, 2019). However,
the high cost of collecting labeled data has resulted in researchers shifting focus to
other promising areas, such as semi-supervised learning, weakly-supervised learning,
and unsupervised and semi-supervised domain adaptation.
4.1.1 Semi-supervised Learning. Semi-supervised learning (SSL)
focuses on target domain (i.e. the domain of interest) and tries to alleviate the
labeled data scarcity by learning a generalized model applied to a large amount of
unlabeled data and a few limited labeled data. French, Aila, Laine, Mackiewicz,
and Finlayson (2019) investigate the consistency regularization technique for semi-
supervised semantic segmentation. The idea behind consistency regularization is
that the trained model should generate consistent predictions for inputs of the
same unlabeled image under different perturbations. They show that incorporating
augmentation techniques such as CutOut (DeVries & Taylor, 2017) and CutMix
(Yun et al., 2019) (with CutMix showing the superior results) improve the
performance of semi-supervised semantic segmentation. The learning approach
is further improved by French and Mackiewicz (2021) who incorporate color
42
augmentation. Finally, Olsson, Tranheden, Pinto, and Svensson (2021) propose the
ClassMix augmentation technique in which two unlabeled images are used to create
a new augmented image based on superimposing part of one image on to the other
image with the help of object boundaries mask extracted from network’s prediction
segmentations. In addition, the corresponding prediction segmentations for each
of the two input unlabeled images are also augmented to create the corresponding
label map for the augmented output image.
Souly, Spampinato, and Shah (2017) use generative adversarial networks
(GANs) for SSL such that the generator creates additional training data while
the discriminator is a segmentation network with an additional class to count for
the fake class, which is produced by the generator. The three-part loss function
takes into account a standard generative adversarial network GAN loss for the
image produced by the generator, a loss for unlabeled data, and a standard cross-
entropy loss for the labeled data. Hung, Tsai, Liou, Lin, and Yang (2018) propose
an adversarial network-based semi-supervised semantic segmentation. Their method
consists of a segmentation network and an adversarial discriminator in which
they alternate the typical GAN discriminators to a fully convolutional network
(FCN)-based discriminator for distinguishing the ground truth segmentation
distribution from the predicted probability maps by the segmentation network.
The discriminator is the key for the semi-supervised setting since it provides
predicted labels as pseudo labels for unlabeled data to be used for the training of
the segmentation network. Similarly, Mittal et al. (2019) propose a model in which
the generator is the segmentation network and the discriminator distinguishes the
ground truth segmentation maps from the generated ones. However, in addition
to the GAN branch, they add a multi-label classifier as the second branch based
43
on a modified Mean Teacher (Tarvainen & Valpola, 2017), which is responsible for
filtering the segmentation network’s false positive predictions. Alonso, Sabater,
Ferstl, Montesano, and Murillo (2021) propose a teacher-student model with
a memory bank that contains relevant and high-quality feature vectors from
labeled data. The teacher network creates pseudo-labels for unlabeled data to
be used along with labeled data information from labeled data for training the
student network. The key element in their approach is the memory bank and
its corresponding contrastive learning loss that enforces the output features from
the student network to be similar to the ones from the teacher network’s memory
bank. Ke, Qiu, Li, Yan, and Lau (2020) propose a general framework for pixel-
wise SSL tasks that can be applied to a wide range of pixel-wise tasks without
structural adaptation. Their model utilizes two (segmentation) models with
different initializations to form perturbations between them, and a flaw detector
network that checks the differences among models’ outputs and ground-truth where
in the corresponding losses, for unlabeled data, the terms for such differences are
assumed to be zero.
Though, while SSL has been successful, it still requires some fully-labeled
images to capture the complete semantics within an image of a complete scene
or object. RS imageries are large; fully labeling one encumbers the same labeling
burdens. And small patches of such images may not contain complete semantics
of objects/classes. In addition, the negative class, in case of RS imageries, is too
diverse such that being interested in only one class for prediction makes labeling far
more undesirable. Finally, and most importantly, SSL does not address the more
challenging problem of distribution shifts among RS images, a problem that makes
deploying the trained models for inference much more difficult.
44
4.1.2 Weakly-supervised Learning. Weakly-supervised approaches
assume availability of zero labeled data and take into account auxiliary information
instead, one of the most studied ones being the image-level class labels (Ahn &
Kwak, 2018; Z. Huang, Wang, Wang, Liu, & Wang, 2018; J. Lee, Kim, Lee, Lee,
& Yoon, 2019). For example, Ahn and Kwak (2018) assume the availability of
image-level class labels. First, they generate pixel-level segmentation labels for
training images given their image-level class labels. Then, they train a semantic
segmentation network using the generated segmentation labels. Other non-
image-level-based weakly-supervised methods are also proposed. For example,
Papandreou, Chen, Murphy, and Yuille (2015) adopt an approach in which either
bounding boxes of semantic object in an image or image-level labels are available.
D. Lin, Dai, Jia, He, and Sun (2016) take advantage of scribble-based image labels
that are easier and faster to generate. They propose a two-part model in which a
graphical model creates fully labeled images by exploring the unmarked pixels using
the propagated information from scribbles, and a FCN that is trained using the
generated labels from the graphical model and provides semantic prediction for the
graphical model.
While researchers have achieved some success, there are two problems with
weakly-supervised approaches. First, the need for auxiliary information is too
vague for the diverse information represented in RS images (or even their image
patches). In addition, such auxiliary information is still too difficult to generate
since RS imageries cover a large geographic location. Creating auxiliary information
requires going through so many different small image patches which, although it
does not represent as much work as creating dense pixel-level information, still
takes considerable time and energy. Second, although weakly-supervised approaches
45
incorporate weakly labeled examples in addition to pixel-level labels, they do “not
exploit the unlabeled data to extract additional training signal” (Ouali et al., 2020,
p. 2).
4.1.3 Semi-supervised Domain Adaptation. Semi-supervised
domain adaptation (SSDA) tries to reduce the data distribution shift between
source and target domains using fully labeled source data and partially limited
labeled target data. SSDA is mainly focused on image classification with different
approaches, including consistency training of different forms such as entropy
minimization (Berthelot et al., 2019), pseudo-labeling (Sohn et al., 2020),
divergence-based consistency (Gong, Wang, & Liu, 2021), and adaptive consistency
(Abuduweili, Li, Shi, Xu, & Dou, 2021). However, semantic segmentation task is
more challenging and it requires methods originally developed for it rather than
adjusting image classification methods for it (S. Chen, Jia, He, Shi, & Liu, 2021).
There are approaches that use Image-to-Image translation techniques in
order to reduce the feature discrepancy between source and target images. For
example, Musto and Zinelli (2020) propose a mixture of pixel-level and feature-
level domain alignments in a generative adversarial (e.g. GAN) framework for
translating source images to target style. These kind of methods either are
done stand alone before the segmentation process or are co-trained with the
segmentation network. For the former, it is an extra step and burden for fast
training, and for the latter, it adds extra computation burdens. However, there
is research that takes advantage of adversarial learning in a different way. For
example, Z. Wang et al. (2020) propose an adversarial approach with shared
feature generator strategy. In addition to the cross-entropy loss for the labeled
data, they introduce two adversarial losses for cross domain feature alignment: (i)
46
a global adaptation using a GAN-based discriminator trying to distinguish the
input feature maps of the source and target domains, and (ii) a semantic-level
adaptation on both source and target data. S. Chen et al. (2021) denotes that
such approach is unstable in training due to co-occurrence of adversarial training
and weak supervision. They also claim that the full potential of labeled data from
the two domains are not utilized; therefore they propose region-level and sample-
level data mixing methods applied to source and target labeled data to reduce
the data distribution gap. Two complementary teacher models are considered
for training on the mixed data, where each of them is fed with data from one of
the mixing methods. Then, an ensemble of these two pre-trained domain-mixed
teachers is used as the teacher network for the student network in the knowledge
distillation pipeline that includes cross-entropy loss for target labeled data and
Kullback–Leibler (KL)-divergence loss for unlabeled data and its corresponding
pseudo labels from the mixed teacher network. Finally, they use a self-training
component to train the teachers for further improvement using the generated
pseudo labels by the student network.
Finally, the idea of shared networks, specifically sharing feature generators
(i.e. encoders) has been receiving attention due to their potential of decreasing the
deployment costs in favor of real-time inference. In addition, Kalluri et al. (2019)
propose a universal segmentation framework that shares an encoder among different
domains, where each has its own decoder; thus, their model can handle label space
discrepancy. In addition, to addressing feature alignment among domains, they
propose a pixel-level entropy regularization as a separate module for propagating
information among labeled and unlabeled images within all domains. Ouali et
al. (2020) also propose an approach based on shared encoders between labeled
47
and unlabeled images within a domain and among domains. They claim that,
for semantic segmentation, it is easier to capture the low density regions in the
hidden representations rather than the input images, and, thus, they perform
a cross-consistency training technique based on the resiliency of the model (i.e.
the shared encoder and main decoder) to the perturbations on the generated
features of unlabeled images by the shared encoder. Auxiliary decoders are used
to help propagate such resiliency of the shared encoder and main decoder using
an unsupervised loss calculated on the output of the main and auxiliary decoders.
The shared encoder and main decoder are further trained using the labeled images.
They also show that this SSL technique can be expanded to SSDA by an alternate-
training approach among domains using domain-specific main and auxiliary
decoders while the shared encoder remains unique across domains.
Although, in comparison to SSL, SSDA addresses the problem of
distribution shifts that happens often among RS imageries, it still requires some
fully labeled images that may not be easily available in case of RS imageries.
4.1.4 Unsupervised Domain Adaptation. Unsupervised domain
adaptation (UDA) tries to reduce the data distribution shift between source
and target domains using fully labeled source data and unlabeled target data
(as opposed to partial labels considered in SSDA). The approaches in UDA can
be broadly categorized into (i) adversarial learning by either transferring the
style of labeled source data to target domain (Hoffman et al., 2018) or adding a
discriminator network to a segmentation network for improving the segmentation
network (Luo, Zheng, Guan, Yu, & Yang, 2019), (ii) self-training (i.e. pseudo
labeling) (Y. Zou, Yu, Kumar, & Wang, 2018), (iii) consistency training (Melas-
Kyriazi & Manrai, 2021), and (iv) a mix of some or all of them (Y.-C. Chen, Lin,
48
Yang, & Huang, 2019; Y. Li, Yuan, & Vasconcelos, 2019), which is usually the case
in recent models. For example, Vu, Jain, Bucher, Cord, and Pérez (2019) address
the gap between data distributions using two entropy minimization approaches
on pixel-wise predictions: (i) a direct MinEnt entropy minimization and (ii) an
adversarial AdvEnt entropy minimization. Pan, Shin, Rameau, Lee, and Kweon
(2020) adopt the adversarial AdvEnt entropy minimization approach to address
the inter-domain adaptation, while they propose an intra-domain adaptation using
image-level mean predicted entropy maps for target domain images to classify them
into easy and hard subdomains. Then, they use the pseudo labels from the easy
subdomain to adapt the segmentation network from the easy to hard subdomain.
However, to capture the domain gaps among predicted labels within pixels, Yan
et al. (2021) propose a pixel-level intra-domain adaptation as an alternative to
image-level intra-domain adaptation. Saito, Watanabe, Ushiku, and Harada (2018)
propose an adversarial approach in which the best discriminative features are
learned through an alternative-training of a shared generator and two classifiers.
Truong et al. (2021) propose a generalized form of the Adversarial Entropy
Minimization, Bijective Maximum Likelihood, that does not take into account pixel
independence and incorporates a bijective mapping network (learned using the label
set of the source domain) for calculating the loss on unlabeled target data.
Research in UDA has achieved notable success in semantic image
segmentation. However, “the domain gap cannot be fully alleviated due to the lack
of strong supervision in the target domain” (S. Chen et al., 2021, p. 2). Also, the
successful adversarial approaches are computationally expensive, are not stable
during training time, and are challenging with respect to hyperparameters (Melas-
Kyriazi & Manrai, 2021), which makes training the models challenging.
49
4.1.5 The motivation behind this work. Although collecting dense
pixel-wise fully-labeled images is not preferable in RS research, it is justifiable to
collect some limited labeled pixels for the landcover class of interest. Such partially
labeled pixels within images can be generated with quick eyeballing and skimming
of RS imageries in comparison to the difficult process of creating dense pixel-wise
labels either through field investigation or visual interpretation. Therefore, with
such labeled data situation, the aforementioned techniques are not quite applicable.
However, they provide a reasonable foundation that can be assembled along with
ideas from PU research to provide a seamless PU method that takes advantage of
RS imageries across domains. Considering this, I start with the naive homogeneous
assumption for model development and show that such assumption is practically
hard to meet. Therefore, I move on to the heterogeneous case and its practicality
in RS imageries. The rest of this chapter is organized into multiple sections, each
of which covers both homogeneous and heterogeneous cases. In § 4.2, I discuss
the baselines, against which I evaluate my model. Next, I present the proposed
approach for PU domain adaptation in § 4.3. In § 4.4, I examine the results, and,
finally, I summarize and discuss future work in § 4.5.
4.2 Baselines
There is a dearth of research on positive and unlabeled domain adaptation
(PUDA), making it hard to consider good baselines. A few different studies Lei
et al. (2021); Loghmani et al. (2020); Sonntag et al. (2022) implicitly or explicitly
address the problem of PUDA. However, none of these research papers are suitable
to be considered for a baseline here, and I explain why in the following page.
Lei et al. (2021) do not explicitly target PUDA. However, they consider
multi-modality (including different times and locations) in the RS imageries
50
they use. With the way of defining train and test datasets, one can consider this
approach as PUDA. Still, there are some major problems with their approach that
make it hard to use it as a baseline. Their model consists of a feature extractor and
fully connected classification heads. Therefore, it introduces high computation and
time cost because of the point-wise predictions. The positive (and negative) labeled
data are selected very carefully using field investigation and interpretation, which
negates the promises of PU learning for removing heavy costs of labeling the data.
In addition, such cherry-picked labeled data makes it hard to leverage more images
for training, which can be seen based on the very limited number of pixels that are
used in their experiments. Next, they select a limited number of unlabeled data,
and my interpretation is that they do so to avoid the class imbalance problem.
However, in such way, they do not take advantage of all of the available unlabeled
data in the images. Finally, their novelty seems to come from their use of NNPU
loss (Kiryo et al., 2017). Therefore, I define one of my baselines to be a more
general case of Lei et al. (2021) that addresses such issues while using the NNPU
loss as they did.
Loghmani et al. (2020) formulate the PUDA such that the source data are
considered as positive and the target data as unlabeled. Therefore, they completely
ignore the available labels in target domain, which makes it UDA-like. However,
their approach is open set DA, which is different from (closed set) UDA. Therefore,
since the open set DA assumption that they consider is not applicable in the cases
considered in this dissertation, instead of their approach, I consider a state-of-the-
art UDA as a baseline.
Sonntag et al. (2022) address PUDA by integrating UDA and PU learning
for image classification. Although this research is the closest to what is done in
51
this chapter, there are still problems, which prevent it from being considered as a
baseline. First, they assume a same feature space for source and target domains.
While this does not hold the method for the homogeneous case, it is definitely
problematic for the heterogeneous case. What is more important is that their
approach is geared towards image classification. The major element in their
model is the candidate set that is constituted during learning phase in step-1 that
identifies reliable positive and negative examples for a supervised learning in step-
2. The creation of the candidate set may be feasible for image classification with
tens’ of thousands images at most (like the size of the datasets that they used).
However, the creation of such a candidate set is not practically possible for pixels in
semantic segmentation. Even if it were possible, the step-2 of the approach will be
similar to the method used by Lei et al. (2021), which introduces the difficulties
that were mentioned before. For example, the positive and negative pixels will
be distributed sparsely in a salt-and-pepper manner preventing the supervised
model in step-2 to capture global and local relationships for semantic segmentation.
Therefore, following the same approach as Sonntag et al. (2022), I consider PU
learning and UDA, with different settings that will be discussed later, as baselines
for my proposed model.
The above research is not suitable to be considered as the baselines due to
the mentioned reasons. Next, I discuss the baselines that I use for my proposed
model.
4.2.1 Homogeneous case. I consider three different baselines for
this setting: (i) PU learning on target-only data with and without transferring pre-
trained weights from ImageNet, (ii) training a model in a supervised manner using
fully-labeled source data, and, then, fine-tuning it in a PU learning manner on PU
52
target data. The supervised section is considered with and without transferring
pre-trained weights from ImageNet, and (iii) the model that I propose when there
is no PU loss meaning that target data is considered as unlabeled. The reason for
the last baseline is to show that power of the proposed model in comparison to
the other two baselines, and that how PU data in target domain can help improve
the model’s performance. Showing the potentials of the proposed model and
incapability of the PU baseline for the heterogeneous case, I use a UDA model as
the baseline in the heterogeneous case as it is explained below.
4.2.2 Heterogeneous case. I consider one of the state-of-the-art
models in UDA, PixMatch (Melas-Kyriazi & Manrai, 2021). The reason behind
choosing PixMatch is its great performance as well as a comprehensive performance
comparison that its authors did with all the competitive and state-of-the-art models
at the time of the PixMatch paper. There are several perturbation settings, such
as Fourier and CutMix, in PixMatch model, which are considered as different
baselines. In addition, I adopt the network architecture and backbone from the
PixMatch model for the proposed model in order to have a further and more fair
comparison of the two models. Finally, it should be mentioned that only a UDA
model as the baseline is sufficient since the proposed model is already compared
against the NNPU-based baselines in the homogenous setting.
4.3 Methodologies
Source and target domains provide two different sets of images. The
source domain DS = {(xs, ys)}s∈S contains nS images of form xs with their
corresponding fully-labeled ground truth semantic maps of form ys. The target
domain DT = {(xt, yt)}t∈T contains nT images of form xt with their corresponding
53
PU semantic maps of form yt = ytp ∪ ytu where ytu contains pixels that come
from both positive and negative classes. The ratio of labeled-positive to unlabeled
data is controlled in 5 different datasets PU5-9 as described in Chapter III. In all
settings, the models are trained on train and validation sets and evaluated on the
test set. In the following pages, I present the proposed PUDA approach starting
with the homogeneous case, and then moving on to the heterogeneous case.
4.3.1 Homogeneous case. The assumption here is that if the two
source and target domains contain images from the same geographic location and
acquired by the same sensors, then their feature space should be the same, but
with different feature distributions due to differences in acquisition times and
atmospheric effects. At the core of the proposed approach is the shared feature
generator (i.e. encoder) idea that is used by researchers such as Kalluri et al.
(2019), Ouali et al. (2020), and Z. Wang et al. (2020). As shown in Fig. 9 and
equation 4.1, the model consists of a supervised learning module, a self-training
learning module, and a PU learning module. For the proposed model, UNET
is chosen as the semantic segmentation architecture with VGG16 as its encoder
backbone. In the following section, I start with reviewing the UNET architecture
and the VGG16 encoder backbone and the proposed model architecture; then,
I discuss each of the aforementioned learning modules. Finally, I mention the
performance metrics used for models’ assessment and evaluation: accuracy, IoU,
and F1-score.
Lhomogeneous = LS supervised
(4.1)
Lhomogeneous = λT puLTpu + λpseudoLTpseudo
54
Model
Image Output GT
PN PN PN
Shared
Pseudo
Label
PU
Model
Image Output
PU PU
GT
PU
Figure 9. Homogeneous transfer positive and unlabeled learning architecture.
4.3.1.1 Model architecture: UNET. Unet, by Ronneberger, Fischer,
and Brox (2015b), is a segmentation network that was originally developed for
Biomedical images. The Unet architecture consists of two modules, a contracting
module and an expansive module. The contracting module (i.e. encoder or feature
extractor) consists of a stack of convolutional, batch-normalization, activation, and
max-pooling layers. The contracting module captures the context of the input
image and outputs a feature map. The expansive module is somewhat symmetric
to the contracting module, which replaces max-pooling layers with transposed-
55
convolution layers as the upsampling operators in order to reconstruct an output
with the same size as the input image. Between each two upsampling operations a
high-resolution intermediary feature map from the contracting path is attached to
the current upsampled layer for localization and propagating context information,
and then is fed to a convolution layer before getting passed to the next upsampling
operator. Since the network is based purely on convolutional layers and does
not have any fully connected layer, it can accept input images of any size (e.g.
256 × 256 in case of this dissertation). The batch-normalization layer is not part
of the original network; rather, it is added later in order to allow the network to
capture the data distribution for better convergence. Fig. 10 shows the architecture
of the Unet model: (i) the contracting module consists of the repeated application
of a 3 × 3 convolution layer with padding and stride 1, a batch-normalization layer,
and a rectified linear unit (ReLU) activation layer two or three times, followed by a
2 × 2 max pooling operation with stride 2 for downsampling, and (ii) the expansive
module consists of repeating steps of a 4 × 4 transposed-convolution for feature
upsampling, a ReLU activation layer, a concatenation of the corresponding feature
map from the contracting path, a 3 × 3 convolution layer to condense features, and
another ReLU activation layer. Finally, a 1 × 1 convolution layer is applied to the
last upsampled layer in order to map the feature layer to the desired number of
classes.
4.3.1.2 Model backbone: VGG16. VGG16, by Simonyan and
Zisserman (2014), is a Convolutional Neural Network (CNN) Architecture
developed for image classification. Since semantic segmentation networks are
evolved from image classification networks, they use one of the many available
image classification networks as the backbone for their feature generators. There
56
are either two or three 3 × 3 convolutional layers followed by a max-pooling layer.
Such a stack of layers is applied repeatedly, which results in a reduction in the
number of hyper-parameters. Finally, two Fully-Connected (FC) layers are applied,
followed by a softmax layer that generates class probabilities for the output. When
considered for semantic segmentation networks, the FC and softmax layers are
factored-out and the remaining of the network is used for as the feature generator.
As mentioned in the Unet section, I use a modified version of the original VGG16
architecture, in which there are batch-normalization layers added to the network.
4.3.1.3 Learning module: supervised. The supervised loss function
is the standard cross-entropy loss on the source domain. As shown in equation 4.2,
for the source input image x : ph,ws s is the output probability distribution at
Conv, Batch-Norm, ReLU
Copy, Concat
Max Pool
Up-Conv
Conv
Figure 10. UNET architecture; the left-side is the contracting module (i.e.
encoder), the right-side is the expansive module (i.e. decoder), the brown square
is the segmentation output, and, finally, the grey squares are intermediary feature
maps to be concatenated to their corresponding upsampled layer.
57
pixel (h,w); yh,ws is the corresponding ground truth label for that pixel; ys is the
corresponding ground truth semantic map; nS is the total number of pixels in the
source domain S; H is binary cross-entropy loss shown in equation 4.3; and LS is
the total loss over all images in the source domain.
L 1
∑ ∑
= H(yh,w h,wS , p (ys|xs))
n s s (4.2)S
s∈S h,w∈s
H(y, ŷ) = −ylog(ŷ)− (1− y)log(1− ŷ) (4.3)
4.3.1.4 Learning module: self-training. The self-training loss
function is also the standard cross-entropy loss on the target domain. However,
it consists of two parts: one for unlabeled pixels that get their pseudo-labels
from the source model and one for the labeled positive pixels. Considering that
labeled positive pixels are used in the PU loss too, having an extra loss term in the
self-training loss for the labeled positive pixels means putting more emphasis on
available information for model training. In particular, this is appropriate since the
labeling mechanism is under SCAR assumption, and thus such double emphasis on
the labeled positive data does not add any bias to the learning procedure.
For calculating the self-training loss, first, I need to calculate the pseudo-
labels for unlabeled pixels within all images in the target domain. Therefore,
for each of the target images, xt, I pass the image through the source model to
obtain the pseudo-labels ŷpseudo = argmax(ps(ys|xt)) for the unlabeled pixels. For
calculating the total loss over all images in the target domain, LT , as shownpseudo
in equation 4.4, the followings terms are needed for the target input image xt: (i)
58
ŷh,wpseudo the pseudo-label generated by the source model for unlabeled pixel (h,w),
(ii) yh,wt the ground truth label for positive-labeled target pixel (h,w), (iii) p
h,w
p t
the target output probability distribution at pixel (h,w), (iv) yt the corresponding
target ground truth semantic map, and (v) nT the total number of pixels in the
target domain T . The extent to which the self-training loss contributes to the final
loss is controlled by the hyper-paramenter λpseudo as it is shown in equation 4.1.
∑ ∑ ∑ 
L 1 h,w h,w h,w h,wT = H(ŷ , p (y |x )) + H(y , p (y |x ))pseudo t tn  pseudo t tp t t tT 
t∈T h,w∈tu h,w∈tp
(4.4)
4.3.1.5 Learning module: positive-unlabeled. In comparison
to positive-negative (PN) problem where fully-labeled data in label space Y ∈
{−1,+1} are available, PU problem addresses availability of some positive
labeled data and unlabeled data from both positive and negative classes. In the
PN problem, positive and negative data are sampled from their corresponding
distributions pp(x) = p(x|Y = +1) and pn(x) = p(x|Y = −1). When the goal
is to learn the decision function F for the problem, then the empirical risk of F is
estimated from PN data as follows:
R̂pn(F) = π +pR̂p (F) + π R̂−n n (F) (4.5)
Where πp = p(Y = +1) and πn = p(Y = −1) = 1 − πp ar∑e the positive and
negative class-prior probabilities, respectively. And, nR̂+p (F) = 1
p p
np i=1
ℓ(F(xi ),+1)
59
∑
and nR̂−n (F) = 1
p n
i=1 ℓ(F(xi ),−1). In PU learning, the negative data is absent,np
and, instead, unlabeled data are available with distribution p(x). In addition,
having πnpn(x) = p(x)− πppp(x) helps to replace π −nRn (F) with R−u (F)− π −pRp (F).
Therefore, equation 4.5 can be re-written as follows:
R̂pu(F) = π +pR̂p (F) + R̂−u (F)− πpR̂−p (F) (4.6)
∑ ∑
where nR̂−p (F) = 1
p
i=1 ℓ(F(x
p
i ),−1) and R̂−u (F) = 1
nu u
np nu i=1
ℓ(F(xi ),−1).
The equation 4.6 can result in negative empirical risks when flexible models such
as neural networks overfit the data. Given that semantic segmentation models use
neural networks, this type of overfitting can pose a significant problem. Therefore,
a modification to equation 4.6 can solve the negative empirical risk problem. This
is shown in the following:
{ }
R̂ +pu(F) = πpR̂p (F) + max 0, R̂−u (F)− πpR̂−p (F) (4.7)
The equation 4.7 (proposed by Kiryo et al., 2017) is referred to as the non-
negative PU (NNPU) risk estimator. This PU loss function is used in the target
domain and is re-written in equation 4.8, where ph,w h,wt and pt are the outputp u
probability distributions of the labeled and unlabeled pixels at locations (h,w),
respectively, over the label space yt; and nTp and nTu are the total number of
labeled and unlabeled pixels in the target domain T , respectively. The extent
to which the PU loss contributes to the final loss is controlled by the hyper-
paramenter λpu as it is shown in equation 4.1.
60
L π
∑ ∑
p
Tpu =  H(+1, p
h,w
t (y |x ))n p t tpTp tp∈T h,w∈tp 
+max 1 ∑ ∑ π ∑ ∑ − p0, H( 1, ph,wt (yt|xtu))− H(−1, ph,wt (yu p t|xtp))nTu n∈ T tu T h,w∈t pu tp∈T h,w∈tp
(4.8)
4.3.1.6 Performance metric: Accuracy. Pixel accuracy measures
the percentage of correctly classified pixels within an image, which then can
be averaged over all images within the dataset. Since I am addressing binary
classification, the per-class and global pixel accuracies are the same representing
the pixel accuracy for the class of interest. The pixel accuracy is calculated using
equation 4.9, where TP , TN , FP , and FN are true positive, true negative, false
positive, and false negative rates that are calculated within the confusion matrix.
TP + TN
Acc = (4.9)
TP + TN + FP + FN
4.3.1.7 Performance metric: IoU. Pixel accuracy can be misleading
if there is class-imbalance within the dataset. Therefore, pixel accuracy is always
accompanied with other performance metrics such as IoU (or Jaccard index) and
F1-score (or Dice coefficient), especially for semantic segmentation. Intersection
over Union (IoU), shown in equation 4.10, quantifies the percentage of overlapping
pixels between the model’s prediction output and the ground truth semantic map.
This metric can be reported per class or averaged over all classes as mean-IoU, but
in my case that concerns binary classes, I report IoU for the class of interest.
61
TP
IoU = (4.10)
TP + FP + FN
4.3.1.8 Performance metric: F1-score. F1-score (or Dice
coefficient) combines two popular metrics, the precision and the recall, and is
designed to handle data with class-imbalance. As shown in equation 4.11, F1-score
equals to 2 times of the overlapping pixels divided by the total number of pixels in
both models’ prediction output and the ground truth semantic map.
2TP
F1 = (4.11)
2TP + FP + FN
4.3.2 Heterogeneous case. The assumption under which I am
working is that the homogeneous case is naive since it’s rare to have such situations
and, a more realistic case happens when the two source and target domains contain
images from different geographic locations, acquired by the different sensors (i.e.
satellites), and have different acquisition times, and, thus different atmospheric
effects (Tuia et al., 2016). As shown in Fig. 11, remote sensors are designed to
capture information from Earth’s surface within specific range intervals of the
electromagnetic spectrum called bands.
The RS imagery’s bands may not overlap for different sensors causing
different feature spaces for images acquired by those sensors. As shown in Fig. 12,
there are misalignments of different remote sensors across different bands. In
addition, the magnitude of the captured reflectance (one of the atmospheric effects)
may not be the same for different satellites that are also affecting the feature space.
62
Therefore, heterogeneous case is applicable since the feature spaces are different
mainly due to differences in sensors.
Figure 11. The electromagnetic spectrum captured by satellite remote sensing
(shown as SRS in the image). Image adopted from Pettorelli et al. (2018).
Figure 12. The misalignments of different remote sensors across different bands.
Image adopted from Rocchio and Barsi (n.d.).
I propose to take advantage of consistency-training in order to tackle the
differences among feature spaces. The reason behind this is that, although feature
spaces are different, they overlap such that the centers of the bands are close or
63
similar with differences in upper bound and/or lower bound for interval range
of the bands. Consistency-training approaches have shown their effectiveness in
training models that are robust to changes such as perturbation-based changes in
input data (e.g. Y. Yang & Soatto, 2020). Therefore, I add a consistency-loss to
the proposed model in the homogeneous section, which is shown in Fig. 13 and
equation 4.12. The meaning of such action can be understood from a teacher-
student network perspective (G. Hinton, Vinyals, Dean, et al., 2015) such that
the source model can be considered as the teacher to the student network in the
target domain. The differences considered in the electromagnetic spectrum range
for different bands (RGB in the case of this dissertation) can be considered as some
sort of perturbations that are added to images from one sensor to resemble images
from another sensor. Thus, the proposed consistency-loss, explained below, makes
sure that the model is robust to such changes in the input data, which results in a
better feature generator for the model.
Besides the UNET architecture and its VGG16 encoder backbone, I
incorporate DeepLabV2 architecture with ResNet101 as its encoder backbone
as well since they are used in the PixMatch model. In the following section, I
review the DeepLabV2 architecture and the ResNet101 encoder backbone and the
proposed consistency-training module for the proposed model. The performance
metrics used for models’ assessment and evaluation are the same as before:
accuracy, IoU, and F1-score.
Lheterogeneous = LS homogeneousS
(4.12)
Lheterogeneous = LT homogeneous + λT consistencyLconsistency
64
Model
Image GTOutput
PN PNPN
Shared
Pseudo
Label
PU
Model
Image Output
PU PU
GT
PU
Model
Output
PN
Figure 13. Heterogeneous transfer positive and unlabeled learning architecture.
4.3.2.1 Model architecture: DeepLabV2. DeepLabV2, by L.-
C. Chen, Papandreou, Kokkinos, Murphy, and Yuille (2017), is also a segmentation
network with innovative Atrous Convolution at its core. DeepLabV2 leverages high-
level global features and fine-grained details by a multi-scale feature aggregation
technique with a final element-wise summation in order to create the final feature
map. Next, the final feature map is passed to a bi-linear interpolation layer to
reconstruct an output with the same size as input image. Fig. 14 shows the
65
architecture schema of the DeepLabV2 model. The Atrous Convolution is an
improvement on convolution kernels (for semantic segmentation in comparison
to image classification) and provides the opportunity for capturing the global
relationships among each pixel and its neighboring pixels at different levels—that
is controlled by the atrous (i.e. dilation) rate, while keeping the computation costs
low. DeepLabV2 utilizes the Atrous Convolution using the Spatial Pyramid Pooling
(SPP) technique in order to create multi-scale feature-maps created by different
atrous rates from 6 to 24. Finally, the whole aforementioned path for creating the
final feature-map is repeated three times over 1.0, 0.75, and 0.5 downscaled input
images for another way of multi-scale feature fusion. It should be mentioned that
DeepLabV2 also uses a Fully-connected Conditional Random Field (CRF) module.
The CRF module is a probabilistic method that helps with better label predictions
of pixels around the boundaries of objects in images based on pixels correlations.
4.3.2.2 Model backbone: ResNet-101. ResNet-101, by K. He et
al. (2016), is also a CNN Architecture developed for image classification and is
also based on layered stacks of convolutional and max-pooling layers followed, at
the end, by an average-pooling, a FC, and a softmax layer. Again, when it is used
for semantic segmentation, the last three layers are filtered out. The innovation of
general ResNet architecture—the 101 part refers to the total number of layers—is
the residual block that solves the vanishing gradient problem in deep networks,
causing accuracy degradation as more layers get added to an existing network.
Residual blocks (Fig. 15) take advantage of the residual function such that instead
of learning a mapping function, H(x), (i.e. a stack of convolutional layers) between
the input and the output, the attempt is to learn a residual function, F(x) where
F(x) = H(x) − x (or i.e. H(x) = F(x) + x). In this way, learning the residual
66
function is much easier and avoids the vanishing gradient problem. As opposed to
ResNets, with smaller number of layers, the 101-layer version uses 3-layer (instead
of 2-layer) deep residual blocks. In each residual block, if the output dimensions is
}
} }
}
Figure 14. DeepLabV2 architecture; top-to-bottom are three repetitions of left-to-
right path in each row over 1.0, 0.75, and 0.5 downscaled input images; left-to-right
path are multi-scale feature generators with different atrous rates from 6 to 24; the
final feature map is upsampled to generate the input-size-like output.
67
}
}
}
the same as the input, the input is added directly to the output; otherwise, first, a
linear projection is applied to ensure that dimensions match.
Figure 15. A sample building block showing the idea of residual learning. Figure
adopted from K. He et al. (2016).
Figure 16. Residual learning block: 2- versus 3-layer. Figure adopted from K. He et
al. (2016).
4.3.2.3 Learning module: consistency-training. The consistency-
training loss is applied on the target model and measures the ℓ1-norm (shown as
|| · ||1 in equation 4.13) or the absolute value of difference between probability
predictions from the source model output and the target model output given the
input source image, xs. In equation 4.13, ph,w and ph,ws t are the output probability
distributions of the source and target models at pixel (h,w), respectively, over the
label spaces ys = yt; and nS is the total number of pixels in the source domain S.
The extent to which the consistency loss contributes to the final loss is controlled
by the hyper-paramenter λconsistency as it is shown in equation 4.12.
68
L 1
∑ ∑
= ||ph,wT s (ys|xs)− p
h,w
t (yt|x )||consistency s 1n (4.13)S
s∈S h,w∈s
4.4 Results
In this section, I present the results from the proposed method and the
corresponding baselines. I again start with the homogeneous case, and then move
on to the heterogeneous case.
4.4.1 Homogeneous case. The performance of models in this section
is evaluated using PN and PU datasets as the source and target domains from
the homogeneous dataset created from Inria Image Dataset, which is discussed in
Chapter III. As for the baseline, two different models are considered. First, a PU
model with the NNPU loss (Kiryo et al., 2017) that is trained only on PU images
within the target domain in two different scenarios: (i) with and (ii) without pre-
trained weights from ImageNet dataset—this model is called Target-PU. The next
model, Target-FT, is an extension of the first model in a way that the model first
is trained on images from the source domain with and without pre-trained weights
from ImageNet dataset. Then, the weights of this model are used as the warm start
for the PU learning phase on PU images within the target domain.
The effect of the proposed model is analyzed with and without the PU
learning module resulting in two models respectively called Seamless-U and
Seamless-PU. Each of these two models, again, are considered in two cases: (i) with
and (ii) without pre-trained weights from ImageNet dataset. The reason behind
having Seamless-U alongside Seamless-PU is to investigate and understand to what
extent the assumptions made on homogeneous cases are effective in practice.
69
The use of three channels of RGB for the models allows me to be able to
use the pre-trained weights from ImageNet dataset. The optimizer is chosen to
be Stochastic Gradient Descent with learning rate, momentum, and weight decay
equal to 0.01, 0.9, and 1e-4, respectively. I adjust the learning rate during the
training phase by reducing the learning rate according to LambdaLR scheduler. It is
assumed that the class-prior probability is known and equal to πp = 0.5. Although
the exact value of the class-prior is calculated from the dataset and is equal to
∼ 0.42, I consider a buffer since having the exact and precise value of the class-
prior is too ideal and rarely (i.e. almost never) happens in a real scenario. This
assumption is justifiable in a sense that it does not affect the comparison of the
models since Kiryo et al. (2017) show robustness of NNPU to the misspecification
of the class-prior probability within the range of [0.8πp, 1.2πp] = [0.36, 0.54]. Finally,
the rest of the settings of NNPU loss, such as values for α and β, are set to be the
same as the original paper by Kiryo et al. (2017).
Table 3 shows the results of running all models on PU5, PU6, PU7, PU8,
and PU9 datasets for different level of positive-class labeled probabilities—since
Sealmless-U model uses the target domain images as unlabeled, its performance is
the same across different PU datasets, and this is shown by arrows in Table 3.
The performance of both baselines degrades when they are trained with pre-
trained weights from ImageNet dataset. However, such pre-trained weights help
improve the proposed model in both U and PU cases. One interpretation of this
could be that the proposed approach finds a feature generator that is more domain
invariant, and, thus, it can incorporate other related domains such as ImageNet
dataset better without introducing negative transfer effect. However, this is not
the case for the baselines resulting in having negative transfer when using image
70
domains that are not quite similar to RS imageries. The best performance among
all models for all PU datasets belongs to the proposed approach with domain
invariant feature generator that also takes advantage of the PU learning module
as well as ImageNet’s pre-trained weights.
Next, I will address the models’ behavior in different PU datasets. Looking
at Table 3, it can be seen that, among the PU datasets, the PU dataset on which
models perform the best is not consistent. In other words, adding more labeled
positive data does not result in a consistent behavior in terms of reducing models
error and improving their performances. This is against the general belief that the
more the labeled data, the better the trained model. However, what is important
is that the proposed model outperforms all other models on all, and, especially
the PU9 dataset, which makes it valuable considering the few amount of positive
labeled data required for such performance.
In the case of Target-PU and Target-FT models, as shown in Fig. 17,
and 19, the less the labeled data are available, the more the gap between the train
and validation losses expands. This means that the models are less generalizable
since they receive fewer signals from the smaller amount of labeled data. In general,
there is one other reason for such divergence, which is that the train and val sets
are not representative of each other in some PU datasets. However, this is not
the case here since the images are a blend of the same geographic location, same
satellite sensor, same time, and same atmospheric conditions. The divergence
between train and validation losses is more noticeable when using ImageNet
pre-trained weights as opposed to training from initial random weights (Fig. 17
vs. Fig. 18; and Fig. 19 vs. Fig. 20), which can be seen as the negative transfer
effect that is discussed above. However, such divergence does not appear for
71
Table 3. The performance of the proposed model and the baselines in the target
domain
PU5 PU6 PU7
Method Acc IoU F1 Acc IoU F1 Acc IoU F1
Target-PU 77.69 63.45 77.57 81.06 67.46 80.49 81.52 66.95 80.13
Target-PU(pre-train) 73.94 53.42 69.55 74.50 53.71 69.76 74.68 51.67 67.87
Target-FT 77.70 64.00 77.89 80.31 67.16 80.30 81.98 67.94 80.81
Target-FT(pre-train) 77.34 63.51 77.57 79.61 66.28 79.66 81.26 67.16 80.26
Seamless-U 87.62 73.69 84.79 ← ← ← ← ← ←
Seamless-U(pre-train) 88.81 76.01 86.31 ← ← ← ← ← ←
Seamless-PU 87.67 74.48 85.31 87.70 74.56 85.37 87.80 74.05 85.05
Seamless-PU(pre-train) 88.85 76.63 86.71 89.21 77.21 87.09 89.19 76.54 86.66
PU8 PU9 Best
Method Acc IoU F1 Acc IoU F1 Case
Target-PU 80.75 66.37 79.69 81.38 66.59 79.87 PU6
Target-PU(pre-train) 74.26 51.96 68.27 74.12 51.54 67.52 PU6
Target-FT 81.96 68.11 80.94 82.50 68.44 81.21 PU9
Target-FT(pre-train) 80.73 66.45 79.79 80.09 65.65 79.22 PU7
Seamless-U ← ← ← ← ← ← N/A
Seamless-U(pre-train) ← ← ← ← ← ← N/A
Seamless-PU 87.44 73.89 84.92 87.67 74.44 85.30 PU6
Seamless-PU(pre-train) 88.92 76.41 86.58 88.58 76.01 86.21 PU6
72
Seamless-U and Seamless-PU models since these models try to learn the maximum
information available in a joint information space by the two domains. These
results are illustrated in Fig. 21, 22, 23, and 24. These figures, however, show some
irregularities in the downward trend of the loss function, which could be due to
switching the encoder back and forth between the two domains during the learning
phase.
Fig. 25 qualitatively supports the quantitative results in Table 3. Fig. 25
shows a sample of image patches, their corresponding ground truth, and predictions
from the best performing model between with and without pre-trained weights for
each model trained on the PU5 case: (i) Target-PU without pre-trained weights,
(ii) Target-FT without pre-trained weights, (iii) Seamless-U with pre-trained
weights, and (iv) Seamless-PU with pre-trained weights. Both baselines—even
when provided with the maximum amount of labeled positive data (i.e. PU5)—
result in very large amount of false positive, while both Seamless-U and Seamless-
PU do not suffer from that. Also, it seems that the false negatives are of a higher
rate in Seamless-U than Seamless-PU, which, along with the quantitative results,
makes the Seamless-PU model, the best model among all. Finally, Fig. 26 shows
the models’ outputs for the other extreme side when there is the minimum amount
of labeled positive data available (i.e. PU9). Again, it can be seen that the
proposed Seamless-PU model outperforms the rest of the models.
73
Training & validation loss Training & validation loss
0.5 Training 0.5 Training
Validation Validation
Training Training
0.4 Validation 0.4 Validation
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(a) (b)
Training & validation loss Training & validation loss
0.5 Training 0.5 Training
Validation Validation
Training Training
0.4 Validation 0.4 Validation
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(c) (d)
Training & validation loss
0.5 Training
Validation
Training
0.4 Validation
0.3
0.2
0.1
0.0
0 20 40 60 80 100
Epoch
(e)
Figure 17. The training loss (in red) and validation loss (in green) for Target-PU
model without ImageNet’s pre-trained weights on (a) PU5, (b) PU6, (c) PU7, (d)
PU8, (e) PU9 datasets.
74
Loss Loss
Loss
Loss Loss
Training & validation loss
0.5 Training & validation loss
Training 0.50 Training
Validation Validation
Training 0.45 Training
0.4 Validation Validation
0.40
0.35
0.3
0.30
0.2 0.25
0.20
0.1 0.15
0.10
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(a) (b)
Training & validation loss Training & validation loss
0.50
Training Training
Validation Validation
0.45 Training 0.45 Training
Validation Validation
0.40 0.40
0.35 0.35
0.30 0.30
0.25
0.25
0.20
0.20
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(c) (d)
Training & validation loss
0.50
Training
Validation
0.45 Training
Validation
0.40
0.35
0.30
0.25
0.20
0 20 40 60 80 100
Epoch
(e)
Figure 18. The training loss (in red) and validation loss (in green) for Target-PU
model with ImageNet’s pre-trained weights on (a) PU5, (b) PU6, (c) PU7, (d)
PU8, (e) PU9 datasets.
75
Loss Loss
Loss
Loss Loss
Training & validation loss Training & validation loss
0.5 Training 0.5 Training
Validation Validation
Training Training
0.4 Validation 0.4 Validation
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(a) (b)
Training & validation loss Training & validation loss
Training 0.5 Training
Validation Validation
Training Training
0.4 Validation 0.4 Validation
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(c) (d)
Training & validation loss
Training
Validation
0.4 TrainingValidation
0.3
0.2
0.1
0.0
0 20 40 60 80 100
Epoch
(e)
Figure 19. The training loss (in red) and validation loss (in green) for Target-FT
model without ImageNet’s pre-trained weights on (a) PU5, (b) PU6, (c) PU7, (d)
PU8, (e) PU9 datasets.
76
Loss Loss
Loss
Loss Loss
Training & validation loss Training & validation loss
Training Training
Validation Validation
0.4 Training 0.4 Training
Validation Validation
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(a) (b)
Training & validation loss Training & validation loss
Training Training
Validation Validation
0.4 Training 0.4 Training
Validation Validation
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(c) (d)
Training & validation loss
Training
Validation
0.4 Training
Validation
0.3
0.2
0.1
0.0
0 20 40 60 80 100
Epoch
(e)
Figure 20. The training loss (in red) and validation loss (in green) for Target-FT
model with ImageNet’s pre-trained weights on (a) PU5, (b) PU6, (c) PU7, (d)
PU8, (e) PU9 datasets.
77
Loss Loss
Loss
Loss Loss
Training & validation loss Training & validation loss
0.7
Training Training
Validation Validation
1.0 Training 0.6 Training
Validation Validation
0.8 0.5
0.4
0.6
0.3
0.4
0.2
0.2
0.1
0.0
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(a) (b)
Training & validation loss Training & validation loss
0.8 Training
1.75 TrainingValidation Validation
0.7 Training Training
Validation 1.50 Validation
0.6
1.25
0.5
1.00
0.4
0.75
0.3
0.50
0.2
0.25
0.1
0.00
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(c) (d)
Training & validation loss
1.2 Training
Validation
Training
1.0 Validation
0.8
0.6
0.4
0.2
0.0
0 20 40 60 80 100
Epoch
(e)
Figure 21. The training loss (in red) and validation loss (in green) for Seamless-U
model without ImageNet’s pre-trained weights on (a) PU5, (b) PU6, (c) PU7, (d)
PU8, (e) PU9 datasets.
78
Loss Loss
Loss
Loss Loss
Training & validation loss Training & validation loss
0.5 Training Training
Validation Validation
Training Training
0.4 Validation 0.4 Validation
0.3 0.3
0.2 0.2
0.1 0.1
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(a) (b)
Training & validation loss Training & validation loss
1.0 Training
0.6 Training
Validation Validation
Training Training
Validation 0.5 Validation
0.8
0.4
0.6
0.3
0.4
0.2
0.2
0.1
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(c) (d)
Training & validation loss
Training
0.6 Validation
Training
0.5 Validation
0.4
0.3
0.2
0.1
0.0
0 20 40 60 80 100
Epoch
(e)
Figure 22. The training loss (in red) and validation loss (in green) for Seamless-
U model with ImageNet’s pre-trained weights on (a) PU5, (b) PU6, (c) PU7, (d)
PU8, (e) PU9 datasets.
79
Loss Loss
Loss
Loss Loss
Training & validation loss Training & validation loss
1.2 Training 1.0 Training
Validation Validation
Training Training
1.0 Validation Validation
0.8
0.8
0.6
0.6
0.4
0.4
0.2 0.2
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(a) (b)
Training & validation loss
1.2 Training & validation loss
Training Training
Validation 1.75 Validation
Training
1.0 TrainingValidation 1.50 Validation
0.8 1.25
1.00
0.6
0.75
0.4
0.50
0.2 0.25
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(c) (d)
Training & validation loss
Training
1.2 Validation
Training
Validation
1.0
0.8
0.6
0.4
0.2
0 20 40 60 80 100
Epoch
(e)
Figure 23. The training loss (in red) and validation loss (in green) for Seamless-PU
model without ImageNet’s pre-trained weights on (a) PU5, (b) PU6, (c) PU7, (d)
PU8, (e) PU9 datasets.
80
Loss Loss
Loss
Loss Loss
Training & validation loss Training & validation loss
Training Training
0.9 Validation 0.8 Validation
Training Training
0.8 Validation Validation0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(a) (b)
Training & validation loss Training & validation loss
Training 0.8 Training
0.8 Validation Validation
Training 0.7 Training
Validation Validation
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch
(c) (d)
Training & validation loss
Training
Validation
Training
0.8 Validation
0.6
0.4
0.2
0 20 40 60 80 100
Epoch
(e)
Figure 24. The training loss (in red) and validation loss (in green) for Seamless-
PU model with ImageNet’s pre-trained weights on (a) PU5, (b) PU6, (c) PU7, (d)
PU8, (e) PU9 datasets.
81
Loss Loss
Loss
Loss Loss
4.4.1.1 Conclusions. In this section, I presented the results of the
proposed model for the homogeneous case, which is the basic building block for the
heterogeneous case. I demonstrated that, even in this easier scenario (compared to
the heterogeneous scenario), with or without transferring pre-learned knowledge
(i.e. model weights) from similar or different image domains, NNPU-based baselines
do not perform as well as the proposed approach and do not show competitive
results. What is more, NNPU-based baselines degrade in heterogeneous cases
and cannot be used as a baseline for the proposed model in such cases. For the
proposed model, the domain-invariant feature learner accompanied with signals
from PU data has shown promising results and suggests its potential compatibility
for the heterogeneous setting. Therefore, building upon this, in the next section, I
will present the results from the proposed model with consistency-learning module
for the heterogeneous setting and show that, with only a fraction of labeled positive
data, the model can handle this even more challenging scenario and perform much
better than the state-of-the-art UDA models.
82
Figure 25. Left-to-right: Sample image patches from homogeneous PU Inrial
Dataset, the corresponding ground truth, and predictions from four models trained
on the PU5 case: (i) Target-PU without pre-trained weights, (ii) Target-FT without
pre-trained weights, (iii) Seamless-U with pre-trained weights, and (iv) Seamless-
PU with pre-trained weights.
83
Figure 26. Left-to-right: Sample image patches from homogeneous PU Inrial
Dataset, the corresponding ground truth, and predictions from four models trained
on the PU9 case: (i) Target-PU without pre-trained weights, (ii) Target-FT without
pre-trained weights, (iii) Seamless-U with pre-trained weights, and (iv) Seamless-
PU with pre-trained weights.
84
4.4.2 Heterogeneous case. The performance of models in this
section is evaluated using the heterogeneous dataset containing PN Massachusetts
Buildings Dataset as the source domain and the complete PU datasets with
different labeling frequencies for the positive class (i.e. PU5, PU6, PU7, PU8,
and PU9) from Inria Image Dataset created in Chapter III as the target domain.
PixMatch (Melas-Kyriazi & Manrai, 2021) is considered as the baseline with its
recommended hyper-parameters setting from the original paper. PixMatch has
different variations based on the type of the perturbation that is used in it. The
extent to which each perturbation contributes is controlled with its coefficient
λi. Within the original paper, it is not clear what is the value for λ for CutMix,
Fourier, and Fourier+CutMix perturbations. Therefore, I use the same best value
of λ that is mentioned for Augmentation perturbation, which is equal to 0.15. For
the proposed Sealmess-PU model, all the hyper-parameters are the same as before
in the homogeneous section.
Table 4 shows the performance drop of the proposed Sealmess-PU model
when trained in a heterogeneous scenario. The performance drop is not very far
in the case of PU5 with the highest amount of labeled positive data. However, for
the PU9 case, which is the most interesting case, the model performs poorly. This
situation requires some considerations within the proposed model to make up for
the challenges caused by the heterogeneous scenario.
According to Table 5, for RS imageries, the PixMatch model in its vanilla
form performs as well as when accompanied by augmentation perturbation and
even better than when the other perturbations are incorporated. In addition,
the performance gap among different perturbations is relatively large. These
observations do not follow the results in the PixMatch paper for urban scene
85
Table 4. Performance drop of PUDA model (Seamless-PU) without the consistency
loss on heterogeneous PU data from Inria Image Dataset. The model uses a warm
start using the ImageNet pre-trained weights.
Acc IoU F1 PU Dataset
83.58 68.85 81.51 PU5
79.64 62.28 76.70 PU6
79.12 60.35 75.22 PU7
72.68 53.04 69.26 PU8
65.31 34.06 50.66 PU9
datasets. One explanation for this could be that the type of the data determines
which perturbation technique is most effective. This hypothesis is based on the
results in Balestriero, Bottou, and LeCun (2022) where they discover that not all
data augmentation techniques affect the different classes within the dataset in the
same way, and thus there could be model performance drop for some classes with
some specific data augmentation while the rest of the classes experience model
performance gains. In addition, the performance of PixMatch model depends
heavily on the network architecture used. For example, changing the network
architecture from DeepLabV2 with ResNet-101 backbone to UNET with VGG16
backbone degrades the PixMatch(Augmentations) model’s performance (i.e. IoU)
from 33.67 to 01.79.
As shown in Table 5, the proposed Sealmess-PU model outperforms the
PixMatch baseline model. I also investigate the effect of model architecture and the
type of backbone on the performance of Sealmess-PU model. Therefore, I change
the architecture to DeepLabV2 and the backbone to ResNet-101. Since PixMatch
86
model also uses an exponential weighted moving average component in its learning
process, I use such component too in the changes to the original proposed Sealmess-
PU model to make everything the same and the comparison fair. Although this
setting for Sealmess-PU model still outperforms the PixMatch baseline model,
it results in performance improvements only on PU5, PU6, and PU7 datasets,
whereas it results in performance degradations on PU8 and PU9 datasets. The
reasons behind such performance changes can be due to (i) the difference in
the number of learnable parameters between the two architectures and/or (ii)
the operations used in each of the two architectures. In terms of the former,
DeepLabV2 model has 42610632 (∼ 43 million) learnable parameters with 42500032
(∼ 42.5 million) of these parameters being for the encoder part, whereas UNET
model has 49698434 (∼ 50 million) learnable parameters with 14723136 (∼ 15
million) of these parameters being for the encoder part, and each decoder having
17487649 (∼ 17.5 million) learnable parameters. The lightweight characteristics
of the UNET model for the target domain is an advantage when facing limited
number of labeled data. For the latter, my hypothesis is that operations such as
Atrous Convolution may depend on the supervision that comes from the labeled
data, which causes model’s performance degradations (this hypothesis needs to
be further investigated in future). Finally, Fig. 27 shows a sample image patches
from heterogeneous PU Inrial Dataset, their corresponding ground truth, and
predictions from PixMatch(Augmentations), Seamless-PU with UNET trained on
PU5 dataset, Seamless-PU with DeepLabV2 trained on PU5 dataset, Seamless-PU
with UNET trained on PU9 dataset, and Seamless-PU with DeepLabV2 trained on
PU9 dataset.
87
Table 5. Performance of unsupervised domain adaptation model (PixMatch) and
the proposed PUDA model (Seamless-PU). All utilize a warm start using the
ImageNet pre-trained weights.
Method Architecture Backbone Acc IoU F1 PU Dataset
PixMatch(Fourier) 57.82 01.78 03.50 N/A
PixMatch(CutMix ) 57.82 01.87 03.67 N/A
UNET VGG
PixMatch(Augmentations) 57.82 01.79 03.51 N/A
PixMatch(Fourier + CutMix ) 57.81 01.28 02.52 N/A
PixMatch 66.82 33.19 49.56 N/A
PixMatch(Fourier) 54.21 23.15 37.51 N/A
PixMatch(CutMix ) DeepLabV2 ResNet-101 54.14 20.54 34.00 N/A
PixMatch(Augmentations) 63.83 33.67 50.15 N/A
PixMatch(Fourier + CutMix ) 54.20 22.58 36.75 N/A
83.08 67.36 80.45 PU5
81.06 64.40 78.29 PU6
UNET VGG16 79.57 59.62 74.65 PU7
77.66 57.52 72.99 PU8
77.09 60.33 75.20 PU9
Seamless-PU
84.35 70.46 82.63 PU5
84.37 70.08 82.37 PU6
DeepLabV2 ResNet-101 83.67 67.53 80.55 PU7
74.82 46.47 63.33 PU8
70.39 38.00 54.90 PU9
88
Figure 27. Left-to-right: Sample image patches from heterogeneous PU
Inrial Dataset, the corresponding ground truth, and predictions from (i)
PixMatch(Augmentations) , (ii) Seamless-PU with UNET trained on PU5, (iii)
Seamless-PU with DeepLabV2 trained on PU5, (iv) Seamless-PU with UNET
trained on PU9, (v) Seamless-PU with DeepLabV2 trained on PU9.
89
4.4.2.1 Ablation Study. I tested the effect of consistency-training
module through its amount of contribution to the proposed Seamless-PU
model using λconsistency. Therefore, using different values for λconsistency, I
investigate different levels of trade-off for focusing on the heterogeneity of the
data within the learning process. Table 6 shows the results for λconsistency =
0.05, 0.10, 0.15, 0.20, 0.25, 0.5, and 1.0 for both UNET and DeepLabV2
architectures for PU9 dataset. The behavior of λconsistency is not consistent across
the two architectures. λconsistency = 0.1 shows the best result for DeepLabV2 which
is still lower than the worst performing value for λconsistency for UNET. The best
case for UNET case is when consistency training module fully contributes to the
learning process(i.e. λconsistency = 1.0).
Table 6. The effect of the degree of the magnitude that the consistency loss is
incorporated, which is shown for the case of PU9 dataset.
λconsistency
Architecture Performance Metric 0.05 0.10 0.15 0.20 0.25 0.50 1.00
Backbone
Acc 70.33 71.77 72.26 70.35 72.00 67.44 70.39
DeepLabV2
IoU 37.37 42.24 41.79 36.21 41.10 27.78 38.00
ResNet-101
F1 54.32 59.30 58.85 53.04 58.16 43.33 54.90
Acc 72.06 72.55 75.59 72.68 69.58 72.41 77.09
UNET
IoU 46.40 44.78 49.89 44.35 42.47 42.89 60.33
VGG16
F1 63.32 61.75 66.45 61.34 59.54 59.87 75.20
90
4.4.2.2 Conclusions. In this section, I presented the results of the
proposed positive and unlabeled domain adaptation model when taking into
account the heterogeneity of the data within the training process. I demonstrated
that, even in the hard case of heterogeneous scenario, the proposed approach
performs well and outperforms the UDA models that target such scenarios. The
incorporation of consistency training loss has shown promising results. However,
it yielded relatively weaker results for more complex architectures for PU datasets
with relatively lower amount of labeled data for the positive class. Therefore, there
is a need to further investigate the relationship between the consistency training
module and the complexity of the model architecture.
4.5 Conclusion
The proposed model in both homogenous and heterogenous settings
outperforms the state-of-the-art baselines while (i) it does not impose additional
computational burdens such as style-transfer modules in adversarial approaches, (ii)
it leverages relatively light-weight architecture and backbone, (iii) it outperforms
the stand-alone PU learning and has potentials for multi-domain learning, (iv)
it performs twice as well as UDA models with only a fraction of labeled data
from the positive class, and (v) it allows PU learning to experience performance
improvement by taking advantage of the other available fully labeled datasets
through transfer learning. Finally, this work can be considered as multi-source
multi target domain adaptation without the hassle of introducing different
networks/encoders for each domain (Isobe et al., 2021) and/or an extra domain
alignment module that generates intermediate adapted domains explicitly (S. Zhao,
Li, Xu, & Keutzer, 2020).
91
CHAPTER V
DEEP ENSEMBLE POSITIVE AND UNLABELED LEARNING
The performance of machine learning models may degrade if they fail to
capture the underlying structure within data (Dong et al., 2020a). Such failure is
more probable when labeled data are not available, such as in the case of positive
and unlabeled (PU) data. Therefore, this can cause degradation in the performance
of PU models, compared to fully-supervised models. Furthermore, defining the
hypothesis space and selecting algorithms that search the defined hypothesis space
can introduce inductive biases (Utgoff, 1986, 2012), which can result in machine
learning models failing to learn the problem, and thus failing to generalize. One
approach to address these two problems is ensemble learning (Dong et al., 2020a;
Opitz & Maclin, 1999), which aims to leverage the collective predictive power
of multiple different models in order to achieve a better predictive performance
compared to any of the participating individual models, alone (Sagi & Rokach,
2018b).
Recently, ensemble learning has gained attention in PU learning research in
different areas—for example Nguyen et al. (2012), P. Yang et al. (2014), P. Yang
et al. (2016), Claesen et al. (2015), Jowkar and Mansoori (2016), and Basile et al.
(2019). However, despite the effectiveness and popularity of ensemble learning in
remote sensing research—see Du et al. (2012) for a survey on such methods—little
research has been done on ensemble PU learning in remote sensing (e.g. R. Liu et
al., 2018, is one of the few).
Furthermore, there is a need to investigate the fusion of ensemble learning
within deep learning methodology in remote sensing research since the effectiveness
and power of such fusion has been shown in deep learning research (Fort, Hu, &
92
Lakshminarayanan, 2019) even in a limited data regime (Brigato & Iocchi, 2021).
Research such as Ekim and Sertel (2021) has realized such needs in remote sensing
research and investigated some native ensemble approaches developed within
deep learning framework for supervised RS image classification. However, there
is a dearth of research in the area of ensemble PU learning for RS imageries, in
general, and RS image segmentation, in particular. Therefore, in this chapter, I
investigate the performance of deep learning-based ensemble methodologies for PU
learning of RS imageries, and then I propose an ensemble PU learning model, which
outperforms all the other models. Finally, I discuss the limitations of the proposed
model and future research opportunities.
5.1 Background
Despite the great performance of deep learning models in computer vision
tasks, due to a positive inductive bias through utilization of convolutional neural
networks (CNNs) (Cohen & Shashua, 2016; LeCun, Bengio, et al., 1995), the
convergence of such models to a global minimum may never happen (G. Huang
et al., 2017). The impossibility of converging to a global minimum is due to many
other inductive biases such as: increasing the number of model parameters and
model complexity (Wasay & Idreos, 2020), incomplete knowledge of inductive bias
of convolutional operations (Cohen & Shashua, 2016; Wasay & Idreos, 2020), and
optimization techniques (Dauphin et al., 2014). Although researchers have sought
to understand such inductive biases, such as pooling schemes in CNNs (Cohen &
Shashua, 2016) and Stochastic Gradient Descent (SGD) (Dauphin et al., 2014),
there are many more parameters embedded in training deep (CNN-based) learning
models, making their inductive bias hard to quantify. As an example of such
research, it has been shown that as the number of model parameters increases, the
93
number of local minima grows exponentially (Cohen & Shashua, 2016; Kawaguchi,
2016). However, the good news is that not all local minima have an adverse affect
and deep learning models can still be generalizable (Keskar, Mudigere, Nocedal,
Smelyanskiy, & Tang, 2016). For this reason, deep learning models still perform
well even with converging to local optima. In addition, models that converge to
different local optima with similar error rates can have different prediction errors,
and thus an ensemble of diverse collection of such models converging to different
local optima can result in a reduction in the final error rates (Cohen & Shashua,
2016; Fort et al., 2019). In such situations, the ensemble learner outperforms each
of the participating individual models (Ekim & Sertel, 2021).
Deep ensemble learning approaches can be categorized into two main
streams: (i) those that attempt to learn a single model through ensembling the
model parameters in the training phase, and (ii) those that attempt to learn an
ensemble of models and take advantage of multi-modal optimization.
There are different approaches considered for ensembling model
parameters such as Dropout (G. E. Hinton, Srivastava, Krizhevsky, Sutskever,
& Salakhutdinov, 2012), Snapshot Ensemble (G. Huang et al., 2017), Fast
Geometric Ensemble (Garipov, Izmailov, Podoprikhin, Vetrov, & Wilson,
2018), Stochastic Weight Averaging (Izmailov, Podoprikhin, Garipov, Vetrov,
& Wilson, 2018), and Exponential Moving Average (Tarvainen & Valpola,
2017). Dropout, by G. E. Hinton et al. (2012); Srivastava, Hinton, Krizhevsky,
Sutskever, and Salakhutdinov (2014), is originally introduced as a regularization
technique to prevent model overfitting, which is common in deep neural
networks. Helmbold and Long (2017) show that dropout can result in a better
model complexity/performance balance since the penalty by dropout can grow
94
exponentially as the number of hidden layers increases while the penalty by other
traditional regularizers, such as L2-norm, grow linearly. In addition, they show
that dropout is scale invariance with respect to, for example, model parameters
resulting in the possibility of finding any local optima. Finally, Warde-Farley,
Goodfellow, Courville, and Bengio (2013) introduce dropout as an ensemble
learning technique, which can be similar to bagging or boosting—such behavior
of dropout has been then further studied in different research such as Z. Zhang,
Dalca, and Sabuncu (2019). As far as the other approaches, G. Huang et al. (2017)
propose a single training multi-model learning approach called Snapshot Ensemble
(SE). The authors show that, using a cyclic cosine annealing schedule for learning
rate, a model can scape multiple local optima while visiting them properly along
its optimization path and saving the corresponding model parameters at each
local optimum along the way. Therefore, at the end of a single training, there are
multiple saved models, each of which correspond to a different local optimum.
Then, an ensemble of these models is used to produce the final predictions. Garipov
et al. (2018) propose Fast Geometric Ensemble (FGE), which is based on the same
logic as SE. However, it utilizes a linear piecewise cyclical learning rate instead.
They also show that deep neural networks’ local optima are connected with a path
such that the train and test loss stay low while moving from one local optimum
to another, and thus such paths can be utilized for an ensemble. Izmailov et al.
(2018) propose Stochastic Weight Averaging (SWA) that is an improvement on
FGE in that SWA can approximate FGE while it only requires one model at the
test time. In comparison to FGE, which averages different models’ predictions,
SWA averages the weights of such models. More specifically, first, they train a
model in a conventional manner using either the full or a proportion of the number
95
of epochs to create the initial model parameters. Then, the training continues
using a cyclical learning rate for traversing multiple local optima for each of the
corresponding model parameters that are captured—for constant learning rate,
model parameters are captured at every epoch. At the end, the average of all these
captured model parameters constitutes the final model parameters. Ekim and
Sertel (2021) investigate all three SE, FGE, and SWA for image classification of
RS imagery, and show that SWA outperforms the other two and the non-ensemble
baseline. Finally, Tarvainen and Valpola (2017) propose an improvement on
Temporal Ensemble (Laine & Aila, 2016), in which model weights are averaged
using the Exponential Moving Average (EMA) method over epochs during the
training phase.
The model ensemble approaches can be categorized into Knowledge
Distillation (discussed in Chapter IV) (Tarvainen & Valpola, 2017), TreeNets
(S. Lee, Purushwalkam, Cogswell, Crandall, & Batra, 2015), AdaBoosts (Mosca &
Magoulas, 2017), and MotherNets (Wasay, Hentschel, Liao, Chen, & Idreos, 2020).
S. Lee et al. (2015) propose TreeNets as a spectrum of ensembles ranging between
single models and non-parameter sharing ensembles of multiple models. Within
the spectrum, it is possible to have models that share some layers at the beginning
of the network and then diverge from each other to create different classification
heads resulting in different outputs to be averaged for the final output. Mosca and
Magoulas (2017) propose using AdaBoost for CNN-based models such that the
learned model parameters of the previous round of sub-training are transferred to
be used as a part of the extended model in the next round. MotherNets, by Wasay
et al. (2020), try to reduce the training time for an ensemble of individual models.
MotherNets capture the structural similarity between a cluster of networks such
96
that the maximum common core structure of models in each cluster is identified as
the MotherNet of that cluster. After training the MotherNet, the individual models
within the cluster inherit the model parameters from the MotherNet and will be
further trained.
Finally, the experiments by Fort et al. (2019) show that models with
different random initializations explore the weight space better than other
approaches, such as bayesian networks, and thus deep ensembles of different
random initializations are able to capture different modes of the space of solutions.
Therefore, a deep ensemble with each of its different models initialized differently
can capture a multi-modal landscape solution due to each of its members
discovering a different local optimum in the solution space, and thus such deep
ensemble can have a better prediction performance.
All in all, each of the aforementioned venues of deep ensemble learning
has their advantages and disadvantages considering the balance among final
performance, training time, and the number of added parameters. In addition,
most of these approaches are developed for supervised settings and/or for image
classification task. Therefore, I evaluate some of these approaches against a non-
ensemble model for semantic segmentation with PU data. The rest of this chapter
is organized as follows. In § 5.2, I first introduce the selected ensemble approaches
and the baseline against which I evaluate them, then I introduce the proposed
approach. In § 5.3, I assess the results, and, finally, I summarize and discuss future
work in § 5.4.
5.2 Methodologies
In this section, different approaches of deep ensemble learning are
investigated against the performance of a non-ensemble baseline model in a PU
97
learning scenario. Target-PU model (introduced in Chapter IV) is chosen as
the baseline. The baseline uses random weights from a Gaussian distribution
and does not use a warm start from ImageNet pre-trained weights, since using
such weights results in a negative transfer effect—see Chapter IV. Thus, for a
fair and better comparison, the ensemble models do not incorporate ImageNet
pre-trained weights either. The selected ensemble models for this chapter are
dropout, EMA, SWA, feature ensemble, model ensemble, TreeNet, and contextual
ensemble. The structure of all of these models are the same as the baseline, except
for the contextual ensemble, which has a smaller depth than the baseline model.
All models utilize a UNET architecture with VGG16 as the backbone. Finally,
based on the lessons learned from these models, I propose a multi-scale ensemble
approach that performs the best among these models.
5.2.1 Dropout. Inspired by the results in Bartolome, Zhang,
and Ramaswami (2018), I consider dropout layers in the expansive path of the
UNET network such that a dropout with probability p = 0.1 is applied after
each concatenation of an up-sample and its corresponding feature map from the
contracting path.
5.2.2 EMA. The model’s temporal ensemble is done using exponential
moving average of model weights over each epoch in the training stage. As shown
in equation 5.1, the averaged model weights at time t (θ̄t) are calculated from the
average value of model parameters over t − 1 times (θ̄t−1) and the current state of
values at time t (θt). The value for γ is chosen to be equal to 0.999, which converts
the simple average to a weighted average.
98
θ̄t = γθ̄t−1 + (1− γ)θt (5.1)
5.2.3 SWA. The SWA trains two models in the training phase. The
first model scans the weight space to find different local optima using a cyclical
learning rate scheduler—for convenience, I call this model cyclical. After a warm-
up stage for the cyclical model, its weights are used as the starting point for
the second model which I call the SWA model. The cyclical model continues its
contribution to the SWA model until the end of the training phase. The SWA
model maintains an exponential moving average (equation 5.2) of its weights and
the weights from the cyclical model. The contribution of the cyclical model’s
weights is controlled and decreased over time by adjusting the parameter α from
value 1 to a value very close to 0.
θSWA = (1− α)θSWA + αθcyclical (5.2)
5.2.4 Feature ensemble. The idea is that two (or multiple feature
generator) contribute to the same decoder. This means that the ensemble of models
happens at feature level instead of the model’s output level. Since UNET takes
advantage of multiple mid-level features to be used directly in the expansive path, I
implement the feature ensemble at the last and all mid feature maps (Fig 28a). The
ensemble of feature maps is done using a 1× 1 convolution operation.
5.2.5 Model ensemble. In this approach of ensemble, multiple
random initializations are used for training multiple models, and then their
outputs are collectively used in order to ensure diversity (Fig. 28b). This kind
99
of ensemble is expensive in terms of training time. However, it may result in the
best performance, as suggested by Fort et al. (2019). I investigate two 2-model
ensembles and one 3-model ensemble with (i) a random initialization strategy, and
(ii) an average method for constructing the final output of the models’ ensemble
from the output of individual models.
5.2.6 TreeNet. TreeNets are a spectrum between a non-ensemble
single model and model-ensemble. I investigate two different ways of implementing
TreeNets: one that consider the divergence to happen at encoder level (Fig. 28c),
and the other one that consider the divergence to happen at decoder level
(Fig. 28d).
5.2.7 Contextual ensemble. Different research such as Marmanis et
al. (2016), Z. Zhou, Siddiquee, Tajbakhsh, and Liang (2019), Ma, Li, Zhang, Tang,
and Guo (2021), and L. Chen et al. (2021) have been trying to provide ensemble
networks by modifying the structure of UNET network. Most recently, Q. Zhou
et al. (2022) propose a UNET-based ensemble for semantic segmentation called
contextual ensemble network (Fig. 28e) which fully explores contextual features
by modifying the expansive module of the UNET network. At each level within
the expansive module, a stack of multi-scale feature representations from previous
stages are concatenated to the stack of upsampled features and feature maps copied
from the contracting module. This process aims to combine feature maps created
with different receptive fields and thus allows harvesting and leveraging multi-scale
context clues.
5.2.8 Multi-scale ensemble. The idea of feature sharing and
ensemble of multiple predictions is not new. Feature Pyramid Network (FPN)
100
} }
(a) Feature ensemble. (b) Model ensemble.
Shared } Shared }
(c) TreeNet ensemble at Encoder. (d) TreeNet ensemble at Decoder.
Conv, Batch-Norm, ReLU
Copy, Concat
Max Pool
Up-Conv
Conv
Up-Conv (2x)
Up-Conv (4x)
Up-Conv (8x)
(e) Contextual ensemble.
Figure 28. Different deep ensemble architectures.
101
by T.-Y. Lin et al. (2017) proposes using feature maps at different scales for
producing multiple predictions to be combined for the final prediction. Such final
prediction performs better than a prediction based solely on a single feature map
output of a feature generator network. FPN is originally for image classification
with a brief extension idea for image segmentation within the original paper.
Following on FPN, different research such as Tao, Sapra, and Catanzaro (2020)
and Bousselham et al. (2021) have been trying to provide an ensemble learning
framework for semantic segmentation. However, such models usually incorporate
complex modules such as different variations of attention and transformer modules
(Z. Liu et al., 2021; Vaswani et al., 2017). Although successful, these models are
developed for fully supervised learning. However, as shown in Chapter IV in the
case of UNET-VGG16 versus DeepLabV2-ResNet101, the high complexity of
the learning model becomes harmful rather than beneficial as the proportion of
labeled data from the positive class decreases (i.e. from PU5 to PU9). This is
important since it is desirable to have a model that performs well with the least
amount of available labeled data from the positive class. In addition such complex
models require using pre-trained weights for optimal performance, whereas pre-
trained weights cause negative transfer in the case of PU data (which is shown in
Chapter IV). Considering these, I propose a model that is based on three elements:
(i) the successful lightweight UNET structure for PU learning that is explored in
Chapter IV, (ii) the multi-scale multi-prediction idea by FPN, and (iii) the multi
local optima exploration idea by SWA model. The idea of using SWA model
as the third element comes from the results by Fort et al. (2019) showing that
mixing multiple ensemble approaches can result in even more improvements in the
prediction performance.
102
As shown in Fig. 29, in addition to the final layer created by a series of up-
samples using deconvolution operations, the two previous layers in the expansive
module are used to create two additional prediction segmentation maps. These
two additional layers are upsampled to create an output with the same size as the
original output. The upsampling is done by a bilinear operation (rather than a
deconvolution operation) in order to (i) keep the model complexity low and (ii)
maintain the information at each layer without deforming it. The final prediction
for each of these two additional upsampled layers is produced by applying a 1 × 1
convolution, which is the same way as for the original prediction output. Finally,
all three prediction segmentation maps are averaged to produce the final prediction
segmentation map.
}
Conv, Batch-Norm, ReLU
Copy, Concat
Max Pool
Up-Conv
Conv
Up-Bilinear
Figure 29. self-ensemble multi-scale UNET.
5.3 Results
The performance of models in this section is evaluated using the PU dataset
from the homogeneous dataset created from Inria Image Dataset, as discussed
103
in Chapter III. The loss function for all models is NNPU loss, and the same
evaluation metrics as in Chapter IV are used here. Target-PU model (introduced in
Chapter IV) is chosen as the non-ensemble baseline, and is initiated using random
weights from a Gaussian distribution. Dropout, EMA, and SWA models are also
initiated using random weights from a Gaussian distribution. Encoders in feature
ensemble model are initiated using random weights from a Gaussian distribution
and Xavier uniform distribution, respectively. This is the same for TreeNet models
as well. In the case of model ensemble, there are two 2-model ensembles and a 3-
model ensemble. In each of the 2-model ensembles, one model is initialized using
random weights from a Gaussian distribution and the other model is initialized
using random weights from either Xavier uniform distribution or Kaiming uniform
distribution, respectively. In 3-model ensemble, Gaussian, Xavier uniform, and
Kaiming uniform distributions are used for each of the three models’ weights
initializations. Finally, contextual ensemble and the proposed multi-scale model
are initiated using random weights from a Gaussian distribution. The optimizer
and learning rate scheduler hyper-parameters are the same as in Chapter IV for
all models, except for the SWA model. For SWA model, I use the configuration set
that the original paper (Izmailov et al., 2018) suggests.
According to Table 7, excluding the proposed model, SWA model performs
the best and outperforms the baseline. After that, model ensemble has the
second place with outperforming the baseline in some cases such as PU9 dataset.
Although Fort et al. (2019) show that using ensemble of models with different
random initializations can result in better performances, it can be seen that such
improvements are not always the case and not guaranteed for PU data. TreeNet
models surprisingly perform very poorly, given that the model ensemble does well.
104
Since model ensemble is on the extreme end of the spectrum of TreeNets model
architectures, I further investigate the performance gap between TreeNets and
model ensemble by analyzing the changes in models’ parameters during the training
phase using the cosine similarity metric.
The cosine similarity of two vectors projected in a multi-dimensional space
is the cosine of the angle between the two vectors. Therefore, it can be used to
measure the alignment of the two vectors. According to Fort et al. (2019), cosine
similarity measures weight space alignment along optimization trajectory, and thus
each of the two aforementioned vectors contain the parameters (i.e. weights) of two
models that are analyzed. I calculate the cosine similarity using the equation 5.3
at every five checkpoints during the training phase, where θi is the parameter set
for model i. The value of the cosine similarity of two vectors ranges from -1 to 1,
where -1, 0, and 1 indicate strongly opposite, independent, and similar vectors,
respectively.
θT θ2
cos(θ1, θ
1
2) = (5.3)||θ1||||θ2||
Although the shared part within the TreeNet models explores the weight
space pretty well (Fig. 30a and Fig. 31a), it controls how and to what extent
separated branches within the models would explore the weight space. As shown
in Fig. 30b and Fig. 31b, if the shared part of the network in a TreeNet model
reduces, there is a higher chance for the separate branches to explore different areas
within the weight space. For example, when the separation point is at decoder
level, since the feature map generation is done similar for the two networks, the
decoders tend to end up exploring the same area within the weight space. Whereas,
105
when the separation point is within the encoder, the two models can drift away
from each other within the weight space, which gives them a higher chance of
finding two local minima, such that their ensemble performs better. This is also the
case when the models are completely separated. For example Fig. 32 compares two
models with Xavier and Kaiming random weight initializations within the 3-model
ensemble network. Although the two models start within the similar neighborhood
within the weight space, they drift apart during the training phase, which again
gives them a higher chance of finding two local minima, such that their ensemble
performs better.
TreeNet (Decoder): Cosine Similarity of Encoder TreeNet (Decoder): Cosine Similarity of Decoders
100 1.0000 100
95 95
90 90 0.06030
85 0.9998 85
80 80 0.06035
75 75
70 70 0.06040
65 0.9996 65
60 60
55 55 0.06045
50 50
45 0.9994 45 0.06050
40 40
35 35
30 30 0.06055
25 0.9992 25
20 20 0.06060
15 15
10 0.9990 10 0.0606505 05
05 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 05 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Checkpoints Checkpoints (Decoder 1)
(a) The shared part. (b) The unshared part.
Figure 30. Cosine similarity of model weights for TreeNet (Decoder) model over
different checkpoints at different epochs during the training phase for PU9 dataset.
106
Checkpoints
Checkpoints (Decoder 2)
Table 7. The performance of different ensemble models compared to the performance of a single model.
PU5 PU6 PU7 PU8 PU9 Best
Method Acc IoU F1 Acc IoU F1 Acc IoU F1 Acc IoU F1 Acc IoU F1 Case
Baseline (non-ensemble) 77.69 63.45 77.57 81.06 67.46 80.49 81.52 66.95 80.13 80.75 66.37 79.69 81.38 66.59 79.87 PU6
Dropout 77.09 62.91 77.20 80.40 66.49 79.85 81.27 66.85 80.10 81.13 66.60 79.93 80.87 65.67 79.25 PU7
EMA 40.94 40.94 58.07 43.33 39.85 56.95 42.48 38.90 55.98 43.13 39.63 56.73 43.20 39.71 5682 PU5
SWA 77.61 63.75 77.79 41.85 41.85 58.91 40.50 40.49 57.54 82.54 68.88 81.49 83.13 69.50 81.95 PU9
Feature Ensemble 57.69 07.84 14.53 50.81 27.69 43.34 40.55 40.44 57.48 42.13 40.32 57.35 48.76 31.48 47.85 PU7
2-Model Ensemble (Xavier) 76.72 62.67 76.98 79.84 66.28 79.64 81.32 66.84 80.05 80.51 66.39 79.67 81.20 66.93 80.12 PU9
2-Model Ensemble (Kaiming) 75.89 61.76 76.29 80.34 66.59 79.88 80.57 65.90 79.38 80.35 66.12 79.48 81.24 66.71 79.95 PU9
3-Model Ensemble 77.03 62.90 77.16 80.87 67.22 80.33 81.46 66.92 80.11 80.96 66.83 79.98 81.92 67.67 80.64 PU9
TreeNet (Encoder) 50.12 28.65 44.51 57.95 01.21 02.39 48.18 31.70 48.09 41.61 40.91 57.94 57.94 03.62 06.98 PU8
TreeNet (Decoder) 52.00 25.02 40.01 51.89 25.12 40.13 53.29 22.16 36.26 49.21 30.62 46.83 49.51 30.20 46.36 PU8
Contextual Ensemble 77.18 60.04 74.98 77.16 58.89 74.02 78.38 59.33 74.37 75.00 54.87 70.76 78.55 60.19 74.91 PU9
Proposed Model 79.58 65.83 79.34 82.77 69.90 82.21 82.00 68.05 80.91 82.78 69.17 81.68 83.93 70.18 82.42 PU9
107
TreeNet (Encoder): Cosine Similarity of The Shared Encoder TreeNet (Encoder): Cosine Similarity of Two Branches
100 1.0000 100
95 95 0.00732
90 0.9998 90
85 85
80 0.9996 80 0.00734
75 75
70 70
65 0.9994 65 0.00736
60 60
55 0.9992 55
50 50 0.00738
45 0.9990 45
40 40
35 35 0.00740
30 0.9988 30
25 25
20 0.9986 20 0.00742
15 15
10 0.9984 10 0.00744
05 05
05 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 05 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Checkpoints Checkpoints (Branch 1)
(a) The shared part. (b) The unshared part.
Figure 31. Cosine similarity of model weights for TreeNet (Encoder) model over
different checkpoints at different epochs during the training phase for PU9 dataset.
Model Ensemble: Two Separate Models' Cosine Similarity
100
95
90 0.148
85
80
75 0.146
70
65
60 0.144
55
50
45 0.142
40
35
30
25 0.140
20
15
10 0.138
05
05 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Checkpoints (Model 1)
Figure 32. Cosine similarity of model weights for ModelNet model over different
checkpoints at different epochs during the training phase for PU9 dataset.
The contextual ensemble performs poorly since there are so many
convolution operations added to the expansive module of the network. I
investigated other options that decrease such complexity such as bilinear
upsampling. This resulted in improved model performance; however, the baseline
model still shows superior performance. I also tried the proposed multi-scale
108
Checkpoints
Checkpoints (Model 2)
Checkpoints (Branch 2)
ensemble model without weight sharing using three different models with Gaussian,
Xavier uniform, and Kaiming uniform distributions for each of the three models’
weights initializations. However, this approach resulted in a minor decrease in
prediction performance of the model. For example, the IoU for PU9 drops from
70.18% to 70.03%. Fig. 33 shows a sample of image patches from homogeneous PU
Inrial Dataset and the corresponding ground truth as well as predictions from the
baseline model and the proposed model trained on PU9 dataset.
109
Figure 33. Left-to-right: Sample image patches from homogeneous PU Inrial
Dataset, the corresponding ground truth, and predictions from the baseline model
and the proposed model trained on PU9 dataset.
110
Finally, since the proposed ensemble model is an end-to-end learning and
compatible with the network structure and the learning schema proposed for PU
domain adaptation, it is possible to create a mixture of both in order to further
extend the performance improvement. As shown in Table 8, the mixture of the
proposed model for ensemble and transfer learning performs the best for PU5, PU7,
and PU9 datasets, whereas its performance of PU6 and PU8 are weaker. What
this has all shown is the superior performance of the proposed model on my most
challenging dataset, that is PU9 dataset with the minimum amount of available
labeled data from positive class. However, further research is needed to investigate
the unsatisfying performance of the model on PU6 and PU8 datasets.
5.4 Conclusion
In this chapter, I investigated most of the available ensemble approaches
developed for deep learning frameworks. However, they are mostly developed for
image classification tasks and supervised learning framework. I demonstrated that
these methods—except for SWA and model ensemble—do not perform well in the
case of PU data for semantic segmentation of RS imageries. Next, I proposed a
lightweight end-to-end ensemble approach that outperforms all the other models
and shows promising results for creating a mixture with the proposed domain
adaptation model for PU learning.
111
Table 8. The mixture of Ensemble and Transfer PU models.
Dataset Performance Metric Seamless-PU Ensemble Seamless-PU
Acc 88.85 89.54
PU5 IoU 76.63 77.78
F1 86.71 87.44
Acc 89.21 84.78
PU6 IoU 77.21 68.91
F1 87.09 81.53
Acc 89.19 89.54
PU7 IoU 76.54 77.08
F1 86.66 87.00
Acc 88.92 88.42
PU8 IoU 76.41 75.84
F1 86.58 86.17
Acc 88.58 89.00
PU9 IoU 76.01 76.63
F1 86.21 86.67
112
CHAPTER VI
CONCLUSIONS & FUTURE WORK
Remotely-sensed (RS) imageries are key elements in research in many
fields of study. The large amount of RS imageries with high spatial and temporal
resolution that are produced frequently require machine learning algorithms for
automatic information extraction. The labor-intensive and time-consuming nature
of creating a set of representative labeled training samples, along with frequent
applications with interests in identifying only one specific landcover or object
from RS imageries, call for incorporating positive and unlabeled (PU) learning
methodology in remote sensing research. The research presented in this dissertation
is among the first that addresses the problem of PU learning in geospatial domain,
in general, and in remote sensing, in particular. As the occurrence of PU data is
inevitable and much of a reality in geospatial applications, this research aims to
connect the research in machine learning and remote sensing communities. In this
dissertation, I investigated the two research questions:
RQ1. How can Transfer Learning be incorporated in the context of positive
and unlabeled learning for semantic segmentation of satellite imagery?
RQ2. How can Ensemble Learning be incorporated in the context of positive
and unlabeled learning for semantic segmentation of satellite imagery?
In summary, my research demonstrated that the naive mixture of existing
PU learning models with transfer learning paradigms does not result in acceptable
model performance. To address this problem, I developed a method that can
overcome the negative transfer effect for PU learning. In addition, I showed that
simply adopting methods developed in computer vision, including UDA models,
113
may not perform as well as they do in other categories, such as urban scene
semantic segmentation. The proposed model shows promising results with limited
amount of labeled data from the positive class. Finally, the proposed model can be
seen as multi-target domain since each city is a separate domain within the dataset.
This makes the proposed model more powerful since it can cope with the changes
in landcover structures at different geographic locations—for example, the change
in material used for rooftop of the buildings at different locations. Also, since there
is no constraint on the number of source domains, the proposed model can easily be
extended to the multi-source domain case as well.
Next, to address the ensemble PU learning framework, first I investigated
the performance of the available ensemble learning methodologies proposed
for deep learning. Most of these approaches are developed for fully supervised
settings and address image classification task rather than semantic segmentation.
I demonstrated that, again, simply adopting such models may not produce
competitive results in comparison to non-ensemble PU models. By learning from
the behavior of such models, I proposed an ensemble PU learning model, which
outperforms all discussed models and shows promising results. The proposed
deep ensemble PU learning is general and end-to-end, and thus it can be easily
incorporated with other deep learning methods, such as the proposed deep transfer
PU learning.
In summary, the results from my research delivers an expansion of the
category of positive and unlabeled learning methods. Such deliverable is a
valuable tool to be utilized for different real-world problems needing RS imagery
segmentation, for which collecting a comprehensively labeled dataset is costly.
114
6.1 Future Work
The results of my dissertation research, illuminated two areas of future work.
First, I would like to investigate relaxing the SCAR assumption in generating
PU data for research, and thus relaxing this assumption in the PU loss function.
This is because it is more realistic that users are biased towards buildings/objects
that they can recognize better when creating distinct patches of labeled pixels
within an image. Second, it is also more realistic that a group of classes together
constitute the positive class which falls under multi-positive and unlabeled (multi-
PU) learning. For example, this is the case in remote sensing for agriculture, in
which multi-class methods are preferred to identify a set of crops of interest. There
is some research in this area in computer vision literature such as Xu, Xu, Xu,
and Tao (2017), Shu, Lin, Yan, and Li (2020), and Teisseyre (2021) that address
multi-PU learning. Therefore, in the future, I to extend my research further in the
context of RS imageries in order to extend the proposed models in this dissertation
to the multi-PU case.
115
REFERENCES CITED
Abuduweili, A., Li, X., Shi, H., Xu, C.-Z., & Dou, D. (2021). Adaptive consistency
regularization for semi-supervised transfer learning. In Proceedings of the
ieee/cvf conference on computer vision and pattern recognition (pp.
6923–6932).
Ahn, J., & Kwak, S. (2018). Learning pixel-level semantic affinity with image-level
supervision for weakly supervised semantic segmentation. In Proceedings of
the ieee conference on computer vision and pattern recognition (pp.
4981–4990).
Alonso, I., Sabater, A., Ferstl, D., Montesano, L., & Murillo, A. C. (2021).
Semi-supervised semantic segmentation with pixel-level contrastive learning
from a class-wise memory bank. In Proceedings of the ieee/cvf international
conference on computer vision (pp. 8219–8228).
Asgarian, A., Sobhani, P., Zhang, J. C., Mihailescu, M., Sibilia, A., Ashraf, A. B.,
& Taati, B. (2018). A hybrid instance-based transfer learning method.
Machine Learning for Health (ML4H) Workshop at NeurIPS .
Balestriero, R., Bottou, L., & LeCun, Y. (2022). The effects of regularization and
data augmentation are class dependent. arXiv preprint arXiv:2204.03632 .
Bartolome, C., Zhang, Y., & Ramaswami, A. (2018). Deepcell: Automating cell
nuclei detection with.
Bashath, S., Perera, N., Tripathi, S., Manjang, K., Dehmer, M., & Streib, F. E.
(2022a). A data-centric review of deep transfer learning with applications to
text data. Information Sciences , 585 , 498–528.
Bashath, S., Perera, N., Tripathi, S., Manjang, K., Dehmer, M., & Streib, F. E.
(2022b). A data-centric review of deep transfer learning with applications to
text data. Information Sciences , 585 , 498–528.
Basile, T. M. A., Di Mauro, N., Esposito, F., Ferilli, S., & Vergari, A. (2019).
Ensembles of density estimators for positive-unlabeled learning. Journal of
Intelligent Information Systems , 53 (2), 199–217.
Bekker, J., & Davis, J. (2020). Learning from positive and unlabeled data: A
survey. Machine Learning , 109 (4), 719–760.
116
Bekker, J., Robberechts, P., & Davis, J. (2019). Beyond the selected completely at
random assumption for learning from positive and unlabeled data. In Joint
european conference on machine learning and knowledge discovery in
databases (pp. 71–85).
Benediktsson, J. A., Chanussot, J., & Fauvel, M. (2007). Multiple classifier systems
in remote sensing: from basics to recent developments. In International
workshop on multiple classifier systems (pp. 501–512).
Berthelot, D., Carlini, N., Cubuk, E. D., Kurakin, A., Sohn, K., Zhang, H., &
Raffel, C. (2019). Remixmatch: Semi-supervised learning with distribution
alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785 .
Bhat, S., & Culotta, A. (2017). Identifying leading indicators of product recalls
from online reviews using positive unlabeled learning and domain
adaptation. In Proceedings of the international aaai conference on web and
social media (Vol. 11, pp. 480–483).
Bousselham, W., Thibault, G., Pagano, L., Machireddy, A., Gray, J., Chang, Y. H.,
& Song, X. (2021). Efficient self-ensemble framework for semantic
segmentation. arXiv preprint arXiv:2111.13280 .
Brigato, L., & Iocchi, L. (2021). On the effectiveness of neural ensembles for image
classification with small datasets. arXiv preprint arXiv:2111.14493 .
Brill, F., Schlaffer, S., Martinis, S., Schröter, K., & Kreibich, H. (2021).
Extrapolating satellite-based flood masks by one-class classification—a test
case in houston. Remote Sensing , 13 (11), 2042.
Chakraborty, S., & Roy, M. (2020). A multi-level weighted transformation based
neuro-fuzzy domain adaptation technique using stacked auto-encoder for
land-cover classification. International Journal of Remote Sensing , 41 (17),
6831–6857.
Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Special issue on learning from
imbalanced data sets. ACM SIGKDD explorations newsletter , 6 (1), 1–6.
Chen, H., Liu, F., Wang, Y., Zhao, L., & Wu, H. (2020). A variational approach
for learning from positive and unlabeled data. 34th Conference on Neural
Information Processing Systems (NeurIPS), Vancouver, Canada..
Chen, J., & Liu, X. (2014). Transfer learning with one-class data. Pattern
Recognition Letters , 37 , 32–40.
117
Chen, L., Dou, X., Peng, J., Li, W., Sun, B., & Li, H. (2021). Efcnet: Ensemble
full convolutional network for semantic segmentation of high-resolution
remote sensing images. IEEE Geoscience and Remote Sensing Letters , 19 ,
1–5.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017).
Deeplab: Semantic image segmentation with deep convolutional nets, atrous
convolution, and fully connected crfs. IEEE transactions on pattern analysis
and machine intelligence, 40 (4), 834–848.
Chen, S., Jia, X., He, J., Shi, Y., & Liu, J. (2021). Semi-supervised domain
adaptation based on dual-level domain mixing for semantic segmentation. In
Proceedings of the ieee/cvf conference on computer vision and pattern
recognition (pp. 11018–11027).
Chen, X., Gong, C., & Yang, J. (2021). Cost-sensitive positive and unlabeled
learning. Information Sciences , 558 , 229–245.
Chen, X., Liu, F., Tu, E., Cao, L., & Yang, J. (2018). Deep-pumr: Deep positive
and unlabeled learning with manifold regularization. In International
conference on neural information processing (pp. 12–20).
Chen, Y., Nasrabadi, N. M., & Tran, T. D. (2012). Hyperspectral image
classification via kernel sparse representation. IEEE Transactions on
Geoscience and Remote sensing , 51 (1), 217–231.
Chen, Y.-C., Lin, Y.-Y., Yang, M.-H., & Huang, J.-B. (2019). Crdoco: Pixel-level
domain transfer with cross-domain consistency. In Proceedings of the
ieee/cvf conference on computer vision and pattern recognition (pp.
1791–1800).
Chiaroni, F., Rahal, M.-C., Hueber, N., & Dufaux, F. (2018). Learning with a
generative adversarial network from a positive unlabeled dataset for image
classification. In 2018 25th ieee international conference on image processing
(icip) (pp. 1368–1372).
Claesen, M., De Smet, F., Suykens, J. A., & De Moor, B. (2015). A robust
ensemble approach to learn from positive and unlabeled data using svm base
models. Neurocomputing , 160 , 73–84.
Cohen, N., & Shashua, A. (2016). Inductive bias of deep convolutional networks
through pooling geometry. arXiv preprint arXiv:1605.06743 .
Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y.
(2014). Identifying and attacking the saddle point problem in
high-dimensional non-convex optimization. Advances in neural information
processing systems , 27 .
118
Day, O., & Khoshgoftaar, T. M. (2017). A survey on heterogeneous transfer
learning. Journal of Big Data, 4 (1), 1–42.
Deng, X., Li, W., Liu, X., Guo, Q., & Newsam, S. (2018). One-class remote sensing
classification: one-class vs. binary classifiers. International Journal of
Remote Sensing , 39 (6), 1890–1910.
Desloires, J., Ienco, D., Botrel, A., & Ranc, N. (2022). Positive unlabelled learning
for satellite images’ time series analysis: An application to cereal and forest
mapping. Remote Sensing , 14 (1), 140.
Devassy, B. R., & Antony, J. K. (2021). Histopathological image classification using
ensemble transfer learning. In International conference on machine learning
and big data analytics (pp. 203–212).
DeVries, T., & Taylor, G. W. (2017). Improved regularization of convolutional
neural networks with cutout. arXiv preprint arXiv:1708.04552 .
Dey, V., Zhang, Y., & Zhong, M. (2010). A review on image segmentation
techniques with remote sensing perspective (Vol. 38). na Vienna, Austria.
Dhurandhar, A., & Gurumoorthy, K. S. (2020). Classifier invariant approach to
learn from positive-unlabeled data. In 2020 ieee international conference on
data mining (icdm) (pp. 102–111).
Djerriri, K., Benyelles, Z., Attaf, D., & Cheriguene, R. S. (2019). Extraction of
built-up areas from remote sensing imagery using one-class classification. In
L. Bruzzone & F. Bovolo (Eds.), Image and signal processing for remote
sensing xxv (Vol. 11155, pp. 610 – 616). SPIE. Retrieved from
https://doi.org/10.1117/12.2535598 doi: 10.1117/12.2535598
Dong, X., Yu, Z., Cao, W., Shi, Y., & Ma, Q. (2020a). A survey on ensemble
learning. Frontiers of Computer Science, 14 (2), 241–258.
Dong, X., Yu, Z., Cao, W., Shi, Y., & Ma, Q. (2020b). A survey on ensemble
learning. Frontiers of Computer Science, 14 (2), 241–258.
Doshi, K., & Yilmaz, Y. (2020). Road damage detection using deep ensemble
learning. In 2020 ieee international conference on big data (big data) (pp.
5540–5544).
Du, P., Xia, J., Zhang, W., Tan, K., Liu, Y., & Liu, S. (2012). Multiple classifier
system for remote sensing image classification: A review. Sensors , 12 (4),
4764–4792.
119
Du Plessis, M., Niu, G., & Sugiyama, M. (2015). Convex formulation for learning
from positive and unlabeled data. In International conference on machine
learning (pp. 1386–1394).
Du Plessis, M. C., Niu, G., & Sugiyama, M. (2014). Analysis of learning from
positive and unlabeled data. Advances in neural information processing
systems , 27 .
Du Plessis, M. C., Niu, G., & Sugiyama, M. (2016). Class-prior estimation for
learning from positive and unlabeled data. In Asian conference on machine
learning (pp. 221–236).
Ekim, B., & Sertel, E. (2021). Deep neural network ensembles for remote sensing
land cover and land use classification. International Journal of Digital
Earth, 14 (12), 1868–1881.
Elkan, C., & Noto, K. (2008). Learning classifiers from only positive and unlabeled
data. In Proceedings of the 14th acm sigkdd international conference on
knowledge discovery and data mining (pp. 213–220).
Fan, J., Xu, C., & Zhang, J. (2021). An ensemble learning approach of multi-model
for classifying house damage. In 2021 2nd international conference on big
data & artificial intelligence & software engineering (icbase) (pp. 145–152).
Foody, G. M., Mathur, A., Sanchez-Hernandez, C., & Boyd, D. S. (2006). Training
set size requirements for the classification of a specific class. Remote Sensing
of Environment , 104 (1), 1–14.
Fort, S., Hu, H., & Lakshminarayanan, B. (2019). Deep ensembles: A loss
landscape perspective. arXiv preprint arXiv:1912.02757 .
French, G., Aila, T., Laine, S., Mackiewicz, M., & Finlayson, G. (2019).
Semi-supervised semantic segmentation needs strong, high-dimensional
perturbations.
French, G., & Mackiewicz, M. (2021). Colour augmentation for improved
semi-supervised semantic segmentation. arXiv preprint arXiv:2110.04487 .
Ganaie, M., Hu, M., et al. (2021). Ensemble deep learning: A review. arXiv
preprint arXiv:2104.02395 .
Ganin, Y., & Lempitsky, V. (2015). Unsupervised domain adaptation by
backpropagation. In International conference on machine learning (pp.
1180–1189).
120
Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., & Wilson, A. G. (2018).
Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in
neural information processing systems , 31 .
Ghosh, R., Jia, X., & Kumar, V. (2021). Land cover mapping in limited labels
scenario: A survey. arXiv preprint arXiv:2103.02429 .
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2015). Region-based
convolutional networks for accurate object detection and segmentation.
IEEE transactions on pattern analysis and machine intelligence, 38 (1),
142–158.
Gong, C., Shi, H., Liu, T., Zhang, C., Yang, J., & Tao, D. (2019). Loss
decomposition and centroid estimation for positive and unlabeled learning.
IEEE transactions on pattern analysis and machine intelligence, 43 (3),
918–932.
Gong, C., Wang, D., & Liu, Q. (2021). Alphamatch: Improving consistency for
semi-supervised learning with alpha-divergence. In Proceedings of the
ieee/cvf conference on computer vision and pattern recognition (pp.
13683–13692).
Gu, X., Zhang, C., Shen, Q., Han, J., Angelov, P. P., & Atkinson, P. M. (2022). A
self-training hierarchical prototype-based ensemble framework for remote
sensing scene classification. Information Fusion, 80 , 179–204.
Gui, R., Xu, X., Wang, L., Yang, R., & Pu, F. (2020). Eigenvalue statistical
components-based pu-learning for polsar built-up areas extraction and
cross-domain analysis. IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing , 13 , 3192–3203.
Guo, T., Xu, C., Huang, J., Wang, Y., Shi, B., Xu, C., & Tao, D. (2020). On
positive-unlabeled classification in gan. In Proceedings of the ieee/cvf
conference on computer vision and pattern recognition (pp. 8385–8393).
Hammoudeh, Z., & Lowd, D. (2020). Learning from positive and unlabeled data
with arbitrary positive shift. Advances in Neural Information Processing
Systems , 33 , 13088–13099.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the ieee conference on computer vision and
pattern recognition (pp. 770–778).
He, X., & Chen, Y. (2020). Transferring cnn ensemble for hyperspectral image
classification. IEEE Geoscience and Remote Sensing Letters , 18 (5),
876–880.
121
He, X., Chen, Y., & Ghamisi, P. (2019). Heterogeneous transfer learning for
hyperspectral image classification based on convolutional neural network.
IEEE Transactions on Geoscience and Remote Sensing , 58 (5), 3246–3263.
Helmbold, D. P., & Long, P. M. (2017). Surprising properties of dropout in deep
networks. In Conference on learning theory (pp. 1123–1146).
Hinton, G., Vinyals, O., Dean, J., et al. (2015). Distilling the knowledge in a neural
network. arXiv preprint arXiv:1503.02531 , 2 (7).
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R.
(2012). Improving neural networks by preventing co-adaptation of feature
detectors. arXiv preprint arXiv:1207.0580 .
Hoffman, J., Tzeng, E., Park, T., Zhu, J.-Y., Isola, P., Saenko, K., . . . Darrell, T.
(2018). Cycada: Cycle-consistent adversarial domain adaptation. In
International conference on machine learning (pp. 1989–1998).
Hoffman, J., Wang, D., Yu, F., & Darrell, T. (2016). Fcns in the wild: Pixel-level
adversarial and constraint-based adaptation. arXiv preprint
arXiv:1612.02649 .
Hou, M., Chaib-Draa, B., Li, C., & Zhao, Q. (2017). Generative adversarial
positive-unlabelled learning. arXiv preprint arXiv:1711.08054 .
Hu, W., Le, R., Liu, B., Ji, F., Ma, J., Zhao, D., & Yan, R. (2021). Predictive
adversarial learning from positive and unlabeled data. In Proceedings of the
aaai conference on artificial intelligence (Vol. 35, pp. 7806–7814).
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q. (2017).
Snapshot ensembles: Train 1, get m for free. arXiv preprint
arXiv:1704.00109 .
Huang, X., & Zhang, L. (2012a). An svm ensemble approach combining spectral,
structural, and semantic features for the classification of high-resolution
remotely sensed imagery. IEEE transactions on geoscience and remote
sensing , 51 (1), 257–272.
Huang, X., & Zhang, L. (2012b). An svm ensemble approach combining spectral,
structural, and semantic features for the classification of high-resolution
remotely sensed imagery. IEEE transactions on geoscience and remote
sensing , 51 (1), 257–272.
Huang, Z., Wang, X., Wang, J., Liu, W., & Wang, J. (2018). Weakly-supervised
semantic segmentation network with deep seeded region growing. In
Proceedings of the ieee conference on computer vision and pattern recognition
(pp. 7014–7023).
122
Hung, W.-C., Tsai, Y.-H., Liou, Y.-T., Lin, Y.-Y., & Yang, M.-H. (2018).
Adversarial learning for semi-supervised semantic segmentation. arXiv
preprint arXiv:1802.07934 .
Iqbal, J., & Ali, M. (2020). Weakly-supervised domain adaptation for built-up
region segmentation in aerial and satellite imagery. ISPRS Journal of
Photogrammetry and Remote Sensing , 167 , 263–275.
Isobe, T., Jia, X., Chen, S., He, J., Shi, Y., Liu, J., . . . Wang, S. (2021).
Multi-target domain adaptation with collaborative consistency learning. In
Proceedings of the ieee/cvf conference on computer vision and pattern
recognition (pp. 8187–8196).
Iyer, P., Sriram, A., & Lal, S. (2021). Deep learning ensemble method for
classification of satellite hyperspectral images. Remote Sensing Applications:
Society and Environment , 23 , 100580.
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018).
Averaging weights leads to wider optima and better generalization. arXiv
preprint arXiv:1803.05407 .
Jain, S., Delano, J., Sharma, H., & Radivojac, P. (2020). Class prior estimation
with biased positives and unlabeled examples. In Proceedings of the aaai
conference on artificial intelligence (Vol. 34, pp. 4255–4263).
Jamali, A., Mahdianpari, M., Brisco, B., Granger, J., Mohammadimanesh, F., &
Salehi, B. (2021). Comparing solo versus ensemble convolutional neural
networks for wetland classification using multi-spectral satellite imagery.
Remote Sensing , 13 (11), 2046.
Jaskie, K., & Spanias, A. (2019). Positive and unlabeled learning algorithms and
applications: A survey. In 2019 10th international conference on
information, intelligence, systems and applications (iisa) (pp. 1–8).
Ji, S., Wang, D., & Luo, M. (2020). Generative adversarial network-based
full-space domain adaptation for land cover classification from
multiple-source remote sensing images. IEEE Transactions on Geoscience
and Remote Sensing , 59 (5), 3816–3828.
Jian, P., Chen, K., & Cheng, W. (2021). Gan-based one-class classification for
remote-sensing image change detection. IEEE Geoscience and Remote
Sensing Letters , 19 , 1–5.
Jowkar, G.-H., & Mansoori, E. G. (2016). Perceptron ensemble of graph-based
positive-unlabeled learning for disease gene identification. Computational
biology and chemistry , 64 , 263–270.
123
Kalluri, T., Varma, G., Chandraker, M., & Jawahar, C. (2019). Universal
semi-supervised semantic segmentation. In Proceedings of the ieee/cvf
international conference on computer vision (pp. 5259–5270).
Kandaswamy, C., Silva, L. M., Alexandre, L. A., & Santos, J. M. (2015). Deep
transfer learning ensemble for classification. In International work-conference
on artificial neural networks (pp. 335–348).
Karbalayghareh, A., Qian, X., & Dougherty, E. R. (2018). Optimal bayesian
transfer learning. IEEE Transactions on Signal Processing , 66 (14),
3724–3739.
Kawaguchi, K. (2016). Deep learning without poor local minima. Advances in
neural information processing systems , 29 .
Ke, Z., Qiu, D., Li, K., Yan, Q., & Lau, R. W. (2020). Guided collaborative
training for pixel-wise semi-supervised learning. In European conference on
computer vision (pp. 429–445).
Kemker, R., Salvaggio, C., & Kanan, C. (2018). Algorithms for semantic
segmentation of multispectral remote sensing imagery using deep learning.
ISPRS journal of photogrammetry and remote sensing , 145 , 60–77.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2016).
On large-batch training for deep learning: Generalization gap and sharp
minima. arXiv preprint arXiv:1609.04836 .
Kim, T., & Kim, C. (2020). Attract, perturb, and explore: Learning a feature
alignment network for semi-supervised domain adaptation. In European
conference on computer vision (pp. 591–607).
Kiryo, R., Niu, G., Du Plessis, M. C., & Sugiyama, M. (2017). Positive-unlabeled
learning with non-negative risk estimator. Advances in neural information
processing systems , 30 .
Korzh, O., Joaristi, M., & Serra, E. (2018). Convolutional neural network ensemble
fine-tuning for extended transfer learning. In International conference on big
data (pp. 110–123).
Krupiński, M., Lewiński, S., & Malinowski, R. (2019). One class svm for building
detection on sentinel-2 images. In Photonics applications in astronomy,
communications, industry, and high-energy physics experiments 2019 (Vol.
11176, p. 1117635).
Kumar, A., Kim, J., Lyndon, D., Fulham, M., & Feng, D. (2016). An ensemble of
fine-tuned convolutional neural networks for medical image classification.
IEEE journal of biomedical and health informatics , 21 (1), 31–40.
124
Laine, S., & Aila, T. (2016). Temporal ensembling for semi-supervised learning.
arXiv preprint arXiv:1610.02242 .
Łazęcka, M., Mielniczuk, J., & Teisseyre, P. (2021). Estimating the class prior for
positive and unlabelled data via logistic regression. Advances in Data
Analysis and Classification, 15 (4), 1039–1068.
LeCun, Y., Bengio, Y., et al. (1995). Convolutional networks for images, speech,
and time series. The handbook of brain theory and neural networks ,
3361 (10), 1995.
Lee, J., Kim, E., Lee, S., Lee, J., & Yoon, S. (2019). Ficklenet: Weakly and
semi-supervised semantic image segmentation using stochastic inference. In
Proceedings of the ieee/cvf conference on computer vision and pattern
recognition (pp. 5267–5276).
Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D., & Batra, D. (2015). Why m
heads are better than one: Training a diverse ensemble of deep networks.
arXiv preprint arXiv:1511.06314 .
Lee, W. S., & Liu, B. (2003). Learning with positive and unlabeled examples using
weighted logistic regression. In Icml (Vol. 3, pp. 448–455).
Lei, L., Wang, X., Zhong, Y., Zhao, H., Hu, X., & Luo, C. (2021). Docc: Deep
one-class crop classification via positive and unlabeled learning for
multi-modal satellite imagery. International Journal of Applied Earth
Observation and Geoinformation, 105 , 102598.
Li, W. (2013). Geographic modeling with one-class data. University of California,
Merced.
Li, W., & Guo, Q. (2010). A maximum entropy approach to one-class classification
of remote sensing imagery. International Journal of Remote Sensing , 31 (8),
2227–2235.
Li, W., & Guo, Q. (2013). A new accuracy assessment method for one-class remote
sensing classification. IEEE transactions on geoscience and remote sensing ,
52 (8), 4621–4632.
Li, W., Guo, Q., & Elkan, C. (2010). A positive and unlabeled learning algorithm
for one-class classification of remote-sensing data. IEEE transactions on
geoscience and remote sensing , 49 (2), 717–725.
Li, W., Guo, Q., & Elkan, C. (2011). Can we model the probability of presence of
species without absence data? Ecography , 34 (6), 1096–1105.
125
Li, Y., Yuan, L., & Vasconcelos, N. (2019). Bidirectional learning for domain
adaptation of semantic segmentation. In Proceedings of the ieee/cvf
conference on computer vision and pattern recognition (pp. 6936–6945).
Li, Y., Zhang, H., Xue, X., Jiang, Y., & Shen, Q. (2018). Deep learning for remote
sensing image classification: A survey. Wiley Interdisciplinary Reviews: Data
Mining and Knowledge Discovery , 8 (6), e1264.
Lin, D., Dai, J., Jia, J., He, K., & Sun, J. (2016). Scribblesup: Scribble-supervised
convolutional networks for semantic segmentation. In Proceedings of the ieee
conference on computer vision and pattern recognition (pp. 3159–3167).
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017).
Feature pyramid networks for object detection. In Proceedings of the ieee
conference on computer vision and pattern recognition (pp. 2117–2125).
Liu, B., Dai, Y., Li, X., Lee, W. S., & Yu, P. S. (2003). Building text classifiers
using positive and unlabeled examples. In Third ieee international
conference on data mining (pp. 179–186).
Liu, B., Lee, W. S., Yu, P. S., & Li, X. (2002). Partially supervised classification of
text documents. In Icml (Vol. 2, pp. 387–394).
Liu, B., Liu, C., Xiao, Y., Liu, L., Li, W., & Chen, X. (2022). Adaboost-based
transfer learning method for positive and unlabelled learning problem.
Knowledge-Based Systems , 108162.
Liu, B., Zhu, C., Wang, K., Liu, X., & Yu, W. (2011). A one-class-extraction
framework for high resolution sar image classification. In Proceedings of 2011
ieee cie international conference on radar (Vol. 1, pp. 732–735).
Liu, M.-Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image
translation networks. Advances in neural information processing systems ,
30 .
Liu, R., Li, W., Liu, X., Lu, X., Li, T., & Guo, Q. (2018). An ensemble of
classifiers based on positive and unlabeled data in one-class remote sensing
classification. IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing , 11 (2), 572–584.
Liu, X., Liu, H., Datta, P., Frey, J., & Koch, B. (2020). Mapping an invasive plant
spartina alterniflora by combining an ensemble one-class classification
algorithm with a phenological ndvi time-series analysis approach in middle
coast of jiangsu, china. Remote Sensing , 12 (24), 4010.
126
Liu, X., Liu, H., Gong, H., Lin, Z., & Lv, S. (2017). Appling the one-class
classification method of maxent to detect an invasive plant spartina
alterniflora with time-series analysis. Remote Sensing , 9 (11), 1120.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., . . . Guo, B. (2021). Swin
transformer: Hierarchical vision transformer using shifted windows. In
Proceedings of the ieee/cvf international conference on computer vision (pp.
10012–10022).
Loghmani, M. R., Vincze, M., & Tommasi, T. (2020). Positive-unlabeled learning
for open set domain adaptation. Pattern Recognition Letters , 136 , 198–204.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for
semantic segmentation. In Proceedings of the ieee conference on computer
vision and pattern recognition (pp. 3431–3440).
Lu, D., & Weng, Q. (2007). A survey of image classification methods and
techniques for improving classification performance. International journal of
Remote sensing , 28 (5), 823–870.
Lucas, B., Pelletier, C., Schmidt, D., Webb, G. I., & Petitjean, F. (2021). A
bayesian-inspired, deep learning-based, semi-supervised domain adaptation
technique for land cover mapping. Machine Learning , 1–33.
Luo, Y., Zheng, L., Guan, T., Yu, J., & Yang, Y. (2019). Taking a closer look at
domain shift: Category-level adversaries for semantics consistent domain
adaptation. In Proceedings of the ieee/cvf conference on computer vision and
pattern recognition (pp. 2507–2516).
Ma, S., Li, X., Zhang, Z., Tang, J., & Guo, F. (2021). Geu-net: Rethinking the
information transmission in the skip connection of u-net architecture. In
2021 ieee international conference on bioinformatics and biomedicine (bibm)
(pp. 1020–1025).
Mack, B., Roscher, R., Stenzel, S., Feilhauer, H., Schmidtlein, S., & Waske, B.
(2016). Mapping raised bogs with an iterative one-class classification
approach. ISPRS Journal of Photogrammetry and Remote Sensing , 120 ,
53–64.
Mack, B., & Waske, B. (2017). In-depth comparisons of maxent, biased svm and
one-class svm for one-class classification of remote sensing data. Remote
sensing letters , 8 (3), 290–299.
Maggiori, E., Tarabalka, Y., Charpiat, G., & Alliez, P. (2017). Can semantic
labeling methods generalize to any city? the inria aerial image labeling
benchmark. In 2017 ieee international geoscience and remote sensing
symposium (igarss) (pp. 3226–3229).
127
Marmanis, D., Wegner, J. D., Galliani, S., Schindler, K., Datcu, M., & Stilla, U.
(2016). Semantic segmentation of aerial images with an ensemble of cnss.
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial
Information Sciences, 2016 , 3 , 473–480.
Maulik, U., & Chakraborty, D. (2017). Remote sensing image classification: A
survey of support-vector-machine-based advanced techniques. IEEE
Geoscience and Remote Sensing Magazine, 5 (1), 33–52.
McDonald, G. G., Costello, C., Bone, J., Cabral, R. B., Farabee, V., Hochberg, T.,
. . . Zahn, O. (2021). Satellites can reveal global extent of forced labor in the
world’s fishing fleet. Proceedings of the National Academy of Sciences ,
118 (3).
Melas-Kyriazi, L., & Manrai, A. K. (2021). Pixmatch: Unsupervised domain
adaptation via pixelwise consistency training. In Proceedings of the ieee/cvf
conference on computer vision and pattern recognition (pp. 12435–12445).
Meng, Z., Xie, Y., & Sun, J. (2021). Short-term load forecasting by transfer
learning based on positive and unlabeled learning. In The 2nd international
conference on computing and data science (pp. 1–5).
Mignone, P., & Pio, G. (2018). Positive unlabeled link prediction via transfer
learning for gene network reconstruction. In International symposium on
methodologies for intelligent systems (pp. 13–23).
Mignone, P., Pio, G., D’Elia, D., & Ceci, M. (2020). Exploiting transfer learning
for the reconstruction of the human gene regulatory network.
Bioinformatics , 36 (5), 1553–1561.
Miller, H. J., & Goodchild, M. F. (2015). Data-driven geography. GeoJournal ,
80 (4), 449–461.
Mittal, S., Tatarchenko, M., & Brox, T. (2019). Semi-supervised semantic
segmentation with high- and low-level consistency. IEEE transactions on
pattern analysis and machine intelligence, 43 (4), 1369–1379.
Mnih, V. (2013). Machine learning for aerial image labeling (Unpublished doctoral
dissertation). University of Toronto.
Mosca, A., & Magoulas, G. D. (2017). Deep incremental boosting. arXiv preprint
arXiv:1708.03704 .
Musto, L., & Zinelli, A. (2020). Semantically adaptive image-to-image translation
for domain adaptation of semantic segmentation. arXiv preprint
arXiv:2009.01166 .
128
Na, B., Kim, H., Song, K., Joo, W., Kim, Y.-Y., & Moon, I.-C. (2020). Deep
generative positive-unlabeled learning under selection bias. In Proceedings of
the 29th acm international conference on information & knowledge
management (pp. 1155–1164).
Najjar, A., Kaneko, S., & Miyanaga, Y. (2017). Combining satellite imagery and
open data to map road safety. In Thirty-first aaai conference on artificial
intelligence.
Nam, H., Lee, H., Park, J., Yoon, W., & Yoo, D. (2021). Reducing domain gap by
reducing style bias. In Proceedings of the ieee/cvf conference on computer
vision and pattern recognition (pp. 8690–8699).
Nguyen, M. N., Li, X.-L., & Ng, S.-K. (2012). Ensemble based positive unlabeled
learning for time series classification. In International conference on database
systems for advanced applications (pp. 243–257).
Nigam, I., Huang, C., & Ramanan, D. (2018). Ensemble knowledge transfer for
semantic segmentation. In 2018 ieee winter conference on applications of
computer vision (wacv) (pp. 1499–1508).
Niu, G., du Plessis, M. C., Sakai, T., Ma, Y., & Sugiyama, M. (2016). Theoretical
comparisons of positive-unlabeled learning against positive-negative learning.
Advances in neural information processing systems , 29 .
Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic
segmentation. In Proceedings of the ieee international conference on
computer vision (pp. 1520–1528).
Nozza, D., Fersini, E., & Messina, E. (2016). Deep learning and ensemble methods
for domain adaptation. In 2016 ieee 28th international conference on tools
with artificial intelligence (ictai) (pp. 184–189).
Olsson, V., Tranheden, W., Pinto, J., & Svensson, L. (2021). Classmix:
Segmentation-based data augmentation for semi-supervised learning. In
Proceedings of the ieee/cvf winter conference on applications of computer
vision (pp. 1369–1378).
Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study.
Journal of artificial intelligence research, 11 , 169–198.
Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring
mid-level image representations using convolutional neural networks. In
Proceedings of the ieee conference on computer vision and pattern recognition
(pp. 1717–1724).
129
Ouali, Y., Hudelot, C., & Tami, M. (2020). Semi-supervised semantic segmentation
with cross-consistency training. In Proceedings of the ieee/cvf conference on
computer vision and pattern recognition (pp. 12674–12684).
Pan, F., Shin, I., Rameau, F., Lee, S., & Kweon, I. S. (2020). Unsupervised
intra-domain adaptation for semantic segmentation through self-supervision.
In Proceedings of the ieee/cvf conference on computer vision and pattern
recognition (pp. 3764–3773).
Papandreou, G., Chen, L.-C., Murphy, K. P., & Yuille, A. L. (2015). Weakly-and
semi-supervised learning of a deep convolutional network for semantic image
segmentation. In Proceedings of the ieee international conference on
computer vision (pp. 1742–1750).
Perini, L., Vercruyssen, V., & Davis, J. (2020). Class prior estimation in active
positive and unlabeled learning. In Proceedings of the 29th international
joint conference on artificial intelligence and the 17th pacific rim
international conference on artificial intelligence (ijcai-pricai 2020) (pp.
2915–2921).
Pettorelli, N., Schulte to Bühne, H., C. Shapiro, A., & Glover-Kapfer, P. (2018).
Satellite remote sensing for conservation. WWF Conservation Technology
Series 1(4). Retrieved 03-19-2022, from https://www.researchgate.net/
profile/Aurelie-Shapiro/publication/
324537528_Conservation_Technology_Series_Issue_4_SATELLITE
_REMOTE_SENSING_FOR_CONSERVATION/links/5ad45106a6fdcc2935800fac/
Conservation-Technology-Series-Issue-4-SATELLITE-REMOTE-SENSING
-FOR-CONSERVATION.pdf?origin=publication_detail
Phillips, S. J., Anderson, R. P., & Schapire, R. E. (2006). Maximum entropy
modeling of species geographic distributions. Ecological modelling , 190 (3-4),
231–259.
Plazas, M., Ramos-Pollán, R., & Martínez, F. (2021). Ensemble-based approach for
semisupervised learning in remote sensing. Journal of Applied Remote
Sensing , 15 (3), 034509.
Rahman, A., Smith, D. V., & Timms, G. (2013). Multiple classifier system for
automated quality assessment of marine sensor data. In 2013 ieee eighth
international conference on intelligent sensors, sensor networks and
information processing (pp. 362–367).
Ran, Q., Zhang, M., Li, W., & Du, Q. (2016). Change detection with one-class
sparse representation classifier. Journal of Applied Remote Sensing , 10 (4),
042006.
130
Rapinel, S., & Hubert-Moy, L. (2021). One-class classification of natural vegetation
using remote sensing: A review. Remote Sensing , 13 (10), 1892.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once:
Unified, real-time object detection. In Proceedings of the ieee conference on
computer vision and pattern recognition (pp. 779–788).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time
object detection with region proposal networks. Advances in neural
information processing systems , 28 , 91–99.
Rocchio, L., & Barsi, J. (n.d.). Different bands comparisons among multiple
satellites. Retrieved 03-19-2022, from
https://landsat.gsfc.nasa.gov/about/technical-details/
Ronneberger, O., Fischer, P., & Brox, T. (2015a). U-net: Convolutional networks
for biomedical image segmentation. In International conference on medical
image computing and computer-assisted intervention (pp. 234–241).
Ronneberger, O., Fischer, P., & Brox, T. (2015b). U-net: Convolutional networks
for biomedical image segmentation. In International conference on medical
image computing and computer-assisted intervention (pp. 234–241).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., . . . Fei-Fei, L.
(2015). ImageNet Large Scale Visual Recognition Challenge. International
Journal of Computer Vision (IJCV), 115 (3), 211-252. doi:
10.1007/s11263-015-0816-y
Sagi, O., & Rokach, L. (2018a). Ensemble learning: A survey. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery , 8 (4),
e1249.
Sagi, O., & Rokach, L. (2018b). Ensemble learning: A survey. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery , 8 (4),
e1249.
Saito, K., Watanabe, K., Ushiku, Y., & Harada, T. (2018). Maximum classifier
discrepancy for unsupervised domain adaptation. In Proceedings of the ieee
conference on computer vision and pattern recognition (pp. 3723–3732).
Santara, A., Datta, J., Sarkar, S., Garg, A., Padia, K., & Mitra, P. (2019). Punch:
Positive unlabelled classification based information retrieval in hyperspectral
images. arXiv preprint arXiv:1904.04547 .
Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C.
(2001). Estimating the support of a high-dimensional distribution. Neural
computation, 13 (7), 1443–1471.
131
Shi, H., Pan, S., Yang, J., & Gong, C. (2018). Positive and unlabeled learning via
loss decomposition and centroid estimation. In Ijcai (pp. 2689–2695).
Shi, L., Wang, Z., Pan, B., & Shi, Z. (2020). An end-to-end network for remote
sensing imagery semantic segmentation via joint pixel-and
representation-level domain adaptation. IEEE Geoscience and Remote
Sensing Letters , 18 (11), 1896–1900.
Shu, S., Lin, Z., Yan, Y., & Li, L. (2020). Learning from multi-class positive and
unlabeled data. In 2020 ieee international conference on data mining (icdm)
(pp. 1256–1261).
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556 .
Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C. A., . . . Li,
C.-L. (2020). Fixmatch: Simplifying semi-supervised learning with
consistency and confidence. Advances in Neural Information Processing
Systems , 33 , 596–608.
Song, B., Li, P., & Jia, X. (2020). New feature selection methods using sparse
representation for one-class classification of remote sensing images. IEEE
Geoscience and Remote Sensing Letters , 18 (10), 1761–1765.
Song, B., Li, P., Li, J., & Plaza, A. (2016). One-class classification of remote
sensing images using kernel sparse representation. IEEE Journal of Selected
Topics in Applied Earth Observations and Remote Sensing , 9 (4), 1613–1623.
Sonntag, J., Behrens, G., & Schmidt-Thieme, L. (2022). Positive-unlabeled domain
adaptation. arXiv preprint arXiv:2202.05695 .
Souly, N., Spampinato, C., & Shah, M. (2017). Semi supervised semantic
segmentation using generative adversarial network. In Proceedings of the ieee
international conference on computer vision (pp. 5688–5696).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R.
(2014). Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research, 15 (1), 1929–1958.
Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., & Liu, C. (2018). A survey on
deep transfer learning. In International conference on artificial neural
networks (pp. 270–279).
Tao, A., Sapra, K., & Catanzaro, B. (2020). Hierarchical multi-scale attention for
semantic segmentation. arXiv preprint arXiv:2005.10821 .
132
Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models:
Weight-averaged consistency targets improve semi-supervised deep learning
results. Advances in neural information processing systems , 30 .
Tasar, O., Giros, A., Tarabalka, Y., Alliez, P., & Clerc, S. (2020). Daugnet:
Unsupervised, multisource, multitarget, and life-long domain adaptation for
semantic segmentation of satellite images. IEEE Transactions on Geoscience
and Remote Sensing , 59 (2), 1067–1081.
Tasar, O., Happy, S., Tarabalka, Y., & Alliez, P. (2020a). Colormapgan:
Unsupervised domain adaptation for semantic segmentation using color
mapping generative adversarial networks. IEEE Transactions on Geoscience
and Remote Sensing , 58 (10), 7178–7193.
Tasar, O., Happy, S., Tarabalka, Y., & Alliez, P. (2020b). Semi2i: Semantically
consistent image-to-image translation for domain adaptation of remote
sensing data. In Igarss 2020-2020 ieee international geoscience and remote
sensing symposium (pp. 1837–1840).
Tasar, O., Tarabalka, Y., Giros, A., Alliez, P., & Clerc, S. (2020). Standardgan:
Multi-source domain adaptation for semantic segmentation of very high
resolution satellite images by data standardization. In Proceedings of the
ieee/cvf conference on computer vision and pattern recognition workshops
(pp. 192–193).
Teisseyre, P. (2021). Classifier chains for positive unlabelled multi-label learning.
Knowledge-Based Systems , 213 , 106709.
Torrey, L., & Shavlik, J. (2010). Transfer learning. In Handbook of research on
machine learning applications and trends: algorithms, methods, and
techniques (pp. 242–264). IGI global.
Truong, T.-D., Duong, C. N., Le, N., Phung, S. L., Rainwater, C., & Luu, K.
(2021). Bimal: Bijective maximum likelihood approach to domain
adaptation in semantic scene segmentation. In Proceedings of the ieee/cvf
international conference on computer vision (pp. 8548–8557).
Tuia, D., Persello, C., & Bruzzone, L. (2016). Domain adaptation for the
classification of remote sensing data: An overview of recent advances. IEEE
geoscience and remote sensing magazine, 4 (2), 41–57.
Utgoff, P. E. (1986). Shift of bias for inductive concept learning. Machine learning:
An artificial intelligence approach, 2 , 107–148.
Utgoff, P. E. (2012). Machine learning of inductive bias (Vol. 15). Springer Science
& Business Media.
133
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., . . .
Polosukhin, I. (2017). Attention is all you need. Advances in neural
information processing systems , 30 .
Vu, T.-H., Jain, H., Bucher, M., Cord, M., & Pérez, P. (2019). Advent: Adversarial
entropy minimization for domain adaptation in semantic segmentation. In
Proceedings of the ieee/cvf conference on computer vision and pattern
recognition (pp. 2517–2526).
Wang, A. X., Tran, C., Desai, N., Lobell, D., & Ermon, S. (2018). Deep transfer
learning for crop yield prediction with remote sensing data. In Proceedings
of the 1st acm sigcas conference on computing and sustainable societies (pp.
1–5).
Wang, S., Hu, H., Lin, T., Liu, Y., Padmanabhan, A., & Soltani, K. (2015).
Cybergis for data-intensive knowledge discovery. SIGSPATIAL Special ,
6 (2), 26–33.
Wang, Z., Wei, Y., Feris, R., Xiong, J., Hwu, W.-M., Huang, T. S., & Shi, H.
(2020). Alleviating semantic-level shift: A semi-supervised domain
adaptation method for semantic segmentation. In Proceedings of the ieee/cvf
conference on computer vision and pattern recognition workshops (pp.
936–937).
Warde-Farley, D., Goodfellow, I. J., Courville, A., & Bengio, Y. (2013). An
empirical analysis of dropout in piecewise linear networks. arXiv preprint
arXiv:1312.6197 .
Wasay, A., Hentschel, B., Liao, Y., Chen, S., & Idreos, S. (2020). Mothernets:
Rapid deep ensemble learning. Proceedings of Machine Learning and
Systems , 2 , 199–215.
Wasay, A., & Idreos, S. (2020). More or less: When and how to build convolutional
neural network ensembles. In International conference on learning
representations.
Weiss, K., Khoshgoftaar, T. M., & Wang, D. (2016). A survey of transfer learning.
Journal of Big data, 3 (1), 1–40.
Wu, B., Qiu, W., Jia, J., & Liu, N. (2020). Landslide susceptibility modeling using
bagging-based positive-unlabeled learning. IEEE Geoscience and Remote
Sensing Letters , 18 (5), 766–770.
Wurm, M., Stark, T., Zhu, X. X., Weigand, M., & Taubenböck, H. (2019).
Semantic segmentation of slums in satellite images using transfer learning on
fully convolutional neural networks. ISPRS journal of photogrammetry and
remote sensing , 150 , 59–69.
134
Xia, J., & Yokoya, N. (2021). Building damage mapping with self-positive
unlabeled learning. Artificial Intelligence for Humanitarian Assistance and
Disaster Response Workshop, NeurIPS .
Xia, J., Yokoya, N., & Adriano, B. (2021). Building damage mapping with
self-positiveunlabeled learning. arXiv preprint arXiv:2111.02586 .
Xie, M., Jean, N., Burke, M., Lobell, D., & Ermon, S. (2016). Transfer learning
from deep features for remote sensing and poverty mapping. In Thirtieth
aaai conference on artificial intelligence.
Xu, Y., Xu, C., Xu, C., & Tao, D. (2017). Multi-positive and unlabeled learning.
In Ijcai (pp. 3182–3188).
Yan, Z., Yu, X., Qin, Y., Wu, Y., Han, X., & Cui, S. (2021). Pixel-level
intra-domain adaptation for semantic segmentation. In Proceedings of the
29th acm international conference on multimedia (pp. 404–413).
Yang, P., Humphrey, S. J., James, D. E., Yang, Y. H., & Jothi, R. (2016).
Positive-unlabeled ensemble learning for kinase substrate prediction from
dynamic phosphoproteomics data. Bioinformatics , 32 (2), 252–259.
Yang, P., Li, X., Chua, H.-N., Kwoh, C.-K., & Ng, S.-K. (2014). Ensemble positive
unlabeled learning for disease gene identification. PloS one, 9 (5), e97079.
Yang, P., Liu, W., & Yang, J. Y. H. (2017). Positive unlabeled learning via
wrapper-based adaptive sampling. In Ijcai (pp. 3273–3279).
Yang, W., Yin, X., Song, H., Liu, Y., & Xu, X. (2013). Extraction of built-up areas
from fully polarimetric sar imagery via pu learning. IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing , 7 (4),
1207–1216.
Yang, Y., Liang, K. J., & Carin, L. (2020). Object detection as a
positive-unlabeled problem. British Machine Vision Conference (BMVC).
Yang, Y., Lv, H., & Chen, N. (2021). A survey on ensemble learning under the era
of deep learning. arXiv preprint arXiv:2101.08387 .
Yang, Y., & Soatto, S. (2020). Fda: Fourier domain adaptation for semantic
segmentation. In Proceedings of the ieee/cvf conference on computer vision
and pattern recognition (pp. 4085–4095).
Yu, Y., Liu, H., Fu, M., Chen, J., Wang, X., & Wang, K. (2021). A two-branch
neural network for non-homogeneous dehazing via ensemble learning. In
Proceedings of the ieee/cvf conference on computer vision and pattern
recognition (pp. 193–202).
135
Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix:
Regularization strategy to train strong classifiers with localizable features.
In Proceedings of the ieee/cvf international conference on computer vision
(pp. 6023–6032).
Zhang, C., Hou, Y., & Zhang, Y. (2020). Learning from positive and unlabeled
data without explicit estimation of class prior. In Proceedings of the aaai
conference on artificial intelligence (Vol. 34, pp. 6762–6769).
Zhang, Y., Zong, R., Han, J., Zheng, H., Lou, Q., Zhang, D., & Wang, D. (2019).
Transland: An adversarial transfer learning approach for migratable urban
land usage classification using remote sensing. In 2019 ieee international
conference on big data (big data) (pp. 1567–1576).
Zhang, Z., Dalca, A. V., & Sabuncu, M. R. (2019). Confidence calibration for
convolutional neural networks using structured dropout. arXiv preprint
arXiv:1906.09551 .
Zhao, D., Li, J., Yuan, B., & Shi, Z. (2021). V2rnet: An unsupervised semantic
segmentation algorithm for remote sensing images via cross-domain transfer
learning. In 2021 ieee international geoscience and remote sensing
symposium igarss (pp. 4676–4679).
Zhao, L., Li, Q., Zhang, Y., Du, X., Wang, H., & Shen, Y. (2019). Study on the
potential of whitening transformation in improving single crop mapping
accuracy. Journal of Applied Remote Sensing , 13 (3), 034512.
Zhao, S., Li, B., Xu, P., & Keutzer, K. (2020). Multi-source domain adaptation in
the deep learning era: A systematic survey. arXiv preprint
arXiv:2002.12169 .
Zheng, P., Yuan, S., Wu, X., Li, J., & Lu, A. (2019). One-class adversarial nets for
fraud detection. In Proceedings of the aaai conference on artificial
intelligence (Vol. 33, pp. 1286–1293).
Zhou, Q., Wu, X., Zhang, S., Kang, B., Ge, Z., & Latecki, L. J. (2022). Contextual
ensemble network for semantic segmentation. Pattern Recognition, 122 ,
108290.
Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N., & Liang, J. (2019). Unet++:
Redesigning skip connections to exploit multiscale features in image
segmentation. IEEE transactions on medical imaging , 39 (6), 1856–1867.
Zhou, Z., Zhang, F., Xiao, H., Wang, F., Hong, X., Wu, K., & Zhang, J. (2021). A
novel ground-based cloud image segmentation method by using deep transfer
learning. IEEE Geoscience and Remote Sensing Letters , 19 , 1–5.
136
Zhou, Z.-H. (2021). Ensemble learning. In Machine learning (pp. 181–210).
Springer.
Zhu, C., Liu, B., Yu, Q., Liu, X., & Yu, W. (2012). A spy positive and unlabeled
learning classifier and its application in hr sar image scene interpretation. In
2012 ieee radar conference (pp. 0516–0521).
Zhu, X. X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., & Fraundorfer, F.
(2017). Deep learning in remote sensing: A comprehensive review and list of
resources. IEEE Geoscience and Remote Sensing Magazine, 5 (4), 8–36.
Zou, M., & Zhong, Y. (2018). Transfer learning for classification of optical satellite
image. Sensing and Imaging , 19 (1), 1–13.
Zou, Y., Yu, Z., Kumar, B. V., & Wang, J. (2018, September). Unsupervised
domain adaptation for semantic segmentation via class-balanced
self-training. In Proceedings of the european conference on computer vision
(eccv).
137