ISBN: 2168-6831

Text
                    
Introducing IEEE Collabratec™ The premier networking and collaboration site for technology professionals around the world. IEEE Collabratec is a new, integrated online community where IEEE members, researchers, authors, and technology professionals with similar fields of interest can network and collaborate, as well as create and manage content. Featuring a suite of powerful online networking and collaboration tools, IEEE Collabratec allows you to connect according to geographic location, technical interests, or career pursuits. You can also create and share a professional identity that showcases key accomplishments and participate in groups focused around mutual interests, actively learning from and contributing to knowledgeable communities. All in one place! Learn about IEEE Collabratec at ieee-collabratec.ieee.org Network. Collaborate. Create.
DECEMBER 2021 VOLUME 9, NUMBER 4 –40 –60 –60 Latitude (°) 0 –20 –40 –60 –150 –100 –50 0 50 Longitude (°) 100 150 40 40 0 –60 1 0.5 0 20 0 –20 –40 –60 –150 –100 –50 0 50 Longitude (°) 100 Latitude (°) 40 Latitude (°) 60 20 150 1 0.5 0 0 –20 –60 100 150 20 0 –40 –60 1 0 40 20 0 –20 –40 –1 –60 –150 –100 –50 0 50 Longitude (°) 100 150 Latitude (°) 60 Latitude (°) 80 60 Spatial–Temporal 80 60 40 1 0 40 20 0 –20 –40 –1 –60 –150 –100 –50 0 50 Longitude (°) (a) (b) 100 150 150 100 150 1 0.5 0 –150 –100 –50 0 50 Longitude (°) 80 –20 100 8 20 –40 –150 –100 –50 0 50 Longitude (°) 1 0.5 0 –150 –100 –50 0 50 Longitude (°) 80 60 I (x) 1 × 1 × 46 20 80 I (x) Latitude (°) 150 40 60 –20 Latitude (°) 100 1 0.5 0 80 –40 Spatial–Temporal I (x ) –20 –40 –150 –100 –50 0 50 Longitude (°) FEATURES 60 0 I (x) –20 1 0.5 0 I (x) 0 80 20 I (x) 40 20  ethods for Small, Weak Object M Detection in Optical High-Resolution Remote Sensing Images  by Wei Han, Jia Chen, Lizhe Wang, Ruyi Feng, Fengpeng Li, Lin Wu, Tian Tian, and Jining Yan Spatial–Temporal 60 40 Latitude (°) 80 60 Spatial–Temporal 7×7×1 80 I (x) Latitude (°) WWW.GRSS-IEEE.ORG 1 0 –1 –150 –100 –50 0 50 Longitude (°) 100 150 (c) 35 Hyperspectral Image Clustering  by Han Zhai, Hongyan Zhang, Pingxiang Li, and Liangpei Zhang 68 PG. 191  hange Detection From Very-HighC Spatial-Resolution Optical Remote Sensing Images  by Dawei Wen, Xin Huang, Francesca Bovolo, Jiayi Li, Xinli Ke, Anlu Zhang, and Jón Atli Benediktsson ON THE COVER: The cover on this issue illustrates the development trend of high-resolution remote sensing (HRRS) data sets over the last decade. The feature by Han, et al., on page 8, reviews the use of these data sets in the development, verification, and evaluation of new algorithms for detection of objects in HRRS images. 102  he CCSDS 123.0-B-2 “Low-Complexity T Lossless and Near-Lossless Multispectral and Hyperspectral Image Compression” Standard  by Miguel Hernández-Cabronero, Aaron B. Kiely, NASA/JPL Matthew Klimesh, Ian Blanes, Jonathan Ligo, Enrico Magli, and Joan Serra-Sagristà 120  dvances and Opportunities in Remote A Sensing Image Geometric Registration  by Ruitao Feng, Huanfeng Shen, SCOPE Jianjun Bai, and Xinghua Li IEEE Geoscience and Remote Sensing Magazine (GRSM) will inform readers of activities in the IEEE Geoscience and Remote Sensing Society, its technical committees, and chapters. GRSM will also inform and educate readers via technical papers, provide information on international remote sensing activities and new satellite missions, publish contributions on education activities, industrial and university profiles, conference news, book reviews, and a calendar of important events. 143 Deep Learning Meets SAR  by Xiao Xiang Zhu, Sina Montazeri, Mohsin Ali, Yuansheng Hua, Yuanyuan Wang, Lichao Mou, Yilei Shi, Feng Xu, and Richard Bamler 173 Forward-Looking GroundPenetrating Radar  by Davide Comite, Fauzia Ahmad, Moeness G. Amin, and Traian Dogaru 191 Gaussianizing the Earth  by J. Emmanuel Johnson, Valero Laparra, Digital Object Identifier 10.1109/MGRS.2021.3120176 DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE María Piles, and Gustau Camps-Valls 1
FEATURES (CONTINUED) 209  ireless Sensor Networks W Applied to Precision Agriculture  by Mónica Karel Huerta, Andrea García-Cedeño, Juan Carlos Guillermo, and Roger Clotet 223 EDITORIAL BOARD Dr. James L. Garrison Editor-in-Chief School of Aeronautics and Astronautics Purdue University West Lafayette, Indiana 47907 USA Email: jlg@ieee.org Dr. Paolo Gamba University of Pavia, Italy  pectral Variability S in Hyperspectral Data Unmixing Dr. Linda Hayden Center of Excellence in Remote Sensing Education and Research Elizabeth City State University, USA Email: haydenl@mindspring.com Tales Imbiriba, José Carlos Moreira Bermudez, Cédric Richard, Jocelyn Chanussot, Lucas Drumetz, Jean-Yves Tourneret, Alina Zare, and Christian Jutten Dr. Irena Hajnsek ETH Zürich, Switzerland, and DLR, Germany Email: Irena.Hajnsek@dlr.de  by Ricardo Augusto Borsoi, COLUMNS & DEPARTMENTS 3 6 271 274 284 289 293 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE FROM THE EDITOR PRESIDENT’S MESSAGE WOMEN IN GRSS TECHNICAL COMMITTEES CHAPTERS EDUCATION Dr. Michael Inggs University of Cape Town, South Africa Email: mikings@gmail.com Dr. John Kerekes Cochair, Conference Advisory Committee Rochester Institute of Technology, USA Email: kerekes@cis.rit.edu Dr. David M. Le Vine NASA Goddard Space Flight Center, USA Email: David.M.LeVine@nasa.gov Dr. Gail Skofronick Jackson NASA Goddard Space Flight Center, USA Email: Gail.S.Jackson@nasa.gov Dr. Marwan Younis DLR, Germany Email: marwan.younis@dlr.de IN MEMORIAM MISSION STATEMENT The IEEE Geoscience and Remote Sensing Society of the IEEE seeks to advance science and technology in geoscience, remote sensing and related fields using conferences, education, and other resources. IEEE Geoscience and Remote Sensing Magazine (ISSN 2168-6831) is published quarterly by The Institute of Electrical and Electronics Engineers, Inc., IEEE Headquarters: 3 Park Ave., 17th Floor, New York, NY 10016-5997, +1 212 419 7900. Responsibility for the contents rests upon the authors and not upon the IEEE, the Society, or its members. IEEE Service Center (for orders, subscriptions, address changes): 445 Hoes Lane, Piscataway, NJ 08854, +1 732 981 0060. Individual copies: IEEE members US$20.00 (first copy only), nonmembers US$110.00 per copy. Subscription rates: included in Society fee for each member of the IEEE Geoscience and Remote Sensing Society. Nonmember subscription prices available on request. Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limits of U.S. Copyright Law for private use of patrons: 1) those post-1977 articles that carry a code at the bottom of the first page, GRS OFFICERS President Dr. David Kunkee The Aerospace Corporation, USA Executive Vice President Dr. Mariko Burgin Jet Propulsion Laboratory (JPL), USA Vice President of Publications Dr. William Emery University of Colorado, USA Vice President of Information Resources Dr. Sidharth Misra Jet Propulsion Laboratory (JPL), USA Vice President of Professional Activities Dr. Lorenzo Bruzzone University of Trento, Italy Vice President of Meetings and Symposia Dr. Saibun Tjuatja The University of Texas at Arlington Vice President of Technical Activities Dr. Fabio Pacifici Maxar, USA Secretary Dr. Steven C. Reising Colorado State University, USA Chief Financial Officer Dr. John Kerekes Rochester Institute of Technology, USA IEEE PERIODICALS MAGAZINES DEPARTMENT Journals Production Manager Sara T. Scudder Senior Managing Editor Geraldine Krolin-Taylor Senior Art Director Janet Dudar Associate Art Director Gail A. Schnitzer Production Coordinator Theresa L. Smith Director, Business Development– Media & Advertising Mark David +1 732 465 6473 m.david@ieee.org Fax: +1 732 981 1855 Advertising Production Manager Felicia Spagnoli Production Director Peter M. Tuohy Editorial Services Director Kevin Lisankie Senior Director, Publishing Operations Dawn M. Melley provided the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923 USA; 2) pre-1978 articles without fee. For all other copying, reprint, or republication information, write to: Copyrights and Permission Department, IEEE Publishing Services, 445 Hoes Lane, Piscataway, NJ 08854 USA. Copyright © 2021 by the Institute of Electrical and Electronics Engineers, Inc. All rights reserved. Application to Mail at Periodicals Postage Prices is pending at New York, New York, and at additional mailing offices. Canadian GST #125634188. Canada Post Corporation (Canadian distribution) publications mail agreement number 40013885. Return undeliverable Canadian addresses to PO Box 122, Niagara Falls, ON L2E 6S8 Canada. Printed in USA. IEEE prohibits discrimination, harassment, and bullying. For more information, visit http://www.ieee.org/web/aboutus/whatis/policies/p9-26.html. Digital Object Identifier 10.1109/MGRS.2021.3120166 2 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
FROM THE EDITOR BY JAMES L. GARRISON Introducing the December Issue W elcome to the December 2021 Issue of IEEE Geoscience and Remote Sensing Magazine! Our theme in this issue is “innovative methods for different modalities.” With this in mind, we have nine feature articles covering a variety of different processing and analysis techniques with applications across a range of remote sensing modalities. Our first five features lie in the optical spectrum. We start off with the problem of identifying small, weak, and typically anthropogenic objects in high-resolution remote sensing images. Applications such as urban monitoring, military reconnaissance, and national security all make use of this capability. Han et al. review the challenges of this problem. A broad range of object-detection frameworks are described, including template matching, object-based image analysis, classical machine learning, and deep learning (DL). These are applied to 13 widely used data sets and evaluated for detection speed and accuracy. Recent advances to improve performance in the presence of image degradation, sensor limitations, object variation, and insufficient training data as well as improvements in suppressing background information and incorporating related context information are presented. Some future research directions include the use of multisource data fusion, weakly supervised detection, automatic neural architecture search, and a universal object framework. The article concludes by identifying the promising future research directions. This issue’s cover image was taken from Figure 6 of the article. Hyperspectral images (HSIs) are high-dimensional data sets, which can be characterized as having a “cube” structure with thousands of spectral bands forming the third dimension. The interpretation of HSI data using supervised methods requires a large amount of high-­ quality labeled data for training. Collecting and processing a ­sufficiently large training set is very labor and time Digital Object Identifier 10.1109/MGRS.2021.3129109 Date of current version: 14 January 2022 DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE intensive, but there is a risk of underfitting if an insufficient number of examples are provided. Unsupervised classification methods can expand the use and interpretation of HSIs in some applications. Zhai et al. review the current status of clustering, a widely used unsupervised method that groups similar pixels and separates dissimilar pixels into different classes based only upon the properties of the hyperspectral data themselves, obviating the need for labeled samples. They group clustering methods into nine main categories: centroid based, density based, probability based, bionics based, intelligent computing based, graph based, subspace clustering, DL based, and hybrid-mechanism based. Several popular clustering methods are then evaluated on two widely used images: Indian Pines and the University of Houston. Quantitative measure of the clustering performance (e.g., overall accuracy and purity) along with running time were compared across these different methods. Spectral–spatial methods were generally found to outperform spectral-based approaches, suggesting the value that spatial information adds to improve clustering. Centroid-, density-, and probability-based methods generally did not perform well because HSIs often do not meet their basic assumptions, but they have low complexity and are efficient with large data sets. Two examples of recently developed subspace clustering methods were found to show good potential for use with HSIs, but at a large computational cost. The article concludes by identifying several HSI clustering challenges and possible future research lines, including the tradeoff between accuracy and efficiency, pointing toward hybrid approaches and the integration with high-performance computing. Multifeature methods and object or subpixel-based methods are also identified, along with DL, as future research directions. F­ inally, automatic estimation of the number of clusters is an important research problem that has not, thus far, received much attention. The feature by Wen, et al., is concerned with change detection, an important technique in remote sensing. This 3
becomes a particularly challenging problem with the advent of very high resolution (VHR) images. A comprehensive review of the research on VHR change detection is provided covering methods, applications, and discussion of future directions. Moving on, the next article addresses compression, a necessity for handling an increasing amount of data while being limited by communication bandwidth or power. As ­indicated in the previous article, HSI generates a substantially larger volume of data than other imagers (up to 5 TB per day in the case of HyspIRI), so effective compression is required. “Nearlossless” algorithms can provide a balance between reduction in data volume and error by allowing the user to specify a bound on the maximum error introduced by compression. In our fourth feature, Hernández-Cabronero et al. present a comprehensive review of the Consultative Committee for Space Data Systems (CCSDS) 123.0-B-2 Standard with “Low-Complexity Lossless and Near-Lossless Multispectral and Hyperspectral Image Compression,” the latest in a series of standards developed by the CCSDS. CCSDS 123.0-B-2 incorporates support for near-lossless compression to achieve significantly better results. It has a number of novel features, including enhanced performance on low-entropy data, modes to facilitate efficient hardware implementation, and support for ancillary information. Decompression is backward compatible with data generated by CCSDS 123.0-B-1. Compression performance was demonstrated using mostly public data consisting of 17 multispectral images, 38 HSIs and two sounder data samples, produced from 14 different instruments. Generally, the new standard was able to meet state-of-the-art performance specifications in absolute or relative error measurements. Our fifth feature, by Feng et al., addresses systematic geometric distortions in attempting to align two or more remote sensing images, collected at different times, with different viewing angles, or from different instruments. Registration techniques have been developed to perform this alignment using information in the images themselves. This is often a required preprocessing step for advanced methods such as image mosaicking or image fusion. A review of intensity-, feature-based, and combination approaches to registration is presented, along with evaluation methods for registration performance (tie-point accuracy, transformation model performance, and alignment error). Some future trends include acceleration of the registration process, the use of compressed sensing methods, and frame-by-frame alignment. A combination of different advanced methods and high-performance computing may be necessary to meet future requirements for high-resolution, heterogenous, and cross-scale remote sensing images. The next feature, by Zhu et al., marks a good transition from optical to microwave modalities, describing the largely unrealized potential to apply DL methods (which have a long history in optical remote sensing) to synthetic aperture radar (SAR) data. DL models seek to encode input data into effective feature representations for target tasks. Common meth4 ods include convolutional neural networks, recurrent neural networks, and generative adversarial networks. Most of the DL approaches are supervised, however, and the existence of high-quality benchmark data for training is important. Although DL has proven quite effective in extracting data from optical images, its application to SAR has been quite limited mostly due to the lack of these large and representative benchmark data sets. In addition, some of the specific characteristics of SAR signals have made the direct application of DL models more difficult. These characteristics include their larger dynamic range, signal statistics, imaging geometry, and that native SAR data are complex with much information content in the phase. This article reviews six typical applications of DL to SAR: terrain surface classification, object detection, parameter inversion, despeckling, interferometric SAR, and the data fusion of SAR with optical images. The generation of representative training data sets, unsupervised DL, interferometric data processing, quantification of uncertainty, large-scale nonlinear optimization problems, and cognitive sensors are identified as promising future trends in this area. Several spaceborne SAR missions are expected to be launched in the upcoming years. Hopefully, this article will encourage more joint initiatives in this area. Forward-looking ground-penetrating radar (FL-GPR) has found important applications in real-time security, military situational awareness, and humanitarian demining. Typically mounted on a vehicle, FL-GPR can provide target detection from a standoff distance. Comite et al. review methods of detecting, locating, and imaging surface targets from arraybased FL-GPR systems, considering aspects of both the electromagnetic modeling and signal processing in the problem formulation and solutions. These are challenging problems as the signal return is strongly influenced by soil conditions and surface roughness. Furthermore, the target signature can be quite weak because most of the transmitted energy is forward scattered, and returns from the ground interface can dominate the radar measurements and obscure the target. Electromagnetic modeling and image-formation methods applied to this problem are introduced. The article also reviews migration approaches adapted from seismology, microwave tomography, and data-adaptive/compressive sensing. The use of FL-GPR from unmanned aerial vehicles is a promising future research area with a number of challenges, such as antenna design. Other open issues concern the detection of nonmetallic targets and real-time operation under realistic conditions. As in many other remote sensing fields, machine learning is attracting interest relevant to FLGPR. Multiplatform data fusion under communication and computation constraints is another important research area. Copious amounts of data do not necessarily mean large quantities of information. Quantifying the information content in Earth science and climate data can be difficult as the application of information theory requires a good estimation of the probability densities. For many types of remote sensing data, producing the density estimate is problematic. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Johnson et al. review work on “Gaussianization” methods to produce statistics that can be used to estimate informationtheoretic measures (e.g., entropy, total correlation, and mutual information). This methodology scales to high dimensions, uses a simple orthogonal transform, and does not assume any parametric form for the density. This approach is demonstrated on several distinct types of data, including radar backscattering intensities, ­hyperspectral data, and aerial optical images. It is also applied to quantify the information content of soil–vegetation status in agroecosystems. Code and demonstrations of the implemented algorithms and IT measures are provided. Next we have a literature review on the use of wireless sensor networks for precision agriculture, focusing on Latin America. Huerta et al. describe how these networks have been applied to improve traditional agricultural processes in the region by monitoring the weather and environment in a noninvasive manner. They document the growth and global distribution of publications on this topic and the benefit of this technology to the agricultural industry in terms of time, production, and environmental factors. Our last feature concern spectral variability in Hyperspectral images. Bayesian, parametric and local endmember (EM) techniques have been developed to address this problem. A literature review covers both classic and recent approaches and provides a new taxonomy to organize these methods from the perspective of the user, based upon the necessary amount of supervision and the computational cost. The article concludes with an outline of future research directions. “Women in GRSS” reports on the IGARSS GRSS Diversity Fireside Chat, in conjunction with the 2021 IEEE International Geoscience and Remote Sensing Symposium (IGARSS) conference, and the Women in Engineering (WIE) International Leadership Conference, both held virtually this year. This issue contains two Technical Committee (TC) columns. The first, from the Information Analysis and Data Fusion TC, presents results from this year’s data fusion contest with the theme of “Geospatial AI for Social Good.” The second one is from the Frequency Allocations in Remote Sensing TC, which reviews items relevant to microwave remote sensing on the World Radiocommunication Conference agenda for 2023. The student branch of the University of Chinese Academy of Sciences, established in 2013, is featured in our Chapters column. The “Education” column reports on the “Green in the City” high school program targeting 16- and 17-year-old pupils in Flanders, Belgium, and held in conjunction with IGARSS. Lastly, I am sorry to report on the loss of two very active members of the geoscience and remote sensing community: Tom von Deak, who worked to ensure that radiofrequency spectrum needs for Earth science remote sensing (continued on p. 7) 1 year free PPK for UAV applications - Try it! QUANTA - Direct Georeferencing » Cost-effective and Full-featured solution » Real-time and Post-processing » Land and Aerial mapping projects www.sbg-systems.com DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 5
PRESIDENT’S MESSAGE BY DAVID KUNKEE GRSS Accomplishments in 2021: Success and Unexpected Turns I n this last message of the year, I think it is important to summarize accomplishments of the Society in 2021 and describe our preparations for success in 2022. By the numbers, our Society continues to grow. We added two student branch Chapters in September with another one expected in November. When this one receives its final approval, the IEEE Geoscience and Remote Sensing Society (GRSS) community will total 70 Chapters, 28 student branch Chapters, and 10 ambassadors engaged with communities in locations where we hope GRSS Chapters will form in the near future. This means that the GRSS community now consists of 98 combined Chapters with membership in the Society surpassing 5,300 members and submissions to our journals surpassing expectations. The numbers confirm continued success for GRSS in 2021, but this year also brought some unexpected twists and turns with high expectations at the beginning to immediately resume in-person meetings. In the spring, GRSS completed transition of the website to a new provider and updated its appearance and structure. I hope that it offers more content and is more straightforward to use when navigating the website. The process of improving the website is ongoing, and we are continuing to transition and add material. It is great to see this result from many past planning sessions and discussions. Extensive preparation by the 2021 IEEE International Geoscience and Remote Sensing Symposium (IGARSS) team to address numerous possible outcomes this summer about the state of COVID-19 worldwide resulted in a decision to pivot from the planned hybrid meeting format at the beginning of the year to a fully virtual meeting. IGARSS 2021 again enjoyed record attendance and continued success overall, including an in-person drone workshop and an evolved meeting format. We are leveraging lessons learned from the past two IGARSSs to provide the best Digital Object Identifier 10.1109/MGRS.2021.3129110 Date of current version: 14 January 2022 6 experience for the upcoming IGARSS 2022, which is currently expected to have both online and in-person content for those who can attend the meeting in Kuala Lumpur. This past year, GRSS education and outreach activities also expanded to include schools offered in all seasons, not just summer, and we expanded GRSS course offerings through the IEEE Learning Network. Don’t forget our cool videos, and now the second season’s sponsoring our “Down to Earth” podcast. Before the end of the year, we also plan to reinstate in-person engagements with our booth at the upcoming American Geophysical Meeting in December. Also underway is the third GRSS Student Grand Challenge. This activity is a collaboration between the Van Allen Foundation of the University of Montpellier and IEEE GRSS. The combined activity consists of four projects overall: REmote Sensing detection of Plastic POllution in the Gulf of LIons, optiCal floAt for PlasTic quAntIficatioN, Remote Identification of Microplastics using Ocean Surface Anomalies, and Micro-PLAStic in the SEA Detection experiment, with GRSS facilitating the latter three projects. The project kickoff meetings, with three of the four participants of this collaboration, were held in October, with the fourth project kickoff meeting anticipated for November. Tracking plastics and debris in our oceans is an important topic requiring a multidisciplinary approach. The value of a wide variety of data, both remotely and in situ sensed, needs to be assessed. New sensors may be needed to supply data to better understand the problem, assessments for decision makers, and design support to better control the problem and its impacts. The four projects underway focus on different approaches to the overall problem and possible mitigations. It is exciting to see the enthusiasm and value of these different approaches coming together. In November, GRSS cosponsored the 2021 Asia Pacific Conference on Synthetic Aperture Radar, which was held in Bali, Indonesia, thanks to conference organizers IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Josaphat Sumantyo, the GRSS Instrumentation and Future Technologies Technical Committee, and Arifin Nugroho, chair of the GRSS Indonesian Chapter. Thanks also for keynote presentations provided by GRSS Administrative Committee (AdCom) members Alberto Moreira and Paul Rosen. Next, I would like to announce that in September, GRSS selected the proposing team from Brisbane, Australia, to host IGARSS 2025, which is now planned for early August 2025. Congratulations to Prof. Xiuping Jia, Prof. Jeffrey Walker, and Prof. Jocelyn Chanussot, general cochairs of IGARSS 2025. I would also like to thank all the teams that participated in the competition for their hard work and preparation. We recognize and appreciate your efforts, and we hope that you will continue to support the longterm success of IGARSS. The call for proposals for IGARSS 2026 from IEEE Regions 1–7 and 9 has now been posted on the GRSS website. Interested groups should submit a letter of intent and a preliminary proposal (preproposal) to the vice president of meetings and symposia of the GRSS at vp_meetings_symposia@grss-ieee.org by 1 March 2022. I am also happy to report that GRSS now has a published standard (IEEE 4003-2021) on IEEE Xplore describing global navigation satellite systems reflectometry data sets. This standard is notable not only because it was developed almost independent of industry representatives but also because a draft for balloting was produced in two years despite changes in leadership. The GRSS Standards Committee has several more IEEE standards projects ongoing. In future AdCom meetings, GRSS may consider further defining the role of standards activity as it relates to the Society’s core mission. In 2021, GRSS leadership held four additional executive sessions of AdCom meetings. These online sessions provided some extra time for discussion on important topics, which has been difficult due to the inability to hold in-person meetings. This year, all AdCom meetings were again held virtually due to the continuing changing nature of the global pandemic, although we are now planning to restart in-person meetings, beginning with our spring AdCom meeting in March. From our recent November AdCom meeting, some key decisions include the adoption of changes to our Bylaws and FROM THE EDITOR GRS (continued from p. 5) were represented in international proceedings, and Dr. Gail Skofronick-Jackson, NASA program manager, IEEE Fellow, former Administrative Committee member, and leader of WIE activities. Their memorials begin on page 289. As I have mentioned in the past few issues, IEEE Geoscience and Remote Sensing Magazine has now implemented a two-stage review process to give more timely feedback to potential authors. Short (five pages or fewer, excluding references) white papers will be submitted first. These will then be reviewed by associate editors or members of the editorial board. Following a positive review of the white DECEMBER 2021 Operations and Procedures (OPs) Manual that better reflect the Society’s practice and help ensure transparency in our future operations. The scope of these changes included the addition of required clauses and conditions for our GRSS awards committees, changes to the conference advisory committee charter, and reduction of the GRSS past-president term of office from three to two years. The roles of social media chair and social media ambassadors were also codified in our documents. Finally, additions to the OPs manual in November defined the terms of our associate and topical associate editors. Please look for these updates and additions to the GRSS Bylaws and OPs Manual on our website. There is a requirement from IEEE to allow a 30-day review period for changes to the Bylaws before they become effective. Considering the scope of the November meeting, I would like to thank the AdCom for their many contributions to GRSS activities throughout the year as well as their time spent preparing and reporting at all of the meetings throughout the year. Of note, the November AdCom meeting included 15 portfolio topics with 52 live presentations and 20 consent agenda presentations. The live meeting was held in short sessions spread over three days to cover the scope of activities within the Society. It was clear from listening to the many speakers at the November meeting that the level of activity is continuing to grow with our membership. To conclude my December letter, it is with a very heavy heart that I forward news of the passing of Dr. Gail Skofronick-Jackson due to an accident while she was on the island of St. Croix in the U.S. Virgin Islands. Gail was a close friend and colleague to many of us within GRSS, NASA, and the international Earth science community. Within GRSS, she served as a member of the AdCom from 2012 to 2016, was a member of the organizing committee of IGARSS 2020, and for several years organized and led GRSS Women in Engineering activities. Gail was a brilliant scientist, continually enthusiastic to learn about the world around us, and always very thoughtful of others. I am grateful for all of the times we were able to share with her at our various meetings and activities. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE paper, authors may be invited to submit a full manuscript, which will then undergo a complete peer review. Contributions to our regular columns; (“Chapters,” “Space Agencies,” “Women in GRSS,” “Education,” “Software and Data Sets,” and “Conference Reports”) are always welcome. White papers, columns, and invited manuscripts should be submitted through manuscript central at http://mc.manuscriptcentral.com/grsm. Proposals for special issues should be sent to me directly at jlg@ieee.org. Please continue to stay safe! GRS 7
Methods for Small, Weak Object Detection in Optical High-Resolution Remote Sensing Images A survey of advances and challenges WEI HAN, JIA CHEN, LIZHE WANG, RUYI FENG, FENGPENG LI, LIN WU, TIAN TIAN, AND JINING YAN O bject detection that focuses on locating objects of interest and categorizing them has long played a critical role in the development of remote sensing imagery. Following significant improvements in Earth observation technologies, the objects in high-resolution remote sensing (HRRS) images show additional detailed information and more complex patterns. Some applications, such as urban monitoring, military reconnaissance, and national security, have urgent needs in terms of identifying small-scale (small) and weak-feature-response (weak) objects. However, these kinds of objects usually take up the small proportion of an image that has enough of its own variations in color, shape, and texture so that the objects’ features are easily affected by weather, illumination, and occlusion. These characteristics of small, weak objects make their detection a more challenging task than generic object detection. This article comprehensively reviews the existing challenges and corresponding technologies for addressing that task and its specific problems. INTRODUCTION Object detection in remote sensing images aims at locating objects of interest on the ground and categorizing them. The term object generally refers to man-made or highly structured bodies (vehicles, buildings, ships, and so forth) that are independent of complex background environments as well as landscapes. As a fundamental task in the field of satellite and aerial image analysis, object detection plays an important role in a wide range of applications, such as urban planning, geographic information processing, precision agriculture, and environmental monitoring. Digital Object Identifier 10.1109/MGRS.2020.3041450 Date of current version: 25 January 2021 8 0274-6638/21©2021IEEE IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
©SHUTTERSTOCK.COM/WILLEM In the past 20 years, the increasing image interpretation accuracy of these applications has enabled them to meet the requirements needed in actual scenarios and thus significantly promotes the development of Earth observation technologies and object-detection approaches. The spatial, temporal, and spectral resolutions of Earth observation sensors have also been greatly improved [1]–[3]. For instance, the images from Google Earth (Google Inc.) [4] have resolutions of up to approximately 0.5 m. WorldView-3 (DigitalGlobe, Inc.) [5] provides a 0.31-m panchromatic resolution and a 1.24-m multispectral resolution. These HRRS images show more texture and shape and additional detailed information about geospatial objects as well as complex spatial patterns. The data volume of HRRS images has also dramatically increased, and a massive number of images is now accessible. The advantages of HRRS images are that they can offer the most economical and efficient way to achieve full-time, high-precision Earth surface monitoring with global coverage and fast detection of small-scale (small), weak-feature-response (weak), and nonuniformly DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE (sparsely or densely) distributed objects is of great significance when meeting the requirements of real scenarios in many special applications, such as military reconnaissance, national security, urban monitoring, and geological disaster monitoring. Unlike natural images, which are often clearer and contain several categories of objects, HRRS images cover an extensive range of the Earth’s surface and involve a massive number of objects. The objects vary in scale, color, shape, and texture; their features are easily affected by weather, illumination, occlusion, and imaging conditions. In addition, the great distance between the sensor and targets means that some kinds of targets occupy only a few to dozens of pixels in the imaging plane and are presented as small objects that can easily be overwhelmed by a bright background [6]. Objects of this kind are usually characterized by a low signal-to-noise ratio (SNR) and inadequate structure information, which is presented as a weak feature response. These characteristics make the detection of small, weak targets a more challenging task in remote sensing. The past decade has witnessed major advances in object detection in remote sensing images. At an early stage, various models based on prior knowledge [7]–[10] were proposed for target detection in satellite images. As image resolution increases, prior-knowledge-based models increase in uncertainty because the high complexity of HRRS images tends to cause limited detection accuracy. More recently, various forms of machine learning (ML) approaches [11] have played a critical role in object detection. With the increasing availability of big data and remarkable advances in data mining, novel methods have come into use for HRRS image processing. Deep learning (DL) models [12]–[15] have attracted serious attention and become dominant tools for processing large-scale, high-dimension data; they have achieved satisfactory accuracy for several tasks in the field of remote sensing. By stacking multiple nonlinear layers, DL models extract semantic information about objects as well as the context relationships among them and the background. DL models demonstrate superiority in the extractions and fusions of multiscale features and have therefore outperformed the early models, with significant developments in remote sensing object representations. In recent decades, many works have presented ML- and DL-based models, leading to the creation of a series of benchmark data sets for promoting remote sensing and small, weak object detection [16]–[19]. Although several survey papers on object detection have been published, they have focused mainly on detection technologies from the image-processing aspect [20], [21] or on reviewing some categories of approaches, such as ML[11] and DL-based methods [19], or some specific detection problems and tasks, including vehicle detection [22] and salient methods [23]–[25] for remote sensing object detection. There is still the lack of a comprehensive review of existing works that addresses the problems of small, weak object detection. Based on the aforementioned analysis, this article 9
concentrates on challenges to and recent advances in addressing these problems and can be summarized as follows: ◗◗ This article systematically analyzes the challenges of small, weak target detection. According to their causes, the challenges have been divided into three aspects: image quality, object variations, and complex context. ◗◗ The technical evolution of object detection, including main developments in the fields of computer vision and remote sensing, is comprehensively involved; the existing benchmark data sets and their contributions to small, weak object detection are introduced and analyzed. ◗◗ The existing works that address the various challenges are also summarized, and some promising research directions into further improvements to small, weak object detection are discussed. DIFFICULTIES AND CHALLENGES IN REMOTE SENSING SMALL, WEAK OBJECT DETECTION GENERIC OBJECT DETECTION IN REMOTE SENSING Object detection, a fundamental and essential task, has attracted broad attention over the past decades. The task is defined as follows: given a remote sensing image, determine whether it includes instances of objects from predefined categories, and, if it does, predict the spatial location and the extent of each instance [27]. Although thousands of geospatial objects occupy optical remote sensing images, research scholars interested in this topic use the term objects to refer to human-made or highly structured bodies (e.g., ships, vehicles, and airplanes) that have shape boundaries and are independent of the background environment and landscape items [11] rather than unstructured bodies or scenes, such as the sky or clouds. Generally, the spatial location and extent of an object can be defined using a bounding box (BB) (a horizontal or orientation rectangle tightly bounding the object) or a precise pixelwise segmentation mask, as shown in Figure 1. Over the past several years, BB annotation has become the most widely used method for evaluating detection performance in remote sensing images; it can define the location of an object by the corner coordinates of a rectangle. The main advantage of this type of annotation is that it focuses on locating only objects of interest, ignoring the context. Therefore, it can greatly save labor costs (a) (b) (c) (d) FIGURE 1. The different annotation types in the HRSC2016 [26] data set. (a) The original image, (b) the HBBs’ annotation, (c) the OBBs’ annotation, and (d) the pixelwise segmentation mask. 10 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
in labeling data and is available to quickly create largescale object-detection data sets for specific applications. The precise pixelwise segmentation mask is an annotation method, wherein each pixel in the image is assigned a category label, such as forest, farmland, road, or background. This type can be applied in scenarios in which the environmental context is important. This type of annotation requires more expert knowledge and labor to be successful. Due to the massive number of object categories and instances, complex backgrounds, and a large data volume of HRRS images, the precise pixelwise segmentation mask annotation is rarely used in large-scale remote sensing target detection. There are two types of widely used BB annotation methods: horizontal BBs (HBBs) and orientation (rotation) BBs (OBBs). In Figure 1, HBBs (the axis-aligned rectangle) were first used to localize objects. However, objects in HRRS images often appear in arbitrary orientations and may be densely distributed. In some extreme but common scenarios, this annotation method involves both the background and targets of interest; it cannot accurately or compactly outline the locations of objects and may decrease detector performance. The annotation method of the OBBs, which can be regraded to add angles to the HBBs, is utilized to gain a tight bounding for the rotation object. For this article, we review mainly methods with these two types of BB annotations. DIFFICULTIES AND CHALLENGES IN SMALL, WEAK OBJECT DETECTION Relevant works for small, weak object detection of infrared images started to appear long ago, when the spatial resolution of remote sensing images was relatively low and infrared images were the main data source for object detection. Many works have addressed solutions to such problems [28]–[32]. Related works covering the analysis of object detection in infrared images [33]–[35] originally defined a small object as one with a total spatial extent of fewer than 80 pixels (a width of fewer than nine pixels), which is less than 0.2% in an image of 256 × 256 pixels. As shown in Figure 2, the long distance of imaging means that the target takes up only a few dozen pixels in the imaging plane, presenting as a small target. Objects of this kind are basically shapeless and have no available texture features. Small objects are usually characterized as having a low SNR, small size, and no adequate structure information for the undulant clutter and imaging distance. These characteristics make small targets very difficult to detect, and small targets are easily overwhelmed by a bright background [6]. Therefore, a small, weak object is more formally defined as 1) small: the scale of the target is small, or the target’s proportion of the total images is low; and 2) weak: the features of the target are insufficient and easily affected by its background. Thanks to the acquisition of HRRS images and the requirements of the related applications, small, weak object detection has attracted increasing attention. Although numerous efforts have been made to develop detectors and DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE benchmark data sets for promoting the development of small, weak object detection in HRRS images, there is still no consensus on the definition of small, weak object detection. Kang et al. [16] proposed a complex background benchmark wherein vehicles were set as the small objects. VEDAI (for “vehicle detection in aerial imagery”) [17] is a data set created for small target detection, but the authors did not propose a specific definition for a small object. Xia et al. [18] proposed a new large-scale benchmark for HRRS image detection, dividing the object instances into three classes according to the pixel width of their BBs: small for a width from 10 to 50, middle for a width from 50 to 300, and large for a width of greater than 300. This was the first work to clearly define a scale for small objects in HRRS images. For the aspect of the weak feature response of objects, no related work has proposed sufficient discussion or drawn a clear conclusion. In this section, we comprehensively consider the factors that affect detection performance and then summarize the difficulties of and challenges to small, weak object detection in HRRS images. In Table 1, each influencing factor is examined from the three aspects of image quality, object variations, and complex context as follows. 1) Image quality: In the process of HRRS image acquisition, the imaging environment, satellite platform, optical system, FIGURE 2. Small, weak object detection in infrared images. TABLE 1. THE CHALLENGES AFFECTING SMALL, WEAK OBJECT DETECTION. THREE ASPECTS SPECIFIC CONTENT Image quality Mixed noise, patch missing, occlusion caused by cloud, fuzzy, shadow, and multisource data Object variations Small size, high intraclass variations, a change of the object features caused by illumination and background, antagonism of the background and the target, a lack of annotation samples, nonuniform distribution, and an imbalance of positive and negative training examples Complex context Many types and quantities of background targets and complex distribution patterns 11
and electronic equipment may affect the image quality, which leads to a certain degree of degradation of the acquired images. As presented in Figure 3, these images cannot fully meet the requirements of precise interpretation in real-world applications. There are two main categories of factors that degrade the image quality. The first one is the factors that possibly appear in the imaging process, such as noise, blurring, cloud occlusion, missing information, shadow, and so on. These kinds of factors are the main reason for remote sensing image degradation. Another category of factors arises from the limitations of sensor production technologies and application scenarios. Because spectral, spatial, and temporal resolutions are often mutually restricted, imaging sensors can achieve high resolution in only one of these three aspects. For these kinds of low-quality images, some methods for improving the image quality should be applied. Multisource satellite data with different resolutions should be complementary to obtain the required data. 2) Object variations: An HRRS image can cover an extensive area of the Earth’s surface and contain many kinds of objects. The scale variations of object instances in HRRS images are great, and some objects are ver y small. As depicted in Figure 4, some objects always take a small proportion of a total image and show weak feature response; for example, the width of a small ship can be fewer than 25 pixels. Different resolution, scale, color, shape, and texture changes residing within a single category create high intraclass variations for ob- (a) jects. These kinds of small-object instances may likely crowd into a specific region of aerial images. Additionally, HRRS images are noisy, and the features of objects easily change when affected by weather, illumination, and occlusions. Some specific targets are adversarial and camouflaged, making them difficult to identify effectively. Another critical problem is that the annotation samples may be insufficient. At present, there are more than 2,000 satellites in orbit around the world; they generate more than a petabyte of data every day. However, there are roughly only 100 GB of annotated data for target detection. 3) Complex context: Generally, the background and context of objects of interest are complex and crowded with other type of objects, as displayed in Figure 5. Natural images are often taken from horizontal perspectives, while HRRS images are typically taken as bird’s-eye views; this implies that many objects of interest form complex spatial patterns with the background. The intricate patterns increase the difficulty of object detection in HRRS images. Considering the three aspects of the challenges, remote sensing small, weak objects can be defined as 1) data quality, i.e., HRRS images for small, weak object detection may be of low quality due to the noise, illumination change, occlusion, and so forth introduced in the imaging process; (2) objects, i.e., they are of small scale, have weak feature response with many categories showing high intraclass variations and a nonuniform distribution, and may lack annotation samples; (3) context, i.e., the context is complex and changeable, (b) (c) IKONOS (0.8–1 m) (d) (e) WorldView-3 (0.31 m) (f) FIGURE 3. Some problems caused by image quality: (a) blur, (b) noise, (c) missing information (d) cloud, (e) shadow, and (f) multisource data. 12 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
(a) (b) (c) (d) (e) FIGURE 4. Some problems caused by object variations: (a) small-scale weak feature; (b) high intraclass variations; (c) multiclass densely distributed instances; (d) occlusions; and (e) camouflage and adversariness. and targets are easily hidden in the background. All of these characteristics make small, weak object detection a more challenging task than generic object detection. To promote its development, more work is needed to address these different aspects of the challenges and their difficulties. A REVIEW OF HRRS OBJECT-DETECTION BENCHMARK DATA SETS AND PERFORMANCE EVALUATION HRRS OBJECT-DETECTION BENCHMARK DATA SETS Throughout the development of object detection and recognition, data sets have played a critical role not only as common resources for the evaluation and verification of algorithm performance but also in pushing research into increasingly complex and challenging problems [20]. Over the past decade, in particular, detection and recognition methods based on DL have achieved tremendous success in addressing visual-understanding problems in the computer vision community; large amounts of annotation data, including Pascal visual object classes [44], ImageNet [45], Microsoft common objects in context (COCO) [46], and Open Images [47], have played a key role in this success. The development of Earth observation technologies and access to a large number of HRRS images make it possible to build large-scale data sets for capturing the vast richness and diversity of objects, promoting unprecedented performance in remote sensing object detection. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE In past decades, research groups in remote sensing have released many public data sets with different characteristics for solving different problems. There are 13 widely used data sets: the Institute for Computer Science and Control in Hungary-Inria (SZTAKI-Inria) [36], Northwestern Polytechnical University very high resolution (NWPU VHR)-10 [37], Chinese Academy of Sciences UCAS-AOD [38], road scene object detection (RSOD) [40], data set for object detection (DOTA), object detection in aerial images (ODAI) [18], VEDAI [17], high-resolution ship collection 2016 (HRSC2016) [26], 3K vehicle [16], cars overhead with context (COWC) [39], xView [41], HRRS detection (HRRSD) [42], [43], and detection in optical remote sensing images (DIOR) [19]. The attributes of (a) (b) FIGURE 5. The problems introduced by complex contexts. (a) A complex context and (b) massive background objects. 13
these data sets are listed in Table 2 for comparison. The development trend and some representative small, weak targets of the data sets are displayed in Figures 6 and 7, respectively. Each data set is introduced in this section. The things-andstuff data set [48] is excluded from this discussion because of its relatively low spatial resolution. SZTAKI-INRIA This benchmark data set, from SZTAKI and the Inria Sophia Antipolis-Méditerranée Research Center in France [36], was created for building detection and is a multisensor aerial set from QuickBird, IKONOS, and Google Earth [4]. It contains nine images and 665 building instances, annotated with oriented OBBs. The images of the data set have three bands: red, green, and blue (RGB). NWPU VHR-10 This available 10-class geospatial object-detection data set from NPWU in Xi’an, China, is used for research purposes [37]. The object classes are airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle. The data set contains 800 VHR remote sensing images cropped from the Google Earth and Vaihingen data sets, which are then manually annotated by experts into 3,775 instances with HBBs. The image resolutions range from 0.5 to 2 m. TABLE 2. COMPARIONS OF THE AVAILABLE BENCHMARK DATA SETS IN EARTH OBSERVATION COMMUNITY. DATA SET NAME TOTAL CATEGORIES IMAGES INSTANCE IMAGE WIDTH DATA SOURCE RESOLUTION ANNOTATION YEAR CHARACTERISTICS SZTAKI-INRIA [36] 1 9 665 ~800 Quick Bird, IKONOS, and Google Earth 0.5–1 m OBBs 2012 Single category, highresolution satellite images, multiple sensors NWUP-VHR10 [37] 10 800 3,775 ~1,000 Google Earth 0.3–2 m HBBs 2014 Multiple categories, clean background UCAS-AOD [38] 2 910 6,029 1,280 Google Earth 0.3–2 m HBBs 2015 Airplane and vehicle detection VEDAI [17] 9 1,210 3,640 1,024 Utah AGRC 0.125 m OBBs 2015 Small-scale objects, multispectral and multiresolution images, illumination changes 3K vehicle [16] 2 20 14,235 5,616 DLR 3K camera 0.13 m system OBBs 2015 Small-scale objects, VHR images COWC [39] 1 53 32,716 2,000– 19,000 Six sources 0.15 m Dot 2016 Small-scale objects, multisensor images HRSC2016 [26] 1 1,061 2,976 ~1,000 Google Earth 0.4–2 m Three types 2016 Sufficient object variations, complex background RSOD [40] 4 976 6,950 ~1,000 Google Earth, Tianditu 0.3–3 m HBBs 2017 Multisensor and multiresolution images DOTA [18] 15 2,806 188,282 800– 4,000 Google Earth, JL-1, and GF-2 0.3–1 m HBBs and OBBs 2018 Multisensor and multiresolution images, nonuniform distribution, many object categories, sufficient object variations ODAI [18] 16 2,806 ~400,000 800– 4,000 Google Earth, JL-1, and GF-2 0.3–1 m HBBs and OBBs 2019 Improved version of DOTA, more instances and categories, especially for small, weak objects xView [41] 60 1,128 ~1,000,000 2,000– 4,000 Worldview-3 0.3 m HBBs 2018 Complex background, many categories, massive instances, dense distribution, noise, blur, occlusion HRRSD [42], [43] 13 21,761 55,740 ~11,000 Google Earth 0.15–1.2m HBBs 2019 Many categories, many instances, sufficient variations DIOR [19] 20 23,463 192,472 800 0.5–30 m HBBs 2019 Complex background, many categories, noise, blur, occlusion Google Earth DLR: German Aerospace Center; COWC: cars overhead with context. AGRC: Automated Geographic Reference Center. 14 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
UCAS-AOD This UCAS data set, collected from Google Earth, contains two detection classes: airplane and vehicle [38]. The airplane category has 600 images with 3,210 instances, while the vehicle category has 310 images with 2,819 vehicles. VEDAI This data set was created for the task of multiclass vehicle detection in satellite images [17]. It consists of nine categories with a total of 3,640 instances, including boat, car, camping car, plane, pickup, tractor, truck, van, and a category labeled “other.” The data set has 1,210 images, each of which is 1,024 × 1,024 pixels with VHR (12.5 cm). VEDAI is provided as a tool to benchmark automatic target-recognition algorithms in unconstrained environments. The vehicles contained in the database, in addition to being small, exhibit different characteristics, such as multiple orientations, illumination/shadowing changes, peculiarities, and occlusions. Furthermore, each image is available in several spectral bands and resolutions. 3K VEHICLE This data set is another of those used for vehicle detection [16]. It has 20 images with 5,616 × 3,744 pixels and a spatial resolution of 13 cm. It contains 14,235 vehicles with OBBs. The images were captured by the German Aerospace Center 2012 Multiple Class High Resolution SZTAKI-INRIA 2014 NWPU VHR-10 Small Scale Illumination/Shadow Higher Resolution UCAS-AOD 3K Vehicle Detection VEDAI 2015 COWC 2016 HRSC2016 ROSD 2017 2018 DOTA and ODAI Complex Background Multiple Scales xView Massive Instance More Categories Dense Distribution Multisensor Data Sufficient Variations 2019 HRRSD DOIR FIGURE 6. The development trend of existing HRRS data sets. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 15
3K camera system (a real-time airborne digital monitoring system) at a height of 1 km above the ground. from Google Earth and Tianditu, and its resolutions range from 0.3 to 3 m. CARS OVERHEAD WITH CONTEXT Also created for vehicle detection, the cars overhead with context (COWC) data set images are standardized to 12.5 cm per pixel at ground level from their original resolutions [39]. The set contains 32,716 unique cars from six sources: Toronto, Canada; Selwyn, New Zealand; Potsdam and Vaihingen, Germany; Columbus, Ohio; and Utah, the United States, covering different geographical locations and produced by different imaging sensors. The car sizes range from 24 to 48 pixels. Two of the sets (Vaihingen and Columbus) are in gray scale; the others are in RGB color. It should be noted that each car in the annotated images has a dot placed on its center. HRSC2016 HSRC2016 is a benchmark data set for boat detection [26]; it has 1,070 images and 2,976 instances from Google Earth with HBB annotations. The image sizes vary from 300 × 300 to 1,500 × 900 pixels. The images contain large variations of scale, position, shape, and appearance. DOTA AND ODAI DOTA is a larger-scale data set with HBB and OBB annotations [18]. It contains 2,806 large images and classifies objects into 15 categories, including baseball diamond, ground track field, small and large vehicles, tennis court, basketball court, storage tank, soccer field, roundabout, swimming pool, helicopter, bridge, harbor, ship, and plane. The fully annotated DOTA contains 188,282 object instances, which vary greatly in scale, orientation, and aspect ratio; the resolutions range from 0.3 to 1 m. The images are collected mainly from Google Earth [4], but some are taken from JL-1 and the rest from GF-2 of the China Center for Resources Satellite Data and Application. ODAI is an updated version of the DOTA data set and contains 0.4 million annotated object instances in 16 categories. Both DOTA and ODAI use the same aerial images, but ODAI has revised and updated the annotation of objects, adding many small-object instances (approximately 10 pixels or fewer) that were missed in DOTA and extending the categories by adding a new one: a container crane. RSOD RSOD consists of 976 images and 6,950 object instances involving four categories [40]. The data set was collected XVIEW This is one of the largest published aerial data sets, covering 60 object classes [41]. It contains images from complex Ship Wind Mill (a) Ground Track Field (b) FIGURE 7. Small, weak object examples of existing HRRS data sets. (a) Small, weak objects in large-scale data sets and (b) small, weak objects in small-scale data sets for vehicle detection. 16 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
scenes and more than one million object instances with HBB annotations. Compared with images in existing HRRS data sets, xView images are high resolution, multispectral, and labeled with a greater variety of objects. The images collected from WorldView-3 have a resolution of up to 0.3 m. The variations of scale, color, shape, and texture make the data set more challenging to the remote sensing community. small objects. They can supplement the large-scale data sets mentioned previously. However, some critical limitations in these data sets are that the object categories are few, including only cars or airplanes, and the image quality is high, which is not consistent with actual scenarios. VHR data sets with more categories, variations, and challenges need to be developed further. HRRSD The HRRSD data set is a large-scale benchmark with 21,758 RBG images extracted from Google Earth and has spatial resolutions ranging from 0.15 to 1.2 m [42], [43]. There are 13 categories of objects, which allows this to be considered an extended version of the NWPU VHR-10 data set with additional classes, such as crossroads, parking lots, and T junctions. This data set is class balanced, and each category has 3,700–5,000 instances. EVALUATION METRICS There are two categories of metrics for evaluating detector performance: detection speed in frames per second (FPS) and detection accuracy in precision, recall, and average precision (AP). FPS is a metric used to express how fast the detector is; it means the number of image FPS that the detector can process. For example, if the time needed for a detector to analyze a standardscale image is 0.04 s, its detection speed is a frame rate of 25 FPS. For a given input image I, the outputs of a detector are the predicted results {(b j, c j, p j)} Mj = 1 (indexed by the object order j; M is the number of predicted detections) of the BB b j, predicted label c j, and confidence score p j. The groundtruth boxes are {(B k, C k)} kN= 1 (indexed by the order k; N is the number of ground-truth boxes) of the BB B k and label C k. {(b j, c j, p j)} Mj = 1 are greedily matched to {(B k, C k)} kN= 1. For given a confidence threshold t and a intersection over union (IoU) threshold e, a predicted result (b j, c j, p j) is set as a true positive (TP) if the following criteria are met: ◗◗ The predicted label c j is equal to the label C k of a ground-truth box (B k, C k), and p j is greater than t. ◗◗ The IoU value between the predicted BB b j and the ground-truth BB B k, IoU (b j, B k), is larger than e, where IoU (b j, B k) is computed as DIOR The DIOR data set is a recently released aerial DOTA [19]. It contains 23,463 images with 800 × 800 pixels and 192,472 instances labeled with HBBs. The images were collected from Google Earth and have resolutions ranging from 0.5 to 30 m. This data set has sufficient variations of scale, weather, seasons, imaging conditions, and quality as well as high interclass similarity and intraclass diversity. It is also one of the larger-scale data sets, with massive images and object instances. COMPARISON As shown in Figure 6, early HRRS data sets, such as SZTAKIIRIA [36] and NWPU VHR10 [37], contained a small number of categories and instances for the detection of large or easily recognized objects. After several years, scholars have forged ahead to introduce massive numbers of instances and many categories: multisensor data, complex context, and low-quality images to create large-scale challenging data sets, such as xView [41], DIOR [19], HRRSD [42], [43], and DOTA [18], which are becoming more and more in line with the conditions of actual applications. These four satellite data sets contain more than 13 object categories and more than 50,000 object instances, with resolutions ranging from 0.3 to 30 m, all available for the development detectors that adapt to large-scale object detection. Some representative small, weak examples from the aforementioned data sets are collected in Figure 7(a). It can be seen that the objects in the large-scale data sets, such as ships and windmills, are very small scaled, weak featured, and easily affected by the context and low-quality data. There are data sets that are very challenging and suitable to develop the detectors for small, weak object detection. VEDAI [17], 3K vehicle [16], and COWC [39] are three relatively small-scale data sets used for vehicle detection. As shown in Figure 7, their images have VHR (up to about 12.5 cm) and their objects, which are fixed to a range, are beneficial for developing and testing a model to detect DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE IoU (b j, B k) = area _ b j ( B k i ,(1) area _ b j ' B k i and the symbols of + and , denote the intersection and union, respectively. The value of e is generally set to 0.5. Otherwise, the predicted result is regarded as a false-positive (FP) sample. Precision is the proportion of correct detection instances out of the total detection results predicted by the detector. Based on the calculations of the TP and FP results, it can be computed by TP P (t) = TP + FP . (2) Recall is defined as the proportion of all positive instances indicated by a detector. It can be formulated by TP R (t) = N , (3) where N is the number of ground-truth boxes. Precision and recall can drive AP, which is the metric most used in recent works. AP is usually computed for each class separately. The precision, P(t), and the recall, R(t), can 17
TABLE 3. THE COMPARIONS OF FIVE CATEGORIES OF DETECTION METHODS. METHOD NAME MAIN CATEGORIES/MILESTONES HIGHLIGHTS LIMITATIONS Template matching Rigid template matching [7], [50], [51], deformable template [52]–[54] Simple and fast to implement, no training samples required Limited to the variations of object appearances, consumes more prior knowledge Knowledge Geometric information [8], [55], context knowledge [9], [56] Detects objects from coarse-to-fine hierarchical architecture, combines more prior information Defining the detection rules and knowledge is subjective, labor consuming OBIA Multiresolution segmentation [57]–[59] Flexible incorporation of different features, GIS-like functionality and expert knowledge Lacks generic solutions to the full automation of segmentation process, defining the classification rules is subjective and not robust Classical ML Features: HoGs [10], BoWs [60], texture Automatically establishes object-andfeatures [61], [62], and so on; classifiers: learn feature representation, better SVM [63], [64], AdaBoost [65], [66], kNN scalability and compatibility [67], and CRF [68], [69] Labels many training samples, detection accuracy depends on the training samples and the feature extractor DL Two stage: RCNN [70], SPPNet [71], fast RCNN [72], faster RCNN [73], RFCN [74], and so forth; one stage: YOLO [75], SSD [76], RetinaNet [77], and CornerNet [78] Labels a large number of samples, consumes massive computing resources End-to-end framework without manual intervention, automatically learns high-level features, adapts to large-scale complex image processing OBIA: object-based image analysis; GIS: geographic information system; HoGs: histogram of oriented gradients; BoWs: bag of words; SVM: support vector machine; kNN: k-nearest neighbor; CRF: conditional random field; RCNN: region-based convolutional neural network; SPPNet: spatial pyramid pooling network; RFCN: region-based fully convolutional network; YOLO: you only look; SSD: single-shot multibox detector. be computed as a function of the confidence threshold t; by varying the confidence threshold, t, different pairs (P, R) can be obtained; in principle, this allows precision to be considered as a function of recall from which the AP value can be found. The mean AP, the average of the AP values of all the object categories, has therefore been adopted as the final measure to evaluate the overall accuracy [44], [45], [49]. A BRIEF REVIEW OF OBJECT-DETECTION FRAMEWORKS Incredible progress has been made in feature representations and classifiers for object detection. In terms of feature Template-Matching Methods Classical ML Methods Deformable Template Matching Rigid Template Matching 1980 … 1995 DL Methods Mask RCNN SVM AdaBoost 1990 representation and recognition, an impressive change is the shift from handcrafted features to DL features. In terms of localization, the sliding-window stage is mainstream. However, the number of windows is extensive and increases dramatically with the number of image pixels, especially when processing remote sensing images. Therefore, scholars focus mainly on the design of effective and efficient objectdetection strategies; these include sharing-feature computations, cascading, reducing per-window computations, the fast localization of objects of interest, and the reduction of computational costs. In the following, we briefly review milestone works. (see Table 3 and Figure 8). Faster SPPNet RCNN BOWs HOGs 2000 2005 RCNN 2010 Fast RCNN 2015 FPN 2020 YOLO Context Knowledge Geometric Information SSD OBIA-Based Methods CornerNet RetinaNet Knowledge-Based Methods FIGURE 8. A road map of object-detection frameworks. SVM: support vector machine; BoWs: bag-of-words; HoGs: histogram of oriented gradients; RCNN: region-based convolutional neural network; SPPNet: spatial pyramid pooling network; FPN: feature pyramid network; SSD: single-shot multibox detector; YOLO: you only look once. 18 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
OBJECT-DETECTION METHODS BASED ON TEMPLATE MATCHING Methods based on template matching [11] are one kind of simple approach to object detection; they find matches in an input image, basing them on a series of predefined templates. The two main steps are 1) template generation, in which a template for each object category should be generated by manual design or learning from the training set, and 2) similarity measurement, in which, given an input image, the template is used to match the entire image at each possible position to find the matches. The methods have been classified into two groups: rigid template matching and deformable template matching. Early research concentrated mainly on rigid template matching, applying it to detect specific objects with simple appearances and small variations [7], [50], [51]. Because of its advanced ability to both impose geometrical constraints on the shape and integrate the local image evidence, deformable template matching is more powerful and flexible than rigid shape matching in processing shape deformations and intraclass variations [52]–[54]. Objectdetection methods based on template matching are simple and easy to implement for application to a specific task; expert knowledge is needed only to design them, and they do not need training samples. However, designing the templates calls for considerable prior knowledge and extensive computations; the templates are limited in their scale and rotation and shape viewpoint changes in objects. OBJECT-DETECTION METHODS BASED ON KNOWLEDGE Object-detection methods based on knowledge can transfer object detection into a hypothesis-testing problem by establishing various knowledge types and rules. The establishment of knowledge and rules is the most important step. Two widely used methods involve both geometric and context knowledge. The geometric information method is the most important and is widely used for early-target object detection; users can encode prior knowledge by taking parametric, specific, or generic-shape models [8], [55]. The context knowledge method is also crucial as the most widely used for object and background context and the relationships among objects and surrounding regions or objects [9], [56]. The methods of this kind enable users to perform the detection process through a coarse-to-fine hierarchical structure. However, decisions on how to define the prior-knowledge detection rules are subjective, and these factors pose critical challenges to the methods. Rules that are too loose cause false positives; too tight and they cause false negatives. OBJECT-DETECTION METHODS USING OBJECT-BASED IMAGE ANALYSIS With the increasing availability of submeter images, objectbased image analysis (OBIA) has been presented for classifying or mapping HRRS imagery into meaningful objects [57]–[59]. It contains two steps: image segmentation and object classification. First, imagery is segmented into DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE homogeneous regions (also called objects), representing a relatively homogeneous group of pixels; this is achieved by selecting the desired scale, shape, and compactness criteria. In the second step, a classification process is applied to these objects. An advantage of OBIA-based methods is that they exploit the knowledge of geographic information systems to overcome the limitations of pixel-based image-classification methods. The real challenges to the satisfactory performance of OBIA methods are in defining appropriate segmentation parameters for varying size, shape, and spatially distributed objects. In addition, accuracy assessments of OBIA are difficult, although many efforts have been made to address the problem. The technique’s advantages lie in its flexible incorporation of the shape, texture, geometry, and contextual semantic features as well as expert knowledge, making it context aware and multisource capable. Generic solutions to the full automation of the segmentation process are still missing, and the expert knowledge needed to decide how to define the classification rules is still subjective; these problems limit the technique’s adaptability to different tasks. OBJECT-DETECTION METHODS BASED ON CLASSICAL ML Due to the remarkable advances of ML techniques, especially their impressive feature representations and powerful classifiers, many recent approaches have taken object detection to be a classification problem, achieving significant improvements. ML object detection can be performed by training a classifier that captures the variations in object appearances and the views from a set of training data. The classifier takes a set of regions (object proposals or image patches) with their feature representations as the input; the output consists of their corresponding predicted labels. The most important components in the process of object detection are feature extraction, feature fusion, and classifier training. The dimension-reduction step is an optional operation. A histogram of oriented gradients (HoGs) feature [10], a bag-of-words (BoWs) feature [60], texture [61], [62], sparse representation-based [79], and Haar-like features [80] are common. The classifiers include support vector machines (SVMs) [63], [64], AdaBoost [65], [66], k-nearest neighbors [67], and conditional random fields [68], [69]. Methods based on ML can be automatically established using ML techniques. The scalability and compatibility are both greatly improved, but these methods need a large number of training samples to learn classifiers and are not suitable for large-scale data sets. In addition, the representation ability of the learned features is not sufficiently robust enough to deal with variations in an object’s appearance. DL-BASED DETECTION FRAMEWORKS We discuss DL detectors separately from the ML methods described previously because of the great success 19
of DL-based techniques in recent years. Deep convolutional neural networks (CNNs) can extract high-level feature representations of an input image and improve classification performance. Girshick et al. [70] took the lead, applying CNNs to object detection by developing region-based CNN (RCNN) features. Since then, many milestones have marked the unprecedented speed of the development of object detection. The main milestone approaches are reviewed in the following sections; they can be categorized into two classes according to the presence or absence of a proposal generation stage: two- and onestage detection frameworks. In the next sections, existing milestones of the two categories of detection frameworks are introduced first, and then the advances of DL-based detectors in small, weak object detection are reviewed. TWO-STAGE DETECTION FRAMEWORKS As depicted in Figure 9(a), for an input image, a two-stage detector would first examine DL features using a pretrained CNN architecture. Then, in the region proposal step, many regions of interest (RoIs), i.e., regions where a target may likely exist, would be generated. Finally, a detection head with a classifier and a regressor would simultaneously predict the location and category of a target for each RoI. The critical characteristic of two-stage detection frameworks is that they contain a prepressing component for generating object proposals. These kinds of detectors have dominated object recognition since the creation of RCNNs [70] due to their remarkable detection performance on benchmark data sets. REGIONS WITH CNN FEATURES The main principle of RCNNs [70] is that they first extract a set of object proposals (candidate boxes) using a selective search. The proposals are resized to a fixed scale and fed into a CNN model pretrained on ImageNet [12] to extract high-level features; for example, Visual Geometry Group [81], a residual neural network (ResNet) [13], and ResNeXt [82]. Then, a linear SVM classifier is used to predict the presence of an object and the object category for each proposed region. RCNNs have achieved remarkable improvement in natural image object detection, but they have obvious drawbacks; for example, the selective search strategy may generate more than 2,000 proposal candidates for one image, RPN For Each Pixel Position Whether There Is a Target Box Location For Each RoI Multiclass Classification BB Regressor Input Image Feature Extractor Feature Maps Feature Maps RoI Region Vector With Proposal Classification and Regression Output Results (a) For Each Grid Multiclass Classification BB Regressor Input Image Feature Extractor Feature Maps Feature Grid Classification and Regression FC Output Results (b) FIGURE 9. The main structures of mainstream frameworks. (a) An illustration of two-stage detection frameworks (using a faster RCNN as an example). (b) An illustration of one-stage detection frameworks (using YOLO as an example). RPN: region proposal network; RoI: region of interest. 20 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
increasing very significantly the computation cost and slowing the detection speed. SPATIAL PYRAMID POOLING NETWORK To reduce the computational costs incurred by an RCNN, He et al. [71], [83] proposed the spatial pyramid pooling network (SPPNet), wherein the SPP layer is the main improvement. Instead of requiring an input image of fixed size, the SPP layer can generate a fixed-length feature representation regardless of the size of the input proposals. During the detection process, the feature maps need only be computed once from the entire image. The SPP layer can then extract the corresponding region of the feature maps and generate a fixed-size feature representation for each region proposal. This significantly speeds up detection by avoiding repeated computations of the feature maps. SPPNet achieved speeds more than 20-times faster than those of RCNNs. However, it is not an end-to-end framework and can fine-tune only its fully connected layers, thus limiting the efficiency and performance of the model. FAST RCNN AND FASTER RCNN In 2015, Girshick et al. [72] proposed the fast RCNN detection framework that uses a unified neural module to localize and recognize targets. It increases detection precision and accelerates detection speed because it can train a classifier and a BB regressor simultaneously. Although fast RCNN outperforms RCNNs and SPPNet, it is restricted by the proposal-generation strategy. The faster RCNN framework presented by Ren et al. [73] is a fully end-to-end framework. It breaks though the speed bottleneck of fast RCNN by introducing a region proposal network (RPN) that enables generated object proposals using a CNN model. It achieved a near-real-time detection speed and state-of-the-art accuracy. From RCNNs to Faster RCNN, the building blocks of a detector, including region proposal generation, feature extraction, and BB regression, have been gradually improved and unified into an effective learning framework. REGION-BASED FULLY CONVOLUTIONAL NETWORK The regionwise subnetwork for localizing and recognizing an object in faster RCNN still needs to be applied per region proposal (several hundred proposals per image). To address this problem in faster RCNN, Dai et al. [74] proposed the region-based fully convolutional network (RFCN), a fully convolutional architecture with most of the computations shared over the entire image. Dai et al. constructed a set of position-sensitive score maps by using a bank of specialized convolutional layers as the FCN output and adding a position-sensitive RoI pooling (RoIPool) layer on top. An RFCN with ResNet101 could achieve an accuracy comparable to faster RCNN (often with faster running times). MASK RCNN Mask RCNN was presented by He et al. [84], [85] to tackle pixelwise object-instance segmentation by extending faster DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE RCNN. Mask RCNN adopts the same two-stage pipeline with an identical first stage (RPN). In the second stage, mask RCNN adds a branch that outputs a binary mask for each RoI in parallel with the class prediction and box offset. The new branch is an FCN [86], [87] on top of a CNN feature map. To avoid the misalignments caused by the original RoIPool layer, an RoI alignment layer was proposed to preserve the pixel-level spatial correspondence. With a backbone network, i.e., a ResNeXt101-feature pyramid network (FPN), mask RCNN achieved the top results for COCO objectinstance segmentation and BB object detection [46]. FPN The previous examples detect objects on only the top layer of the feature-extraction network. In some cases, this is not suitable for localizing objects, especially small ones. Lin et al. [88] proposed an FPN whose top-down architecture has skip connections to the remaining all-scale feature maps. It shows great advances for detecting objects with a wide variety of scales and aspect ratios and has been set as a basic building block in many recent detectors. CHAINED CASCADE NETWORK AND CASCADE RCNN Two-stage object detection can be considered a cascade structure; the first detector removes large amounts of background, and the second stage classifies the remaining regions. Recently, a series of end-to-end learning of more than two cascaded classifiers and regressors for generic object detection in the chained cascade network [89] was proposed, extended in cascade RCNN [90], and later applied to simultaneous object detection and instance segmentation [91]. These models have a sequence of detection heads trained with increasing IoU thresholds. The subsequent heads with the increasing IoU thresholds would train on more abundant positive samples to conduct accurate detection and avoid the problem of overfitting. ONE-STAGE DETECTION FRAMEWORKS Although two-stage detectors perform satisfactorily, they are computation intensive and therefore unsuitable for scenarios with limited storage and computational capability. Research scholars have therefore started to design one-stage unified detection approaches to accelerate detection speed. As displayed in Figure 9(b), a one-stage detector directly predicts the locations of the BB and the class probabilities in an entire image by using a single CNN. It does not involve the steps of region proposal generation, feature resampling, and postclassification, but it does encapsulate all of the computations in a single network [20]. YOU ONLY LOOK ONCE You only look once (YOLO), presented by Joseph et al. [75], is considered the first one-stage detector in the DL era. The model divides the entire image into many regions then predicts the category probabilities and BB offsets for each region simultaneously. Two improved versions, YOLO v2 and 21
v3, were proposed later [92], [93]; these further promote detection precision while retaining high detection speed. Although they have obvious speed advantages, these models have a lower localization accuracy than do the two-stage models, especially for small-scale objects. SINGLE-SHOT MULTIBOX DETECTOR To further boost the localization accuracy of a one-stage detector, Liu et al. developed a single-shot multibox detector (SSD) [76], which is faster than YOLO and achieves better detection accuracy. The main idea of SSD is that it can effectively combine an RPN in faster RCNN with multiscale feature maps, thus achieving high detection accuracy while keeping a fast detection speed. Unlike two-stage detectors, an SSD can predict only a fixed number of BBs, followed by a nonmaximal suppression (NMS) operation to obtain the final results. The network architecture of an SSD uses FCNs. It carries out detection processing on multiple feature maps, each of which predicts a category score and location offset for each box of an appropriate size. RETINANET For years, there has been a large gap between the accuracies of one- and two-stage detectors. Lin et al. [77] claimed that the central cause of this gap is the extreme foreground– background class imbalance encountered during the training of dense detectors. To counter this, a new loss function, focal loss, has been proposed in RetinaNet to improve the standard cross-entropy loss. Focal loss makes the detector focus more on hard-to-classify examples during training. It enables one-stage detectors to achieve detection performances comparable to those of two-stage detectors while maintaining a high detection speed. CORNERNET Law et al. [78], [94], thinking that the anchor boxes for regressing the location of objects could cause a huge imbalance between positive and negative examples, proposed CornerNet. This formulates BB object detection as the identification of paired top-left and bottom-right key points. In CornerNet, the backbone network consists of two stacked hourglass networks [95], with a simple corner pooling approach to better localize corners. Its accuracy, although improved, was obviously lower than that of SSD and YOLO’s. CornerNet may generate incorrect BBs because it is difficult to decide which pairs of key points belong to the same objects. Duan et al. [96] addressed the problem by detecting each object as a triplet of key points, introducing an extra point at the center of a proposal. DL FRAMEWORKS FOR SMALL, WEAK OBJECT DETECTION Though there is not a clear definition of small, weak object detection in the field of remote sensing, some excellent DLbased works have been made to address the related challenges. Data augmentation is a straightforward and simple technique used to improve the detection accuracy of small 22 objects. Kisantal et al. [97] simply oversampled images with small objects and augmented each of those by copying and pasting objects many times for small-object detection. Features of different levels in DL models can effectively retain the location and semantic information of targets with different scales. The development of multiscale detection, that is, detecting objects in an appropriate feature level, is marked by many milestones, such as an FPN [88] and path aggregation [98], extended [99], multilevel [100], and multiscale FPNs [101]. These models have proved their superiority and achieved satisfactory performances, especially for small-scale object detection. Although there has been success with multiscale detection, some objects lack the discriminative features necessary for recognition. Deng et al. [102] developed a feature-level superresolution method that enhances the features of small RoIs. Li et al. [103] proposed a perceptual generative adversarial network (GAN) to improve the representations of tiny objects to large objects with similar characteristics for more precise detection. Visual attention is an effective method used to highlight objects of interest, so it is used to detect small and dim objects. Yang et al. [104] developed a multicategory rotation detector for small, cluttered, and rotated objects wherein a supervised pixel-attention network and a channel-attention network are jointly used for highlighting small and cluttered objects. Lim et al. [105] combined the context information and the objects of interest for addressing the limited information of small objects. To address the nonuniform distribution, Yang et al. [106] presented a clustered detection network wherein a cluster proposal subnetwork can conduct object cluster regions and a scale-estimation subnetwork estimates object scales for each region. The clusterbased scale estimation is more accurate than the ones based on single objects, and the clustered regions implicitly model the prior context information. The detailed techniques and approaches for addressing small, weak target detection are summarized in the next section. ADVANCES FOR ADDRESSING DIFFERENT CHALLENGES IN SMALL, WEAK OBJECT DETECTION Inspired by the significant progress of object-detection methods and technologies, extensive studies have been devoted to object detection in remote sensing. Having thoroughly reviewed the recent progress of representative methods for remote sensing object detection, we introduce some critical technologies and methods that address the challenges to small, weak object detection. All of the mentioned approaches are divided into three aspects for solving the challenges discussed in the “Difficulties and Challenges in Remote Sensing Small, Weak Object Detection” section. HANDLING THE CHALLENGES INVOLVED IN IMAGE QUALITY In remote sensing image acquisition, there are various kinds of uncertain factors, such as noise, blurring, thin IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
clouds, missing information, and shadows, which may cause some degree of image degradation. In addition, due to the limitations of manufacturing technologies and the characteristics of imaging sensors, remote sensing images can reach a high resolution in only one aspect of spectral, spatial, and temporal resolution. These low-quality images cause the missing or false detection of small, weak objects. Therefore, improving the quality of remote sensing images is of great significance for small, weak object detection. In the following, the problems to be solved by the current methods for improving image quality are summarized from two aspects: image degradation and imaging sensor limitations. HANDLING IMAGE DEGRADATION The factors that cause image degradation can be divided into two categories: 1) the atmospheric influence on the reflection wave of ground objects and 2) the loss of information caused by the damaged components of the imaging sensors. Furthermore, a variety of degradation models, such as noise, blurring, thin clouds, missing information, and shadows, have been produced. Over the past few years, many approaches have been developed for addressing these different types of degradation models. In general, noise cannot be entirely avoided while acquiring remote sensing images. The most common types are additive, multiplicative speckle, and stripe noises. Some classical denoising methods are described in [107]–[109]. The causes of blurring in remote sensing images are optical blurring, mainly caused by imaging components; motion blurring, caused by relative motion between the target and sensor; and atmospheric blurring, caused by atmospheric turbulence. Most deblurring models use regularization terms to keep the solution stable and suppress the corresponding noise interference. In general, existing works for image deblurring can be divided into 1) image restoration with a known blur kernel function and 2) blind image restoration with an unknown blur kernel function [110], [111]. A large number of remote sensing images are likely covered by clouds, which can be characterized as thin and thick clouds. Thin clouds lead to the color fading of objects and reduce the contrast of objects in the images, making them difficult to recognize. In recent years, many approaches [112]–[114] have been proposed for thin cloud removal. Thick clouds and damaged sensors cause the loss of some image regions. In this case, the surface information of Earth obtained by images is incomplete and difficult to acquire for real-world applications. Some representative methods [115]–[117] have been developed to restore the missing parts of remote sensing images. Because of the imaging angle of sensors, shadows are one of the basic characteristics of remote sensing images. Tall trees, scattered buildings, mountains, and so on may cause shadows. Many small, weak objects in shadows are more difficult to recognize. Some effective methods for removing shadows are introduced in [118]–[121]. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE HANDLING SENSOR LIMITATIONS Due to the limitations of sensors, remote sensing images achieve high performance in only one aspect of spatial, spectral, and temporal resolution, which cannot meet the requirements for some specific tasks. Additionally, when processing remote sensing images, it is necessary to discretize the time, space, spectrum, and observation angle information from original images to save them in the form of digital images. The process of discretization often means downsampling data, which inevitably leads to a loss of information. To some extent, image-fusion models that fuse single or multisource images with different resolutions can remedy the degradation of remote sensing images and improve the data quality. Information-complementary fusion methods include spatial and spectral [122], temporal and spatial [123], and multispectral and hyperspectral fusion [124]. HANDLING THE CHALLENGES INVOLVED WITH OBJECT VARIATIONS HRRS images always contain massive object categories and instances, which are variant in scale, appearance, and distribution. The features easily change, as they are affected by weather, illumination, and occlusions. Additionally, due to large image sizes, the problem of unbalanced positive and negative training examples is quite serious, and high-quality training instances are relatively few. Obtaining large-scale annotation data sets is another critical problem for achieving satisfactory detection performance. The aforementioned challenges of object variations are divided into four types in this article: scale variations, high intraclass variations, the imbalance of positive and negative examples, and a lack of annotation data sets. The scale problem should belong with high intraclass variations; however, because of its importance in remote sensing object detection, we list it separately and summarize the corresponding methods. The methods used to address the challenges of these four aspects are introduced in the following sections. HANDLING SCALE VARIATIONS In the remote sensing community, scale variations, overlarge images, complex image backgrounds, and the nonuniform distribution of training samples make detection tasks more challenging, especially for small and cluttered objects. Some targets, such as football fields and harbors, are wider than 150 m and occupy 300 pixels in an image, while the widths of some other targets, such as vehicles, are fewer than 3 m and can occupy only 10 pixels in an image. The multiscale detection of objects with different sizes and aspect ratios is one of the main challenges in remote sensing object detection. Many scholars have further improved the model and achieved better results for robust multiscale detection. There are three main categories of detection methods used in Earth observation. The first category uses an image or sliding-window pyramids as the input. Zhang et al. [125], [126] resized the input image to different scales and extracted image features on each scale. Yao et al. [127]–[129] used 23
multiscale sliding windows with different step sizes to conduct training with images for generating potential candidate boxes. This method, however, is too time- and computation consuming to meet the requirements of practical applications. The second category is based mainly on various multiscale features of a manual design, such as a scale-invariant feature transform (SIFT) [130], an HoG [10], and a BoW [60]. Beril et al. [131] utilized the SIFT feature and graph theory to detect buildings and urban areas. Shi et al. [40], [132] combined both circle-frequency and HoG features to learn the appearances and shapes of objects. Sun et al. [134] developed a spatial sparse-coding BoW model to build the visual vocabulary by clustering local features; it can effectively fuse local and global features. However, the two categories of methods pose difficulties when it comes to achieving satisfactory performances for remote sensing target detection because they all depend on handcrafted features—extracted according to expert experience—and are not robust enough to process complex remote sensing images. Since 2014, many learning-based detectors that incorporate the object proposal strategy, coupled with the remarkable performance of DL-based features [13], [14], [81], [135], have enabled significant improvements in the performance of object localization and recognition [136]–[138]. Multireference and multiresolution detection, developed on this basis, have become the two most widely used fundamental blocks in the task of object detection [21]. The main idea of multireference detection is to predefine a set of reference boxes (anchor boxes) with different sizes and aspect ratios and then to predict the detection box based on those references. The milestone models are faster RCNN [73], RetinaNet [77], and mask RCNN [84], [85]. Multiresolution detection detects objects with different scales by constructing a feature pyramid at different layers of the network. The shallow layers hold information about small objects, while the deep layers contain information about large objects. The main improvements are in the FPN [88]. To detect multiscale objects, especially small ones, in HRRS images, Guo et al. [139] and Zhang et al. [140] designed unified multiscale detection frameworks; they used a modified FPN as well as anchors with different scales and aspect ratios. Qiu et al. [141] developed an adaptive aspect ratio multiscale network, which utilizes a multiscale feature gatefusion subnetwork and an aspect ratio attention network to learn the weights of different feature maps and automatically select the appropriate aspect ratios in accordance with the aspect ratios of objects. Wu et al. [142] introduced multiscale and rotation-insensitive convolutional channel features by involving two modules, the rotation-insensitive descriptor and the multiscale aggregated descriptor. AlAlimi et al. [143] designed a unique shallow-deep feature extraction that employs a squeeze and excitation network and ResNet to obtain feature maps. Deng et al. [144] addressed the problems of scale variants by applying different filters to several intermediate layers. Li et al. [145] proposed 24 multiscale convolutional feature fusion to detect multisensor HRRS images using a symmetric encoder–decoder module to extract and fuse multiscale and high-level spatial features. Some scholars have focused their research work on segmentation methods. Dong and You [146], [147] utilized a graph-segmentation algorithm. Based on multiscale saliency maps, it is constructed to overcome the problem of ship scale change to accurately locate candidate regions. Kang et al. [148] designed an FCN with dense SPP for building detection that can extract dense and multiscale features simultaneously. Mo et al. [149] focused on generating an anchor of the most suitable scale for each category and developed a class-specific anchor block, which provides better initial values for an RPN. Xie et al. [150] used multidetectors with different sensitivities and accessed the fused features to finish the task of target detection. Superresolution [102] and GANs [103] have also been used to restore or enhance the features response of small targets during the detection process. HANDLING INTRACLASS VARIATIONS Objects in HRRS images vary in color, texture, and shape feature because of the vast number of object instances and categories as well as the influences of weather, illumination, imaging condition, and occlusion. For real-scenario HRRS image object detection, powerful object representations should be extracted with robustness and discrimination. Many recent works have been devoted to handling changes in object variations by applying DL models to remote sensing object detection. However, CNN models lack the ability to be spatially invariant for generating transformations of input data. In processing HRRS images, the performance of these models is limited due to the intraclass variations of objects. Data augmentation is the most straightforward method used to address intraclass variations, including rotation and resizing. To some extent, these operations can make detectors learn robustness with regard to rotation and scale, although these methods can involve expensive training and a massive number of model parameters. Therefore, many attempts have been made to learn invariant CNN representations with respect to different transformations, including scale [151]–[153], rotation [151], [154]–[156], or both [157]. Early deformable part-based models (DPMs) [157], which represent objects by components arranged in a deformable configuration, were successful for generic object detection, but these models are less sensitive to object variations in both pose and viewpoint. Many scholars have attempted to combine DPMs with CNNs, aiming to realize the advantages of both [159]–[161]. To address the problem of occlusions, deformable RoIPool [161]–[163] and deformable convolution have been proposed to achieve more flexibility in fixed geometric structures [27]. Another method, the application of GANs [164], [165] to generate missing parts of objects and context, is promising. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
HANDING THE IMBALANCE OF POSITIVE AND NEGATIVE EXAMPLES In essence, training a detector is a problem in imbalanced data learning. For detectors based on a sliding window, the imbalance between objects of interest and backgrounds may be as extreme as 104–105 background windows for each object [21]. For a modern detection task with a prediction of the object aspect ratio, the imbalanced ratios increase to greater than 106. In this case, a vast number of negative and easy samples would guide the training process, and the detector would achieve poor performances for hard-to-recognize objects, especially small, weak objects. Hard negative mining focuses on solving the problem of imbalanced data during the training process. Bootstrapping was a milestone technique used for addressing the problem of a training data imbalance in object detection, in which the training starts with a small number of background samples to which new misclassified backgrounds are added iteratively during the training process [166]. Later in the DL era, detectors such as faster RCNN [73] and YOLO [75] developed a weighted balancing method for positive and negative samples. However, that method cannot completely address an imbalanced data problem. Bootstrapping was reused in DL-based detectors [76], [167]. In RefineDet [168], an anchor-refinement module is designed to filter easy negatives. An alternative improvement is to design new loss functions [77], [170] by reshaping the standard cross-entropy loss to put more focus on difficult, misclassified examples. The recent A-Fast-RCNN detection model [164], which utilizes GANs to handle occlusion and deformation samples, is also regarded as a hard miningapproach example. Pang et al. [172] proposed an IoUbalanced sample method to adaptively select high-quality negative examples in the proposal candidates for stabilizing the training process. In Earth observation literature, recent research works reveal that detection data sets contain an overwhelming number of easy examples and only a few difficult examples. Many scholars have therefore tried to mine the more representative difficult examples to balance the proportion of foreground– background class examples. Traditional methods usually freeze the model to mine negative examples; however, positive sample mining is also essential to avoid missed detection. Besides, freezing the model to collect difficult examples would dramatically slow the progress of the model. Cheng et al. [173] developed a two-step iterative training strategy, which alternates between updating the detection model given to the training set and adaptively selecting the difficult negative examples for updating the detection model. Focusing on airport detection, Cai et al. [174] and Xu et al. [175] applied cascade strategies to automatically select difficult examples according to the loss values of proposals. The cascade strategies significantly inhibited the false alarms that existed in airport detection. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE HANDLING INSUFFICIENT TRAINING DATA The difficulty of acquiring annotation samples means that the training data are not usually sufficient for obtaining ideal models, and data augmentation is the most straightforward method for increasing training data. In addition, research scholars have developed many methods to address the problem; these can be divided into three categories: transfer learning (TL), active learning (AL), and weak supervised learning (WSL). TL can effectively transfer welltrained knowledge from one or more source tasks to another task; this needs only a small amount of labeled data and eliminates the drudgery of preliminary learning [176]– [179]. Dong et al. [180] proposed a Sig-NMS-based faster RCNN with TL; this can annotate not only the class of an object but also its location. Chan-Hon-Tong et al. [181] and Kellenberger et al. [182] exploited an AL-based strategy to find very confident samples for the quick retrieval of TPs in the target data set. Another method, WSL, addresses the data insufficiency problem by training detection using image-level labels only. Recently, research works on WSL have followed different branches. Some scholars have utilized multi-instance learning for WSL [183]–[185]. If an image contains many object candidates, it is considered to involve a set of labeled bags, with each bag containing many instances; image-level annotation acts as the label. The object detector is then obtained by alternating detector training, using the detector to select the most likely object instances in positive images. Research works on CNN visualization have demonstrated that the convolution layer of a CNN model behaves as a target detector even though there is no supervision of the object’s location. Therefore, class-activation mapping sheds light on a way to give a CNN model localization ability by training it on image-level labels [186]–[188]. Some scholars automatically select the most informative regions and train them with image-level annotation [189]. Another method masks out different regions of the image to localize the object [190]. Interactive annotation [184] and generative adversarial training have also been used for WSL [191]. To address the problem of a lack of annotated HRRS data sets, Zhang et al. [192] employed an iterative, weakly supervised learning framework to automatically mine and augment a training data set from the original images. Cao et al. [193] proposed a novel multi-instance-detection algorithm based on learning, using it to learn instancewise detectors from such a “weak annotation.” In the algorithm, a density estimator is adopted to estimate the density map of vehicle instances from the positive regions; a multi-instance SVM is then trained to classify and locate vehicle instances from this map. Although existing WSL methods take scenes as being isolated and ignore the mutual cues between scene pairs when optimizing deep networks, Li et al. [194] exploited both the separate scene category information and the mutual cues between scene pairs to train deep networks well enough to pursue superior objectdetection performance. 25
HANDLING COMPLEX CONTEXT Objects of interest are always embedded in a typical context with surrounding environments and objects. An HRRS image usually involves a broad range of space and contains many kinds of objects that form an intricate spatial pattern. The complex background of the objects of interest increases the difficulty of highly accurate detection; however, many existing works have demonstrated that the proper use of context information can improve the performance of detectors. Current works on the adaptation of complex backgrounds have been divided into two categories: 1) detection with a suppressing background and 2) detection with related context information. DETECTION WITH SUPPRESSING BACKGROUND Many early works, taking advantage of the remarkable feature-extraction ability of the CNN model, directly applied the models to adapt to the complex, changeable background and learn discriminative features for HRRS image detection [126], [195]. To effectively distinguish between the target and background information, Xiao et al. [196] designed an encoder–decoder network to perform paired semantic segmentation for per-pixel prediction. The top-left and bottom-right parts of the objects of interest are then predicted, and the rotated minimum BB is generated as the rotated anchor. Compared to the presented methods, this method is more robust across different data sets. DETECTION WITH RELATED CONTEXT INFORMATION The remote sensing community has long acknowledged that context information benefits the improvement of object detection. Therefore, more work has been done to explore how to make good use of that information. Context information can be placed into two categories: local and global context [21]. Local context refers to visual information such as the texture, color, and objects in the region that surrounds the targets to be detected. In contrast, global context employs scene semantics as the additional information for target detection. Existing methods focus mainly on fusing local contexts to improve detection performance. Gong et al. [197] integrated the context RoIs’ mining layer into the detector. The layer can extract local context features by mapping context RoIs to multilevel feature maps. Considering the limited label information provided by objects—especially small objects—in the feature map, Mo et al. [149] doubled the size of the region proposal box, with the center in the predicted box, to incorporate the local context information and thus improve the discriminative ability of features in recognizing the objects. Ma et al. proposed a multimodel decision fusion network [198], based on gated recurrent units (GRUs) [199], in which one of the subnetworks is designed to learn the local context of objects of interest and the object–object relationships. GRUs are used to merge all of the features and form discriminative-feature representation. Bell et al. [200] developed the inside–outside network (ION) to exploit information both inside and outside the 26 RoIs; it integrates the contextual information outside the RoIs by using spatial recurrent neural networks. Xiao et al. [129] fused auxiliary features within and around the RoIs to represent the complementary information of each region proposal for airport detection, effectively alleviating detection problems caused by the diversity of illumination intensities in remote sensing images. To generate accurate rotation BBs in large-scale aerial images, Feng et al. [202] proposed a detection network that introduced a novel sequence local context module. It can extract local context features, thus making the rotated BB fit the ship tightly. The accurate BB can include the discriminative parts, such as the prow, and exclude noise information, such as the background. Other works have promoted the global context as additional information. Focusing on the task of vehicle detection, Tao et al. [158] proposed a vehicle-detection method driven by scene context. This first classifies the input image into different scene categories (e.g., road, parking lot, and others) and then detects vehicles in different scenes separately from the contextual information provided by the prior scene. Incorporating the scene before vehicle detection can effectively confine the region where vehicles may be present and apply a more flexible postprocessing strategy according to different scene types. By analyzing the relationship of objects and scenes in remote sensing images, Chen et al. [133] found that most of the objects appeared in their relevant scenes. The objects have a strong correlation with the contextual information of their scene. Chen et al. [133] proposed a scene-contextual FPN that fuses the global scene features into region proposal features for training the classifier. Both global and local contextual information is valuable, so the fusion of the two may achieve a better performance. Relevant work has been carried out on this approach. Zhang and Liu et al. [169] proposed a context-aware detection network to improve the accuracy of target detection; this can learn the correlations of the global information (at the scene level) and the local neighboring objects or features (at the object level). Li and Gong [156] used a double-channel network to fuse the local and global features to enhance the discrimination of the feature. FUTURE RESEARCH DIRECTIONS Despite tremendous recent progress in small, weak object detection, the main technologies are still primitive and cannot satisfactorily address all the difficulties and challenges. Our analysis shows that future research may focus on (but should not be limited to) the following areas. DETECTION WITH MULTISOURCE DATA FUSION Detectors for small, weak object detection may not be stable. The fusion of multiple sources/modalities of data, such as 3D point clouds, lidar, and Internet data, is of great importance for improving detection accuracy. Two critical problems should be addressed: how to encode multisource or multimodal data into a unified input for the detectors IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
and how to transfer well-trained detectors to different modalities of data. WEAKLY SUPERVISED DETECTION Recent state-of-the-art approaches require many samples with accurate annotation, in the manner of fully supervised learning. However, labeling samples is labor intensive and time consuming. Meanwhile, weakly/partially annotated or unlabeled samples are easily accessible and sufficient. Therefore, it is essential to leverage DL-based models to learn from these samples to boost detection ability. LIGHTWEIGHT OBJECT MODEL The number of layers in existing CNN models for extracting features has dramatically increased from several [14] to hundreds of layers [13], [206]. They have millions of parameters and need massive computation resources and training data to obtain an ideal mode. To train the CNN models effectively, much work has been done to develop a series of lightweight and compact models. However, a significant gap in the efficiency between detectors and the human eye remains. AUTOMATIC NEURAL ARCHITECTURE SEARCH Most existing target detectors are based on manual design. To meet problems of ever-increasing complexity requires increasing domain knowledge and expertise. Recently, a natural research direction has been to automatically select and build a detector with a performance that can deal with the number of parameters, such as automated ML [201]. Related work should be carried out for small, weak object detection. IMPROVEMENT OF IMAGE QUALITY Affected by imaging conditions such as weather, light, and the resolution of sensors, remote sensing images may not be able to meet the requirements of usage, as they are blurred or noisy or have low resolution. Algorithms, such as those undertaken in image fusion, image denoising, and superresolution, have been developed to address these problems. These should be combined with detection methods to improve detection performance. UNIVERSAL OBJECT FRAMEWORK Recently, increasing efforts have been made in learning universal representations, reinforcement learning, and lifelong learning; these are effective in learning, transferring, and reasoning knowledge from massive data. It is meaningful to design a universal object framework based on state-of-theart advances, which can gradually self-evolve and improve detection performance. CONCLUSIONS To meet the requirements of some applications, the task of small, weak object detection, which is more challenging than generic object detection, has gradually become increasingly important and attracted much attention. During the last several years, considerable efforts have been made DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE to develop various methods that address small, weak object detection. This article presented a systematic review of the advances of small, weak object detection in the remote sensing community. Having analyzed the challenges and difficulties of small, weak target detection, we discussed the technical evolution of object detection and benchmark data sets. Finally, we categorized the existing works that address different challenges and in which some promising research directions have been drawn for the further improvement of small, weak object detection. The research of small, weak object detection is still far from complete, but given the breakthroughs over the past several years, we are optimistic about future developments. ACKNOWLEDGMENTS This work was supported by the National Natural Science Foundation of China under grants U1711266 and 41925007 and the Fundamental Research Funds for the Central Universities, China University of Geosciences, Wuhan (no. 162301212697). Lizhe Wang and Ruyi Feng are the corresponding authors. AUTHOR INFORMATION Wei Han (weihan@cug.edu.cn) is with the School of Computer Science, China University of Geosciences, Wuhan, 430078, China, and also the Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, 430078, China. Jia Chen (chen_jia@cug.edu.cn) is with the School of Computer Science, China University of Geosciences, Wuhan, 430078, China, and also the Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, 430078, China. Lizhe Wang (lizhe.wang@foxmail.com) is with the School of Computer Science, China University of Geosciences, Wuhan, 430078, China, the Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, 430078, China, and the Key Laboratory of Geological Survey and Evaluation of the Ministry of Education, China University of Geosciences, Wuhan, 430078, China. Ruyi Feng (fengry@cug.edu.cn) is with the School of Computer Science, China University of Geosciences, Wuhan, 430078, China, and also the Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, 430078, China. Fengpeng Li (li_feng_peng@cug.edu.cn) is with the School of Computer Science, China University of Geosciences, Wuhan, 430078, China, and also the Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, 430078, China. Lin Wu (wulin@cug.edu.cn) is with the Key Laboratory of Geological Survey and Evaluation of the Ministry of Education, China University of Geosciences, Wuhan, 430074, China. Tian Tian (tiantian@cug.edu.cn) is with the School of Computer Science, China University of Geosciences, Wuhan, 430078, China, and also the Hubei Key Laboratory of 27
Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, 430078, China. Jining Yan (yanjn@cug.edu.cn) is with the School of Computer Science, China University of Geosciences, Wuhan, 430078, China, and also the Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, 430078, China. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] 28 Z. Lin et al., “A contextual and multitemporal active-fire detection algorithm based on FengYun-2G S-VISSR data,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 11, pp. 8840–8852, 2019. doi: 10.1109/TGRS.2019.2923248. Z. Lin et al., “An active fire detection algorithm based on multitemporal FengYun-3C VIRR data,” Remote Sens. Environ, vol. 211, pp. 376–387, June 2018. doi: 10.1016/j.rse.2018.04.027. N. Wang, F. Chen, B. Yu, and Y. Qin, “Segmentation of largescale remotely sensed images on a spark platform: A strategy for handling massive image tiles with the MapReduce model,” ISPRS J. Photogram. Remote Sens., vol. 162, pp. 137–147, Apr. 2020. doi: 10.1016/j.isprsjprs.2020.02.012. N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore, “Google earth engine: Planetary-scale geospatial analysis for everyone,” Remote Sens. Environ., vol. 202, pp. 18–27, Dec. 2017. doi: 10.1016/j.rse.2017.06.031. D. Li, Y. Ke, H. Gong, and X. Li, “Object-based urban tree species classification using bi-temporal worldview-2 and worldview-3 images,” Remote Sens., vol. 7, no. 12, pp. 16,917–16,937, 2015. doi: 10.3390/rs71215861. K. Huang and X. Mao, “Detectability of infrared small targets,” Infrared Phys. Techn., vol. 53, no. 3, pp. 208–217, 2010. doi: 10.1016/j.infrared.2009.12.001. D. M. McKeown Jr. and J. L. Denlinger, “Cooperative methods for road tracking in aerial imagery,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 1988, pp. 662–672. doi: 10.1109/CVPR.1988.196307. S. Leninisha and K. Vani, “Water flow based geometric active deformable model for road network,” ISPRS J. Photogram. Remote Sens., vol. 102, pp. 140–147, Apr. 2015. doi: 10.1016/j.isprsjprs.2015.01.013. J. Peng and Y. Liu, “Model and context-driven building extraction in dense urban aerial images,” Int. J. Remote Sens., vol. 26, no. 7, pp. 1289–1307, 2005. doi: 10.1080/01431160512331326675. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 886–893. doi: 10.1109/CVPR.2005.177. G. Cheng and J. Han, “A survey on object detection in optical remote sensing images,” ISPRS J. Photogram. Remote Sens., vol. 117, pp. 11–28, July 2016. doi: 10.1016/j.isprsjprs.2016.03.014. J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), 2009, pp. 248–255. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90. [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst. 25: 26th Annu. Conf. Neural Inf. Process. Syst., 2012, pp. 1106–1114. [15] F. Li, R. Feng, W. Han, and L. Wang, “High-resolution remote sensing image scene classification via key filter bank based on convolutional neural network,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 11, pp. 8077–8092, 2020. doi: 10.1109/TGRS .2020.2987060. [16] K. Liu and G. Mattyus, “Fast multiclass vehicle detection on aerial images,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 9, pp. 1938–1942, 2015. doi: 10.1109/LGRS.2015.2439517. [17] S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery: A small target detection benchmark,” J. Vis. Commun. Image R, vol. 34, pp. 187–203, 2016. doi: 10.1016/j.jvcir.2015.11.002. [18] G. Xia et al., “DOTA: A large-scale dataset for object detection in aerial images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3974–3983. doi: 10.1109/CVPR.2018.00418. [19] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,” ISPRS J. Photogram. Remote Sens., vol. 159, pp. 296–307, Jan. 2020. doi: 10.1016/j.isprsjprs.2019.11.023. [20] L. Liu et al., “Deep learning for generic object detection: A survey,” Int. J. Comput. Vis., vol. 128, no. 2, pp. 261–318, 2020. doi: 10.1007/s11263-019-01247-4. [21] Z. Zou, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,” 2019. [Online]. Available: http://arxiv.org/ abs/1905.05055 [22] M. Manana, C. Tu, and P. A. Owolawi, “A survey on vehicle detection based on convolution neural networks,” in Proc. 3rd IEEE Int. Conf. Comput. Commun. (ICCC), 2017, pp. 1751–1755. doi: 10.1109/CompComm.2017.8322840. [23] A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, and J. Li, “Salient object detection: A survey,” Comput. Vis. Media, vol. 1411, no. 7, pp. 1–34, 2014. doi: 10.1007/s41095-019-0149-9. [24] W. Wang, Q. Lai, H. Fu, J. Shen, and H. Ling, “Salient object detection in the deep learning era: An in-depth survey,” 2019, arXiv:1904.09146. [25] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, “Advanced deep-learning techniques for salient and category-specific object detection: A survey,” IEEE Signal Process. Mag., vol. 35, no. 1, pp. 84–100, 2018. doi: 10.1109/MSP.2017.2749125. [26] Z. Liu, H. Wang, L. Weng, and Y. Yang, “Ship rotated bounding box space for ship extraction from high-resolution optical satellite images with complex backgrounds,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 8, pp. 1074–1078, 2016. doi: 10.1109/ LGRS.2016.2565705. [27] X. Zhang, Y. Yang, Z. Han, H. Wang, and C. Gao, “Object class detection: A survey,” ACM Comput. Surv., vol. 46, no. 1, pp. 10:1–10:53, 2013. doi: 10.1145/2522968.2522978. [28] G.-D. Wang, C.-Y. Chen, and X.-B. Shen, “Facet-based infrared small target detection method,” Electron. Lett., vol. 41, no. 22, pp. 1244–1246, 2005. doi: 10.1049/el:20052289. [29] G. J. Klinker, S. A. Shafer, and T. Kanade, “Image segmentation and reflection analysis through color,” in Proc. Appl. Artific. Intell. VI, vol. 937, 1988, pp. 229–244. doi: 10.1117/12.946980. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[30] P. W. Kruse, “Principles of uncooled infrared focal plane arrays,” in Semiconductors Semimetals, vol. 47, P. W. Kruse and D. D. Skatrud, Amsterdam, The Netherlands: Elsevier, 1997, pp. 17–42. [31] J. Han, Y. Ma, B. Zhou, F. Fan, K. Liang, and Y. Fang, “A robust infrared small target detection algorithm based on human visual system,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 12, pp. 2168–2172, 2014. doi: 10.1109/LGRS.2014.2323236. [32] B. Lei, B. Wang, G. Sun, Y. Xu, P. Hong, C. Liu, and S. Yue, “A fast detection method for small weak infrared target in complex background,” in Proc. Infrared, Millimeter-Wave, Terahertz Technol. IV, vol. 10030, 2016, p. 100301V. doi: 10.1117/12.2245912. [33] A. G. Tartakovsky, S. Kligys, and A. Petrov, “Adaptive sequential algorithms for detecting targets in a heavy IR clutter,” in Proc. Signal Data Process. Small Targets 1999, vol. 3809, pp. 119– 130. doi: 10.1117/12.364013. [34] A. G. Tartakovsky and R. B. Blazek, “Effective adaptive spatialtemporal technique for clutter rejection in IRST,” in Proc. Signal Data Process. Small Targets 2000, vol. 4048, pp. 85–95. doi: 10.1117/12.392023. [35] B. L. Rozovskii, A. Petrov, and R. B. Blazek, “Interactive banks of Bayesian matched filters,” in Proc. Signal Data Process Small Targets 2000, vol. 4048, pp. 122–133. doi: 10.1117/12.391972. [36] C. Benedek, X. Descombes, and J. Zerubia, “Building development monitoring in multitemporal remotely sensed image pairs with stochastic birth-death dynamics,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 1, pp. 33–50, 2012. doi: 10.1109/TPAMI.2011.94. [37] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object detection and geographic image classification based on collection of part detectors,” ISPRS J. Photogram. Remote Sens., vol. 98, pp. 119–132, 2014. doi: 10.1016/j.isprsjprs.2014.10.002. [38] H. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao, “Orientation robust object detection in aerial images using deep convolutional neural network,” in Proc. IEEE Int. Conf. Image Process., 2015, pp. 3735–3739. doi: 10.1109/ICIP.2015.7351502. [39] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye, “A large contextual dataset for classification, detection and counting of cars with deep learning,” in Proc. Comput. Vis. - ECCV 2016 - 14th Euro. Conf., Amsterdam, The Netherlands, pp. 785– 800. doi: 10.1007/978-3-319-46487-9_48. [40] Z. Xiao, Q. Liu, G. Tang, and X. Zhai, “Elliptic fourier transformation-based histograms of oriented gradients for rotationally invariant object detection in remote-sensing images,” Int. J. Remote Sens., vol. 36, no. 2, pp. 618–644, 2015. doi: 10.1080/01431161.2014.999881. [41] D. Lam et al., “xView: Objects in context in overhead imagery,” 2018, arXiv:1802.07856. [42] Y. Zhang, Y. Yuan, Y. Feng, and X. Lu, “Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 8, pp. 5535–5548, 2019. doi: 10.1109/TGRS.2019.2900302. [43] X. Lu, Y. Zhang, Y. Yuan, and Y. Feng, “Gated and axis-concentrated localization network for remote sensing object detection,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 1, pp. 179–192, 2020. doi: 10.1109/TGRS.2019.293517. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [44] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman, “The Pascal visual object classes challenge: A retrospective,” Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136, 2015. doi: 10.1007/s11263-014-0733-5. [45] O. Russakovsky et al., “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015. doi: 10.1007/s11263-015-0816-y. [46] T. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. Comput. Vis. - ECCV 2014 - 13th Euro. Conf., D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., in Lecture Notes in Computer Science, vol. 8693, 2014, pp. 740–755. doi: 10.1007/978-3-319-10602-1_48. [47] A. Kuznetsova et al., “The open images dataset V4: Unified image classification, object detection, and visual relationship detection at scale,” 2018. [Online]. Available: http://arxiv.org/ abs/1811.00982 [48] G. Heitz and D. Koller, “Learning spatial context: Using stuff to find things,” in ECCV 2008: Proc. 10th Euro. Conf. Comput. Vis., Part I, pp. 30–43. doi: 10.1007/978-3-540-88682-2_4. [49] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010. doi: 10.1007/s11263-009-0275-4. [50] J. Zhang, X. Lin, Z. Liu, and J. Shen, “Semi-automatic road tracking by template matching and distance transformation in urban areas,” Int. J. Remote Sens., vol. 32, no. 23, pp. 8331–8347, 2011. doi: 10.1080/01431161.2010.540587. [51] J. Zhou, W. F. Bischof, and T. Caelli, “Road tracking in aerial images based on human–computer interaction and Bayesian filtering,” ISPRS J. Photogram. Remote Sens., vol. 61, no. 2, pp. 108–124, 2006. doi: 10.1016/j.isprsjprs.2006.09.002. [52] M. A. Fischler and R. A. Elschlager, “The representation and matching of pictorial structures,” IEEE Trans. Comput., vol. C -22, no. 1, pp. 67–92, 1973. doi: 10.1109/ T-C.1973.223602. [53] A. K. Jain, Y. Zhong, and M. Dubuisson-Jolly, “Deformable template models: A review,” Signal Process., vol. 71, no. 2, pp. 109–129, 1998. doi: 10.1016/S0165-1684(98)00139-X. [54] C. Xu and H. Duan, “Artificial bee colony (ABC) optimized edge potential function (EPF) approach to target recognition for low-altitude aircraft,” Pattern Recognit. Lett., vol. 31, no. 13, pp. 1759–1772, 2010. doi: 10.1016/j. patrec.2009.11.018. [55] A. Huertas and R. Nevatia, “Detecting buildings in aerial images,” Comput. Vis. Graph. Image Process., vol. 41, no. 2, pp. 131– 152, 1988. doi: 10.1016/0734-189X(88)90016-3. [56] R. B. Irvin and D. M. McKeown, “Methods for exploiting the relationship between buildings and their shadows in aerial imagery,” IEEE Trans. Syst., Man, Cybern., vol. 19, no. 6, pp. 1564–1575, 1989. doi: 10.1109/21.44071. [57] T. Blaschke, “Object based image analysis for remote sensing,” ISPRS J. Photogram. Remote Sens., vol. 65, no. 1, pp. 2–16, 2010. doi: 10.1016/j.isprsjprs.2009.06.004. [58] T. Blaschke et al., “Geographic object-based image analysis–Towards a new paradigm,” ISPRS J. Photogram. Remote Sens., vol. 87, pp. 180–191, Jan. 2014. doi: 10.1016/j.isprsjprs.2013.09.014. 29
[59] T. Blaschke, S. Lang, and G. Hay, Object-Based Image Analysis: Spatial Concepts for Knowledge-Driven Remote Sensing Applications. Berlin: Springer-Verlag, 2008. [60] F. Li and P. Perona, “A Bayesian hierarchical model for learning natural scene categories,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. CVPR, pp. 524–531. doi: 10.1109/CVPR.2005.16. [61] Ö. Aytekin, U. Zöngür, and U. Halici, “Texture-based airport runway detection,” IEEE Geosci. Remote Sens. Lett., vol. 10, no. 3, pp. 471–475, 2013. doi: 10.1109/LGRS.2012.2210189. [62] C. Senaras, M. Ozay, and F. T. Yarman-Vural, “Building detection with decision fusion,” IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., vol. 6, no. 3, pp. 1295–1304, 2013. doi: 10.1109/ JSTARS.2013.2249498. [63] V. Vapnik, Statistical Learning Theory. Hoboken, NJ: Wiley, 1998. [64] J. Inglada, “Automatic recognition of man-made objects in high resolution optical remote sensing images by svm classification of geometric image features,” ISPRS J. Photogram. Remote Sens., vol. 62, no. 3, pp. 236–248, 2007. doi: 10.1016/j. isprsjprs.2007.05.011. [65] Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” in Proc. 13th Int. Conf. Machine Learn. (ICML ‘96), Bari, Italy, 1996, pp. 148–156. doi: 10.5555/3091696.3091715. [66] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, 1997. doi: 10.1006/ jcss.1997.1504. [67] E. Blanzieri and F. Melgani, “Nearest neighbor classification of remote sensing images with the maximal margin principle,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 6, pp. 1804–1811, 2008. doi: 10.1109/TGRS.2008.916090. [68] E. Li, J. Femiani, S. Xu, X. Zhang, and P. Wonka, “Robust rooftop extraction from visible band images using higher order CRF,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 8, pp. 4483– 4495, 2015. doi: 10.1109/TGRS.2015.2400462. [69] P. Zhong and R. Wang, “A multiple conditional random fields ensemble model for urban area detection in remote sensing optical images,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 12, pp. 3978–3988, 2007. doi: 10.1109/TGRS.2007.907109. [70] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., (CVPR), 2014, pp. 580–587. doi: 10.1109/CVPR.2014.81. [71] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, 2015. doi: 10.1109/TPAMI.2015.2389824. [72] R. B. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Santiago, Chile, 2015, pp. 1440–1448. doi: 10.1109/ICCV.2015.169. [73] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2015, pp. 91–99. [74] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-based fully convolutional networks,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2016, pp. 379–387. 30 [75] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 779–788. [76] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. 14th Euro. Conf. Comput. Vis., Amsterdam, The Netherlands, 2016, pp. 21–37. doi: 10.1007/978-3-319-46448-0_2. [77] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 318–327, 2020. doi: 10.1109/ TPAMI.2018.2858826. [78] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” Int. J. Comput. Vis., vol. 128, no. 3, pp. 642–656, 2020. doi: 10.1007/s11263-019-01204-1. [79] Y. Zhao and J. Yang, “Hyperspectral image denoising via sparse representation and low-rank constraint,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 1, pp. 296–308, 2015. doi: 10.1109/ TGRS.2014.2321557. [80] P. A. Viola and M. J. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. 2001 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 511–518. doi: 10.1109/CVPR.2001.990517. [81] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014. [Online]. Available: http://arxiv.org/abs/1409.1556 [82] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proc. 2017 IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5987–5995. doi: 10.1109/CVPR.2017.634. [83] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in Proc. 13th Euro. Conf. Comput. Vis., 2014, pp. 346–361. doi: 10.1007/978-3-319-10578-9_23. [84] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2980– 2988. doi: 10.1109/ICCV.2017.324. [85] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 386– 397, 2020. doi: 10.1109/TPAMI.2018.2844175. [86] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 3431–3440. doi: 10.1109/CVPR.2015.7298965. [87] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 640–651, 2017. doi: 10.1109/ TPAMI.2016.2572683. [88] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie, “Feature pyramid networks for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 936–944. doi: 10.1109/CVPR.2017.106. [89] W. Ouyang, K. Wang, X. Zhu, and X. Wang, “Chained cascade network for object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 1956–1964. doi: 10.1109/ICCV.2017.214. [90] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality object detection,” in Proc. IEEE Conf. Comput. Vis. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] Pattern Recognit. (CVPR), 2018, pp. 6154–6162. doi: 10.1109/ CVPR.2018.00644. K. Chen et al., “Hybrid task cascade for instance segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 4974–4983. doi: 10.1109/CVPR.2019.00511. J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 6517–6525. doi: 10.1109/CVPR.2017.690. J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” 2018. [Online]. Available: http://arxiv.org/abs/1804.02767 H. Law and J. Deng, “CornerNet: Detecting objects as paired keypoints,” in Proc. 15th Euro. Conf. Comput. Vis., 2018, pp. 765–781. A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in Proc. 14th Euro. Conf. Comput. Vis., Amsterdam, The Netherlands, 2016, pp. 483–499. K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: Keypoint triplets for object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 6568–6577. M. Kisantal, Z. Wojna, J. Murawski, J. Naruniec, and K. Cho, “Augmentation for small object detection,” 2019. [Online]. Available: http://arxiv.org/abs/1902.07296 S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 8759–8768. C. Deng, M. Wang, L. Liu, and Y. Liu, “Extended feature pyramid network for small object detection,” 2020. [Online]. Available: https://arxiv.org/abs/2003.07021 Q. Zhao et al., “M2Det: A single-shot object detector based on multi-level feature pyramid network,” in Proc. 33rd AAAI Conf. Artific. Intell., 2019, pp. 9259–9266. doi: 10.1609/aaai. v33i01.33019259. Z. Liu, G. Gao, L. Sun, and Z. Fang, “HRDNet: High-resolution detection network for small objects,” 2020. [Online]. Available: https://arxiv.org/abs/2006.07607 J. Noh, W. Bae, W. Lee, J. Seo, and G. Kim, “Better to follow, follow to be better: Towards precise supervision of feature superresolution for small object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 9724–9733. J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, “Perceptual generative adversarial networks for small object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 1951–1959. doi: 10.1109/CVPR.2017.211. X. Yang et al., “SCRDet: Towards more robust detection for small, cluttered and rotated objects,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 8231–8240. doi: 10.1109/ ICCV.2019.00832. J. Lim, M. Astrid, H. Yoon, and S. Lee, “Small object detection using context and attention,” 2019. [Online]. Available: http:// arxiv.org/abs/1912.06319 F. Yang, H. Fan, P. Chu, E. Blasch, and H. Ling, “Clustered object detection in aerial images,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 8310–8319. V. S. Frost, J. A. Stiles, K. S. Shanmugan, and J. C. Holtzman, “A model for radar images and its application to adaptive digital filtering of multiplicative noise,” IEEE Trans. Pattern DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE Anal. Mach. Intell., vol. PAMI-4, no. 2, pp. 157–166, 1982. doi: 10.1109/TPAMI.1982.4767223. [108] D. T. Kuan, A. A. Sawchuk, T. C. Strand, and P. Chavel, “Adaptive noise smoothing filter for images with signal-dependent noise,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-7, no. 2, pp. 165–177, 1985. doi: 10.1109/TPAMI.1985.4767641. [109] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Phys. D, Nonlinear Phenomena, vol. 60, nos. 1–4, pp. 259–268, 1992. doi: 10.1016/01672789(92)90242-F. [110] C. R. Vogel and M. E. Oman, “Fast, robust total variationbased reconstruction of noisy, blurred images,” IEEE Trans. Image Process., vol. 7, no. 6, pp. 813–824, 1998. doi: 10.1109/ 83.679423. [111] J. Cai, H. Ji, C. Liu, and Z. Shen, “Framelet-based blind motion deblurring from a single image,” IEEE Trans. Image Process., vol. 21, no. 2, pp. 562–572, 2012. doi: 10.1109/TIP.2011.2164413. [112] M. Xu, M. R. Pickering, A. J. Plaza, and X. Jia, “Thin cloud removal based on signal transmission principles and spectral mi x ture analysis,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 3, pp. 1659–1669, 2016. doi: 10.1109/TGRS.2015. 2486780. [113] Y. Zhang, B. Guindon, and J. Cihlar, “An image transform to characterize and compensate for spatial variations in thin cloud contamination of Landsat images,” Remote Sens. Environ., vol. 82, nos. 2–3, pp. 173–187, 2002. doi: 10.1016/S00344257(02)00034-2. [114] S. Le Hégarat-Mascle and C. André, “Use of Markov random fields for automatic cloud/shadow detection on high resolution optical images,” ISPRS J. Photogram. Remote Sens., vol. 64, no. 4, pp. 351–366, 2009. doi: 10.1016/j.isprsjprs.2008. 12.007. [115] J. Zhang, M. K. Clayton, and P. A. Townsend, “Missing data and regression models for spatial images,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 3, pp. 1574–1582, 2015. doi: 10.1109/ TGRS.2014.2345513. [116] C. Zeng, H. Shen, and L. Zhang, “Recovering missing pixels for Landsat ETM + SLC-off imagery using multi-temporal regression analysis and a regularization method,” Remote Sens. Environ., vol. 131, pp. 182–194, Apr. 2013. doi: 10.1016/j.rse.2012. 12.012. [117] X. Li, H. Shen, L. Zhang, and H. Li, “Sparse-based reconstruction of missing information in remote sensing images from spectral/temporal complementary information,” ISPRS J. Photogram. Remote Sens., vol. 106, pp. 1–15, Aug. 2015. doi: 10.1016/j.isprsjprs.2015.03.009. [118] H. Li, L. Zhang, and H. Shen, “An adaptive nonlocal regularized shadow removal method for aerial remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 1, pp. 106–120, 2014. doi: 10.1109/TGRS.2012.2236562. [119] G. D. Finlayson, S. D. Hordley, and M. S. Drew, “Removing shadows from images using retinex,” in Proc. 10th Color Imag. Conf., Color Sci. Eng. Syst., Technol., Appl., 2002, pp. 73–79. [120] A. Suzuki, A. Shio, H. Arai, and S. Ohtsuka, “Dynamic shadow compensation of aerial images based on color and spatial analysis,” in Proc. 15th Int. Conf. Pattern Recognit. (ICPR’00), 31
Barcelona, Spain, 2000, pp. 1317–1320. doi: 10.1109/ICPR.2000. 905339. [121] H. Song, B. Huang, and K. Zhang, “Shadow detection and re­­ construction in high-resolution satellite images via morphological filtering and example-based learning,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 5, pp. 2545–2554, 2014. doi: 10.1109/TGRS.2013.2262722. [122] H. Li, B. S. Manjunath, and S. K. Mitra, “Multi-sensor image fusion using the wavelet transform,” in Proc. 1994 Int. Conf. Image Process., pp. 51–55. doi: 10.1109/ICIP.1994.413273. [123] F. Gao, J. G. Masek, M. R. Schwaller, and F. G. Hall, “On the blending of the Landsat and MODIS surface reflectance: Predicting daily Landsat surface reflectance,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 8, pp. 2207–2218, 2006. doi: 10.1109/ TGRS.2006.872081. [124] Q. Wei, J. M. Bioucas-Dias, N. Dobigeon, and J. Tourneret, “Hyperspectral and multispectral image fusion based on a sparse representation,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 7, pp. 3658–3668, 2015. doi: 10.1109/TGRS.2014.2381272. [125] L. Zhang and Y. Zhang, “Airport detection and aircraft recognition based on two-layer saliency model in high spatial resolution remote-sensing images,” IEEE J Sel. Topics Appl. Earth Observ. Remote Sens., vol. 10, no. 4, pp. 1511–1524, 2017. doi: 10.1109/JSTARS.2016.2620900. [126] Y. Long, Y. Gong, Z. Xiao, and Q. Liu, “Accurate object localization in remote sensing images based on convolutional neural networks,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2486–2498, 2017. doi: 10.1109/TGRS.2016.2645610. [127] X. Yao, J. Han, L. Guo, S. Bu, and Z. Liu, “A coarse-to-fine model for airport detection from remote sensing images using target-oriented visual saliency and CRF,” Neurocomputing, vol. 164, pp. 162–172, Sept. 2015. doi: 10.1016/j.neucom.2015. 02.073. [128] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 6, pp. 3325–3337, 2015. doi: 10.1109/TGRS.2014.2374218. [129] Z. Xiao, Y. Gong, Y. Long, D. Li, X. Wang, and H. Liu, “Airport detection based on a multiscale fusion feature for optical remote sensing images,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 9, pp. 1469–1473, 2017. doi: 10.1109/LGRS.2017. 2712638. [130] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. doi: 10.1023/B:VISI.0000029664.99615.94. [131] B. Sirmaçek and C. Ünsalan, “Urban-area and building detection using SIFT keypoints and graph theory,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 4, pp. 1156–1167, 2009. doi: 10.1109/ TGRS.2008.2008440. [132] Z. Shi, X. Yu, Z. Jiang, and B. Li, “Ship detection in high-resolution optical imagery based on anomaly detector and local shape feature,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 8, pp. 4511–4523, 2014. doi: 10.1109/TGRS.2013.2282355. [133] C. Tao, L. Mi, Y. Li, J. Qi, Y. Xiao, and J. Zhang, “Scene contextdriven vehicle detection in high-resolution aerial images,” 32 IEEE Trans. Geosci. Remote Sens., vol. 57, no. 10, pp. 7339–7351, 2019. doi: 10.1109/TGRS.2019.2912985. [134] H. Sun, X. Sun, H. Wang, Y. Li, and X. Li, “Automatic target detection in high-resolution remote sensing images using spatial sparse coding bag-of-words model,” IEEE Geosci. Remote Sens. Lett., vol. 9, no. 1, pp. 109–113, 2011. doi: 10.1109/ LGRS.2011.2161569. [135] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 1–9. doi: 10.1109/CVPR.2015.7298594. [136] S. Zhuang, P. Wang, B. Jiang, G. Wang, and C. Wang, “A single shot framework with multi-scale feature fusion for geospatial object detection,” Remote Sens., vol. 11, no. 5, p. 594, 2019. doi: 10.3390/rs11050594. [137] S. Chen, R. Zhan, and J. Zhang, “Geospatial object detection in remote sensing imagery based on multiscale single-shot detector with activated semantics,” Remote Sens., vol. 10, no. 6, p. 820, 2018. doi: 10.3390/rs10060. [138] W. Li, R. Dong, H. Fu, and L. Yu, “Large-scale oil palm tree detection from high-resolution satellite images using two-stage convolutional neural networks,” Remote Sens., vol. 11, no. 1, p. 11, 2019. doi: 10.3390/rs11010011. [139] W. Guo, W. Yang, H. Zhang, and G. Hua, “Geospatial object detection in high resolution satellite images based on multiscale convolutional neural network,” Remote Sens., vol. 10, no. 1, p. 131, 2018. doi: 10.3390/rs10010131. [140] X. Zhang et al., “Geospatial object detection on high resolution remote sensing imagery based on Double multi-scale feature Pyramid Network,” Remote Sens., vol. 11, no. 7, p. 755, 2019. doi: 10.3390/rs11070755. [141] H. Qiu, H. Li, Q. Wu, F. Meng, K. N. Ngan, and H. Shi, “A2RMNet: Adaptively Aspect Ratio Multi-Scale Network for object detection in remote sensing images,” Remote Sens., vol. 11, no. 13, pp. 1–23, 2019. doi: 10.3390/rs11131594. [142] X. Wu, D. Hong, P. Ghamisi, W. Li, and R. Tao, “MsRi-CCF: Multi-scale and rotation-insensitive convolutional channel features for geospatial object detection,” Remote Sens., vol. 10, no. 12, p. 1990, 2018. doi: 10.3390/rs10121990. [143] D. AL-Alimi, Y. Shao, R. Feng, M. A. Al-Qaness, M. A. Elaziz, and S. Kim, “Multi-scale geospatial object detection based on shallow-deep feature extraction,” Remote Sens., vol. 11, no. 21, 2019. [144] Z. Deng, H. Sun, S. Zhou, J. Zhao, L. Lei, and H. Zou, “Multiscale object detection in remote sensing imagery with convolutional neural networks,” ISPRS J. Photogram. Remote Sens., vol. 145, pp. 3–22, Nov. 2018. doi: 10.1016/j.isprsjprs. 2018.04.003. [145] Z. Li, H. Shen, Q. Cheng, Y. Liu, S. You, and Z. He, “Deep learning based cloud detection for medium and high resolution remote sensing images of different sensors,” ISPRS J. Photogram. Remote Sens., vol. 150, pp. 197–212, Mar. 2019. doi: 10.1016/j. isprsjprs.2019.02.017. [146] C. Dong, J. Liu, F. Xu, and C. Liu, “Ship detection from optical remote sensing images using multi-scale analysis and Fourier HOG descriptor,” Remote Sens., vol. 11, no. 13, p. 1529, 2019. doi: 10.3390/rs11131529. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[147] Y. You, Z. Li, B. Ran, J. Cao, S. Lv, and F. Liu, “Broad area target search system for ship detection via deep convolutional neural network,” Remote Sens., vol. 11, no. 17, p. 1965, 2019. doi: 10.3390/rs11171965. [148] W. Kang, Y. Xiang, F. Wang, and H. You, “EU-Net: An efficient fully convolutional network for building extraction from optical remote sensing images,” Remote Sens., vol. 11, no. 23, p. 2813, 2019. doi: 10.3390/rs11232813. [149] N. Mo, L. Yan, R. Zhu, and H. Xie, “Class-specific anchor based and context-guided multi-class object detection in High Resolution Remote Sensing Imagery with a convolutional neural network,” Remote Sens., vol. 11, no. 3, p. 272, 2019. doi: 10.3390/rs11030272. [150] W. Xie, H. Qin, Y. Li, Z. Wang, and J. Lei, “A novel effectively optimized one-stage network for object detection in remote sensing imagery,” Remote Sens., vol. 11, no. 11, p. 1376, 2019. doi: 10.3390/rs11111376. [151] G. Cheng, P. Zhou, and J. Han, “RIFD-CNN: Rotation-invariant and fisher discriminative convolutional neural networks for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 2884–2893. doi: 10.1109/ CVPR.2016.315. [152] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1872–1886, 2013. doi: 10.1109/TPAMI.2012.230. [153] H. He, Y. Lin, F. Chen, H. Tai, and Z. Yin, “Inshore ship detection in remote sensing images via weighted pose voting,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 6, pp. 3091–3107, 2017. doi: 10.1109/TGRS.2017.2658950. [154] G. Cheng, J. Han, P. Zhou, and D. Xu, “Learning rotationinvariant and Fisher discriminative convolutional neural networks for object detection,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 265–278, 2019. doi: 10.1109/TIP.2018. 2867198. [155] Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao, “Oriented response networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 4961–4970. [156] K. Li, G. Cheng, S. Bu, and X. You, “Rotation-insensitive and context-augmented object detection in remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 4, pp. 2337– 2348, 2018. doi: 10.1109/TGRS.2017.2778300] [157] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2015, pp. 2017–2025. [158] C. Chen, W. Gong, Y. Chen, and W. Li, “Object detection in remote sensing images based on a scene-contextual feature pyramid network,” Remote Sens., vol. 11, no. 3, p. 339, 2019. doi: 10.3390/rs11030339. [159] R. B. Girshick, F. N. Iandola, T. Darrell, and J. Malik, “Deformable part models are convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 437– 446. doi: 10.1109/CVPR.2015.7298641. [160] W. Ouyang et al., “DeepID-Net: Deformable deep convolutional neural networks for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 2403–2412. doi: 10.1109/CVPR.2015.7298854. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [161] J. Dai et al., “Deformable convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 764–773. [162] T. Mordan, N. Thome, G. Hénaff, and M. Cord, “End-to-end learning of latent deformable part-based representations for object detection,” Int. J. Comput. Vis., vol. 127, nos. 11–12, pp. 1659–1679, 2019. doi: 10.1007/s11263-018-1109-z. [163] W. Ouyang and X. Wang, “Joint deep learning for pedestrian detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2013, pp. 2056–2063. doi: 10.1109/ICCV.2013.257. [164] X. Wang, A. Shrivastava, and A. Gupta, “A-fast-RCNN: Hard positive generation via adversary for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 3039–3048. doi: 10.1109/CVPR.2017.324. [165] S. Zhang, J. Yang, and B. Schiele, “Occluded pedestrian detection through guided attention in CNNs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 6995–7003. doi: 10.1109/CVPR.2018.00731. [166] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), 2005, pp. 886–893. doi: 10.1109/CVPR.2005.177. [167] A. Shrivastava, A. Gupta, and R. B. Girshick, “Training regionbased object detectors with online hard example mining,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 761–769. doi: 10.1109/CVPR.2016.89. [168] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinement neural network for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 4203–4212. [169] G. Zhang, S. Lu, and W. Zhang, “CAD-net: A context-aware detection network for objects in remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 12, pp. 10,015–10,024, 2019. doi: 10.1109/TGRS.2019.2930982. [170] J. Jin, K. Fu, and C. Zhang, “Traffic sign recognition with hinge loss trained convolutional neural networks,” IEEE Trans. Intell. Transp. Syst., vol. 15, no. 5, pp. 1991–2000, 2014. doi: 10.1109/ TITS.2014.2308281. [171] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 2261– 2269. doi: 10.1109/CVPR.2017.243. [172] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, “Libra R-CNN: Towards balanced learning for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 821–830. doi: 10.1109/CVPR.2019.00091. [173] G. Cheng et al., “Object detection in remote sensing imagery using a discriminatively trained mixture model,” ISPRS J. Photogram. Remote Sens., vol. 85, pp. 32–43, Nov. 2013. doi: 10.1016/j.isprsjprs.2013.08.001. [174] B. Cai, Z. Jiang, H. Zhang, D. Zhao, and Y. Yao, “Airport detection using end-to-end convolutional neural network with hard example mining,” Remote Sens., vol. 9, no. 11, pp. 1–20, 2017. doi: 10.3390/rs9111198. [175] Y. Xu, M. Zhu, S. Li, H. Feng, S. Ma, and J. Che, “End-to-end airport detection in remote sensing images combining cascade region proposal networks and multi-threshold detection networks,” Remote Sens., vol. 10, no. 10, pp. 1–17, 2018. doi: 10.3390/rs10101516. 33
[176] M. Zhu, Y. Xu, S. Ma, S. Li, H. Ma, and Y. Han, “Effective airplane detection in remote sensing images based on multilayer feature fusion and improved nonmaximal suppression algorithm,” Remote Sens., vol. 11, no. 9, p. 1062, 2019. doi: 10.3390/ rs11091062. [177] G. Zhou and Y. Zhang, “Transfer and association: A novel detection method for targets without prior homogeneous samples,” Remote Sens., vol. 11, no. 12, p. 1492, 2019. doi: 10.3390/ rs11121492. [178] Z. Chen, T. Zhang, and C. Ouyang, “End-to-end airplane detection using transfer learning in remote sensing images,” Remote Sens., vol. 10, no. 1, pp. 1–15, 2018. doi: 10.3390/rs10010139. [179] C. Liu, S. Li, F. Chang, and W. Dong, “Supplemental boosting and cascaded ConvNet based transfer learning structure for fast traffic sign detection in unknown application scenes,” Sensors, vol. 18, no. 7, p. 2386, 2018. doi: 10.3390/s18072386. [180] R. Dong, D. Xu, J. Zhao, L. Jiao, and J. An, “Sig-NMS-based RCNN combining transfer learning for small target detection in VHR optical remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 11, pp. 8534–8545, 2019. doi: 10.1109/ TGRS.2019.2921396. [181] A. Chan-Hon-Tong and N. Audebert, “Object detection in re­­mote sensing images with center only,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2018, pp. 7054–7057. doi: 10.1109/ IGARSS.2018.8517860. [182] B. Kellenberger, D. Marcos, S. Lobry, and D. Tuia, “Half a percent of labels is enough: Efficient animal detection in UAV imagery using deep CNNs and active learning,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 12, pp. 9524–9533, 2019. doi: 10.1109/TGRS.2019.2927393. [183] R. G. Cinbis, J. J. Verbeek, and C. Schmid, “Weakly supervised object localization with multi-fold multiple instance learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1, pp. 189–203, 2017. doi: 10.1109/TPAMI.2016.2535231. [184] D. P. Papadopoulos, J. R. R. Uijlings, F. Keller, and V. Ferrari, “We don’t need no bounding-boxes: Training object class detectors using only human verification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 854–863. doi: 10.1109/CVPR.2016.99. [185] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, “Solving the multiple instance problem with axis-parallel rectangles,” Artif. Intell., vol. 89, nos. 1–2, pp. 31–71, 1997. doi: 10.1016/ S0004-3702(96)00034-3. [186] Y. Zhu, Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao, “Soft proposal networks for weakly supervised object localization,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 1859–1868. [187] A. Diba, V. Sharma, A. M. Pazandeh, H. Pirsiavash, and L. V. Gool, “Weakly supervised cascaded convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 5131–5139. [188] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 2921–2929. doi: 10.1109/CVPR.2016.319. [189] H. Bilen and A. Vedaldi, “Weakly supervised deep detection networks,” in Proc. IEEE Conf. Comput. Vis. Pattern 34 Recognit. (CVPR), 2016, pp. 2846–2854. doi: 10.1109/CVPR. 2016.311. [190] L. Bazzani, A. Bergamo, D. Anguelov, and L. Torresani, “Selftaught object localization with deep networks,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2016, pp. 1–9. doi: 10.1109/WACV.2016.7477688. [191] Y. Shen, R. Ji, S. Zhang, W. Zuo, and Y. Wang, “Generative adversarial learning towards fast weakly supervised detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 5764–5773. doi: 10.1109/CVPR.2018.00604. [192] F. Zhang, B. Du, L. Zhang, and M. Xu, “Weakly supervised learning based on coupled convolutional neural networks for aircraft detection,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 9, pp. 5553–5563, 2016. doi: 10.1109/TGRS.2016. 2569141. [193] L. Cao et al., “Weakly supervised vehicle detection in satellite images via multi-instance discriminative learning,” Pattern Recognit., vol. 64, pp. 417–424, Apr. 2017. doi: 10.1016/j.patcog.2016.10.033. [194] Y. Li, Y. Zhang, X. Huang, and A. L. Yuille, “Deep networks under scene-level supervision for multi-class geospatial object detection from remote sensing images,” ISPRS J. Photogram. Remote Sens., vol. 146, pp. 182–196, Sept. 2018. doi: 10.1016/j. isprsjprs.2018.09.014. [195] Y. Ren, C. Zhu, and S. Xiao, “Small object detection in optical remote sensing images via modified Faster R-CNN,” Appl. Sci., vol. 8, no. 5, p. 813, 2018. doi: 10.3390/app8050813. [196] X. Xiao, Z. Zhou, B. Wang, L. Li, and L. Miao, “Ship detection under complex backgrounds based on accurate rotated anchor boxes from paired semantic segmentation,” Remote Sens., vol. 11, no. 21, pp. 1–18, 2019. doi: 10.3390/rs11212506. [197] Y. Gong et al., “Context-aware convolutional neural network for object detection in VHR remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 1, pp. 34–44, 2020. doi: 10.1109/TGRS.2019.2930246. [198] W. Ma, Q. Guo, Y. Wu, W. Zhao, X. Zhang, and L. Jiao, “A novel multi-model decision fusion network for object detection in remote sensing images,” Remote Sens., vol. 11, no. 7, pp. 1–18, 2019. doi: 10.3390/rs11070737. [199] K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proc. 2014 Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 1724–1734. doi: 10.3115/v1/D14-1179. [200] S. Bell, C. L. Zitnick, K. Bala, and R. B. Girshick, “Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks,” in Proc. 2016 IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 2874–2883. doi: 10.1109/ CVPR.2016.314. [201] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” in Proc. 5th Int. Conf. Learn. Represent. (ICLR), 2017. [202] Y. Feng, W. Diao, X. Sun, M. Yan, and X. Gao, “Towards automated ship detection and category recognition from highresolution aerial images,” Remote Sens., vol. 11, no. 16, pp. 1–23, 2019. GRS IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Hyperspectral Image Clustering Current achievements and future lines HAN ZHAI, HONGYAN ZHANG, PINGXIANG LI, AND LIANGPEI ZHANG ST ER TT HU ©S O CK .C OM /S ER GE YN IVE NS H yperspectral remote sensing organically combines traditional space imaging with advanced spectral measurement technologies, delivering advantages stemming from continuous spectrum data and rich spatial information. This development of hyperspectral technology takes remote sensing into a brand-new phase, making the technology widely applicable in various fields. Hyperspectral clustering analysis is widely utilized in hyperspectral image (HSI) interpretation and information extraction, which can reveal the natural partition pattern of pixels in an unsupervised way. In this article, current hyperspectral clustering algorithms are systematically reviewed and summarized in nine main categories: centroid-based, density-based, probability-based, bionics-based, intelligent computing-based, graph-based, subspace clustering, deep learning-based, and hybrid mechanism-based. The performance of several popular hyperspectral clustering methods is demonstrated on two widely used data sets. HSI clustering challenges and possible future research lines are identified. Digital Object Identifier 10.1109/MGRS.2020.3032575 Date of current version: 19 January 2021 DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE THE NECESSITY FOR HSI CLUSTERING Hyperspectral sensors can image an area of interest at a nanometer spectral resolution and collect rich spectral information to capture subtle differences among various ground objects [1]–[3]. An HSI has a 3D cube structure, containing tens and up to hundreds of bands, as shown in Figure 1, to support the fine recognition of ground objects [4]–[9]. This is good news in numerous applications, such as mineral exploration [10], [11], vegetation monitoring [12], [13], the quantitative inversion of physical and biological parameters [14], [15], military reconnaissance [16], [17], and so forth. However, with such high-dimensional data, the interpretation of HSIs commonly relies on a great quantity of high-quality labeled samples to avoid the Hughes phenomenon caused by having insufficient training examples and the underfitting problem that results from the inadequate training of the classifiers [18]–[20]. Unfortunately, in practice, sample collection is commonly time consuming, labor intensive, expensive, and inefficient, and, in some remote and uninhabited areas, training samples can be unavailable, which greatly limits the applications of hyperspectral remote sensing. Therefore, it is necessary to develop unsupervised ground object recognition 0274-6638/21©2021IEEE 35
theory and methods to overcome the restrictions related to labeled samples and prior knowledge. Clustering is an effective unsupervised pattern recognition and information extraction technique, and it is a common means for HSI interpretation [21]–[25]. Hyperspectral clustering groups similar pixels and separate dissimilar pixels, with each assemblage corresponding to a certain class, by fully mining the structural properties of hyperspectral data according to a similarity criterion, such as distance [26], [27], correlation THIS DEVELOPMENT OF [28], spectral angle [29], and HYPERSPECTRAL pair-wise pixel metrics [30]. TECHNOLOGY TAKES Because no labeled samples REMOTE SENSING INTO A are required, clustering seems BRAND-NEW PHASE, more attractive in many apMAKING THE TECHNOLOGY plications, in contrast to suWIDELY APPLICABLE IN pervised classification. EspeVARIOUS FIELDS. cially when there is no labeled sample, clustering can be an effective approach for ground object recognition, improving the application potential of hyperspectral remote sensing to a large degree. HSIs have a much more complex internal structure than handwritten figures, text, natural pictures, and multispectral images. In addition, there is a large spectral variability in HSIs, as pixels from the same class have different spectra, given the complexity of the imaging environment. Generally, in the high-dimensional feature space, the distribution of pixels is relatively sparse and uniform, with no clear rules to follow. Accordingly, hyperspectral clustering is commonly a more challenging task. Hyperspectral clustering has experienced decades of development, and a great quantity of methods has been put forward. However, to the best of our knowledge, very few studies have systematically and comprehensively reviewed FIGURE 1. The 3D cube structure of an HSI. 36 the current research status of hyperspectral clustering. Therefore, in this article, we fill this gap and investigate the current hyperspectral clustering methods in the literature to provide a detailed summary and analysis of various clustering methods, and we discuss challenges and possible future directions. REVIEW OF CURRENT HYPERSPECTRAL CLUSTERING METHODS Hyperspectral clustering generally includes two major tasks, i.e., estimating the number of clusters and constructing the proper clustering model. However, studies of the first task are relatively few in the hyperspectral clustering field. In [31]–[33], the number of clusters is automatically estimated by evolution algorithms and by using statistical histograms. However, these methods are generally bound to specific clustering models, such as the fuzzy c means (FCM) model [34], and are not universally applicable. In addition, many densitybased models can automatically estimate the number of clusters [35]–[37]. However, due to the inherent defects of density-based clustering, such techniques are generally less effective when applied to HSIs, which will be discussed in a later section. In some studies [38], [39], the optimal number of clusters is determined by a series of experiments. However, this strategy is time consuming and not practical in many use instances. In many cases, the number of clusters is regarded as a manually input parameter [21], [22], [40]–[44]. This number can be determined by visually interpreting the original HSIs [21], [41], which is simple and convenient but subjective and not fully automated. More often, in practice, this quantity is set as the number of classes in the ground truth [22], [42]–[44]. Generally speaking, cluster number estimation has always been an important topic in hyperspectral clustering research, while clustering model construction is the core of hyperspectral clustering, whose reasonability and effectiveness have a direct influence on the final clustering accuracy. Thus, the clustering methodology/model has always been a focus in the HSI processing field, and most of the existing work concentrates on clustering methodologies. In the article, we also focus on clustering models and methods. On the basis of the principle and the working mechanism, the current hyperspectral clustering methods can be classified into nine main types: 1) centroid-based methods, 2) density-based methods, 3) probabilitybased methods, 4) bionics-based methods, 5) intelligent computing-based methods, 6) graph-based methods, 7) subspace clustering methods, 8) deep learning-based methods, and 9) hybrid mechanism-based methods. In practice, an HSI can be expressed as a 2D data matrix, i.e., Y = 6Y1, Y2, f, YMN@ ! R D # MN, with each column denoting a pixel, where D and MN represent the number of bands and the number of pixels, respectively. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
For hyperspectral clustering, in the case of c different classes, the core task is to partition the pixels into c different groups based on a certain clustering model, with each group corresponding to a certain class. Different methods deal with the internal structure and the complexity of HSIs, with various model assumptions, which determines their clustering effect to a large degree. A taxonomy of the hyperspectral clustering methods considered in this article appears in Table 1. CENTROID-BASED CLUSTERING METHODS Centroid-based methods are the most classical and representative clustering approach, and they were also the earliest to be introduced to HSI analysis [45], [46]. Such techniques TABLE 1. THE TAXONOMY OF HYPERSPECTRAL CLUSTERING METHODS. CATEGORY MECHANISM SUBCATEGORY REPRESENTATIVE METHODS Centroid Assumes the cluster has a ball-like structure in the feature space; clusters HSIs by iteratively minimizing the overall partition error Hard partition k-means [47], ISODATA [49], NC-k-mean [52] Soft partition FCM [45], FCM-S1 [64], FLDNICM [69] Density Assumes clusters are density point sets separated by sparse areas in the feature space; clusters HSIs based on the local density and relative distances of pixels — CFSFDP [71], DAE [72], SSDL [77] Probability Assumes pixels from the same class satisfy a probability distribution model; clusters HSIs based on a probability rule — GMM [79], ICAMM [80], CLDD [86] Bionics Simulates the complex internal structure of HSIs with a certain biological model; clusters HSIs through a biological evolution algorithm — SOM [88], UAIC [42], UADSM [39] Intelligent computing Based on other clustering models; utilizes advanced intelligent computing algorithms to search for the global optimal solution to the clustering model Single objective FCIDE [92], MoDEFC [31], PSO-GMM [93] Multiple objective AFCMDE [32], AFCMOMA [94], MOPSO [38] Models the similarity among pixels with an adjacency matrix; clusters HSIs with a graph cut algorithm Complete graph SC [105], SENP [106], NLTV [107] Bipartite graph SSCC-BG [115], S-SC [116], BGP-CJS [117] Abbreviated graph FSCAG [43], SGCNR [121] Models the internal complex structure of HSIs via the union of subspaces; explores the underlying adjacency between pixels through self-representation learning; groups HSIs by applying spectral clustering (SC) to the adjacent matrix induced by the coefficient matrix Spectral–spatial subspace clustering S 4C [40], L2-SSC [41], SSC3DEPF [128] Multiple-view subspace clustering SSMLC [139], k-SSMLC [140], p-SSMLC [141] Kernel subspace clustering KSSC-SMP [142], KSLRSC [143] Relies on deep neural networks to learn more discriminative features for clustering and more accurately simulate the nonlinearity of data Autoencoder DCN [147], DMC [148], DSCNet [151] Separated network CCNN [155], DBNC [156], JSL [159] — — Generative network CatGAN [162], DAGMC [164], VaDE [166] Hybrid mechanism Deals with the clustering task by combining two or more clustering models — k-GMM [168], k-FDPC [169], SDCR [174] Graph Subspace clustering Deep learning ISODATA: iterative self-organizing data analysis technique algorithm; NC-k-mean: neighborhood-constrained k-means; FCM-S1: FCM with mean filtered spatial information; FLDNICM: fuzzy local double neighborhood information c-means; CFSFDP: clustering by the fast searching and finding of density peaks; DAE: density analysis ensemble; SSDL: spectral–spatial (SS) diffusion learning; GMM: Gaussian mixture model (MM); ICAMM: independent component analysis MM; CLDD: clustering based on the latent Dirichlet distribution; SOM: selforganizing map; UAIC: unsupervised artificial immune classifier; UADSM: unsupervised spectral matching classifier based on artificial deoxyribonucleic acid (DNA) computing; FCIDE: fuzzy clustering (FC) using improved differential evolution (DE); MoDEFC: modified DE FC; PSO–GMM: particle swarm optimization-based GMM; AFCMDE: automatic FC based on multiple-objective DE; AFCMOMA: adaptive multiple-objective memetic FC algorithm; MOPSO: multiple-objective PSO; SC: spectral clustering; SENP: Schroedinger Eigenmap with nondiagonal potentials; NLTV: graph-based nonlocal total variation; SSCC-BG: SS coclustering based on a bipartite graph (BG); S-SC: sequential SC; BGP-CJS: BG partition-based coclustering with joint sparsity; FSCAG: fast SC with anchor graph; SGCNR: scalable graph-based clustering with nonnegative relaxation; S 4C: SS sparse subspace clustering; L2-SSC: , 2 -norm regularized sparse subspace clustering; SSC-3DEPF: SSC based on 3D edge-preserving filtering; SSMLC: SS-based multiple-view low-rank SSC; p-SSMLC: parallel SSMLC; DCN: deep clustering network; DMC: deep multiple-manifold clustering; DSCNet: deep SSC based on an autoencoder network; CCNN: clustering based on a convolutional neural network; DBNC: deep brief network nonparametric clustering; JSL: joint unsupervised learning; CatGAN: categorical generative adversarial network; DAGMC: deep adversarial Gaussian mixture autoencoder clustering; VaDE: variational deep embedding; k-GMM: hybridization of the k-means and the GMM; k-FDPC: hybridization of the k-means and fast finding of density peaks clustering; SDCR: sparse dictionary-based anchor regression. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 37
c are based on the assumption that a cluster has a “ball-like” structure in the feature space. Starting with random initializations, such methods iteratively update the centroids and their associated pixel partitions until the overall partition error meets the tolerance requirement or the number of iterations reaches the predefined HYPERSPECTRAL maximum value, as illustrated CLUSTERING HAS in Figure 2. Partition error is EXPERIENCED DECADES OF generally defined as the sum DEVELOPMENT, AND A of squared distances between GREAT QUANTITY OF the assigned pixels and the corMETHODS HAS BEEN PUT responding centroids across all FORWARD. classes. Centroid-based clustering methods mainly include two types, i.e., hard partition clustering and soft partition clustering, based on whether a pixel belongs to multiple classes or not. min / / Y j - n i 22, (1a) 2 2 # Y j - n i 22 , i ! "1, f, c ,, (1b) i=1 j=1 ni 1 n i = n i / Yl , l=1 Y j - n i) where n i denotes the centroid of the ith cluster and n i represents the number of pixels in the ith cluster. Specifically, the k-means starts with randomly selected centroids and then iteratively updates the cluster centroids, with each pixel Y j assigned to the nearest cluster centroid n i) based on the distance metric, according to (1b) [48], until the cluster centroids do not change or the total partition error in (1a) does not significantly vary. Based on the k-means, numerous improved methods were developed. For example, the iterative self-organizing data analysis technique algorithm (ISODATA) was proposed to improve the clustering effect by integrating the dynamic adjustment mechanism of clusters into the clustering process, and it was successfully applied to HSIs [49]. In [50], a distributed k-means clustering method was developed for HSIs to further improve efficiency and practicability by employing the parallel computing technique. In [51], a kernel k-means was used for HSI feature extraction, which conducts clustering in the much-higher-dimensional kernel space to relieve the nonlinearity of HSIs. In addition, a neighborhood-constrained k-means HARD PARTITION-BASED CLUSTERING Hard partition-based methods allow each pixel to belong to only one class and assign each pixel to the nearest cluster. A typical example is the k-means [47], commonly considered as the originator of clustering analysis and one of the earliest clustering methods applied to HSIs [46]. The principle of the k-means is simple: it segments HSIs by minimizing the partition error across all c classes, as in (1): (a) MN (b) (c) (e) (d) Iteratively Updating the Centroids and the Pixel Assignment … (f) FIGURE 2. The centroid-based clustering mechanism. (a) The original pixel points. (b) The initialization (randomly selecting the centroids). (c) The pixel assignment. (d) Updating the centroids. (e) Updating the pixel assignment. (f) The clustering result. 38 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
(NC-k-means) approach was put forward, inspired by the clearly evident spatial correlation among neighboring pixels [52]. With a pure neighborhood index integrated into (1), the spatial information of HSIs is incorporated to help with the spectral analysis, and a much better clustering result is obtained. Furthermore, a two-stage k-means clustering technique combined with a neighboring union histogram (k-NUH) was developed, integrating the spatial information by the NUH [53]. It divides HSIs into several uncorrelated groups and computes the NUH of each collection based on the first few principle components. Then, it employs a twostage k-means model to cluster HSIs from rough to fine. Moreover, an improved k-means (I-means) algorithm was proposed for HSI mineral mapping. It takes the spectral information divergence as the similarity measurement and initializes the centroids via three different strategies [54]. SOFT PARTITION-BASED CLUSTERING Differing from hard partition-based approaches, soft partition-based methods consider the uncertainty of the pixel partitioning during the clustering process, allowing each pixel to belong to multiple classes, which may be more suitable for HSIs, due to the mixed pixel problem. Such techniques assign a fuzzy membership to each pixel in the range of [0, 1], with the sum of the memberships across all c classes being equal to one. The most representative soft partition-based clustering method is the FCM model [34], [45], which can be formulated as in (2): c min / MN / U mi,j Y j - n i 22 , (2a) i=1 j=1 MN / U mi,j Y j ni = j=1 MN / U mi,j j=1 , U i, j = c / Yj - ni -2/^m - 1 h Yj - nl -2/^m - 1 h , (2b) l=1 where U denotes the fuzzy membership matrix, with each element U i, j standing for the fuzzy membership of the jth pixel belonging to the ith centroid; n i represents the ith centroid, which can be updated according to (2b); and m is the fuzzy exponent. Based on the FCM, many enhanced methods were successively proposed. In [55] and [56], two weighted FCM models were developed, i.e., fuzzy weighted c-means (FWCM) and new weighed FCM (NW-FCM). These two approaches weight the similarity between neighboring pixels and the center pixel, which effectively improves the clustering performance. In [57], an uncertainty analysis-based FCM (UAFCM) algorithm was introduced. It detects pixels that have a large uncertainty through entropy and squared error-based criterion and reclassifies those pixels to refine the clustering results. In addition, to address the nonlinearity of HSIs, a kernel FCM was used in HSI semisupervised classification [58]. To overcome the sensitivity of the FCM DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE to initialization, an improved FCM algorithm based on the support vector domain description (SVDD) was proposed for HSIs [59]. It estimates the cluster centroids based on the SVDD to reduce the influence of noise and outliers on the centroids. Furthermore, in [33], an automatic histogrambased FCM (AHFCM) algorithm was developed. It obtains the initializations and the number of clusters for the FCM through two steps, clustering each band by calculating the slopes in the histogram and automatically fusing the labeled images. However, these techniques take only the spectral information into account, which are susceptible to noise and singular points and the spatial homogeneity of the clustering result is difficult to guarantee. To overcome these obstacles, a large number of enhanced FCM models that incorporate spatial information were developed. A representative example can be found in the spatial model for fuzzy clustering (SMFC) [60], with the formulation shown in (3): c min / MN / U mi,j 2 2 Yj - ni i=1 j=1 b c MN + 2 / / U mi, j / / U mp, q, (3a) i=1 j=1 p ! Mi q ! N j MN / U mi,j Y j ni = U i, j = / ` Yj - n c l=1 j=1 MN / U mi,j j=1 2 i 2 + b / p ! Mi / q ! N j U j ` Yj - nl 2 2 m -1/^m - 1 h p, q + b / p ! Mi / q ! N j U mp, q j -1/^m - 1 h  , (3b) where N j represents t he neig hbors of pi xel j, M i = " 1, 2, f, c , \ " i ,, and b is a tradeoff parameter. By adding a spatial penalty term to (3), the spatial neighborhood information is integrated to smooth the membership matrix, which leads to a more accurate result. In addition, in [61], a conditional FCM (C-FCM) algorithm was proposed. It simultaneously makes use of spectral–spatial information via the generalized multiplication of the spatial information and the spectral information. These methods have been successfully applied to HSIs [62]. To better utilize the spatial information, a neighborhood constraint clustering (NCC) algorithm was put forward [62]. It exploits the local spatial information via a neighborhood homogeneity index and obtains more smooth clustering results with a higher accuracy for HSIs. In addition, through adding a spatial constraint term to (2), an FCM with spatial information (FCM-S) algorithm was proposed [63]. It explores the spatial neighborhood information through a local window that is opened for each target pixel and obtains much better performance compared to the FCM. However, the FCM-S is computationally complex. To tackle this problem, two improved versions were developed, i.e., the FCM-S1 and FCM-S2, which, respectively, employ the mean filtered result and the median filtered result to simplify the spatial information calculation [64]. These techniques were then successfully applied to HSIs [65]. However, the spatial regularization 39
pixels and the center pixel to accurately model the spatial contextual information of HSIs. In this way, the clustering accuracy is further improved. Generally speaking, due to their simplicity and efficiency, centroid-based methods are very popular in many practical applications. However, centroid-based methods, in essence, belong to the “mountain-climbing” algorithms, which are easy to sink into the local optimal solutions [65], [70]. What is worse, the “ball-like” structure assumption generally cannot be satisfied by HSIs, due to a complex internal structure and a large spectral variability, which limit the approaches’ clustering performance to a large degree. parameters are difficult to determine, and the global information is poorly utilized. To overcome the drawbacks of these models, an adaptive memetic fuzzy clustering algorithm with spatial information (AMASFC) was proposed [65]. Through adaptively determining the spatial regularization parameters based on the information entropy and by simultaneously exploring the local information and the global information via the memetic algorithm, the clustering accuracy is further improved. Furthermore, a fuzzy approach with the spatial membership relations (FASMR) algorithm was proposed [66]. It incorporates the spatial information via a Gaussian filter and explores the membership relations among pixels in a local neighborhood. Moreover, by defining a fuzzy factor to integrate the spectral information and the local spatial information and to avoid parameter determination, a new fuzzy local information c-means clustering model (FLICM) was developed [67]. It was then applied to HSIs [68]. However, the FLICM has drawbacks, such as fuzzy edges and poor maintenance of spatial details. Faced with these obstacles, an adaptive FLICM (ADFLICM) algorithm was put forward [68]. It constructs a pixel spatial attraction model to adaptively measure the effects of neighboring pixels through weighting, which better recognizes the boundaries among different classes and maintains the details. Then, by flexibly exploiting the local spatial information and the spectral information, an improved version, the fuzzy local double neighborhood information c-means clustering (FLDNICM) algorithm, was introduced [69]. A fuzzy prior probability function is constructed based on the mutual dependent information between neighboring DENSITY-BASED CLUSTERING METHODS Density-based clustering methods partition pixels according to density criteria, under the basic assumption that clusters are generally dense point sets separated by sparse areas in the feature space. Such methods cluster HSIs based on the local density and the relative distances of pixels, as detailed in Figure 3. A typical example is the clustering by the fast searching and finding of density peaks (CFSFDP) algorithm [71]. It assumes that cluster centroids are surrounded by pixel points that have a lower density and are relatively far from pixel points with a higher density, computing two quantities for each pixel, i.e., local density t i and relative distance d i, as in (4), to search the optimal centroids: t i = / | ^d ij - d c h, | ^ x h = ( j di = ( 1, if x 1 0 , (4a) 0, otherwise min ^d ij h, j : t j 2 t i , (4b) max ^d ij h, t i = max ^t h (b) Pixel Assignment 0.5 0.4 Cluster Centroids 0.3 δ (a) 0.2 (d) 0.1 0 0 40 ρ 80 120 160 (c) FIGURE 3. The density-based clustering mechanism. (a) The HSI. (b) The density assumption for the clusters. (c) Searching the cluster centroids based on the local density and the relative distances. (d) The clustering result. 40 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
where d ij denotes the distance between pixels i and j, and d c represents the cutoff distance. Cluster centroids can be found by constructing a decision graph, i.e., a d-t graph, or determined by the measurement c i = t i # d i . Pixels with significantly large d and relatively large t values in the decision graph or with a significantly large c are considered to be centroids. Then, by assigning each pixel to the nearest cluster centroid, the final clustering result can be obtained. To further improve the efficiency and accuracy of CFSFDP, an enhanced version, i.e., density analysis ensemble (DAE) clustering, was developed for HSIs [72]. The DAE uses a random subspace ensemble to establish a series of clustering systems, with each individual system corresponding to a density analysis. Subsequently, the final clustering result is obtained by majority voting. Another representative method is the density-based spatial clustering of applications with noise (DBSCAN) [73]. The core idea of DBSCAN is to find pixels that have a higher density and connect them to generate clusters. The approach was utilized for HSI band selection and obtained good results [74]. In addition, the mean shift (MS) is also a typical density-based model, based on the rule of density gradient rising [75]. In [76], an adaptive MS algorithm was put forward by integrating nonnegative matrix factorization (NMF) and bandwidth selection, which better segments HSIs. In addition, in recent years, a series of nearest-neighbor density-based clustering methods were developed for HSIs. For example, the k-nearest-neighbor density-based clustering (KNNCLUST) method was proposed by extending the k-nearest-neighbor (KNN) model to an iterative procedure to automatically estimate the number of clusters [35]. Each pixel is assigned based on its KNNs and the distances to those neighbors by using the Bayes decision rule. Then, KNNCLUST was applied to HSIs, and a stochastic extended version, i.e., the kernel stochastic expectation maximum (KSEM), was developed for HSIs [36]. The KSEM employs KNNs to estimate the contextual class conditional distribution, which it iteratively updates with the posterior probability to account for the current clustering result. Then, the KSEM defines the stopping criterion based on the clustering entropy to make the conditional distribution converge to a stationary clustering result. As a result, the KSEM outperforms KNNCLUST. Moreover, a graph watershed clustering based on nearest neighbors (GWNN) algorithm was introduced for HSIs to alleviate the quadratic complexity of KNN estimation [37]. GWNN utilizes a labeling rule similar to KNNCLUST to account for the local density values and introduce a coarse-tofine multiresolution scheme, instead of a full KNN graph computation with all pixels. Consequently, GWNN effectively enhances the efficiency of the model and obtains a high clustering accuracy. Furthermore, in [77], an unsupervised spectral–spatial diffusion learning (SSDL)-based clustering algorithm was proposed for HSIs. SSDL takes advantage of geometrical DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE estimation and diffusion-inspired labeling to excavate the spectral–spatial duality of HSIs, based on the diffusion distances. SSDL includes two main steps, i.e., finding the cluster modes through density estimation and geomatic analysis and assigning pixels to the corresponding modes based on the spectral–spatial proximity. In addition, based on SSDL, an enhanced spectral–spatial diffusion geometry (SSDG)-based clustering method was developed [78]. SSDG introduces the spatially regularized random walk strategy to the diffusion construction, regularizes neighboring pixels by Markov diffusion, searches cluster modes via kernel density estimation and the diffusion distance, and assigns pixels based on the selected modes. As a result, SSDG further improves the clustering accuracy. In a word, density-based methods are relatively robust to noise and the shapes of clusters. In addition, many densitybased methods can automatically estimate the number of clusters. However, the relatively sparse and uniform distribution of the high-dimensional feature space of HSIs makes the assumption of the density-based clustering methods not fully satisfied, which degrades the clustering effect to a large degree. PROBABILITY-BASED CLUSTERING METHODS Probability-based clustering methods partition pixels based on certain likelihood criteria. Such methods assume that pixels from the same class generally obey a certain probability distribution, with each cluster modeled by a multivariate conditional distribution with specific parameters and the HSIs modeled by the joint probability distribution, as in Figure 4. Then, the final clustering result can be obtained by maximizing the likelihood function based on a certain probability stipulation, such as expectation maximization (EM), the maximum posterior probability, and the Bayesian rule. A representative probability-based clustering method is the Gaussian mixture model (GMM) [79]. The GMM is based on the assumption that hyperspectral pixels generally satisfy the Gaussian distribution, and it models each cluster with a multivariate Gaussian conditional distribution, as in (5): c p ^ Yih = / Pj g ^ Yi | m j, C jh, (5a) j =1 MN p ^ Y h = % p ^ Yih, (5b) i =1 where g is a Gaussian probability density function (pdf) and Pj is the prior probability of the jth cluster, with m j and C j denoting the mean vector and the covariance matrix of the jth cluster. Then, according to a certain probability rule, such as EM, the GMM partitions pixels into c different clusters to obtain the final clustering result. Considering that hyperspectral pixels commonly would not strictly obey the Gaussian distribution, an independent component analysis mixture model (ICAMM) was constructed for HSIs [80], [81]. The ICAMM represents each cluster as a non-Gaussian distribution, as in (6): 41
p ^ Yi | Hh = c / p^ Yi | ~ j , i jhP^~ jh,(6a) j=1 MN p ^ Y | H h = % p ^ Yi | H h, (6b) i =1 where H = 6i 1, i 2, f, i c@ is the class parameter set and P ^~ j h is the prior probability of the jth class ~ j . Then, the independent components and the mixing matrix of each class are estimated based on the modified information maximum model, and the membership probability of each pixel to belong to various classes is computed. Based on the maximum membership probability rule, the pixel partition result can be obtained based on the ICA model. A weighted principle component analysis ICA (WPCA-ICA) method was developed to extract the independent features based on second- and higher-order statistics, which performs better for HSIs [82]. Furthermore, in [83], a nonparametric stochastic expectation maximum (NPSEM) algorithm was proposed, which extends stochastic EM to a nonparametric representation to further improve the model’s practicability. The NPSEM was then introduced to HSIs and performed well [36]. In [84], a pairwise Markov field (PMF) model was constructed to segment noisy and blurred astronomical HSIs. It integrates the PMF model into the Bayesian framework to optimize the probability model, and it segments HSIs based on faint singles. In addition, to better learn the similarity among hyperspectral pixels, a layered sparse adaptive possibility c-means clustering (LSAPCM) approach was developed [85]. It integrates the layered possibility into the FCM framework to extend the architecture to a probability optimization model, and it produces good clustering results. In [86], a novel clustering model based on the latent Dirichlet distribution (CLDD) was constructed by introducing the topic model to simulate the structure of HSIs, with each topic modeled by the LDD. Moreover, considering that the mixed pixels of HSIs generally degrade the GMM performance, a Bayesian clustering method based on the spectral mixture model (SMM) and the Markov random field (MRF) was put forward for HSIs [87]. The Bayesian SMM-MRF utilizes the SMM to obtain the end-member abundance for each mixed pixel, and it assigns the mixed pixel according to the dominant endmember. Subsequently, this method integrates the SMM into the Bayesian framework to construct a conditional distribution of the mixed pixels to search for the dominant end-member, with the MRF utilized to optimize the label prior. Last, by solving the maximum posterior probability problem based on the EM rule, the pixel partition result is obtained. By considering the mixed pixel problem and comprehensively utilizing spectral–spatial information, the Bayesian SMM-MRF achieves good performance. As a whole, probability-based clustering methods have strict mathematical foundations and employ various probability theories to optimize the clustering model. However, the complex internal structure and large spectral variability of HSIs make hyperspectral pixels not strictly obey specific probability distributions, and thus they are inconsistent with the assumptions of such methods. As a result, probability-based clustering methods may fail to obtain good performance for HSIs. BIONICS-BASED CLUSTERING METHODS Bionics-based clustering methods employ certain biological models, such as artificial neural networks (NNs), to simulate the complex internal structure of HSIs and partition pixels based on certain biological evolution algorithms, as described in Figure 5. A typical example is the self-organizing map (SOM) model, which is an unsupervised learning method based on the Kohonen NN and has been successfully applied to HSIs [42], [88]. The SOM automatically learns the underlying similarity among the input pixels and then puts similar pixels close together in the network. The SOM generally consists of an input layer and a competitive layer, with a learning stage and a clustering stage. In the learning stage, the winning neurons are selected based on Euclidean distance, and then the weights of the winning neurons and the neighboring neurons are 0.04 0.02 0 20 20 10 (a) 10 0 0 (b) (c) FIGURE 4. The probability-based clustering mechanism. (a) The HSI. (b) The probability model construction and optimization. (c) The clustering result. 42 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
updated, as in (7). In the clustering stage, similar pixels are mapped to the neighboring neurons: DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE FIGURE 5. The bionics-based clustering mechanism. (a) The HSI. (b) The biological model construction. (c) The biological evolution optimization. (d) The clustering result. (b) (a) Artificial DNA Model Evolution Affinity Threshold Artificial Immune Network Fitness Value (c) Reproduction Crossover Mutation Operators Population Memory Cell Antigen New Memory Cell where W kij denotes the weight between neurons i and j in the kth iteration, TW ij stands for the weight gains, d means the Euclidean distance, I ^ $ h represents the activated neuron, h is the learning rate, and v is the kernel parameter. To better simulate the complexity of HSIs, many advanced biological models have been constructed. For example, in [42], an unsupervised artificial immune classifier (UAIC) was proposed. The UAIC utilizes an artificial immune system to simulate the complex internal structure of HSIs and employs a series of biological computation techniques, such as clonal selection, immune network, and immune memory, to partition pixels. Specifically, cluster centroids are randomly selected, and each pixel is assigned to a cluster with the maximum affinity between antigens and antibodies. An immune evolution algorithm is utilized to update the antibody population and the memory cell (MC) pooling until convergence. As a result, the UAIC obtains a relatively good result for HSIs. Then, an enhanced version of the UAIC, i.e., an unsupervised artificial immune network for remote sensing classification (RSUAIN) was constructed to further improve the clustering performance [89]. Instead of utilizing the distance threshold scalar to update the MC pooling and constrain the number of MCs, the RSUAIN introduces two immunological parameters, i.e., the death rate and the suppression rate, to update the MC matrix and determine the structure of the network by controlling the connection of network cells. Then, the RSUAIN forces each class to have an inner network connection and enhances the diversity of the MC population via a suppression rate to improve the evolution quality. In addition, considering the large volume, high dimension, and spectral diversity of HSIs, an unsupervised spectral-matching classifier based on artificial deoxyribonucleic acid (DNA) computing (UADSM) was put forward [39]. The UADSM employs an artificial DNA model to simulate the complexity of HSIs, and it clusters pixels through a series of artificial DNA computing techniques, including DNA spectral coding, optimization, and matching. The UADSM extracts multiple spectral features, such as the shape, amplitude, and slope, to enhance the discriminability of the features and optimizes clusters by recombining DNA strands. Based on the normalized DNA spectral similarity, the spectral signature of each pixel is assigned to the corresponding cluster to obtain the clustering result. Moreover, in [90], a novel context-aware unsupervised discriminative extreme learning machine (CUDELM) algorithm was developed for HSIs. The CUDELM introduces the extended NN, i.e., the ELM, to efficiently learn the structural information. Then, local spectral–spatial information (d) (7) Closest Antibody W kij + 1 = W kij + TW ij ,  TW ij = h exp _ - d 2j, I^ Yih /2v 2 i ^ Y j - W ij h, 43
is incorporated into the hidden layer features via a contextaware propagation filter, and the local and global structural information is integrated through regularization to learn more discriminative features. Consequently, the CUDELM yields accurate clustering results for HSIs. Besides, in [91], a new weighted incremental NN (WINN) method was developed for HSI segmentation. The WINN models the topology of pixels by using a set of weighted nodes, with the weights determined by the local density, and clusters the net through a watershed-like procedure to obtain the final clustering result. On the whole, bionics-based clustering methods can effectively simulate the internal complexity of HSIs to some degree, and they may produce accurate clustering results by employing advanced biological evolution algorithms. However, these methods still face obstacles. For example, the complex structure of HSIs cannot always be well fitted by specific biological models, in practice, and the large spectral variability further reduces the modeling accuracy, which limits the clustering performance. INTELLIGENT COMPUTING-BASED CLUSTERING METHODS Intelligent computing-based clustering methods are generally founded on other clustering models, such as the centroid-based clustering model, and utilize some advanced intelligent computing algorithms, such as genetic evolution, differential evolution, and particle swarm optimization (PSO), to search for the global optimal solution of the clustering model and further improve the clustering performance, as presented in Figure 6. According to the number of objective functions in the optimization problem, intelligent computation-based clustering methods can be further divided into two types: 1) single-objective-based clustering and 2) multiobjective-based clustering. SINGLE-OBJECTIVE-BASED CLUSTERING The single-objective-based clustering method has only a single objective function in the optimization problem, with an intelligent computing technique utilized to search for the global optimal solution. A representative single-objective-based clustering method is fuzzy clustering using an improved differential evolution (FCIDE) algorithm [92]. It introduces a certain validation index as the fitness function and searches for the optimal solution based on the differential evolution algorithm. Specifically, FCIDE utilizes the clustering separation (CS) measure or the Davis–Bouldin (DB) measure as the validation index to define the fitness function, as in (8): f= 1 1 , (8) or f = CS i ^K h + eps DB i ^K h + eps where K denotes the number of clusters and eps is an adjustment factor. The definition of CS and DB can refer to [92]. Based on FCIDE, in [31], a modified differential evolution fuzzy clustering (MoDEFC) algorithm was put forward to further improve the clustering performance. MoDEFC constructs a model using the Xie–Beni index as a validation index. FCIDE and MoDEFC were then introduced to HSIs, delivering good performance [32]. In addition, the AMASFC method employs the memetic algorithm to combine local and global information to search for the optimal solution, and it further improves the clustering accuracy [65]. Moreover, considering that the GMM–EM easily falls into the local optimal solution, a novel PSO-based GMM clustering (PSO-GMM) method was developed for HSIs [93]. It uses the advanced PSO algorithm instead of EM to search for a global optimal solution and improves the parameterization and parameter updating approaches to overcome the degeneracy problem. Consequently, the clustering accuracy is effectively improved. f1 Z2 Z1 Individuals Pareto Front Global Optimum f2 Evolution Algorithm Single-/ Multiobjective Model (a) (b) (d) Particle Swarm Algorithm (c) FIGURE 6. The intelligent computing-based clustering mechanism. (a) The HSI. (b) The clustering model construction. (c) Intelligent comput- ing. (d) The clustering result. 44 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
MULTIOBJECTIVE-BASED CLUSTERING Multiobjective-based clustering methods generally address more than one optimization problem and simultaneously search for optimal solutions based on certain intelligent computing techniques. Compared with the singleobjective-based clustering methods, multiobjective-based clustering approaches are more popular and generally perform better, as they consider numerous factors at the same time, e.g., spectral and spatial information, local and global information. A representative example is the automatic fuzzy clustering based on the multiobjective differential evolution (AFCMDE) algorithm [32]. It extends the MoDEFC model to a multiobjective version for an improved ability to learn the complexity of remote sensing images, with two objective functions included, i.e., the partition error and the Xie–Beni index, as in (9): min f ^ Y h = 6 f1 ^ Y h, f2 ^ Y h@, (9a) c c f1 = / MN /U MN / / U mi,j m i, j Yj - n i =1 j =1 2 i 2 , f2 = Yj - ni i =1 j =1 MN min i ! k n i - n k 2 2 2 2 , (9b) MN / U mi,j Y j ni = j=1 MN / U mi,j j =1 , U i, j = c / Yj - ni -2 /^m - 1 h Yj - nl -2/^m - 1 h . (9c) l =1 Specifically, AFCMDE consists of two layers, i.e., optimization and clustering. In the optimization layer, a feasible number of clusters is obtained by minimizing these two objective functions. In the clustering layer, a nondominated sorting method is utilized to update the population and search the Pareto front to obtain the final clustering result. Through multiobjective optimization, AFCMDE outperforms MoDEFC. Then, based on AFCMDE, a multiobjective memetic FCM algorithm (AFCMOMA) was presented to further improve the optimization capability of the model [94]. The approach introduces the memetic algorithm to balance the local and global search ability and adds a new population-updating strategy to obtain more high-quality individual samples. As a result, the clustering accuracy is further improved. In addition, a novel social recognition-based multiobjective gravitational algorithm (SMGSA) was developed for HSIs to learn the similarity relationships among pixels [95]. The SMGSA algorithm searches individual pixels among the elite ones obtain by the gravitational force and the general ones learned from the social recognition model, based on the whole population, to generate an outstanding exploitation ability. Furthermore, in [38], a novel multiobjective PSO (MOPSO) method was proposed for HSIs to simultaneously solve three problems, i.e., clustering the statistical parameter estimation, searching for the best discriminative bands, and estimating the number of clusters, using three different optimization criteria. Moreover, based on the advanced sparse subspace clustering (SSC) model, a multiobjective SSC (MOSSC) method was put forward for HSIs. It DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE treats the sparse constraint term and the data fidelity term as two objective functions to avoid the manual determination of the regularization parameter, as in the SSC model [96]. Commonly, with the help of advanced intelligent optimization algorithms to search for the global optimal solution to the clustering model, intelligent computing-based clustering methods may perform better than traditional clustering approaches. However, such techniques still have several disadvantages that limit their practical applications to some degree. For example, the principle of such methods is relatively complex, with a high application threshold. In addition, such techniques are generally based on other clustering models, and their performance is limited by the inherent defects of the foundation clustering models, such as FCM and GMM. GRAPH-BASED CLUSTERING METHODS Graph-based clustering is one of the recently developed advanced hyperspectral clustering approaches that is evolved from graph theory. Such methods generally model the relationships among hyperspectral pixels with an adjacency u ! R MN # MN, also known as a similarity graph, matrix W whose element represents the similarity between a corresponding pair of pixels or the penalty factor when separating the corresponding two pixels into different subgraphs. The adjacency matrix is the basis of graph clustering. The quality of the matrix directly affects the final clustering accuracy. In practice, it is generally constructed by the f- ball strategy [97], the KNN strategy [98], and the full connection strategy [99]. Then, by applying a certain graph cut algorithm to minimize the total cutting cost of the adjacency matrix, the final clustering result can be obtained, as shown in Figure 7. Specifically, graph cut is a very important part of graph theory. It aims to segment a graph into several disjoint and distinctive subgraphs by maximizing the intrasubgraph similarity and minimizing the intersubgraph similarity, with each subgraph denoting a specific class. With decades of development, many graph cut algorithms have been developed, including minimum cut [100], radio cut [101], normalized cut [102], average cut [103], minimum–maximum cut [104], and so on. Among them, the normalized cut algorithm is the most widely used. According to the differences among the constructed graphs, graph-based clustering methods can be coarsely divided into three main kinds: 1) complete graph-based clustering, 2) bipartite graph-based clustering, and 3) abbreviated graph-based clustering. COMPLETE GRAPH-BASED CLUSTERING Complete graph-based clustering methods group HSIs based u that consists of all pixels; the maon an adjacency matrix W trix contains the similarity between any pair of pixels, at a size of MN # MN. A typical example is spectral clustering (SC) [102], [105], which generally employs the normalized cut algorithm to conduct graph cutting, with a spectral analysis model formulated as the following optimization problem: 45
min Tr ^F T LFh, (10) (c) 3 0.6 (b) 0 0.7 0.8 1 (a) 0 v6 0 2 0 0.8 1 0.8 v5 0.2 0 0 0.3 1 0.8 0.7 0 v3 0.6 0.8 1 0.3 0 v4 0 0 v2 0.8 1 0.8 0 0 0.8 0.8 1 v1 1 0.8 0.6 0 0.2 0 v1 v2 v3 v4 v5 v6 46 FIGURE 7. The graph-based clustering mechanism. (a) The HSI. (b) The adjacency matrix. (c) The graph cut. (d) The clustering result. 0.3 0.8 0.2 Cut Edge 4 5 0.7 0.8 6 (d) FT F = I u , and where L is the graph Laplacian matrix, i.e., L = D - W D is the degree matrix, which is a diagonal matrix with the u ij . SC commonly solves diagonal element being D ii = / j W the optimization problem via singular value decomposition (SVD). By extracting the c eigenvectors corresponding to the c smallest eigenvalues, an optimal F can be obtained, where c is the number of clusters. Then, by applying k-means to F, the final clustering result can be obtained. In [106], a novel Schroedinger eigenmap with nondiagonal potentials for a spectral–spatial clustering (SENP) algorithm was proposed for HSIs. The approach employs a Schroedinger eigenmap, which is an extension of the graph Laplacian matrix, to integrate barrier and cluster potentials to accurately model the similarity between pixels. Then, different kinds of nondiagonal potentials are explored within the model to encode the spatial proximity and integrate the spectral proximity through manifold learning. As a result, the graph discriminability is enhanced, and a more accurate clustering result is obtained. In addition, in [107], a graph-based nonlocal total variation (NLTV) method was developed. It explores the spatial information of HSIs with an NLTV constraint to construct a more accurate similarity graph, and it introduces the primal–dual hybrid gradient algorithm to efficiently solve the graph cut problem. Consequently, NLTV obtains accurate clustering results for HSIs. Furthermore, a joint spectral–spatial clustering with a block-diagonal amplified affinity matrix (JC-BAAM) algorithm was proposed. It considers the size and shape differences of the spatial neighborhoods of different hyperspectral pixels to promote the block-diagonal property of the affinity matrix and increase the separability between different classes [108]. Besides, by paying special attention to small variations in data density and scaling the clusters based on the latent structure, a novel graph-based clustering (GC) algorithm was developed for HSIs. It obtains a better effect for small classes that have few pixels [109]. In addition, a graph clustering-based method was put forward to solve semisupervised and unsupervised classification problem for HSIs [110]. It constructs a pairwise pixel similarity graph and develops a parallel Nyström extension model that randomly samples the graph to obtain a low-rank approximation of the graph Laplacian for SC. Moreover, some other extended graph-based clustering models, i.e., manifold-based models, were developed for HSIs. For example, in [111], a multimanifold SC (MMSC) algorithm was proposed for HSIs that constructs a nearestneighbor connectivity model based on the shared nearest neighborhood and estimates the tangent space with a weighted principal component analysis (PCA). Then, an enhanced MMSC, i.e., contractive autoencoder-based MMSC (CA-MMSC), was developed for HSIs to estimate the tangent space via a contractive autoencoder and obtain IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
better performance [112]. In [113], a rank 2 NMF-based hierarchical clustering (H2NMF) algorithm was developed for HSIs. It first treats all pixels as a cluster and then splits one cluster into two disjoint clusters using a rank 2 NMF model until obtaining stable results. In addition, in [114], an orthogonal graph-regularized NMF (OGNMF) method was introduced. It combines the orthogonal graph constraints with the NMF model to learn the local structure information of HSIs and achieves a relatively good clustering effect. In addition, a robust manifold factorization-based clustering (RMFC) algorithm was proposed for HSIs [22]. It employs a low-rank matrix factorization framework to simultaneously deal with the dimension reduction (DR) task and the clustering task, with manifold regularization to enhance the robustness of the clustering model. With the help of the out-of-sample extension trick, it can be extended to large HSIs. BIPARTITE GRAPH-BASED CLUSTERING Bipartite graph-based clustering is an extended version of complete graph-based clustering, and it has been successfully applied to HSIs to obtain good effect. In contrast to the completed graph, the bipartite graph models the relationships between two different sets, i.e., the anchor set and the pixel set, to obtain a structured similarity matrix t ! R^MN +nh # ^MN + nh at a larger size, as in (11): W t = c 0T A m . (11) W A 0 Here, A is generally constructed based on Gaussian kernel distances, with the KNN strategy utilized, as in (12): A ij = ) 2 exp _ - d Yi - Yt j i, Yi ! M k ^ Yt jh or Yt j ! M k ^ Yih , (12) 0, otherwise where Yt ! R D # n is the anchor matrix derived from the HSI matrix and d is the kernel parameter. A representative bipartite graph-based clustering method is spectral–spatial coclustering based on a bipartite graph (SSCC-BG) [115]. It extracts anchors from the cluster centroids of k-means and then constructs a bipartite graph between centroids and pixels. SSCC-BG obtains good clustering results for HSIs by fusing spectral information and spatial information into the graph. In addition, in [116], a sequential SC (S-SC) method was developed to efficiently cluster HSIs. It employs the minibatch k-means to determine the anchors and conduct cluster assignments. Then, based on the bipartite graph, S-SC utilizes the sequential SVD for the product of the rows and columns of A T A , instead of directly decomposing it, which effectively reduces the computational complexity and improves the efficiency of the model. Furthermore, in [117], a novel bipartite graph partitionbased coclustering with joint sparsity (BGP-CJS) was put forward for HSIs. The technique builds a more informative DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE bipartite graph with a learned A from the joint-sparsity-constrained optimization problem. Then, an efficient spectral graph-based normalized cut method is proposed to simultaneously cluster the rows and columns of the similarity matrix. Consequently, the BGP-CJS further improves the clustering accuracy. ABBREVIATED GRAPH-BASED CLUSTERING To overcome the large computational complexity of complete graph-based clustering, efficient abbreviated graphbased clustering has been developed, which selects only a few important and representative points, i.e., key points, to construct a similarity graph at a much smaller size of n # n, where n is the number of key points. In practice, the abbreviated graph is generally induced by the anchor graph. Hence, anchor graph-based clustering methods can be the most representative abbreviated graph-based clustering models. In many recent studies, the anchor graph is utilized to evaluate the similarity among pixels, instead of the complete-graph [118], [119]. Such methods generally include two main steps, i.e., anchor selection and relation matrix construction. The representative pixels or cluster centroids are considered as anchors, which are commonly obtained via random selection or preclustering. The relation matrix A ! R MN # n models the relationships among anchors and pixels, and it is generally constructed based on certain similarity measurements, such as the Gaussian kernel distance. In [120], A is taken as a variable and obtained by learning, as in (13), which leads to a more accurate A: min A1=1, A ij $ 0 MN n // i =1 j =1 Yi - Yt j 2 2 A ij + c A 2F , (13) where c is the regularization parameter and 1 ! R n #1 is a vector whose elements are all ones. With the obu can be constructed as tained A , the adjacency matrix W u = A K -1 A T , where K ! R n # n is a diagonal matrix with W MN the diagonal element being K jj = / i =1 A ij . Such methods are generally much more efficient, and they are more scalable to large HSIs, given their small computing demands. However, because only a few key points are utilized to approximate the structure information of HSIs, the underlying adjacency among pixels cannot be accurately mined. Consequently, the clustering accuracy of such methods is generally discounted. A typical example of the abbreviated graph-based clustering method is the fast SC with an anchor graph (FSCAG) algorithm [43]. To ensure the clustering efficiency, FSCAG randomly selects the anchors from the original hyperspectral pixels. Then, the relation matrix A is learned from (13), with a spatial constraint based on the mean filtered results of HSIs, inspired by FCM-S1, to incorporate spatial information into the anchor graph. Last, through spectral analu induced by A , as in (10), ysis of the adjacency matrix W the final clustering result is yielded. In addition, in [121], 47
a scalable graph-based clustering with nonnegative relaxation (SGCNR) algorithm was proposed. It learns A from u . Then, through (13) to construct the adjacency matrix W adding an additional nonnegative constraint to the spectral analysis model to more accurately relax it from the discrete case to the continuous case, improved clustering results can be obtained. In summary, because of the flexible graph construction means, powerful structure information mining ability, and relatively robust clustering performance, graph-based clustering methods have drawn wide attention and become one of the research hot spots in the hyperspectral clustering field. However, they are generally restricted by computational complexity, and they need to strike a compromise between accuracy and efficiency, as in abbreviated graphbased clustering. In addition, due to the inadequate consideration of the interactions among pixels during the graph construction process and the influences from the large spectral variability and high correlations among hyperspectral pixels, such techniques generally cannot accurately mine the underlying adjacency among pixels, which limits their clustering performance to a certain degree. SUBSPACE CLUSTERING METHODS Subspace clustering is another recently developed advanced hyperspectral clustering approach founded on graph-based clustering models. Such methods generally model same-class pixels that have various spectral signatures with a subspace and approximate the complex internal structure of HSIs by a union of subspaces, as detailed in Figure 8, which may relieve the large spectral variability and improve the modeling accuracy. Then, such methods explore the underlying adjacency among pixels through self-representation learning via an overcompleted dictionary derived from the HSI data, with a certain prior structural constraint utilized for the representation coefficient matrix to obtain stable solutions, as in (14). By fully exploring the interactions among pixels and the contribution of each atom to the target pixel, the learned adjacency matrix may be more accurate and informative, and it may guarantee that pixels are segmented into the correct subspaces: min C G ^C h subject to Y = YC + N, (14) where C ! R MN # MN is the representation coefficient matrix, which contains pairwise-pixel similarity and reveals the latent partition pattern of pixels to a certain degree. Here, G ^ $ h denotes the certain prior structural constraint for C , including the sparse constraint [122], low-rank constraint [123], energy constraint [124], and so on. Then, the adjacency mau can be induced by the coefficient matrix C , such as trix W u = C + C T . Last, by employing SC to the adjacency maW trix, the final clustering result can be obtained [125], [126]. The most representative subspace clustering model can be SSC [122], [127], which exploits the underlying adjacency among hyperspectral pixels by solving the following sparsity-promoting optimization problem, based on the basic assumption that each target pixel can be recovered by only a few atoms from its own subspace in the HSI selfdictionary. The SSC model can be formulated as in (15): min C C 1 m + 2 Y - YC 2 F C T 1 = 1, subject to diag ^C h = 0,  (15) 7,000 6,000 DN Value 5,000 4,000 3,000 2,000 1,000 0 0 50 100 Band Number 150 200 Spectral Variability A Subspace A Union of Subspaces Spectral Clustering (b) Structural Prior Constraint MN (a) MN MN (d) D × MN =D Hyperspectral 2D Matrix Y (e) Dictionary Y Coefficient Matrix C Spatial Regularization (c) FIGURE 8. The subspace clustering mechanism. (a)The HSI. (b) Subspace modeling. (c) Self-representation learning. (d) The similarity graph. (e) The clustering result. DN: digital number. 48 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
where m is a regularization parameter to balance the sparsity term and the data fidelity term, diag ^Ch = 0 is to avoid the trivial solution caused by representing each pixel by itself, and C T1 = 1 means that the affine subspace model is adopted. Although SSC has shown significant potential in hyperspectral clustering, due to some shortcomings, e.g., ignoring the importance of the spatial information and the nonlinearity of HSIs, the clustering performance is still limited. Based on this fact, in recent years, many enhanced subspace clustering algorithms have been proposed to further improve the clustering performance and exploit the potential of subspace clustering. On the basis of the working mechanism, such methods can be coarsely summarized into three main categories: 1) spectral–spatial subspace clustering, 2) multiview subspace clustering, and 3) kernel subspace clustering. SPECTRAL–SPATIAL SUBSPACE CLUSTERING Spectral–spatial subspace clustering methods focus on exploring the spectral–spatial duality of HSIs within the selfrepresentation framework to reduce the influence of saltand-pepper noise and enhance the spatial homogeneity of the clustering result. By incorporating spatial information to help spectral analysis in the representation domain, the piecewise smoothness of the representation coefficient matrix can be effectively enhanced, and the representation bias can be reduced to some degree. As a result, the clustering performance can be effectively improved. In general, with a certain spatial constraint, the spectral–spatial subspace clustering model can be formulated as follows: min C G ^C h + aR ^C h subject to Y = YC + N, (16) where R^ $ h denotes the spatial regularization term and a is a regularization parameter to trade off the importance between the spectral term and the spatial term. A typical example is the spectral–spatial SSC ^S 4 C h algorithm [40]. It promotes the target pixels to be represented by highly related atoms via a weighting strategy and incorporates the spatial neighborhood information to generate an integrated self-representation model by constructing an eight-neighborhood local average spatial regularization, based on the assumption that the average coefficient in the local small window should be close to the coefficient of the center pixel. Considering that the assumption of the local average constraint cannot be satisfied in areas with a complex land cover distribution, a new , 2- norm regularized SSC (L2-SSC) algorithm was proposed [41]. It incorporates spatial information in a more refined way by constructing an efficient four-neighborhood , 2- norm spatial regularization, which further improves the clustering performance. In addition, in [128], a spectral–spatial SSC based on 3D edge-preserving filtering (SSC-3DEPF) algorithm was put forward. It utilizes 3D edge-preserving filtering for the sparse coefficient matrix obtained by SSC to extract the spectral–spatial information to generate a more accurate coefficient matrix, which is favorable for clustering. In [129], a joint SSC (JSSC) method was proposed to make use of spatial information DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE through joint sparse representation. It forces the pixels in a spatial neighborhood to share the same sparse basis. In [44], based on the sparse coefficient matrix learned by SSC, two enhanced methods were put forward to construct a more accurate adjacency matrix, i.e., cosine–Euclidean (CE) and CE dynamic weighting (CEDW). These two methods simultaneously utilize the spectral and spatial information, with the cosine similarity exploited to measure the spectral similarity and Euclidean distances utilized to incorporate the spatial information. Moreover, in [130], a Laplacian-regularized low-rank subspace clustering (LLRSC) algorithm was proposed. It incorporates three different Laplacian regularizations into the low-rank subspace clustering (LRSC) model to explore the importance of the correlation information of HSIs, and it achieves good performance in HSI band selection. In [131], a spectral–spatial LRSC (SS-LRSC) model was developed. It utilizes a new modulation strategy to incorporate the correlations into the low-rank representation matrix through weighting and local spatial bilateral filtering, which performs well for HSIs. Furthermore, in [132], a Gaussian kernel dynamic similarity matrix-based SSC (GKD-SSC) method was introduced. It improves the quality of the adjacency matrix by simultaneously utilizing the sparse coefficient matrix obtained by SSC and the Gaussian kernel similarity based on the distances between pixels after PCA processing. Considering the large computational complexity of sparse recovery-based methods, a novel total variation (TV)-regularized collaborative representation clustering algorithm with a locally adaptive dictionary (TV-CRC-LAD) was proposed for HSIs [21]. This approach exploits the collaborative and competitive relationships among pixels from all classes in the self-representation process and deals with the clustering task within the collaborative representation framework, with less complexity. Then, it reduces the serious interferences from unrelated atoms in the whole dictionary by constructing a locally adaptive dictionary for each target pixel and integrates spatial information to enhance the piecewise smoothness of the coefficient matrix via the TV regularization. As a result, the TV-CRC-LAD may perform better for HSIs. In addition, to overcome the large computational complexity and the time and memory cost of the self-dictionarybased methods, a sketched subspace clustering method was developed [133]. It conducts the self-representation learning under a sketched dictionary with much fewer atoms obtained by random projection, which reduces the computational complexity and enhances the scalability of the model to a large degree. Then, the sketched subspace clustering method was introduced to HSIs, and based on it, through the TV constraint to incorporate spatial information, an enhanced method was proposed, i.e., TV sketched subspace clustering [134]. Furthermore, considering that pixel-based clustering methods generally encounter several obstacles and that they were easily affected by salt-and-pepper noise and could not accurately model the spatial neighborhoods of hyperspectral pixels with various shapes, several object/super-pixel 49
based SSC methods were developed for HSIs. A typical example is the mass center-reweighted object-oriented SSC (MCR-OOSSC) algorithm [135]. It flexibly models spatial neighborhoods with various shapes via objects obtained from oversegmentation and extracts more representative and discriminative object mass centers as features to construct the object sparse representation model, as in (17): u min Cu C 1 m uu + 2 Yu - YC u T 1 = 1. C 2 F u h = 0, subject to diag ^C  (17) Here, Yu ! R D # G is the object mass center data matrix and C ! R G # G is the associated sparse coefficient matrix, with G denoting the number of objects. Based on the MCR-OOSSC approach, in [136], a higher-order superpixel-based SSC algorithm with a conditional random field (SP-SSC-CRF) was proposed. It integrates the advantages of the S 4 C and OOSSC methods to generate an enhanced model and utilizes the conditional random field to further smooth the within-class noise. In general, these object/superpixel-based methods improve the clustering performance to a certain degree and greatly reduce the time cost by converting pixel clustering to object clustering, which significantly increases the attractiveness of subspace clustering in practical applications. Moreover, to better evaluate the discriminative information and more accurately learn the nonlinear structure of HSIs, a Laplacian-regularized deep subspace clustering (LRDSC) algorithm was proposed [137]. It combines subspace clustering with the deep convolutional autoencoder network to learn the nonlinearity of HSIs and extracts spectral–spatial information through 3D convolutions and deconvolutions with skip connections to fully exploit multilevel features. Consequently, LRDSC obtains highly competitive clustering performance for HSIs. MULTIVIEW SUBSPACE CLUSTERING Multiview subspace clustering methods take full advantage of complementary information found in different domains of HSIs to further improve the clustering performance. Generally, each view corresponds to a specific feature domain, such as the spectral feature domain, the contexture feature domain, the shape feature domain, and so on. Such techniques generally construct a unified model to integrate multiview feature self-representation problems. A typical example is the spectral–spatial-based multiview low-rank SSC (SSMLC) method [138], which has been applied to HSIs [139]. Specifically, it generates the spectral view by spectral partitioning to obtain correlated bands and creates the spatial view by morphological processing. In addition, another view is generated by PCA to remove the serious noise in HSIs. By integrating different views within the SSC framework, SSMLC can be modeled as in (18): 1 min 2 / ^b 1 m m C , C , f, C i =1 Ci ) + b 2 C i 1h + c subject to Y i = Y i C i, diag ^C ih = 0, 50 / 1 # i, j # m, i ! j Ci - C j 2 F  (18) where m denotes the number of views and Y i indicates the feature matrix of the ith view, with C i representing the associated coefficient matrix. The terms b 1, b 2, and c are three regularization parameters. The second term is utilized to force the coefficient matrixes learned from different views to share the same pattern. Through multiview learning, the complementary information of HSIs can be effectively integrated, and the discriminability of the representation coefficients can be enhanced to some degree, which leads to more accurate clustering results. Considering the nonlinearity of HSIs, the SSMLC model was extended to a kernel version, i.e., k-SSMLC [140]. It further improves the clustering accuracy by introducing the kernel technique to address the nonlinearly separable problem of HSIs in the multiview subspace clustering framework. In addition, to overcome the large computational burden of multiview subspace clustering, a parallel SSMLC (p-SSMLC) method was put forward [141]. It adopts a simple parallel strategy to reduce the time cost of SSMLC. Specifically, given the large size of remote sensing images, the HSI is first partitioned into many nonoverlapping 3D blocks. Then, the SSMLC method is applied to each 3D block to obtain the local clustering results. Last, by merging these local clustering outcomes, the final clustering result is obtained. By employing the advanced parallel computing technique, the overall time cost is significantly reduced, which further improves the practicability of the computationally expensive multiview subspace clustering models. KERNEL SUBSPACE CLUSTERING Due to the complex imaging environment and serious interference from various nonlinear factors in the imaging process, HSIs generally have an obvious nonlinear structure, and different classes are generally not linearly separable. However, most subspace clustering methods are based on the linear subspace assumption, which utilizes the union of linear subspaces to approximate the complex nonlinear internal structure of HSIs, leading to a large systematic error and poor separability among different classes. As a result, the clustering performance is degraded to some degree. Based on this fact, kernel subspace clustering methods have been developed to relieve the nonlinearity of HSIs to further improve the clustering performance through a kernel self-representation model, instead of a linear model, to more accurately mine the latent adjacency among pixels. Such methods first map pixels from the original feature space into a much higher dimensional kernel space to approximately transform the nonlinearly separable problem into a linearly separable one. Then, the self-representation property of the mapped features in the reproducing kernel space is exploited to construct a kernel self-representation model, as in (19), which generally leads to a more accurate coefficient matrix: m min C H ^C h + 2 K ^ Y h - K ^ Y h C 2F  subject to diag ^Ch = 0, IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE (19) DECEMBER 2021
where K ^ $ h denotes the kernelized data matrix, with the Gaussian radial basis function commonly utilized. A typical kernel subspace clustering method is the kernel SSC algorithm with a spatial maximum pooling operation (KSSC-SMP) [142]. The KSSC-SMP extends SSC to nonlinear manifolds to construct the KSSC model to relieve the nonlinearly separable problem of HSIs to some degree. Then, it incorporates spatial neighborhood information through spatial maximum pooling to generate more discriminative features. Consequently, the KSSC-SMP may outperform linear SSC methods. In addition, in [143], a kernel sparse and LRSC (KSLRSC) algorithm was proposed. It utilizes sparse and low-rank constraints to simultaneously explore the local and global structure information of HSIs. Accordingly, the underlying adjacency among pixels can be more accurately learned. Then, the KSLRSC method is extended to semisupervised classification for HSIs. Furthermore, by adding a TV denoising constraint into the KSSC model to enhance the similarity among pixels from the same subspace, a KSSC with TV denoising (KSSC-TVD) algorithm was put forward for HSIs [144]. In addition, the k-SSMLC method is also a typical kernel subspace clustering model [140], which extends the linear multiview subspace clustering model to a kernel version to further improve the clustering accuracy. In general, because of accurate modeling and powerful information extraction capability, subspace clustering methods have shown great potential for HSI clustering and achieved very competitive performance. In recent years, subspace clustering has gained progressively more attention. However, such methods are generally accompanied by a large computational complexity and massive time and memory consumption, which limits their applications to some degree. DEEP LEARNING-BASED CLUSTERING METHODS Deep learning-based clustering methods have been recently developed and are one of the most advanced clustering techniques [145]. These approaches rely on deep NNs (DNNs), such as fully connected networks (FCNs) and convolutional NNs (CNNs), to learn more discriminative features for clustering and to more accurately simulate the nonlinearity of data, as shown in Figure 9. Such methods generally deploy two components, i.e., the network and the clustering model. Since there are no available labeled samples, these models are generally optimized in an unsupervised way. According to the basic architecture, deep learning-based clustering methods can be further divided into three main categories: 1) autoencoder-based clustering, 2) separated network-based clustering, and 3) generative network-based clustering [146]. AUTOENCODER-BASED CLUSTERING Autoencoder-based clustering methods are the earliest and most representative deep clustering approaches. An autoencoder is an unsupervised NN with the advantages of simplicity and effectiveness. It generally consists of an encoder for data representation and a decoder for data reconstruction, and it self-trains by minimizing the reconstruction error. A typical example is the deep clustering network (DCN) [147]. It implements DR via a deep autoencoder network to learn more k-means-friendly features and optimizes the DR and clustering tasks in a unified framework, as shown in (20): MN c min / ` , ^ g ^ f ^ Yihh, Yih + 2 f ^ Yih - Is i 22 j I, " s i , i = 1  T subject to s j, i ! " 0, 1 ,, 1 s i = 1, 6i, j, (20) Autoencoder Features Encoder Reconstruction Loss Decoder Joint Training Clustering Loss Separated Network Features Clustering Model Fine-Tuning the Network (a) (c) Generative Network z Generator D (x) G (z) Discriminator x_real D (G (z)) (b) FIGURE 9. The deep learning-based clustering mechanism. (a) The HSI. (b) Deep learning. (c) The clustering result. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 51
where f ^ $ h and g ^ $ h denote the nonlinear mapping function of the encoder and the decoder, respectively; , ^ $ h stands for the reconstruction loss function, defined as , ^ Yi, X ih = Yi - X i 22 , with X i being the reconstructed sample; I represents the centroid matrix, with its ith column referring to the ith cluster centroid; and s i means the assignment vector with only one nonzero element. The first term is the network loss, and the second term is the clustering assignment loss, with c being a tradeoff parameter. Then, the DCN solves this problem with an alternating stochastic algorithm. Through learning more discriminative features via a deep autoencoder network, the DCN outperforms traditional clustering methods. A set of deep clustering models has been developed based on an autoencoder approach. In [148], a deep multimanifold clustering (DMC) method was proposed. It integrates a locality-preserving constraint into the deep autoencoder network to learn the latent embedded manifolds and more informatic features, using both the reconstruction loss and the local preserving loss. Then, the proximity of the representations to the centroids is employed as the penalty to enhance the representations’ clustering friendliness. Furthermore, a dual autoencoder-based deep SC (DAE-DSC) method was proposed that jointly optimizes the deep autoencoder and the deep SC networks [149]. It employs a dual autoencoder network to learn more robust reduced representations and uses mutual information to more effectively reserve discriminative information. Moreover, a general deep clustering model was developed by integrating the traditional clustering models, e.g., the k-means and the GMM, into the deep networks [150]. It yields a much higher accuracy than tradition methods. The autoencoder was successfully introduced to subspace clustering and delivered competitive performance by employing deep networks to more effectively deal with feature extraction and data nonlinearity simulation tasks. For example, in [151], a deep subspace clustering algorithm based on an autoencoder network (DSCNet) was proposed. It inserts a self-expressive layer between the encoder and the decoder to learn the pairwise adjacency between data points via back propagation, and it integrates the reconstruction loss and the self-representation loss to learn more discriminative representations. In addition, a structured autoencoder-based subspace clustering (StructAE) method was developed, which constructs a structured autoencoder network to more effectively preserve the local and global structure information of data [152]. Furthermore, a self-supervised convolutional subspace clustering network (S2ConvSCN) was put forward [153]. It employs a convolutional autoencoder network to fully explore the spatial information of an image, and it adds a self-expression module and an SC module into the network to generate a trainable end-to-end model. Then, it jointly optimizes the feature extraction and subspace clustering in a self-supervised way. Moreover, the LRDSC algorithm introduced previously obtains a higher clustering accuracy for 52 HSIs by introducing a convolutional autoencoder to subspace clustering [137]. In addition, in [154], a deep subspace clustering band selection model was developed for HSIs, combining a convolutional autoencoder network with the SSC model and producing a good band selection effect. SEPARATED NETWORK-BASED CLUSTERING Separated network-based clustering methods generally optimize a deep network only by the clustering loss, with the network and the clustering model separated. Although the basic network can be very deep, these methods may fail to learn informatic features for clustering, due to the absence of a network constraint, such as the reconstruction loss. Therefore, the initialization of the network seems crucial for these methods. Generally, the network is pretrained or randomly initialized. A typical example is clustering based on CNN (CCNN) [155]. It deals with the feature extraction and clustering tasks within the CNN framework in an iterative way. First, it employs a CNN pretrained on ImageNet to extract features for initial clustering, with c randomly selected centroids and the minibatch k-means utilized. Then, it exploits the difference between the label predicted by the CNN and the minibatch k-means to fine-tune the network, based on the stochastic gradient descent algorithm, and simultaneously updates the cluster centroids, as in (21). The CCNN employs feature drift compensation to relieve mismatching to further improve the clustering accuracy: 1 SSE = 2 c / ^y j - t jh2,(21a) j =1 n j = ^1 - c jh n j ^kh ^k - 1h kh + c j h^new , (21b) where SSE denotes the sum of the squared error; y j and t j stand for the label predicted by the CNN and the minibatch ^kh k-means, respectively; n j represents the jth centroid in the ^kh kth iteration; h new indicates the extracted features from a minibatch assigned to n j ; and c j is the learning rate of the jth centroid, which is defined as the reciprocal of the number of samples in the jth cluster. Unsupervised pretraining is widely utilized for separated network-based clustering methods. In [156], a deep brief network (DBN) nonparametric clustering (DBNC) algorithm was proposed that relies on an unsupervised pretrained DBN. It learns the reduced representations of data through the pretrained DBN and employs nonparametric clustering with the maximum margin to perform clustering, with the parameters of the top layer of the DBN finetuned subsequently. In addition, a deep embedded clustering (DEC) method was developed [157]. It first pretrains a stacked autoencoder network, based on the reconstruction loss in an unsupervised way, and drops the decoder part. Then, it fine-tunes the network with the clustering loss and refines the clustering result based on the Kullback–Leibler divergence between the soft assignment and the auxiliary distribution in an iterative way. Based on the DEC model, an improved version was put forward, by employing a IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
L G = min - G G 6p ^y | D h@ + E z 6G 6p ^y | G ^ z h, D h@@, (22b) enhances the discriminability and robustness of the classifier and yields a high clustering accuracy. Based on the CatGAN, in [163], an information-maximizing GAN (InfoGAN) was proposed. It enhances the clustering performance by exploiting the mutual information among the fixed small subset of latent variables and the observations. Furthermore, a deep adversarial GMM autoencoder clustering (DAGMC) algorithm was developed [164]. It uses an adversarial autoencoder network to learn the reduced representations and employs a tunable GMM for clustering. It simultaneously considers the autoencoder, the GMM, and the adversarial losses in its objectives, which are optimized by the stochastic gradient descend algorithm. Moreover, a deep adversarial subspace clustering (DASC) method was put forward [165]. It is also based on an adversarial autoencoder network with a generator for subspace estimation and the clustering assignment and a discriminator for clustering performance evaluation. Then, it progressively learns more informatic representations, with the selfrepresentation and subspace clustering tasks supervised by adversarial learning. In addition to GAN-based models, a set of variational autoencoder (VAE)-based generative deep clustering methods has been developed, integrating certain probability models into a deep network to learn the distribution of data for sample generation. For example, in [166], a variational deep embedding (VaDE) algorithm was proposed. It integrates GMM into the VAE network for sample generation and optimizes the clustering problem by maximizing the evidence lower bound via the stochastic gradient variational Bayes estimator. In addition, a VAE with Gaussian mixture (VAE-GM) method was put forward [167]. It generates samples from a prior distribution, i.e., Gaussian mixture, and introduces the minimum information constraint to relieve the over-regularization of VAE. As a result, it yields a high clustering accuracy. Overall, deep learning-based clustering methods can bring about a higher clustering accuracy due to their powerful feature learning and nonlinearity-fitting capabilities. Accordingly, they have become a research hot spot in the clustering field. However, most deep clustering methods are concentrated in the computer vision field, with rare trails in the hyperspectral remote sensing arena. Hence, more deep learning-based hyperspectral clustering methods should be developed to promote the development of this field. In addition, most of the relevant works focus on improving the clustering performance but ignoring the theoretical exploration behind the performance, which leads to the poor interpretability of these methods and limits their popularization and application to a certain degree. where G ^ $ h denotes the empirical entropy, y is the predicted label of a given example Yi, and z is a generated noise vector from a prior distribution P ^ z h, with D and G representing the discriminator and the generator, respectively. Through adversarial learning, the CatGAN effectively HYBRID MECHANISM-BASED CLUSTERING MODELS Hybrid mechanism-based methods deal with the clustering task by combining two or more models, as presented in Figure 10. Considering that a single clustering model generally has certain shortcomings, such techniques integrate convolutional autoencoder network to learn more informatic features for clustering [158]. Random initialization is also often utilized for separated network-based clustering models. For instance, in [159], a CNN-based joint unsupervised learning (JSL) algorithm was proposed. It starts with a random initialization and formulates a recurrent framework to jointly update the representations and clusters during the training process, with clustering as the forward pass and representation as the backward pass. In addition, a deep SSC (DSSC) method was developed, which combines a DNN with SSC [160]. It randomly initializes the network and iteratively refines the sparse coding and the clustering results in the forward propagation stage, with the parameters of the DNN updated in the backward propagation stage. Furthermore, a DNN-based SC (SpectralNet) method was proposed [161]. It randomly initializes the parameters of the network and considers three terms in the unsupervised training process: 1) affinity learning based on a Siamese network, 2) embedding learning under an orthogonality constraint to map the data into eigenspace, and 3) the clustering assignment. As a result, it significantly outperforms traditional SC methods. GENERATIVE NETWORK-BASED CLUSTERING Differing from autoencoder- and separated network-based deep clustering methods, generative network-based clustering approaches simultaneously perform clustering and uncover the underlying structure of data to generate new samples. These methods generally aim at learning the real structure of data as accurately as possible to create highquality samples. Therefore, they can more effectively guarantee the discriminability of the extracted features. The most representative generative deep clustering methods can be the generative adversarial network (GAN)-based approaches. These methods commonly include two parts, i.e., the generator and the discriminator, and improve the quality of the extracted features through the antagonism between the generator and the discriminator. A typical example is the categorical GAN (CatGAN) [162]. It plays a minimum–maximum adversarial game to learn a discriminative classifier in an unsupervised way, by trading off mutual information between the observations and the predicted class distribution. Its discriminator and generator are defined as in (22): L D = max G 6p ^y | D h@ - E Yi 6G 6p ^y | Yi, D h@@ D + E z 6G 6p ^y | G ^ z h, D h@@,  (22a) G DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 53
model, a graph-based k-means (G-k-means) technique was developed. It utilizes the graph model to estimate the parameters and initializations for the k-means, which effectively improves the clustering performance to obtain an accurate segmentation result for HSIs for Mars exploration. In addition, by combining anchor graph-based clustering with subspace clustering, a sparse dictionary-based anchor regression (SDCR) algorithm was introduced for HSIs [174]. It constructs a more representative dictionary through dictionary learning with double sparsity constraints, and it utilizes the anchor subspace regression to efficiently evaluate the similarity between hyperspectral pixels. With the help of SC, the final clustering result is obtained. By integrating the advantages of the anchor graph and the subspace, SDCR achieves good performance. Generally speaking, by comprehensively taking advantage of two or more different clustering schemes, hybrid mechanism-based clustering methods can overcome the defects of both techniques and may bring about better clustering performance. In theory, hybridization can be extended to any two or more clustering schemes, and progressively more attractive hybrid clustering methods may be developed through future research. the advantages of different schemes to further improve the clustering performance. A typical example is the combination of the centroid-based clustering scheme with other approaches, such as the k-GMM [168]. The k-GMM combines centroid- and probability-based clustering. Taking advantage of the k-means and the GMM, k-GMM obtains better clustering accuracy for HSIs. In addition, in [169], an improved fast density peaks-based clustering (k-­F DPC) algorithm was proposed, which is a hybridization of the k-means and the CFSFDP. Based on the CFSFDP approach, this algorithm calculates the local density based on an adaptive bandwidth pdf, and it searches cluster centroids by fitting the density and distance decision graph. Subsequently, it infers the pixel assignment through the k-means. As a result, k-FDPC outperforms both k-means and CFSFDP. In addition, bionics-based clustering can also be combined with other approaches. For example, in [170], a fuzzy Kohonen local information c-means clustering (FKLICM) method was put forward. It employs the Kohonen NN to model the complexity of remote sensing images and integrates the discriminative rules of the FLICM to enhance the discriminability of the model. Consequently, more accurate clustering results are obtained. In addition, by combining the advanced artificial bee colony (ABC) model with MRF, a novel ABC–MRF clustering algorithm was developed for HSIs [171]. The ABC model is utilized to better search cluster centroids and optimize the objective function, with the MRF utilized to incorporate spatial neighborhood information to further improve the clustering accuracy. Moreover, the graph-based clustering scheme can be flexibly combined with other clustering schemes as well. For example, in [172], a Gaussian SC model (GSC) was constructed by integrating the powerful information extraction ability of the graph model into the GMM framework. In [173], by combining the k-means with the graph EXPERIMENTS In this section, the performance of some popular and representative hyperspectral clustering algorithms is evaluated, including FCM (https://github.com/wwwwwwzj/ fcm)[45],FCM-S1[64],CFSFDP (https://github.com/DesperadoZ/ Density_Peak_Clustering) [71], GMM (https://github.com/ AdamaTG/Matlab_GMM) [79], SC (https://github.com/jhliu17/ SpectralClustering) [105], FSCAG [43], SGCNR [121], SSC (http://vision.jhu.edu/code/) [40], [122], and L2-SSC [41]. Specifically, FCM is one of the most representative centroidbased clustering methods, while FCM-S1 is a classical improved version of FCM, achieved by incorporating spatial information. CFSFDP is a representative density-based Centroid-Based Model 0.04 0.02 Combination (a) Density-Based Model Optimization 0 20 10 0 0 10 20 Probability-Based Model (c) (d) Graph-Based Model (b) FIGURE 10. The hybrid mechanism-based clustering scheme. (a) The HSI. (b) The hybrid clustering model construction. (c) The hybrid model optimization. (d) The clustering result. 54 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
clustering approach. GMM is a typical probability-based clustering method. SC is a complete graph-based clustering technique, while FSCAG and SGCNR are two recently developed state-of-the-art abbreviated graph-based clustering approaches. SSC is a representative subspace clustering method, and L2-SSC is a very competitive spectral–spatial subspace clustering technique for HSIs. These clustering methods were tested on two wellknown HSIs, i.e., the Indian Pines image and the University of Houston image, with both cluster maps and quantitative assessments provided for comprehensive evaluation and comparison. Specifically, the producer’s accuracy (PA), user’s accuracy (UA), overall accuracy (OA), kappa, and purity were utilized for quantitative analysis. In addition, the running time of each method was also given. The parameters of each approach were manually adjusted to be optimal. In the experiments, the clusters’ thematic information was automatically determined by the widely used Hungarian algorithm [175], [176], with the number of clusters set as the quantity of the classes in the ground truth [22], [43], [121]. Until now, there has been no unified standard for the utilization of unlabeled pixels in the ground truth in the hyperspectral clustering field. Some of the literatures utilize all the pixels and gives the cluster map of the whole image [21], [39], [68], while other works give only the cluster map of the labeled pixels in the ground truth [22], [77], [117]. Generally speaking, each of these strategies has advantages. The former seems more in line with the working mechanism of unsupervised clustering, as there is no available prior knowledge. The latter can more clearly present the differences between the clustering results of different algorithms. In this article, the latter is utilized. The Indian Pines image was collected by the Airborne Visible/Infrared Imaging Spectrometer sensor over northwestern Indiana on 12 June 1992. This image has a size of 145 × 145 pixels and 220 spectral bands, with a spatial resolution of 20 m. In the experiments, only 200 bands were utilized for analysis, with 20 badquality bands removed. This scene covers an agricultural area and has a relatively concentrated land cover distribution. It contains 16 different classes, with many subclasses of vegetation. As in [21] and [177], nine main classes are utilized for clustering. The false-color image and the ground truth are shown in Figure 11(a) and (b). Figure 11(c) displays the mean spectra of the nine classes, with the t-distributed stochastic neighbor embedding (t-SNE) graph of labeled samples of the nine classes given in Figure 11(d) [178], [179], from which it can be found that different classes are mixed together and are difficult to separate, leading to a very challenging clustering task. The University of Houston image is a relatively new HSI data set, provided by the 2013 IEEE Geoscience and Remote Sensing Society Data Fusion Competition. It was obtained above the University of Houston by the National Center for Airborne Laser Mapping sensor on 23 June 2012. Different DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE from the Indian Pines image, this scene mainly covers an urban area with a relatively complex land cover distribution. The image has a size of 349 × 1,905 pixels, with 144 spectral bands. In the experiment, a typical subset at a size of 160 × 150 × 144 was utilized, with seven main classes included [21]. The false-color image, the ground truth, the mean spectra, and the t-SNE graph of labeled samples of the seven classes appear in Figure 12. Cluster maps of different methods are provided in Figures 13 and 14, with the quantitative evaluations given in Tables 2 and 3, respectively. Comprehensively analyzing the experimental results, it can be seen that, in general, the spectral–spatial methods outperform the spectralbased approaches by taking full advantage of the spectral–spatial duality of HSIs to obtain smoother clustering results with a higher accuracy, which suggests that the spatial information is informative and favorable for clustering. Specifically, it can be seen that FCM and FCM-S1 fail to perform well for both HSI data sets, with a large number of misclassifications, a significant amount of within-class noise in the cluster map, and relatively lower clustering accuracy. Generally speaking, centroid-based methods are more suitable for data that have a well-separated and near-spherical geometric structure [180]. However, this performance guarantee is generally violated for HSIs. Comparatively speaking, these methods perform better on the second image scene, where there is better indivisibility, as shown in Figures 11 and 12. Similarly, GMM also performs poorly for both image scenes because its assumption that samples from different classes obey the union of Gaussian distributions cannot be fully satisfied by HSIs. Although CFSFDP obtains relatively smooth clustering results for both scenes, there are several important classes that are not effectively recognized, especially for the Indian Pines image. This is because density-based methods commonly have strong assumptions about the distribution of the feature space and are suitable for data with a multimodal distribution and nonlinear structure [180]. Unfortunately, the complexity of HSIs generally conflicts with the performance guarantee of these methods. By comparison, SC performs better, as it more accurately exploits the similarity among pixels by means of the graph. It obtains the second-best clustering accuracy for the Indian Pines image and the fourth-best clustering accuracy for the University of Houston image. Since the abbreviated graph cannot accurately model the relationships among pixels, SGCNR and FSCAG fail to obtain good performance for both scenes, although they are very efficient. There are a large number of misclassifications and a notable amount of noise in the cluster map for both scenes. In general, graph-based methods also need certain performance guarantees and are more suitable for data with a geometric structure that samples from different classes and are almost orthogonal or where the overlap between classes is small relative to the indivisibility [180]–[182], which cannot be well satisfied by HSIs. 55
effectiveness, i.e., a tolerable noise level to support a strict subspace model, enough samples for each subspace, and a low affinity between different subspaces, which has been theoretically proved [183]. It can be found that subspace clustering methods fail to obtain a satisfactory accuracy for the Indian Pines image, due to the strong noise and serious overlap between different classes, while the approaches perform well for the noiseless University of Houston image, with larger distances between different classes. In addition, it should be noted that SSC and L2-SSC are troubled by the large computational complexity and are time consuming compared with the other clustering methods, which is a shortcoming of such approaches that needs to be solved. Relative to these methods, the recently developed subspace clustering techniques, i.e., SSC and L2-SSC, may better model the complex structure of HSIs and relieve the large spectral variability with the subspace model. Through self-representation learning, interactions among pixels can be more effectively exploited, and the underlying adjacency between pixels can be more accurately learned, which might guarantee that pixels are partitioned into the correct groups. As a result, SSC and L2-SSC have a relatively good performance and show significant potential for HSIs. L2-SSC achieves the best clustering results, with smoother cluster maps and higher clustering accuracy for both scenes. However, behind the good performance, some restrictive assumptions are needed to guarantee their (b) (a) 80 7,000 60 Dimension Two After DR 8,000 DN Value 6,000 5,000 4,000 3,000 2,000 1,000 40 20 0 –20 –40 –60 0 20 40 60 80 100 120 140 160 180 200 Band Number Corn-Notill Corn-Minimum-Till Grass/Pasture Grass/Trees Hay-Windrowed Soybeans-Notill Soybeans-Minimum-Till Soybeans-Clean Woods (c) –80 –80 –60 –40 –20 0 20 40 Dimension One After DR Corn-Notill Corn-Minimum-Till Grass Trees Hay-Windrowed 60 80 Soybeans-Notill Soybeans-Minimum-Till Soybeans-Clean Woods (d) FIGURE 11. The Indian Pines data set. (a) The original image (red: 40; green: 30; blue: 20). (b) The ground truth. (c) The mean spectra of the nine classes. (d) The t-SNE graph of labeled samples of the nine classes. 56 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
SUMMARY AND DISCUSSION Hyperspectral remote sensing images provide a wealth of spectral information and show subtle differences between various classes to support fine land cover classification, and they have been an important data resource in various applications. As typical high-dimensional data, the interpretation of HSIs relies on a large number of labeled samples. However, it is very difficult to acquire high-quality samples, in practice. Therefore, during recent decades, many clustering methods have been developed for HSIs to deal with the interpretation task in an unsupervised way. In this article, we systematically reviewed the existing hyperspectral clustering methods in the literature and summarized them into nine main kinds, i.e., centroid-based, density-based, probability-based, bionics-based, intelligent computing-based, graph-based, subspace clustering, deep learning-based, and hybrid mechanism-based. In addition, we introduced the principle and mechanism of each type of clustering method and reviewed the representative approaches in detail, with the advantages and disadvantages simply summarized. From this research, we find that the development of hyperIN THIS ARTICLE, WE spectral clustering is not balSYSTEMATICALLY REVIEWED anced. The development of the centroid-, density-, and probaTHE EXISTING bility-based clustering methHYPERSPECTRAL ods is more mature, especially CLUSTERING METHODS IN for the former two approaches. THE LITERATURE AND The achievements of these two SUMMARIZED THEM INTO kinds of clustering methods NINE MAIN KINDS. are relatively more abundant. Research on bionics-based (b) (a) × 104 3.5 100 80 Dimension Two After DR 3 DN Value 2.5 2 1.5 1 0.5 0 60 40 20 0 –20 –40 –60 0 50 100 Band Number 150 –80 –80 –60 –40 Grass-Synthetic Running Track Bare Soil Building 1 Building 2 Grass Trees (c) –20 0 20 40 60 Dimension One After DR Building 1 Building 2 Grass Trees 80 100 Grass-Synthetic Running Track Bare Soil (d) FIGURE 12. The University of Houston data set. (a) The original image (red: 110; green: 40; blue: 10). (b) The ground truth. (c) The mean spectra of the seven classes. (d) The t-SNE graph of labeled samples of the seven classes. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 57
and trials are needed in the future. Recently, graph-based clustering and subspace clustering have gained an increasing attention due to their relatively good clustering performance, and more and more algorithms have been proposed. clustering is relatively few, which demands more attention in future work. In addition, deep learning has obtained remarkable achievements in the computer vision field, but it has few applications in the hyperspectral clustering arena. More effort (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Corn-No-Till Corn-Minimum-Till Grass/Pasture Grass/Trees Hay-Windrowed Soybeans-No-Till Soybeans-Minimum-Till Soybeans-Clean Woods Unlabeled FIGURE 13. Cluster maps of the different methods for the Indian Pines image. (a) The ground truth. (b) FCM. (c) FCM-S1. (d) CFSFDP. (e) GMM. (f) SC. (g) SGCNR. (h) FSCAG. (i) SSC. (j) L2-SSC. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Building 1 Building 2 Grass Trees Grass-Synthetic Running Track Bare Soil Unlabeled FIGURE 14. Cluster maps of the different methods for the University of Houston image. (a) The ground truth. (b) FCM. (c) FCM-S1. (d) CFSFDP. (e) GMM. (f) SC. (g) SGCNR. (h) FSCAG. (i) SSC. (j) L2-SSC. 58 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Moreover, we comprehensively compared and analyzed the performance of several popular hyperspectral clustering methods on two well-known HSIs. From the experimental results, we find that, in general, spectral–spatial methods outperform spectral-based methods, which indicates the importance of spatial information. Centroid-, density-, and probability-based methods, e.g., FCM, FCM-S1, CFSFDP, and GMM, do not perform well because their assumptions TABLE 2. QUANTITATIVE EVALUATIONS OF THE DIFFERENT METHODS FOR THE INDIAN PINES IMAGE. METHOD CLASS FCM FCM-S1 CFSFDP GMM SC SGCNR FSCAG SSC L2-SSC PA (%) Cluster 1 35.78 37.9 58.33 25.59 30.21 39.27 28.6 48.39 44.33 Cluster 2 0 0 3.86 17.54 0.96 0 3.01 8.07 0 Cluster 3 56.73 60.25 0 36.65 57.56 39.09 51.76 59.42 65.01 Cluster 4 52.19 54.79 66.71 67.97 77.4 55.21 64.96 67.67 76.71 Cluster 5 98.95 100 0 84.35 99.58 90.79 99.71 88.08 100 Cluster 6 49.38 49.59 59.57 30.31 53.5 40.64 55.95 34.36 45.68 Cluster 7 42.93 42.62 15.72 56.43 47.98 50.26 40.17 48.92 67.09 Cluster 8 33.73 34.57 0 15.08 28.84 16.16 27.52 2.87 2.53 Cluster 9 47.43 45.68 99.68 64.43 55.73 48.08 58.04 60.16 49.96 Cluster 1 52.95 53.34 28.1 37.96 63.78 44.81 65.96 45.22 48.62 Cluster 2 0 0 8.74 7.42 18.18 0 3.14 7.55 0 Cluster 3 30.34 30.04 0 27.75 33.03 36.78 33.73 34.41 34.97 Cluster 4 94.54 93.9 53.22 75.45 91.87 67.05 82.96 96.67 98.94 Cluster 5 69.56 68.92 0 80.28 60.79 62.64 94.51 98.14 93.91 Cluster 6 20.78 21.01 23.22 28.48 21.54 23.05 22.09 28.64 23.97 Cluster 7 51.92 52.89 58.13 45.27 49.41 45.9 47.04 45.03 51.78 Cluster 8 21.93 22.76 0 25.19 23.42 20.82 20.45 4.05 9.74 Cluster 9 88.76 91.02 77.08 84.16 94.5 77.11 92.73 96.09 98.29 OA (%) 43.03 43.55 38.75 45.18 46.92 42.45 43.99 46.27 51.15 Kappa 0.3427 0.3497 0.2927 0.3486 0.3839 0.3227 0.353 0.3676 0.419 Purity 0.5528 0.5596 0.473 0.5205 0.5498 0.4843 0.5642 0.5489 0.5647 Time (s) 69 381 497 30.68 5409 7.42 1.44 32764 13532 UA (%) Cluster 1: corn-notill; cluster 2: corn-minimum-till; cluster 3: grass/pasture; cluster 4: grass/trees; cluster 5: hay-windrowed; cluster 6: soybeans-no-till; cluster 7: soybeans-minimum-till; cluster 8: soybeans-clean; cluster 9: woods. TABLE 3. THE QUANTITATIVE EVALUATION OF THE DIFFERENT METHODS FOR THE UNIVERSITY OF HOUSTON IMAGE. METHOD CLASS FCM FCM-S1 CFSFDP GMM SC SGCNR FSCAG SSC L2-SSC PA (%) Cluster 1 51.73 51.12 99.82 68.37 88.72 66.67 74.59 79.53 91.8 Cluster 2 95.16 97.25 93.96 39.52 96.52 38.94 38.68 58.42 99.08 Cluster 3 94.45 95.51 99.94 95.01 93.48 59.78 76.09 95.25 96.61 Cluster 4 49.44 68.5 2.07 28.42 73.69 37.58 87.66 68.67 64.32 Cluster 5 100 100 99.76 99.95 100 76.66 96.71 98.9 100 Cluster 6 16.37 0 98.42 41.21 0.39 79.16 58.82 94.87 98.42 Cluster 7 67.32 72.32 94.91 73.48 54.54 48.92 82.81 69.07 89.09 Cluster 1 100 98.57 98.72 99.75 100 65.93 98.59 99.77 99.27 Cluster 2 47.51 49.54 99.81 23.66 50.87 13.73 20.18 51.04 68.48 Cluster 3 90.32 92.21 76.6 82.6 96.08 81.05 95.58 96.44 92.69 Cluster 4 43.89 79.82 30.1 37.13 78.47 39.32 66.49 88.45 86.65 Cluster 5 82.65 88 96.24 88.93 75.16 59.88 92.81 46 74.28 Cluster 6 17.28 0 99.87 22.04 0.5 43.67 60 100 99.07 Cluster 7 69.65 67.34 98.38 80.43 60.1 79.02 79.16 78.45 99.03 OA (%) 73.99 76.62 86.69 74.01 78.27 57.45 77.14 83.82 91.13 Kappa 0.6614 0.6968 0.8199 0.6556 0.7209 0.4756 0.7131 0.7921 0.8849 Purity 0.8083 0.8301 0.87 0.7903 0.8153 0.6931 0.8479 0.8382 0.9113 Time (s) 32 391 507 39 24382 7.47 1.28 24382 53281 UA (%) Cluster 1: building 1; cluster 2: building 2; cluster 3: grass; cluster 4: trees; cluster 5: grass-synthetic; cluster 6: running track; cluster 7: bare soil. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 59
cannot be fully satisfied by HSIs. FCM and FCM-S1 have a low complexity of O(MNDct) and are relatively efficient, and thus they are suitable for large hyperspectral data sets, where t denotes the number of iterations. CFSFDP has a higher complexity of approximately O ^^ MN h2h and requires a relatively large memory to store a sizeable pairwise pixel distance matrix, limiting its suitability for large HSIs. GMM has a relatively large complexity of O ^^ MN h2ct h, which degrades its suitability for large HSIs to some degree. By comparison, complete graph-based methods, e.g., SC, may perform well, but they are troubled by a large computational cost. Specifically, SC has a large complexity of O ^^ MN h2 Dt h and is time consuming, which reduces its practicability to a large degree. Comparatively speaking, abbreviated graph-based methods, e.g., FSCAG and SGCNR, are very efficient and suitable for large HSIs because of their lower complexity. The complexity of FSCAG and SGCNR are O(MNDu) and O(MND log u + MNc 2 + MNcv + c 3), respectively, where u is the number of anchors and v is the number of nearest neighbors, with u, v % MN [43], [121]. However, their clustering accuracy cannot be guaranteed. Relative to the above methods, subspace clustering approaches, e.g., SSC and L2-SSC, may deal better with the clustering task for HSIs and bring about a competitive clustering performance. However, such methods generally have a very large computational complexity of about O ^^ MN h3t h and are very time and memory consuming, which degrades their attractiveness in real functions and hinders their applications to large hyperspectral data sets to a large degree. In general, clustering is an important and necessary technique for HSI interpretation, but it has much room for improvement. Accuracy, efficiency, and intelligence may be the major lines of development for hyperspectral clustering in the future. Based on the research status of hyperspectral clustering, the challenges and possible future research lines are pointed out as follows. DEVELOPING EFFECTIVE AND EFFICIENT MODELS Accuracy and efficiency are both very important in practical applications. However, most current hyperspectral clustering algorithms cannot simultaneously consider these two aspects very well. For example, subspace clustering may bring about a higher clustering accuracy, but significant computational complexity generally follows, which degrades the technique’s scalability to large scenes and limits its practical applications. Centroid-, density-, and probability-based methods are generally efficient but with limited clustering accuracy for HSIs. Hence, how to develop more effective and efficient hyperspectral clustering models with a high accuracy and a low time cost is an interesting and attractive topic. Generally speaking, combining the advantages of different clustering models, such as hybrid mechanism-based clustering, may be an effective way to overcome these obstacles. In addition, combining advanced clustering models with high-performance 60 computing techniques, such as parallel computing, may greatly enhance the efficiency while guaranteeing a high clustering accuracy. DEVELOPING MULTIFEATURE-BASED METHODS Hyperspectral remote sensing images generally come with the serious problem that pixels from the same class have different spectra, while pixels from different classes have similar spectra, which greatly degrades the separability among different classes. Multiple features from different views/domains, e.g., spectrum, texture, and geometry, describe ground objects from different views and can provide complementary information to effectively enhance the discriminable capability of a clustering model to improve the clustering accuracy. However, most existing clustering methods integrate the spatial information by means of regularization to simply explore the discriminability of the spectral–spatial information or simply fuse multiple features through concatenation, which does not fully excavate the potential of the multidomain information in HSIs. Therefore, more advanced multiple featured-based clustering methods should be developed to further improve the clustering accuracy. DEVELOPING OBJECT- OR SUPERPIXEL-BASED METHODS Hyperspectral remote sensing images contain abundant and complex spatial neighborhood information; however, they are seriously influenced by noise during the imaging process. Most existing clustering methods are pixel-based methods, which have several inherent shortcomings. First, pixel-based methods are easily affected by salt-and-pepper noise, resulting in fragmented cluster maps. Second, pixel-based methods cannot flexibly model the spatial neighborhoods with various shapes, which leads to an inadequate exploitation of the spatial information of HSIs. Last, due to the large number of pixels, pixel-based methods may be troubled by a large computational cost, especially for graph-based clustering and subspace clustering methods. At this point, object/ superpixel-based clustering techniques can effectively overcome these obstacles. Thus, more object-/superpixel-based clustering methods should be developed for HSIs to further improve the clustering performance. PUSHING DEEP LEARNING INTO THE HYPERSPECTRAL CLUSTERING FIELD Hyperspectral remote sensing images generally have a typical nonlinear structure due to the complex imaging environment and the influences of many nonlinear factors. Thus, pixels from different classes are commonly not linearly separable. On the other hand, low-level spectral or spatial features have a limited discriminability and cannot well distinguish various classes with high similarity. However, most existing clustering methods are based on linear models to approximate the nonlinearity of HSIs, which leads to a large systematical error, or simply deal IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
with the nonlinearity of HSIs through the kernel technique. It should be noted that the kernel approach is, in essence, a template-based model, which results in a large computing amount and can only alleviate the nonlinear separable problem to a certain extent, which restricts the technique’s practical applications. Many deep learning-based methods have been developed in the computer vision field and shown powerful capabilities for nonlinear fitting and feature extraction, but the successful deployments in hyperspectral clustering are very rare. Generally speaking, current hyperspectral clustering methods remain at the stage of shallow learning and only utilize the low-level features of HSIs, which yields a limited clustering accuracy. Due to the huge differences between hyperspectral data and natural figures, directly introducing deep models in the computer vision field to HSIs generally fails to obtain a satisfactory effect. Therefore, how to adjust/ modify deep models to better learn the intrinsic structure of HSIs and extract more informatic and discriminative features to further improve the clustering performance would be a very promising research line. AUTOMATICALLY ESTIMATING THE NUMBER OF CLUSTERS Automatically and accurately estimating the number of clusters is very important for hyperspectral clustering, which promotes clustering applications to be more intelligent and attractive in practical applications. However, most current studies focus on improving the clustering models, and studies on the automatic estimation of the number of clusters are relatively few. Although some methods can automatically estimate the number of clusters for HSIs, they are generally bound to specific clustering models, e.g., FCM, and have a limited universality. Hence, finding a technique to automatically and accurately estimate the number of clusters in a more generic way will be an interesting and important research direction in the future. ACKNOWLEDGMENTS This work was funded, in part, by the Special Foundation for National Science and Technology Basic Research Program of China, under grant 2019FY202503; the National Key Research and Development Program of China, under grant 2018YFB0504500; the National Natural Science Foundation of China, under grants 42001313, 61871298, and 42071322; and the Fundamental Research Funds for Central Universities, under grant G1323520273. Readers who have questions about the article are encouraged to directly contact the corresponding author, Hongyan Zhang (zhanghongyan@whu.edu.cn). AUTHOR INFORMATION Han Zhai (zhaihan@cug.edu.cn) is with the School of Geography and Information Engineering, China University of Geosciences, Wuhan, 430074, China. He is a Member of IEEE. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE Hongyan Zhang (zhanghongyan@whu.edu.cn) is with the State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, 430079, China. He is a Senior Member of IEEE. Pingxiang Li (pxLi@whu.edu.cn) is with the State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, 430079, China. He is a Member of IEEE. Liangpei Zhang (zlp62@whu.edu.cn) is with the State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan, 430079, China. He is a Fellow of IEEE. REFERENCES [1] A. Plaza et al., “Recent advances in techniques for hyperspectral image processing,” Remote Sens. Environ., vol. 113, pp. S110– S122, Sept. 2009. doi: 10.1016/j.rse.2007.07.028. [2] G. Camps-Valls, D. Tuia, L. Bruzzone, and J. A. Benediktsson, “Advances in hyperspectral image classification: Earth monitoring with statistical learning methods,” IEEE Signal Process. Mag., vol. 31, no. 1, pp. 45–54, Jan. 2014. doi: 10.1109/ MSP.2013.2279179. [3] M. Imani and H. Ghassemian, “An overview on spectral and spatial information fusion for hyperspectral image classification: Current trends and challenges,” Inf. Fusion, vol. 59, pp. 59–83, July 2020. doi: 10.1016/j.inffus.2020.01.007. [4] P. Duan, X. Kang, S. Li, P. Ghamisi, and J. A. Benediktsson, “Fusion of multiple edge-preserving operations for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 12, pp. 10,336–10,349, 2019. doi: 10.1109/TGRS.2019.2933588. [5] H. Zhang, Y. Song, C. Han, and L. Zhang, “Remote sensing image spatiotemporal fusion using a generative adversarial network,” IEEE Trans. Geosci. Remote Sens., early access, 2020. doi: 10.1109/TGRS.2020.3010530. [6] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 6, pp. 1351–1362, 2005. doi: 10.1109/ TGRS.2005.846154. [7] H. Zhang, L. Liu, W. He, and L. Zhang, “Hyperspectral image denoising with total variation regularization and nonlocal low-rank tensor decomposition,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 5, pp. 3071–3084, 2019. doi: 10.1109/ TGRS.2019.2947333. [8] H. Zhai, H. Zhang, L. Zhang, and P. Li, “Cloud/shadow detection based on spectral indices for multi/hyperspectral optical remote sensing imagery,” ISPRS J. Photogram. Remote Sens., vol. 144, pp. 235–253, Oct. 2018. doi: 10.1016/j.isprsjprs.2018.07.006. [9] W. He, H. Zhang, and L. Zhang, “Total variation regularized reweighted sparse nonegative matrix factorization for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 7, pp. 3909–3921, 2017. doi: 10.1109/TGRS.2017.2683719. [10] F. A. Kruse, J. W. Boardman, and J. F. Huntington, “Comparison of airborne hyperspectral data and EO-1 Hyperion for mineral mapping,” IEEE Trans. Geosci. Remote Sens., vol. 41, no. 6, pp. 1388–1400, 2003. doi: 10.1109/TGRS.2003.812908. 61
[11] L. Tusa et al., “Mineral mapping and vein detection in hyperspectral drill-core scans: Application to porphyry-type mineralization,” Minerals, vol. 9, no. 2, p. 122, 2019. doi: 10.3390/ min9020122. [12] U. Bradter, J. O’Connell, W. E. Kunin, C. W. Boffey, R. J. Ellis, and T. G. Benton, “Classifying grass-dominated habitats from remotely sensed data: The influence of spectral resolution, acquisition time and the vegetation classification system on accuracy and thematic resolution,” Sci. Total Environ., vol. 711, p. 134,584, Apr. 2020. doi: 10.1016/j.scitotenv.2019.134584. [13] H. Zhang, J. Kang, X. Xu, and L. Zhang, “Accessing the temporal and spectral features in crop type mapping using multi-temporal Sentinel-2 imagery: A case study of Yi’an County, Heilongjiang province, China,” Comput. Electron. Agricul., vol. 176, p. 105,618, Sept. 2020. doi: 10.1016/j.compag.2020.105618. [14] R. Darvishzadeh, C. Atzberger, A. Skidmore, and M. Schlerf, “Mapping grassland leaf area index with airborne hyperspectral imagery: A comparison study of statistical approaches and inversion of radiative transfer models,” ISPRS J. Photogram. Remote Sens., vol. 66, no. 6, pp. 894–906, 2011. doi: 10.1016/j.isprsjprs .2011.09.013. [15] B. Kong, H. Yu, R. Du, and Q. Wang, “Quantitative estimation of biomass of alpine grasslands using hyperspectral remote sensing,” Rangeland Ecol. Manage, vol. 72, no. 2, pp. 336–346, 2019. doi: 10.1016/j.rama.2018.10.005. [16] K. C. Tiwari, M. K. Arora, and D. Singh, “An assessment of independent component analysis for detection of military targets from hyperspectral images,” Int. J. Appl. Earth Observ. Geoinf., vol. 13, no. 5, pp. 730–740, 2011. doi: 10.1016/j.jag.2011.03.007. [17] M. Shimoni, R. Haelterman, and C. Perneel, “Hypersectral imaging for military and security applications: Combining myriad processing and sensing techniques,” IEEE Geosci. Remote Sens. Mag., vol. 7, no. 2, pp. 101–117, 2019. doi: 10.1109/ MGRS.2019.2902525. [18] A. Plaza, P. Martínez, J. Plaza, and R. Pérez, “Dimensionality reduction and classification of hyperspectral image data using sequences of extended morphological transformations,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp. 466–479, 2005. doi: 10.1109/TGRS.2004.841417. [19] W. Li, S. Prasadand, J. E. Fowler, and L. M. Bruce, “Locality-preserving dimensionality reduction and classification for hyperspectral image analysis,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 4, pp. 1185–1198, 2012. doi: 10.1109/TGRS.2011.2165957. [20] W. Li, F. Feng, H. Li, and Q. Du, “Discriminant analysis-based dimension reduction for hyperspectral image classification: A survey of the most recent advances and an experimental comparison of different techniques,” IEEE Geosci. Remote Sens. Mag., vol. 6, no. 1, pp. 15–34, 2018. doi: 10.1109/MGRS.2018.2793873. [21] H. Zhai, H. Zhang, L. Zhang, and P. Li, “Total variation regularized collaborative representation clustering with a locally adaptive dictionary for hyperspectral imagery,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 1, pp. 166–180, 2019. doi: 10.1109/ TGRS.2018.2852708. [22] L. Zhang, L. Zhang, B. Du, J. You, and D. Tao, “Hyperspectral image unsupervised classification by robust manifold matrix 62 [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] factorization,” Inf. Sci., vol. 485, pp. 154–169, June 2019. doi: 10.1016/j.ins.2019.02.008. Y. Kong, Y. Cheng, C. P. Chen, and X. Wang, “Hyperspectral image clustering based on unsupervised broad learning,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 11, pp. 1741–1745, 2019. doi: 10.1109/LGRS.2019.2907598. H. Zhai, H. Zhang, L. Zhang, and P. Li, “Nonlocal means regularized sketched reweighted sparse and low-rank subspace clustering for large hyperspectral images,” IEEE Trans. Geosci. Remote Sens., early access, 2020. doi: 10.1109/TGRS.2020.3023418. H. Zhai, H. Zhang, L. Zhang, and P. Li, “Sparsity-based clustering for large hyperspectral remote sensing images,” IEEE Trans. Geosci. Remote Sens., early access, 2020. doi: 10.1109/ TGRS.2020.3032427. H. Kashima, J. Hu, B. Ray, and M. Singh, “K-means clustering of proportional data using L1 distance,” in Proc. IEEE Int. Conf. Pattern Recognit., Dec. 2008, pp. 1–4. doi: 10.1109/ ICPR.2008.4760982. J. Mao and A. K. Jain, “A self-organizing network for hyperellipsoidal clustering (HEC),” IEEE Trans. Neural Netw., vol. 7, no. 1, pp. 16–29, 1996. doi: 10.1109/72.478389. Y. Ma, S. Lao, E. Takikawa, and M. Kawade, “Discriminant analysis in correlation similarity measure space,” in Proc. Int. Conf. Mach. Learn., June 2007, pp. 577–584. doi: 10.1145/1273496.1273569. J. Chen, X. Jia, W. Yang, and B. Matsushita, “Generalization of subpixel analysis for hyperspectral data with flexibility in spectral similarity measures,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 7, pp. 2165–2171, 2009. doi: 10.1109/ TGRS.2008.2011432. C. Rohkohl and K. Engel, “Efficient image segmentation using pairwise pixel similarities,” in Proc. Joint Pattern Recognit. Symp. (JPRS), Berlin: Springer-Verlag, Sept. 2007, pp. 254–263. doi: 10.1007/978-3-540-74936-3_26. U. Maulik and I. Saha, “Automatic fuzzy clustering using modified differential evolution for image classification,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 9, pp. 3503–3510, 2010. doi: 10.1109/TGRS.2010.2047020. Y. Zhong, S. Zhang, and L. Zhang, “Automatic fuzzy clustering based on adaptive multi-objective differential evolution for remote sensing imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 6, no. 5, pp. 2290–2301, 2013. doi: 10.1109/ JSTARS.2013.2240655. S. Ghaffarian and S. Ghaffarian, “Automatic histogram-based fuzzy C-means clustering for remote sensing imagery,” ISPRS J. Photogram. Remote Sens., vol. 97, pp. 46–57, Nov. 2014. doi: 10.1016/j.isprsjprs.2014.08.006. J. C. Bezdek, Pattern Recognition With Fuzzy Objective Function Algorithms. New York: Plenum, 1981. T. N. Tran, R. Wehrens, and L. M. Buydens, “KNN-kernel density-based clustering for high-dimensional multivariate data,” Comput. Statist. Data Anal., vol. 51, no. 2, pp. 513–525, 2006. doi: 10.1016/j.csda.2005.10.001. C. Cariou and K. Chehdi, “Nearest neighbor-density-based clustering methods for large hyperspectral images,” in Proc. Image Signal Process. Remote Sens. XXIII. Int. Soc. Opt. Photon., Oct. 2017, vol. 10427, p. 1,042,70I. doi: 10.1117/12.2278221. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[37] C. Cariou and K. Chehdi, “Unsupervised nearest neighbors clustering with application to hyperspectral images,” IEEE J. Sel. Topics Signal Process., vol. 9, no. 6, pp. 1105–1116, 2015. doi: 10.1109/JSTSP.2015.2413371. [38] A. Paoli, F. Melgani, and E. Pasolli, “Clustering of hyperspectral images based on multiobjective particle swarm optimization,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 12, pp. 4175–4188, 2009. doi: 10.1109/TGRS.2009.2023666. [39] H. Jiao, Y. Zhong, and L. Zhang, “An unsupervised spectral matching classifier based on artificial DNA computing for hyperspectral remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 8, pp. 4524–4538, 2013. doi: 10.1109/ TGRS.2013.2282356. [40] H. Zhang, H. Zhai, L. Zhang, and P. Li, “Spectral-spatial sparse subspace clustering for hyperspectral images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 6, pp. 3672–3684, 2016. doi: 10.1109/TGRS.2016.2524557. [41] H. Zhai, H. Zhang, L. Zhang, P. Li, and A. Plaza, “A new sparse subspace clustering algorithm for hyperspectral remote sensing imagery,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 1, pp. 43– 47, 2017. doi: 10.1109/LGRS.2016.2625200. [42] Y. Zhong, L. Zhang, B. Huang, and P. Li, “An unsupervised artificial immune classifier for multi/hyperspectral remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 2, pp. 420–431, 2006. doi: 10.1109/TGRS.2005.861548. [43] R. Wang, F. Nie, and W. Yu, “Fast spectral clustering with anchor graph for large hyperspectral images,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 11, pp. 2003–2007, 2017. doi: 10.1109/ LGRS.2017.2746625. [44] Q. Yan, Y. Ding, J. J. Zhang, Y. Xia, and C. H. Zheng, “A discriminated similarity matrix construction based on sparse subspace clustering algorithm for hyperspectral imagery,” Cognit. Syst. Res., vol. 53, pp. 98–110, Jan. 2019. doi: 10.1016/j.cogsys.2018.01.003. [45] C. W. Ahn, M. F. Baumgardner, and L. L. Biehl, “Delineation of soil variability using geostatistics and fuzzy clustering analyses of hyperspectral data,” Soil Sci. Soc. Amer. J., vol. 63, no. 1, pp. 142– 150, 1999. doi: 10.2136/sssaj1999.03615995006300010021x. [46] D. Lavenier, “FPGA implementation of the k-means clustering algorithm for hyperspectral images,” Los Alamos National Lab, LAUR, Los Alamos, NM, 2000. [Online]. Available: https:// www.researchgate.net/publication/2582177_FPGA_imple mentation_of_the_k-means_clustering_algorithm_for_hyper spectral_images [47] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, 1982. doi: 10.1109/ TIT.1982.1056489. [48] K. Alsabti, S. Ranka, and V. Singh, “An efficient k-means clustering algorithm,” Elect. Eng. Comput. Sci., vol. 43, 1997. [49] S. A. El Rahman, “Hyperspectral imaging classification using ISODATA algorithm: Big data challenge,” in Proc. IEEE Int. Conf. E-Learn. (ECOF), Oct. 2015, pp. 247–250. doi: 10.1109/ ECONF.2015.39. [50] J. M. Haut, M. Paoletti, J. Plaza, and A. Plaza, “Cloud implementation of the K-means algorithm for hyperspectral image analysis,” J. Supercomput., vol. 73, no. 1, pp. 514–529, 2017. doi: 10.1007/s11227-016-1896-3. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [51] B. Zhao, L. Gao, W. Liao, and B. Zhang, “A new kernel method for hyperspectral image feature extraction,” Geo-spat. Inf. Sci., vol. 20, no. 4, pp. 309–318, 2017. doi: 10.1080/10095020. 2017.1403088. [52] B. Zhang, S. Li, C. Wu, L. Gao, W. Zhang, and M. Peng, “A neighbourhood-constrained k-means approach to classify very high spatial resolution hyperspectral imagery,” Remote Sens. Lett., vol. 4, no. 2, pp. 161–170, 2013. doi: 10.1080/2150704X. 2012.713139. [53] W. Yang, K. Hou, B. Liu, F. Yu, and L. Lin, “Two-stage clustering technique based on the neighboring union histogram for hyperspectral remote sensing images,” IEEE Access, vol. 5, pp. 5640–5647, Apr. 2017. doi: 10.1109/ACCESS.2017.2695616. [54] Z. Ren, L. Sun, Q. Zhai, and X. Liu, “Mineral mapping with hyperspectral image based on an improved k-means clustering algorithm,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), July 2019, pp. 2989–2992. [55] B. C. Kuo and D. A. Landgrebe, “Nonparametric weighted feature extraction for classification,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 5, pp. 1096–1105, May 2004. doi: 10.1109/ TGRS.2004.825578. [56] C. C. Hung, S. Kulkarni, and B. C. Kuo, “A new weighted fuzzy c-means clustering algorithm for remotely sensed image classification,” IEEE J. Sel. Topics Signal Process., vol. 5, no. 3, pp. 543– 553, 2010. doi: 10.1109/JSTSP.2010.2096797. [57] Q. Wang and W. Shi, “Unsupervised classification based on fuzzy c-means with uncertainty analysis,” Remote Sens. Lett., vol. 4, no. 11, pp. 1087–1096, 2013. doi: 10.1080/2150704X.2013.832842. [58] X. Liu, B. He, and X. Li, “Semi-supervised classification for hyperspectral remote sensing image based on PCA and kernel FCM algorithm,” in Proc. GeoInformatics Joint Conf. GIS Built Environ., Classif. Remote Sens. Images. Int. Soc. Opt. Photon., Nov. 2008, vol. 7147, p. 714,71I. doi: 10.1117/12.813255. [59] S. Niazmardi, S. Homayouni, and A. Safari, “An improved FCM algorithm based on the SVDD for unsupervised hyperspectral data classification,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 6, no. 2, pp. 831–839, 2013. doi: 10.1109/ JSTARS.2013.2244851. [60] D. L. Pham, “Spatial models for fuzzy clustering,” Comput. Vis. Image Understand, vol. 84, no. 2, pp. 285–297, 2001. doi: 10.1006/cviu.2001.0951. [61] W. Pedrycz, “Conditional fuzzy c-means,” Pattern Recog. Lett., vol. 17, no. 6, pp. 625–631, 1996. doi: 10.1016/01678655(96)00027-X. [62] S. Li, B. Zhang, A. Li, X. Jia, L. Gao, and M. Peng, “Hyperspectral imagery clustering with neighborhood constraints,” IEEE Geosci. Remote Sens. Lett., vol. 10, no. 3, pp. 588–592, 2012. doi: 10.1109/LGRS.2012.2215005. [63] X. Y. Wang and J. Bu, “A fast and robust image segmentation using FCM with spatial information,” Dig. Signal Process., vol. 20, no. 4, pp. 1173–1182, 2010. doi: 10.1016/j.dsp.2009.11.007. [64] S. Chen and D. Zhang, “Robust image segmentation using FCM with spatial constraints based on new kernel-induced distance measure,” IEEE Trans. Syst., Man, Cybern. B Cybern., vol. 34, no. 4, pp. 1907–1916, 2004. doi: 10.1109/TSMCB.2004. 831165. 63
[65] Y. Zhong, A. Ma, and L. Zhang, “An adaptive memetic fuzzy clustering algorithm with spatial information for remote sensing imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 4, pp. 1235–1248, 2014. doi: 10.1109/JSTARS.2014.2303634. [66] G. Bilgin, S. Erturk, and T. Yildirim, “Unsupervised classification of hyperspectral-image data using fuzzy approaches that spatially exploit membership relations,” IEEE Geosci. Remote Sens. Lett., vol. 5, no. 4, pp. 673–677, 2008. doi: 10.1109/ LGRS.2008.2002319. [67] S. Krinidis and V. Chatzis, “A robust fuzzy local information Cmeans clustering algorithm,” IEEE Trans. Image Process., vol. 19, no. 5, pp. 1328–1337, 2010. doi: 10.1109/TIP.2010.2040763. [68] H. Zhang, Q. Wang, W. Shi, and M. Hao, “A novel adaptive fuzzy local information c-means clustering algorithm for remotely sensed imagery classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 9, pp. 5057–5068, 2017. doi: 10.1109/ TGRS.2017.2702061. [69] H. Zhang, L. Bruzzone, W. Shi, M. Hao, and Y. Wang, “­ Enhanced spatially constrained remotely sensed imagery classification ­using a fuzzy local double neighborhood information c-means clustering algorithm,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 11, no. 8, pp. 2896–2910, 2018. doi: 10.1109/ JSTARS.2018.2846603. [70] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognit. Lett., vol. 31, no. 8, pp. 651–666, 2010. doi: 10.1016/j. patrec.2009.09.011. [71] A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014. doi: 10.1126/science.1242072. [72] Y. Chen, S. Ma, X. Chen, and P. Ghamisi, “Hyperspectral data clustering based on density analysis ensemble,” Remote Sens. Lett., vol. 8, no. 2, pp. 194–203, 2017. doi: 10.1080/2150704X.2016.1249295. [73] H. Bäcklund, A. Hedblom, and N. Neijman, “A density-based spatial clustering of application with noise,” Data Min. TNM, vol. 33, pp. 11–30, Nov. 2011. [74] S. Jia, G. Tang, J. Zhu, and Q. Li, “A novel ranking-based clustering approach for hyperspectral band selection,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 1, pp. 88–102, 2015. doi: 10.1109/TGRS.2015.2450759. [75] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 5, pp. 603–619, 2002. doi: 10.1109/34.1000236. [76] X. Huang and L. Zhang, “An adaptive mean-shift analysis approach for object extraction and classification from urban hyperspectral imagery,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 12, pp. 4173–4185, 2008. doi: 10.1109/TGRS.2008.2002577. [77] J. M. Murphy and M. Maggioni, “Unsupervised clustering and active learning of hyperspectral images with nonlinear diffusion,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 3, pp. 1829– 1845, 2018. doi: 10.1109/TGRS.2018.2869723. [78] J. M. Murphy and M. Maggioni, “Spectral-spatial diffusion geometry for hyperspectral image clustering,” IEEE Geosci. Remote Sens. Lett., vol. 17, no. 7, pp. 1243–1247, 2020. doi: 10.1109/ LGRS.2019.2943001. [79] N. Acito, G. Corsini, and M. Diani, “An unsupervised algorithm for hyperspectral image segmentation based on the 64 [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] Gaussian mixture model,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), July 2003, vol. 6, pp. 3745–3747. C. A. Shah, M. K. Arora, and P. K. Varshney, “Unsupervised classification of hyperspectral data: An ICA mixture model based approach,” Int. J. Remote Sens., vol. 25, no. 2, pp. 481– 487, 2004. doi: 10.1080/01431160310001618040. C. A. Shah, P. K. Varshney, and M. K. Arora, “ICA mixture model algorithm for unsupervised classification of remote sensing imagery,” Int. J. Remote Sens., vol. 28, no. 8, pp. 1711–1731, 2007. doi: 10.1080/01431160500462121. C. F. Li, L. Liu, Y. M. Lei, J. Y. Yin, J. J. Zhao, and X. K. Sun, “Clustering for HSI hyperspectral image with weighted PCA and ICA,” J. Intell. Fuzzy Syst., vol. 32, no. 5, pp. 3729–3737, 2017. doi: 10.3233/JIFS-169305. G. Celeux, “The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem,” Comput. Statist. Quart., vol. 2, pp. 73–82, Jan. 1985. J. B. Courbot, V. Mazet, E. Monfrini, and C. Collet, “Pairwise Markov fields for segmentation in astronomical hyperspectral images,” Signal Process., vol. 163, pp. 41–48, Oct. 2019. doi: 10.1016/j.sigpro.2019.05.005. S. D. Xenaki, K. D. Koutroumbas, A. A. Rontogiannis, and O. A. Sykioti, “A layered sparse adaptive possibilistic approach for hyperspectral image clustering,” in Proc. IEEE Geosci. Remote Sens. Symp. (IGARSS), July 2014, pp. 2890–2893. doi: 10.1109/ IGARSS.2014.6947080. C. Teodor, B. Alzenk, R. Constantinescu, and M. Datcu, “Unsupervised classification of EO-1 Hyperion hyperspectral data using Latent Dirichlet allocation,” in Proc. IEEE Int. Symp. Signals Circuits Syst. (ISSCS), July 2013, pp. 1–4. doi: 10.1109/ ISSCS.2013.6651211. Y. Fang, L. Xu, J. Peng, H. Yang, A. Wong, and D. A. Clausi, “Unsupervised Bayesian classification of a hyperspectral image based on the spectral mixture model and Markov random field,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 11, no. 9, pp. 3325–3337, 2018. doi: 10.1109/JSTARS.2018.2858008. A. Baraldi and F. Parmiggiani, “A neural network for unsupervised categorization of multivalued input patterns: An application to satellite image clustering,” IEEE Trans. Geosci. Remote Sens., vol. 33, no. 2, pp. 305–316, Mar. 1995. doi: 10.1109/36.377930. Y. Zhong, L. Zhang, and W. Gong, “Unsupervised remote sensing image classification using an artificial immune network,” Int. J. Remote Sens., vol. 32, no. 19, pp. 5461–5483, 2011. doi: 10.1080/01431161.2010.502155. J. Xu, H. Li, P. Liu, and L. Xiao, “A novel hyperspectral image clustering method with context-aware unsupervised discriminative extreme learning machine,” IEEE Access, vol. 6, pp. 16,176– 16,188, Mar. 2018. doi: 10.1109/ACCESS.2018.2813988. H. H. Muhammed, “Unsupervised hyperspectral image segmentation using a new class of neuro-fuzzy systems based on weighted incremental neural networks,” in Proc. IEEE Applied Imagery Pattern Recognit. Workshop (AIPR), Oct. 2002, pp. 171– 177. doi: 10.1109/AIPR.2002.1182272. S. Das, A. Abraham, and A. Konar, “Automatic clustering using an improved differential evolution algorithm,” IEEE Trans. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Syst., Man, Cybern. A, Syst. Humans, vol. 38, no. 1, pp. 218–237, Jan. 2008. doi: 10.1109/TSMCA.2007.909595. [93] Ç. Ari and S. Aksoy, “Unsupervised classification of remotely sensed images using Gaussian mixture models and particle swarm optimization,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), July 2010, pp. 1859–1862. doi: 10.1109/ IGARSS.2010.5653855. [94] A. Ma, Y. Zhong, and L. Zhang, “Adaptive multiobjective memetic fuzzy clustering algorithm for remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 8, pp. 4202–4217, 2015. doi: 10.1109/TGRS.2015.2393357. [95] A. Zhang et al., “Clustering of remote sensing imagery using a social recognition-based multi-objective gravitational search algorithm,” Cognit. Comput., vol. 11, no. 6, pp. 789–798, 2019. doi: 10.1007/s12559-018-9582-9. [96] Y. Wan, Y. Zhong, A. Ma, and L. Zhang, “Multi-objective sparse subspace clustering for hyperspectral imagery,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 4, pp. 2290–2307, 2019. doi: 10.1109/TGRS.2019.2947253. [97] R. S. Zemel and M. Á. Carreira-Perpiñán, “Proximity graphs for clustering and manifold learning,” in Proc. Adv. Neural Inf. Process. Syst. (ANIPS), 2005, pp. 225–232. [98] X. Zhu, C. Change Loy, and S. Gong, “Constructing robust affinity graphs for spectral clustering,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), 2014, pp. 1450–1457. [99] S. Liu, S. De Mello, J. Gu, G. Zhong, M. H. Yang, and J. Kautz, “Learning affinity via spatial propagation networks,” in Proc. Adv. Neural Inf. Process. Syst. (ANIPS), 2017, pp. 1520–1530. [100] D. R. Karger and C. Stein, “A new approach to the minimum cut problem,” J. ACM, vol. 43, no. 4, pp. 601–640, 1996. doi: 10.1145/234533.234534. [101] S. Wang and J. M. Siskind, “Image segmentation with ratio cut,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 6, pp. 675– 690, 2003. doi: 10.1109/TPAMI.2003.1201819. [102] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888– 905, Aug. 2000. doi: 10.1109/34.868688. [103] P. Soundararajan and S. Sarkar, “Analysis of mincut, average cut, and normalized cut measures,” in Proc. Workshop Percept. Organiz. Comput. Vision (POCV), July 2001, pp. 1–4. [104] C. H. Ding, X. He, H. Zha, M. Gu, and H. D. Simon, “A min-max cut algorithm for graph partitioning and data clustering,” in Proc. IEEE Int. Conf. Data Min. (ICDM), Nov. 2001, pp. 107–114. [105] U. Von Luxburg, “A tutorial on spectral clustering,” Statist. Comput., vol. 17, no. 4, pp. 395–416, 2007. doi: 10.1007/s11222007-9033-z. [106] N. D. Cahill, W. Czaja, and D. W. Messinger, “Schroedinger eigenmaps with nondiagonal potentials for spatial-spectral clustering of hyperspectral imagery,” in Proc. Alg. Tech. Multispe. Hyperspe. Ultraspe. Image. XX. Int. Soc. Opt. Photon. (ISOP), June 2014, vol. 9088, p. 908,804. [107] W. Zhu et al., “Unsupervised classification in hyperspectral imagery with nonlocal total variation and primal-dual hybrid gradient algorithm,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2786–2798, 2017. doi: 10.1109/TGRS. 2017.2654486. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [108] L. Fan and D. W. Messinger, “Joint spatial–spectral hyperspectral image clustering using block-diagonal amplified affinity matrix,” Opt. Eng., vol. 57, no. 3, p. 033107, 2018. doi: 10.1117/1. OE.57.3.033107. [109] B. Hufnagl and H. Lohninger, “A graph-based clustering method with special focus on hyperspectral imaging,” Anal. Chimica Acta, vol. 1097, pp. 37–48, Feb. 2020. doi: 10.1016/j. aca.2019.10.071. [110] Z. Meng, E. Merkurjev, A. Koniges, and A. L. Bertozzi, “Hyperspectral image classification using graph clustering methods,” Image Process. Line, vol. 7, pp. 218–245, Aug. 2017. doi: 10.5201/ ipol.2017.204. [111] A. Hassanzadeh, T. Kauranne, and A. Kaarna, “A multi-manifold clustering algorithm for hyperspectral remote sensing imagery,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), July 2016, pp. 3326–3329. [112] A. Hassanzadeh, A. Kaarna, and T. Kauranne, “Unsupervised multi-manifold classification of hyperspectral remote sensing images with contractive autoencoder,” in Proc. Scandinavian Conf. Image Anal. (SCIA), Cham: Springer-Verlag, June 2017, pp. 169–180. [113] N. Gillis, D. Kuang, and H. Park, “Hierarchical clustering of hyperspectral images using rank-two nonnegative matrix factorization,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 4, pp. 2066–2078, 2014. doi: 10.1109/TGRS.2014.2352857. [114] L. Tian, Q. Du, I. Kopriva, and N. Younan, “Orthogonal graphregularized non-negative matrix factorization for hyperspectral image clustering,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), July 2019, pp. 795–798. [115] W. Liu, S. Li, X. Lin, Y. Wu, and R. Ji, “Spectral–spatial co-clustering of hyperspectral image data based on bipartite graph,” Multimedia Syst., vol. 22, no. 3, pp. 355–366, 2016. doi: 10.1007/ s00530-015-0450-0. [116] A. Hassanzadeh, A. Kaarna, and T. Kauranne, “Sequential spectral clustering of hyperspectral remote sensing image over bipartite graph,” Appl. Soft Comput., vol. 73, pp. 727–734, Dec. 2018. doi: 10.1016/j.asoc.2018.09.015. [117] N. Huang, L. Xiao, and Y. Xu, “Bipartite graph partition based coclustering with joint sparsity for hyperspectral images,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 12, no. 12, pp. 4698–4711, 2019. doi: 10.1109/JSTARS.2019.2953378. [118] W. Liu, J. He, and S.-F. Chang, “Large graph construction for scalable semi-supervised learning,” in Proc. Int. Conf. Mach. Learn. (ICML), 2010, pp. 679–686. [119] D. Cai and X. Chen, “Large scale spectral clustering via landmarkbased sparse representation,” IEEE Trans. Cybern., vol. 45, no. 8, pp. 1669–1680, Aug. 2015. [120] F. Nie, W. Zhu, and X. Li, “Unsupervised large graph embedding,” in Proc. 31st Conf. Artif. Intell. (AAAI), 2017, pp. 2422–2428. [121] R. Wang, F. Nie, Z. Wang, F. He, and X. Li, “Scalable graphbased clustering with nonnegative relaxation for large hyperspectral image,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 10, pp. 7352–7364, 2019. doi: 10.1109/TGRS.2019.2913004. [122] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2765–2781, 2013. doi: 10.1109/ TPAMI.2013.57. 65
[123] R. Vidal and P. Favaro, “Low rank subspace clustering (LRSC),” Pattern Recognit. Lett., vol. 43, pp. 47–61, July 2014. doi: 10.1016/j.patrec.2013.08.006. [124] Z. Wu, M. Yin, Y. Zhou, X. Fang, and S. Xie, “Robust spectral subspace clustering based on least square regression,” Neural Process. Lett., vol. 48, no. 3, pp. 1359–1372, 2018. doi: 10.1007/ s11063-017-9726-z. [125] V. M. Patel, H. Van Nguyen, and R. Vidal, “Latent space sparse subspace clustering,” in Proc. IEEE Int. Conf. Comput. Vision (ICCV), 2013, pp. 225–232. [126] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 171–184, 2012. doi: 10.1109/TPAMI.2012.88. [127] E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in Proc. IEEE Conf. Comput. Vision Pattern Recog. (CVPR), June 2009, pp. 2790–2797. [128] A. Li, A. Qin, Z. Shang, and Y. Y. Tang, “Spectral-spatial sparse subspace clustering based on three-dimensional edge-preserving filtering for hyperspectral image,” Int. J. Pattern Recognit. Artif. Intell., vol. 33, no. 3, p. 1,955,003, 2019. doi: 10.1142/ S0218001419550036. [129] S. Huang, H. Zhang, and A. Pižurica, “Semisupervised sparse subspace clustering method with a joint sparsity constraint for hyperspectral remote sensing images,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 12, no. 3, pp. 989–999, 2019. doi: 10.1109/JSTARS.2019.2895508. [130] H. Zhai, H. Zhang, L. Zhang, and P. Li, “Laplacian-regularized low-rank subspace clustering for hyperspectral image band selection,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 3, pp. 1723–1740, 2019. doi: 10.1109/TGRS.2018. 2868796. [131] J. Xu, N. Huang, and L. Xiao, “Spectral-spatial subspace clustering for hyperspectral images via modulated low-rank representation,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), pp. 3202–3205, July 2017. [132] Y. Long, X. Deng, G. Zhong, J. Fan, and F. Liu, “Gaussian kernel dynamic similarity matrix based sparse subspace clustering for hyperspectral images,” in Proc. Int. Conf. Comput. Intell. Security (CIS), Dec. 2019, pp. 211–215. [133] P. A. Traganitis and G. B. Giannakis, “Sketched subspace clustering,” IEEE Trans. Signal Process., vol. 66, no. 7, pp. 1663–1675, 2017. doi: 10.1109/TSP.2017.2781649. [134] S. Huang, H. Zhang, Q. Du, and A. Pižurica, “Sketch-based subspace clustering of hyperspectral images,” Remote Sens., vol. 12, no. 5, p. 775, 2020. doi: 10.3390/rs12050775. [135] H. Zhai, H. Zhang, L. Zhang, and P. Li, “Reweighted mass center based object-oriented sparse subspace clustering for hyperspectral images,” J. Appl. Remote Sens., vol. 10, no. 4, p. 046014, 2016. doi: 10.1117/1.JRS.10.046014. [136] L. Wang et al., “Fast high-order sparse subspace clustering with cumulative MRF for hyperspectral images,” IEEE Geosci. Remote Sens. Lett., early access, 2020. doi: 10.1109/LGRS. 2020.2968350. [137] M. Zeng, Y. Cai, X. Liu, Z. Cai, and X. Li, “Spectral-spatial clustering of hyperspectral image based on Laplacian regularized 66 deep subspace clustering,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), July 2019, pp. 2694–2697. [138] M. Brbić and I. Kopriva, “Multi-view low-rank sparse subspace clustering,” Pattern Recognit., vol. 73, pp. 247–258, Jan. 2018. doi: 10.1016/j.patcog.2017.08.024. [139] L. Tian, Q. Du, I. Kopriva, and N. Younan, “Spatial-spectral based multi-view low-rank sparse subspace clustering for hyperspectral imagery,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), July 2018, pp. 8488–8491. [140] L. Tian, Q. Du, I. Kopriva, and N. Younan, “Kernel spatial-spectral based multi-view low-rank sparse subspace clustering for hyperspectral imagery,” in Proc. IEEE Workshop Hyperspec. Image Signal Process. Evol. Remote Sens. (WHISPERS), Sept. 2018, pp. 1–4. [141] L. Tian and Q. Du, “Parallel multi-view low-rank and sparse subspace clustering for unsupervised hyperspectral image classification,” in Proc. IEEE Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit Conf. (APSIPA ASC), Nov. 2018, pp. 618–621. [142] H. Zhai, H. Zhang, X. Xu, L. Zhang, and P. Li, “Kernel sparse subspace clustering with a spatial max pooling operation for hyperspectral remote sensing data interpretation,” Remote Sens., vol. 9, no. 4, p. 335, 2017. doi: 10.3390/rs9040335. [143] F. De Morsier, M. Borgeaud, V. Gass, J. P. Thiran, and D. Tuia, “Kernel low-rank and sparse graph for unsupervised and semisupervised classification of hyperspectral images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 6, pp. 3410–3420, 2016. doi: 10.1109/TGRS.2016.2517242. [144] J. Bacca, C. A. Hinojosa, and H. Arguello, “Kernel sparse subspace clustering with total variation denoising for hyperspectral remote sensing images,” in Proc. Math. Imag. Opt. Soc. Amer. (MIOSA), June 2017, pp. MTu4C–MTu45. [145] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza, “Deep learning classifiers for hyperspectral imaging: A review,” ISPRS J. Photogram. Remote Sens., vol. 158, pp. 279–317, 2019. doi: 10.1016/j. isprsjprs.2019.09.006. [146] E. Min, X. Guo, Q. Liu, G. Zhang, J. Cui, and J. Long, “A survey of clustering with deep learning: From the perspective of network architecture,” IEEE Access, vol. 6, pp. 39,501–39,514, July 2018. doi: 10.1109/ACCESS.2018.2855437. [147] B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong, “Towards k-means-friendly spaces: Simultaneous deep learning and clustering,” in Proc. Int. Conf. Mach. Learn. (ICML), July 2017, pp. 3861–3870. [148] D. Chen, J. Lv, and Y. Zhang, “Unsupervised multi-manifold clustering by learning deep representation,” in Proc. Workshop 31th AAAI Conf. Artif. Intell. (AAAI), Mar. 2017, pp. 385–391. [149] X. Yang, C. Deng, F. Zheng, J. Yan, and W. Liu, “Deep spectral clustering using dual autoencoder network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 4066–4075. [150] K. Tian, S. Zhou, and J. Guan, “Deepcluster: A general clustering framework based on deep learning,” in Proc. Joint Eur. Conf. Mach. Learn. Knowl. Discovery Databases, Cham: Springer-Verlag, Sept. 2017, pp. 809–825. [151] P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid, “Deep subspace clustering networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2017, pp. 24–33. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[152] X. Peng, J. Feng, S. Xiao, W. Y. Yau, J. T. Zhou, and S. Yang, “Structured autoencoders for subspace clustering,” IEEE Trans. Image Process., vol. 27, no. 10, pp. 5076–5086, 2018. doi: 10.1109/TIP.2018.2848470. [153] J. Zhang et al., “Self-supervised convolutional subspace clustering network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 5473–5482. [154] M. Zeng, Y. Cai, Z. Cai, X. Liu, P. Hu, and J. Ku, “Unsupervised hyperspectral image band selection based on deep subspace clustering,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 12, pp. 1889–1893, 2019. doi: 10.1109/LGRS.2019.2912170. [155] C. C. Hsu and C. W. Lin, “CNN-based joint clustering and representation learning with feature drift compensation for large-scale image data,” IEEE Trans. Multimedia, vol. 20, no. 2, pp. 421–429, 2017. doi: 10.1109/TMM.2017.2745702. [156] G. Chen, “Deep learning with nonparametric clustering,” 2015. [Online]. Available: http://arxiv.org/abs/1501.03084 [157] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in Proc. Int. Conf. Mach. Learn. (ICML), June 2016, pp. 478–487. [158] F. Li, H. Qiao, and B. Zhang, “Discriminatively boosted image clustering with fully convolutional auto-encoders,” Pattern Recognit., vol. 83, pp. 161–173, Nov. 2018. doi: 10.1016/j.patcog .2018.05.019. [159] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of deep representations and image clusters,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 5147–5156. [160] X. Peng, J. Feng, S. Xiao, J. Lu, Z. Yi, and S. Yan, “Deep sparse subspace clustering,” 2017. [Online]. Available: http://arxiv.org/ abs/1709.08374 [161] U. Shaham, K. Stanton, H. Li, B. Nadler, R. Basri, and Y. Kluger, “Spectralnet: Spectral clustering using deep neural networks,” 2018. [Online]. Available: http://arxiv.org/abs/1801.01587 [162] J. T. Springenberg, “Unsupervised and semi-supervised learning with categorical generative adversarial networks,” 2015. [Online]. Available: http://arxiv.org/abs/1511.06390 [163] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2016, pp. 2172–2180. [164] W. Harchaoui, P. A. Mattei, and C. Bouveyron, “Deep adversarial Gaussian mixture auto-encoder for clustering,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2017, pp. 1–5. [165] P. Zhou, Y. Hou, and J. Feng, “Deep adversarial subspace clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 1596–1604. [166] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou, “Variational deep embedding: An unsupervised and generative approach to clustering,” 2016. [Online]. Available: http://arxiv.org/ abs/1611.05148 [167] N. Dilokthanakul et al., “Deep unsupervised clustering with Gaussian mixture variational autoencoders,” 2016. [Online]. Available: http://arxiv.org/abs/1611.02648 [168] V. E. Neagoe and V. Chirila-Berbentea, “Improved Gaussian mixture model with expectation-maximization for clustering of remote sensing imagery,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2016, pp. 3063–3065. [169] H. Xie et al., “Unsupervised hyperspectral remote sensing image clustering based on adaptive density,” IEEE Geosci. Remote Sens. Lett., vol. 15, no. 4, pp. 632–636, 2018. doi: 10.1109/ LGRS.2017.2786732. [170] K. K. Singh, M. J. Nigam, K. Pal, and A. Mehrotra, “A fuzzy Kohonen local information c-means clustering for remote sensing imagery,” IETE Tech. Rev., vol. 31, no. 1, pp. 75–81, 2014. doi: 10.1080/02564602.2014.891375. [171] X. Sun, L. Yang, L. Gao, B. Zhang, S. Li, and J. Li, “Hyperspectral image clustering method based on artificial bee colony algorithm and Markov random fields,” J. Appl. Remote Sens., vol. 9, no. 1, p. 095047, 2015. doi: 10.1117/1.JRS.9.095047. [172] S. G. Beaven, G. G. Hazel, and A. D. Stocker, “Automated Gaussian spectral clustering of hyperspectral data,” in Proc. Alg. Tech. Multispec. Hyperspec. Ultraspec. Image. VIII. Int. Soc. Opt. Photo. (ISOP), 2002, vol. 4725, pp. 254–267. [173] L. Galluccio, O. Michel, P. Comon, and A. O. Hero, III, “Graph based k-means clustering,” Signal Process., vol. 92, no. 9, pp. 1970– 1984, 2012. doi: 10.1016/j.sigpro.2011.12.009. [174] N. Huang and L. Xiao, “Hyperspectral image clustering via sparse dictionary-based anchored regression,” IET Image Process., vol. 13, no. 2, pp. 261–269, 2018. doi: 10.1049/iet-ipr .2018.5421. [175] H. W. Kuhn, “The Hungarian method for the assignment problem,” Naval Res. Logist. Quart., vol. 2, no. 1–2, pp. 83–97, 1955. doi: 10.1002/nav.3800020109. [176] G. Carpaneto and P. Toth, “Algorithm 548: Solution of the assignment problem,” ACM Trans. Math. Softw., vol. 6, no. 1, pp. 104–111, 1980. doi: 10.1145/355873.355883. [177] H. Yuan and Y. Y. Tang, “A novel sparsity-based framework using max pooling operation for hyperspectral image classification,” ‘‘ IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 8, pp. 3570–3576, Aug. 2014. doi: 10.1109/JSTARS.2014.2339298. [178] L. V. D. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008. [179] L. Van Der Maaten, “Fast optimization for t-SNE,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Sept. 2010, vol. 100, pp. 1–5. [180] M. Maggioni and J. M. Murphy, “Learning by unsupervised nonlinear diffusion,” J. Mach. Learn. Res., vol. 20, no. 160, pp. 1–56, 2019. [181] W. Czaja and M. Ehler, “Schroedinger eigenmaps for the analysis of biomedical data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 5, pp. 1274–1280, 2012. doi: 10.1109/TPAMI.2012.270. [182] G. Schiebinger, M. J. Wainwright, and B. Yu, “The geometry of kernelized spectral clustering,” Ann. Statist., vol. 43, no. 2, pp. 819–846, 2015. doi: 10.1214/14-AOS1283. [183] M. Soltanolkotabi, E. Elhamifar, and E. J. Candes, “Robust subspace clustering,” Ann. Statist., vol. 42, no. 2, pp. 669–699, 2014. doi: 10.1214/13-AOS1199. GRS DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 67
Methods, applications, and future directions Digital Object Identifier 10.1109/MGRS.2021.3063465 Date of current version: 5 April 2021 hange detection is a vibrant area of research in remote sensing. Thanks to increases in the spatial resolution of remote sensing images, subtle changes at a finer geometrical scale can now be effectively detected. However, change detection from very-high-spatial-resolution (VHR) (≤5 m) remote sensing images is challenging due to limited spectral information, spectral variability, geometric distortion, and information loss. To address these challenges, many change detection algorithms have been developed. However, a comprehensive review of change detection in VHR images is lacking in the existing literature. This review aims to fill the gap and mainly includes three aspects: methods, applications, and future directions. 0274-6638/21©2021IEEE IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DAWEI WEN, XIN HUANG, FRANCESCA BOVOLO, JIAYI LI, XINLI KE, ANLU ZHANG, AND JÓN ATLI BENEDIKTSSON 68 C DECEMBER 2021 ©SHUTTERSTOCK.COM/VORAN Change Detection From Very-High-SpatialResolution Optical Remote Sensing Images
Textural Features Land Cover and Land Use Feature Extraction Buildings Vegetation Frequency Object-Based Features Scale Angular Features Crops Ecosystem Services Impervious Surfaces (a) Change Detection Global Change Detection Detailed Algebra Lakes and Wetlands Change Tracking Deep Features Hyperspectral Change Detection Transforms Machine Learning Semantic End-to-End Architectures Urban Functional Zone Changes (b) (c) FIGURE 1. An outline of this review, including (a) applications, (b) methods, and (c) future directions. BACKGROUND Change detection is a vibrant area of research with wideranging applicability, including damage assessment, land management, and environment monitoring. Due to the revisit property of Earth observation sensors, multitemporal remote sensing images at a large geographical scale can be acquired easily and conveniently. Due to their extensive availability, optical images become the main data sources for change detection [1]. Since these satellite sensors are able to acquire images with meter and submeter spatial resolutions, ground objects in fine spatial detail can be investigated [2]. Subtle change detection using these VHR images has drawn great interest in both the academic and industrial communities. However, multitemporal VHR images exhibit unique properties, such as limited spectral information, intrinsic spectral variability, spatial displacement, and information loss, that limit the usefulness of traditional change detection methods. Therefore, a great number of studies have been carried out on VHR change detection, and a series of new research topics has emerged along with advances in remote sensing technology and data computing methods. In this regard, a timely overview of VHR change detection is required to summarize the new techniques and applications. Although a number of reviews about change detection using remote sensing data [3]–[10] exist in the literature, the publications discuss general change detection methods and do not focus on high-spatial-resolution images. Only a few available works involve VHR images, e.g., the reviews in [6] and [7]. However, those two works concern object-based change detection methods for VHR data, neglecting other aspects, e.g., recent technological advances in deep learning and multiview and 3D change detection. Moreover, specific applications of VHR change detection have rarely been summarized and discussed in the currently available literature. Therefore, a comprehensive review of change detection from VHR remote sensing images, including methods, applications, and future directions, is presented (Figure 1). DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE ISSUES RELATED TO VHR IMAGES AND THEIR CHANGE DETECTION With the ongoing development of remote sensing imaging techniques, an increasing number of VHR sensors are available, and many new sensors are being planned and launched [11]. New platforms, such as unmanned aerial vehicles (UAVs) and remotely piloted aircraft systems, have grown in popularity [12] and are now providing a large amount of VHR remote sensing data. As seen in Table 1, the imaging capabilities of VHR platforms and sensors TABLE 1. THE MAIN PARAMETERS OF SOME VHR SENSORS. SENSOR SPATIAL RESOLUTION (M) NUMBER OF BANDS REVISIT TIME (DAYS) LAUNCH YEAR IKONOS 1 Four One to three 1991 QuickBird 0.61 Four 1.5–2.5 2001 SPOT-5 2.5 Four 26 2002 OrbView-3 1 Four Three 2003 Cartosat-2 0.8 One Four 2007 WorldView-1 0.5 One 1.7 2007 GeoEye-1 0.41 Four Fewer than three 2008 WorldView-2 0.46 Four 1.1 2009 KOMPSAT-3 0.7 Four Three 2012 Ziyuan-3 2.1 Four Four to five 2012 SPOT-6/7 2 Four One 2012/2014 Gaofen-1 2 Four Fewer than four 2013 Gaofen-2 0.8 Four Four 2014 Planet Labs 3 Four One or two 2014 Deimos-2 1 Four One or two 2014 WorldView-3 0.31 16 Fewer than one 2014 DMC-3 1 Four One 2015 WorldView-4 0.31 Four Fewer than one 2016 SPOT: Satellite Pour l’Observation de la Terre; KOMPSAT: Korean Multipurpose Satellite; DMC: Disaster Monitoring Constellation. 69
are continually being improved with higher spatial resolutions, more spectral bands, and higher temporal revisit frequencies. In addition, most VHR sensors provide an along-track and across-track pair for stereo capture [12], [13]. With the improved capability of VHR remote sensing equipment, it is now becoming possible to achieve subtle, detailed, and frequent 3D change detection. Although change detection using VHR images is advantageous, from a technological point of view, it remains a challenge due to 1) limited spectral information, 2) intrinsic spectral variability, 3) spatial displacement, and 4) information loss, as discussed in the following. 1) Limited spectral information: Compared to coarse- and medium-resolution sensors, images captured by VHR sensors usually provide a smaller number of bands. Although WorldView-3, one of the most advanced VHR sensors, can provide images with 16 spectral bands, most VHR images, e.g., from IKONOS, QuickBird, WorldView-2, and Ziyuan-3, cover only four bands (blue, green, red, and near-infrared) [14]. With limited spectral information, it is difficult to separate classes that have similar spectral signatures because of the low between-class variance [15]–[18]. Researchers have also pointed out that it is difficult to achieve high-accuracy change detection with the limited spectral information [5], [15], [19]–[21] of VHR images. This may inhibit the direct use of traditional spectral-based change detection methods, e.g., change vector analysis (CVA) [22]. Therefore, other categories of features are often adopted to augment the spectral information for VHR change detection. 2) Spectral variability: There exists a high degree of spectral variability in VHR images. Buildings, for example, have complicated appearances, with various roof superstructures, such as chimneys, water tanks, and pipelines; this leads to significantly heterogeneous spectral characteristics in VHR images [23], [24]. High spectral variability within geographic objects increases the within-class variance, which inevitably leads to the uncertainty of spectral-based image interpretation methods. External (a) (b) (c) FIGURE 2. The spatial displacement in multispectral data acquired with different viewing geometries in an unchanged urban scene [21]: (a) Image (t1), with a satellite angle zenith of 153°, and (b) image (t 2), with a satellite angle zenith of 129°12´. (c) The result of traditional spectral-based CVA shows a high number of false alarms (black and white indicate unchanged and changed areas, respectively) [31]. 70 factors, such as atmospheric conditions, phenological stages, sun angles, soil moisture, tidal stages, and water turbidity, may make unchanged objects temporally variant in their spectral features and hence result in them being incorrectly identified as changed ones [25], [26]. In addition, temporary objects, such as cars on a road, visible in VHR images can also affect the performance of traditional spectral-based change detection methods using VHR images. 3) Spatial displacement: The VHR imaging systems on optical satellites are highly agile platforms and can operate as constellations [27] that can support rapid retargeting, high revisit times (for instance, <1 day for WorldView-3 and WorldView-4), and stereoscopic coverage for rapid disaster response and 3D change detection [28]. However, this imaging mode makes it extremely difficult to acquire multitemporal images with the same or close viewing angles for accurate change detection [29], [30]. As such, multitemporal VHR images may suffer from apparent spatial displacement due to the parallax distortion of land cover objects, especially for high-rise buildings [31]. Specifically, a building may display distinct spatial morphologies (e.g., roofs and facades) in multitemporal VHR images due to different viewing angles (Figure 2). This may lead to a large number of commission errors if traditional spectral and pixel-based change detection methods are adopted. To solve such a problem, precise orthorectification using VHR digital surface models (DSMs) is a feasible solution. In particular, sensors equipped with multiview imaging systems, for instance, the three-line array of Ziyuan-3 and the two cameras of Cartosat-2, that can nearly simultaneously collect multiview images are preferred in similar atmospheric conditions for their stereo pairs and convenient collection of multitemporal data. 4) Information loss: VHR images suffer from serious information loss owing to the presence of clouds/haze, cloud shadows, and shadows cast by terrain, buildings, and trees. The problem of cloud and cloud shadow contamination can be avoided by selecting cloud-free observations [32]. However, shadows cast by terrain, buildings, and trees seem unavoidable in VHR imagery, especially in urban areas [33]. Although shadow information is useful in building detection and height estimation [34]– [36], it becomes a problem for change detection in wider areas [37]. Since the direction and length of shadows are dependent on the sun’s azimuth and elevation angle at the time of image acquisition, shadow-affected areas are different in multitemporal images. Besides, in the case of occlusions by vertical structures (e.g., high-rise buildings and trees), the problem of information loss can be more complicated. With different viewing geometries in multitemporal images, the size and direction of the tilting effect can vary, as shown in Figure 2. Overall, the regions affected by shadow and occlusions may become invisible and different in multitemporal VHR images. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
METHODS Change detection methods for VHR images are commonly based on two steps: 1) feature extraction and 2) change detection (see Figure 1). the challenges of limited spectral information and intrinsic spectral variability. A summary of the major features used for VHR change detection, including categories, subcategories, descriptions, characteristics, most-used sensors, and corresponding references, is presented in Table 2. 4 5 6 5 6 5 1 2 1 6 5 1 6 0 2 6 5 4 0 0 0 1 0 0 0 3 5 3 3 0 0 0 0 0 1 0 4 1 2 2 1 0 0 0 0 0 0 5 1 1 1 1 0 0 0 0 3 6 FEATURE EXTRACTION Change detection methods rely on effective multitemporal feature representation to indicate whether and what changes have occurred. It has been agreed that spectral-based methods become ineffective in dealing with the challenges facing VHR change detection. During the past decades, a large number of image features have been extracted, which can compensate for the limited spectral information contained in VHR images and improve the discriminative capability of image change information. In this review, image features designed for VHR change detection are divided into the following categories: textural, deep, object based, and angular (Figure 3). These are potentially useful for dealing with 0 0 0 0 2 0 Statistical Model Based Transform Based (a) Convolution Autoencoder Single Object Openings Structural Code Encoder Closings TEXTURAL FEATURES Textural features depict contextual and structural information by using a moving window or kernel, where the parameters of size, direction, and distance must be appropriately determined [5], [38]. Textural features for VHR change detection can be categorized as statistical, structural, model based, and transform based. Statistical textures describe the relationships between the gray levels of local windows, e.g., the gray-level cooccurrence matrix (GLCM); local binary patterns (LBPs); and pixel shape index (PSI). The GLCM, the most popular statistical texture, measures the contrast (e.g., dissimilarity and homogeneity), orderliness (e.g., the Pooling Decoder Radiometry (b) Two Objects Convolution Fully Connected Convolutional Neural Network Adjacency Geometry Proximity Texture Relations Second Level (c) First Level Pooling Multiple Objects Spatial Arrangements Third Level Angular Variation Stereo Photogrammetry Forward Nadir DSM Implicit Backward ADF (d) Explicit FIGURE 3. Features for change detection using high-spatial-resolution remote sensing images. (a) Textural features. (b) Deep features. (c) Object-based features. (d) Angular features. ADF: angular difference feature. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 71
geometric information of relevant structures is preserved and unimportant details are attenuated [48], [49]. MPs and APs have proved to be effective in VHR change detection since they can simplify results and reduce noise components (e.g., spectral variations) [48], [49]. For instance, Liu et al. [50] took the geometrical structure of change targets into account using MPs. In addition, the morphological building index (MBI) [36], which is defined as differential MPs with linear structural elements, has been extensively used in VHR change detection in urban areas since it can highlight bright and high-contrast structures, mostly consisting of buildings, in remote sensing images. For example, Huang et al. [51] proposed an automatic building change detection framework based on the MBI. Experimental results showed that the proposed method outperformed supervised classification via a support vector machine (SVM). In addition, point and line features, for instance, Harris [52] and scale-invariant feature transforms (SIFTs) [53], can improve the discriminability of man-made objects, such as buildings, roads, and cars, by describing corners and edges, therefore improving results. Model-based textures, e.g., Markov random fields (MRFs) and fractal models, aim to represent textures through stochastic processes [54]. MRF models present spatial context through a graph-based image representation, where the nodes and edges of the graph express pixels and their relationship with connected nodes, respectively. Fractal models can depict texture roughness and complexity by capturing self-similar and self-affine patterns [55]. A number of MRF-based methods have been proposed to deal with VHR image change detection [56]–[60] because of their ability to describe local spatial angular second moment and entropy), and statistical (e.g., the mean, variance, and correlation) attributes within local windows [39], [40]. The LBP, an ordered set of binary comparisons of pixel values between the central pixel and its neighboring ones, is invariant to monotonic grayscale change [41]. The PSI aims to measure the length of direction lines, which are extended based on gray-level similarity along a series of directions [42]. Some representative examples for VHR change detection using statistical textures are briefly introduced in the following. Tan et al. [43] adopted the GLCM in an automatic change detection method to consider the variation information of direction, distance, and amplitude in images. Li et al. [44] applied the local similarity of GLCM textures to detect changes and demonstrated that this kind of feature was robust against both noise and spectral similarity. Peng and Zhang [45] used the LBP for change detection from Gaofen-1 imagery, and both qualitative and quantitative analyses demonstrated the effectiveness of the proposed approach. Zhang et al. [46] identified building change types, i.e., new construction, demolition, and reconstruction, by using LBP features and obtained satisfactory change detection results with a high detection accuracy and precise structure boundaries. Liu et al. [47] proposed a line-constrained shape feature, a modified version of the PSI, for building change detection, and the results showed the approach’s advantage in individual building change detection in a lightly populated region. Structural textures, e.g., morphological profiles (MPs) and attribute profiles (APs), facilitate the investigation of the geometries, shapes, and edges of regions, with the convex and concave components being erased so that the TABLE 2. A SUMMARY OF THE FEATURES USED FOR VHR IMAGE CHANGE DETECTION. CATEGORY SUBCATEGORY DESCRIPTION CHARACTERISTICS SENSOR REFERENCES Textural features Statistical Describe the relationships among the gray levels of local windows Edge effect, difficulty of identifying parameters QuickBird [48]–[53] [43]–[47] Structural Investigate the geometry, shapes, and edges of regions [48]–[53] Model based Obtain coefficients from the model describing the relationships among the local image neighborhood [56]–[61] Transform based Capture local structures in a transformed space Autoencoders Learn efficient encoding through the optimization of a series of criteria Convolutional neural networks Extract mid- and high-level abstract features by interleaving convolutional and pooling layers First level Radiometry, geometry, and texture for each image object Second level Relationships between two image objects, e.g., adjacency and proximity, and relationships with neighboring objects Third level Spatial arrangements of multiple objects Implicit Orthographic images and DSMs Explicit Quantify the differences contained in multiangle images, such as angular difference features Deep features Objectbased features Angular features 72 [63], [64] Complex training and parameter tuning, “black-box” nature, high computational burden, overfitting, and so on Gaofen-2 [66], [77] and Google Earth images [66], [76] [70]–[73] Determination of appropriate segmentation parameters and uncertainties of the segmentation results QuickBird [88], [89] [85], [88], [89] [66], [67], [75]–[78] [91], [92] [95] Availability of multiangle images Ziyuan-3 [2], [21] [21], [98], [99] [2] IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
relationships. Specifically, Bruzzone and Prieto [57] introduced a change detection method based on an MRF to model prior class probabilities by interpixel dependence, which increased the accuracy and reliability of the change detection results. In [60], spatial constraints between neighboring samples were formulated using an MRF in an active learning process for change detection. Multifractal features were applied to change detection in [61], and experiments on a complex landscape that included urban areas, agricultural fields, trees, and an unregulated river indicated that the features were tolerant to some degree to multitemporal differences caused by the viewing geometry and illumination angles. Transform-based textures, e.g., Gabor, wavelets, and contourlets (CTs), aim to convert images into a new space to capture local structures corresponding to scale, localization, and orientation [62]. For example, Li et al. [63] used a Gabor-based approach to improve the change detection performance since the technique can capture contextual information at different scales and orientations. Wei et al. [64] introduced wavelet pyramid decomposition features to VHR change detection. Thus, in VHR images, the complexity of homogeneous regions can be reduced in low-scale features, and details and edge information can be retained in high-scale ones [64]. In a comparative study conducted by Li et al. [65], a number of representative textural features were selected for change detection using VHR images, and it was shown that texture-based change detection methods can obtain better performance than spectral-based pixel ones. Texture change detection results are demonstrated in Figure 4, and it can be seen that, compared to using individual textures, combining multiple textures can improve change detection accuracy. Unchanged Changed (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) FIGURE 4. Change detection results based on textures: (a) image (t1), (b) image (t 2), (c) the reference change map, (d) the GLCM, (e) APs, (f) a 2D wavelet transform (WT), (g) a fractal, (h) a fuzzy set (APs plus a 2D WT plus a 3D WT), (i) a fuzzy set (all textures), (j) a random forest (APs plus a 2D WT plus a 3D WT), and (k) a random forest (all textures) [65]. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 73
DEEP FEATURES Deep feature representation based on the layer-wise learning of image patterns is a very promising research direction for change detection in VHR images [66], [67]. Differing from traditional handcrafted features, higher-level abstractions (both linear and nonlinear features) can be automatically extracted and optimized by multilayer neural networks, which can retain crucial variations and discard uncorrelated differences for change detection tasks [68]. In recent years, many deep learning methods have been developed, such as autoencoder (AE) models and convolutional neural networks (CNNs), for deep feature extraction in change detection with VHR images. The AE is an unsupervised feature learning model that is constructed by minimizing the reconstruction error. However, it may learn a useless feature representation, such as a simple copy of the input [69]. To overcome that issue, variant models, e.g., the denoising AE (DAE) [70], sparse AE (SAE) [71], and Fisher AE (FAE) [72], have been employed for VHR change detection, with denoising, sparsity, and Fisher discriminant criteria, respectively. Specifically, a stacked DAE was used to learn high-level features from the local neighborhood [70]. In [70], it was found that the filters learned by a stacked DAE have a stronger representation capability than existing explicit ones. Based on the SAE, Su et al. [71] transformed a difference image into a suitable feature space for suppressing noise and extracting key change information in the change detection framework. Liu et al. [72] used the FAE for unsupervised layer-wise feature learning and showed that the model can generate more discriminative features than the original AE. In addition to unsupervised feature learning through the optimization of certain criterions, AE-based models can learn effective features in a supervised way by considering label consistency, e.g., the contractive AE [73]. It is well recognized that CNNs are effective in extracting mid- and high-level abstract features by interleaving convolutional and pooling layers [74]. According to the feature learning strategy, CNNs can be categorized as unsupervised [67], [75], [76], supervised [77], fine-tuning [66], and transfer learning based [78]. For example, Zhan et al. [75] used a pretrained CNN to automatically extract deep spatial–spectral features for change detection in VHR satellite images. Saha et al. [67] developed unsupervised deep CVA for change detection, and a network trained on remote sensing aerial images for semantic labeling by Volpi and Tuia [79] was adopted for deep feature extraction. As detailed in Figure 5, the experimental results demonstrated that, compared to object-based methods, deep features are effective for capturing change information and are promising for distinguishing multiclass change information. Wang et al. [77] trained a model through manually selected samples, where the parameters of the shared convolutional layers were initialized by the pretrained ResNet-50 model, and the others were randomly initialized. Hou et al. [66] chose to extract CNN-based deep features through a fine-tuned Visual Geometry Group 74 (VGG)-16 by transferring a model pretrained on largescale natural images to the remote sensing domain via an aerial image data set. Liu et al. [78] proposed a CNN-based transfer learning method for change detection. In particular, the loss function was designed by combining high-level features extracted from a pretrained model (i.e., the U-net model trained on an open source data set) and semantic information contained in change detection data sets. Notably, deep learning methods depend on an enormous amount of training data, which may not be available for multitemporal VHR remote sensing imagery [74]. Meanwhile, great differences in spectral properties and image contexts among natural red–green–blue (RGB) images and remote sensing data result in deep features extracted by finetuned models that do not fully represent the essential characteristics of remote sensing images. As a result, the contrast between a small number of remote sensing data sets and a large number of natural images during model learning may hamper the further improvement of VHR change detection using deep features. In recent years, large multitemporal data sets have been released, such as 86 image pairs from the DigitalGlobe satellite constellation (i.e., QuickBird, WorldView-1, WorldView-2, and GeoEye-1) [80], 291 pairs of multitemporal aerial images [81], and more than 700,000 labeled instances for building damage assessment [82]. It can be anticipated that more and larger multitemporal VHR remote sensing data sets with diverse image characteristics and various acquisition conditions will appear in the near future. In this case, the essential change features for VHR remote sensing images can be effectively extracted by a deep network specialized for multitemporal remote sensing data. OBJECT-BASED FEATURES Object-based features refer to spectral, geometry, texture, extent, and contextual information at the object scale rather than single pixels and groups of pixels within a kernel filter/moving window. In this way, an image object is viewed as the processing unit for change detection. An object is a set of spatially adjacent pixels that are spectrally similar and that can be extracted through image segmentation. Overall, object-based features are effective in VHR change detection since they mitigate radiometric differences, spectral variability, and misregistration errors [38], [83]. However, appropriate segmentation parameters, which are often dependent on subjective and laborious trial-and-error experiments, need to be determined [84]. Furthermore, shortcomings and problems in different multitemporal image segmentation strategies, e.g., 1) the segmentation of only one monotemporal image, 2) the segmentation of stacked multitemporal images, and 3) the independent segmentation of multitemporal images, should be carefully considered and tackled [5], [85]. Specifically, geometric changes (e.g., the size and shape) cannot be captured by 1) and 2) [85]. Moreover, strategy 2) may also result in “sliver objects” caused by image misregistration. As for strategy 3), spatial correspondence between multitemporal objects needs to be established. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Therefore, object-oriented texture computed within the boundary of an object is recommended, such as object-wise GLCM texture measures [87] and object-based MPs [90]. The second-level object-based features exploit relationships between two image objects, e.g., adjacency, proximity, and relations between neighboring objects [87]. For example, Liang et al. [91] considered the relations of neighboring objects in feature extraction for object-oriented change detection. Yu et al. [92] combined a relative border with a “forest with no change” and the normalized difference vegetation index (NDVI) to identify the category of “change from forest to developed land.” The third-level features refer to spatial arrangements among multiple objects [87]. Thirdlevel object-based features have been used in image classification, such as urban functional zone extraction [93] and urban village detection [94]. Nevertheless, such features have rarely been used in VHR image change detection. In [95], spatial dependency and sharing boundaries among multiple objects are considered to reduce spurious errors caused by shadow in urban vegetation change detection. Object-based CVA results [85] derived from different multitemporal segmentation strategies are presented in Figure 6, where it can be observed that different multitemporal segmentation strategies can significantly affect change detection results. Generally speaking, three levels of object-based features can be used for change detection [86]. In the first, the objectbased features include the radiometry, geometry, and texture for each image object [87]. For instance, in [88], key points of each object are extracted in change detection, which was successfully applied in three landslide scenes and one view that examined land use changes. Bovolo [89] computed the mean values of texture measures in separate parcels for change detection, and better accuracy with high fidelity in the homogeneous and border regions was achieved by the objectbased method than with the pixel-based one. However, in these studies, texture is still extracted in a pixel-based manner and depends on the size of a moving window (or kernel). More importantly, kernel- and window-based texture can create between-class texture, leading to an edge effect [87]. (a) (b) (c) (d) (e) (f) ωnc ωc1 ωc2 ωc3 Bounding Box Denoting Changes FIGURE 5. Change detection results for QuickBird bi-temporal images: (a) image (t1), (b) image (t 2), (c) the reference change map, (d) multiclass deep CVA, (e) binary change deep CVA, and (f) object-based CVA [67]. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 75
ANGULAR FEATURES Multiangle satellite images can be acquired by WorldView-2, IKONOS, Cartosat-1, and Ziyuan-3 through across-track and along-track stereoscopy [96]. Spatial and spectral variations encoded in multiangle images can be extracted as new information sources for change detection. To be specific, multiangle observations can capture information about bidirectional reflectance signatures and vertical structures (e.g., trees and buildings) and hence complement conventional spectral and spatial features [27]. In this article, angular features are categorized as 1) implicit ones that are generated by stereo photogrammetry, such as orthographic images and DSMs, and 2) explicit ones that capture angular variations, such as angular difference features [97]. Most existing change detection studies based on multiangle VHR imagery adopt implicit angular features. For example, Chaabouni-Chouayakh et al. [98] presented a fully automatic change detection method for urban monitoring using IKONOS stereo data, and their experimental results verified the effectiveness of the joint use of multispectral and DSM features. Tian et al. [99] investigated building and forest change detection using panchromatic Cartosat-1 stereo imagery, and they found that extracted height values from DSMs can greatly improve change detection accuracy. Huang et al. [21] used photogrammetrically derived orthographic images from multiangle Ziyuan-3 data to monitor subtle changes across urban areas, and it was shown that the use of orthographic images can minimize the influence of spatial inconsistency among multitemporal data, e.g., misregistration and parallax distortion for high-rise buildings. On the other hand, explicit angular features aim at describing the differences contained in multiangle images, e.g., the angular difference feature [100], multiangular builtup index (MABI) [101], multiangle spectral variation feature [27], stacked multiangle spectral feature [102], and bidirectional reflectance distribution function-based index [103]. Benefiting from these explicit angular features, detailed urban and vegetation classifications were achieved using multiangle VHR images. Nevertheless, in the current literature, the previously mentioned explicit angular features have seldom been employed for change detection. One exception is a recent study presented in [2]. In it, the MABI, which indicates spectral and structural variations in multiview images, was used. Specifically, Huang et al. [2] integrated planar (i.e., MBI, Harris, and PanTex) and vertical [multispectral image (MSI), normalized DSM (nDSM), and MABI] features to detect newly constructed buildings and identify their change timing by using time-series, multiview Ziyuan-3 imagery. Figure 7 gives an example of change results from different feature combinations. It shows that the joint use of planar and vertical features can generate more accurate results in terms of change extents and timings. To better evaluate the different kinds of features, we create a Ziyuan-3 multiview change detection (MVCD) data (a) (b) (c) (d) (e) (f) FIGURE 6. Object-based CVA results from different multitemporal segmentation strategies: (a) image (t1), (b) image (t 2), (c) the reference change map, (d) the segmentation of image(t1), (e) the segmentation of stacked multitemporal images, and (f) the separate segmentation of each monotemporal image [85]. 76 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
set, which is available at http://irsip.whu.edu.cn/resources/ resources_en_v2.php. It includes both urban and rural scenes with diverse and complex change types, and, moreover, it considers seasonal and illumination influences. These characteristics enable the MVCD to function as a challenging change detection data set. A comparative analysis between different attributes, including the GLCM [39], AP [49], CT [62], MABI, [101], object-wise GLCM (GLCM-Obj) [87], and deep features [67], has been carried out. Specifically, the change intensity map was obtained by CVA, and the threshold for each feature was determined based on receiver operating characteristic curves to achieve a balance between commission and omission errors [65]. Qualitative and quantitative experimental results are provided in Figure 8 and Table 3, respectively. The spectral feature fails to detect changes between spectrally similar classes (e.g., bare soil and buildings), and unchanged objects with spectral variation are incorrectly detected as changed ones. The GLCM, AP, and CT can depict textural changes, e.g., the spatial distribution of the gray value, geometry, and local details. Among them, the CT gives more complete changed regions, and the AP produces more false alarms. The MABI emphasizes building changes, but it is not sensitive to other variations (e.g., soil, vegetation, and roads), which therefore leads to a large omission error. The GLCM-Obj generates smoother results with smaller omissions but larger commission errors than its pixel-wise version. Deep CVA outperforms the other methods, but false alarms caused by shadows and seasonal effects can be still observed. CHANGE DETECTORS VHR change detectors can be categorized as algebra-, transform-, and machine learning-based indicators. CVA is one of the most widely used algebraic approaches, and it is carried out by measuring the difference among bi-temporal multifeature vectors to derive a change vector for VHR images [67], [104], [105]. Transform-based methods, such as principal component analysis [106] and multivariate alteration detection [107], attempt to suppress no-change areas and emphasize change information in the transformed feature space. In the machine learning community, change detection is often viewed as a classification problem. In conventional classification-based VHR change detection, spectral– spatial feature extraction and detectors (e.g., SVMs [108] and the random forest [65]) are separately implemented. The recent hot spot, i.e., deep learning, can integrate these two operations in a joint learning framework, which is therefore very promising for VHR change detection [109], [110]. Deep learning-based change detectors can be grouped in terms of different criteria, including learning and fusion strategies, network models, and processing units (Table 4). We first discuss learning strategies. On the basis of a large amount of annotated data, supervised deep learning methods can capture semantic changes, and hence they 1 2012 2013 2014 MBI Harris 2015 2016 2017 MSI nDSM 2 2018 MABI 2013 2014 2015 2016 2017 2018 Non-NCBAs 4 3 Pantex Reference Data Fused Planar Features (a) Fused Vertical Features Planar Vertical Features (b) FIGURE 7. Experimental results for the automatic monitoring of newly constructed building areas (NCBAs) using planar (i.e., MBI, Harris, and Pantex) and vertical (MSI, nDSM, and MABI) features [2]. (a) Multitemporal Ziyuan-3 images. (b) NCBAs and their change timing. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 77
(a) (b) (c) (d) (e) Commission (f) (g) (h) (i) (j) Omission FIGURE 8. A comparison of different features for the MVCD data set: (a) image (t1), (b) image (t 2), (c) the ground reference, (d) spectral features, (e) the GLCM, (f) APs, (g) CTs, (h) the MABI, (i) the GLCM-Obj, and (j) deep features. TABLE 3. THE CONSIDERED METHODS’ CHANGE DETECTION ACCURACY WHEN USING THE MVCD DATA (%). METHOD CORRECTNESS COMMISSION ERROR OMISSION ERROR OVERALL ERROR Spectral 70.88 24.51 29.12 26.62 GLCM 65.05 16.1 34.95 22.04 AP 71.75 40.2 28.25 33.18 CT 75.13 33.84 24.87 28.67 MABI 57.87 28.76 42.13 34.18 GLCM-Obj 74.51 30.47 25.49 27.76 Deep 79.98 25.46 20.02 22.41 78 are sensitive to actual variations of interest and tolerant to “pseudo changes” (such as geometric deformation and radiation distortions caused by spatial displacement and phenology variation, respectively) [110]–[116]. However, it is difficult to learn a deep model only from the training samples of a study area since the proportion of the change area is usually very small. To tackle this problem, on the one hand, transfer learning [117] and meta-learning [118] are considered to leverage knowledge from other data sources. Transfer learning strategies focus on fine-tuning pretrained models that are designed for different but related tasks. Meta-learning can learn from data, and it can learn how to learn by utilizing previous experiences [119]. Regarding the IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
huge difference between VHR remote sensing images and data from other fields (e.g., natural RGB images) in terms of the image modality, spectral bands, spatial resolution, viewing angle, and so on, large amounts of publicly available multitemporal VHR remote sensing data are required to construct a robust VHR deep change detector. On the other hand, semisupervised deep learning methods, with the consideration of unlabeled samples [120], can relieve the burdensome labeling process, although the effects of unlabeled samples as well as the complexity of the semisupervised model should be further investigated. With regard to the fusion strategy, according to how bi-temporal images are dealt with, deep learning-based change detectors can be classified as early fusion and late fusion. Early fusion methods concatenate multitemporal images as a whole input into a deep network [110]. Early fusion is able to capture the hierarchical difference representation, i.e., from low-level grayscale differences in shallow layers to high-level semantic changes in deep layers, while grayscale differences that are not relevant to semantic changes, e.g., spatial misalignment and the internal variability of objects, may propagate to deeper layers and therefore lead to false alarms [113]. In contrast to early fusion, late fusion methods separately learn monotemporal features and concatenate them later as an input to the change detection layers [121]. This kind of network architecture may lead to insufficient learning, e.g., during network training. Gradients in high layers are difficult to flow backward to lower ones [122] and hence affect the change detection performance. Thus, as an attempt in [113], early and late fusion networks were combined to complement one another. As for network models, AE [123], [124], deep belief networks [125], CNNs [110], [112], [113], [115], [120], [126], recurrent neural networks (RNNs) [127]–[129], generative adversarial networks (GANs) [130], [131], and graph neural networks [132] have been adopted for end-to-end change detection. The CNN is one of the most widely used methods, and mainstream CNN architectures, such as AlexNet [133], VGGNet [134], GoogleNet [135], ResNet [136], and DenseNet [137] as well as their variants, have been considered [138]. RNNs with modules, such long short-term memory and gated recurrent units as well as their variants, are also widely employed to model the phenological process of multitemporal VHR images, due to the superiority of recurrent layers in processing sequential data and modeling time-series dependence. In addition, the U-net and its variants, which are composed of an encoder to hierarchically extract semantic information and a counterpart decoder to delineate spatial details, can be viewed as AE architectures for VHR change detection. They receive much attention due to their ability to maintain change object spatial details. Recently, some studies proposed hybrid models, such as those in [111] and [127]. For instance, as illustrated in Figure 9, a CNN and an RNN are combined in one endto-end network to extract joint spectral–spatial–temporal features [111]. In [139], difference-based methods using DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE edge-based level set evolution (ELSE), region-based level set evolution (RLSE), MRFs, and fully convolutional networks (FCNs) as well as postclassification-based methods with SVMs, CNNs, GANs, Siamese convolutional networks (SCNs), and end-to-end GAN-based Siamese frameworks (GSFs) are compared for landslide detection (Figure 10). Since observing landslides separately from unchanged and other changed regions is required, this kind of change detection is challenging. As can be seen, the four difference-based methods lead to more false alarms. As for the five postclassification methods, deep learning techniques generally outperformed SVMs, due to their explorative capabilities in representing related changes and suppressing irrelevant variations. TABLE 4. A SUMMARY OF DEEP LEARNING-BASED CHANGE DETECTORS. CRITERIA CATEGORY DESCRIPTION REFERENCES Learning strategy Supervised Based on a large number of labeled samples [110]–[116], [121], [124]– [130], [139], [140] Transfer learning Fine-tunes pretrained models that are designed for different but related tasks [117], [131] Metalearning Learns from little labeled data and learns how to learn [118] Semisupervised Joint use of labeled and unlabeled data [120], [132] Fusion strategy Network model Processing unit Early fusion Uses concatenated multitemporal images as input [110], [114], [115], [125]–[129], [131], [132], [139] Late fusion Learns monotemporal features separately and then concatenates them as a whole input [111], [112], [116], [117], [121], [130], [140] CNN Stacked convolutional, pooling, and fully connected layers [110], [112]– [116], [120], [126] Recurrent neural network Models with a recurrent hidden state, e.g., gated recurrent units and long short-term memory [127]–[129] AE Reconstructs the input with an encoder–decoder structure [123], [124] Deep belief Composed of layer-wise renetwork stricted Boltzmann machine [125] Graph neural network Learns graph structure, e.g., relationships between features of pixels/objects [132] Generative adversarial network Generator and discriminator that are adversarially trained [130], [131], [139] Patch Assigns a label to each patch [111], [115]–[117], [120], [121], [128]–[130] Pixel Predicts change labels for each pixel [110], [113], [114], [126], [131], [139] Object Incorporation of segments/ superpixels [124], [125], [127], [132], [140] 79
c h c h Convolutional Layers of Branch (t2) (a) (b) (c) Sigmoid/Softmax Unrolled Recurrent Layer Fully Convolutional Layer Convolutional Layers of Branch (t1) (d) (e) FIGURE 9. An end-to-end architecture composed of a CNN, RNN, and fully connected network for change detection [111]. (a) Image (t1) (top) and image (t 2). (b) The convolutional subnetwork. (c) The recurrent subnetwork. (d) The fully convolutional layers. (e) The binary change detection (top) and multiclass change detection. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) FIGURE 10. Landslide detection results from different methods: (a) image (t1), (b) image (t 2), (c) ELSE, (d) RLSE, (e) an MRF, (f) an FCN, (g) an SVM, (h) a CNN, (i) a GAN, (j) an SCN, (k) a GSF, and (l) the ground truth. White and black indicate areas where landslides are detected and not detected, respectively. Red and blue circles represent landslide pixels that are wrongly detected and omitted [139]. 80 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
According to the processing unit, deep learning-based detectors are divided into patch- [116], [130], pixel- [110], [117], and object-based [127], [140] varieties. For a patch-based change detection task, a sliding window with a fixed size is used to divide the study area into a series of patches, and each patch is assigned a label by the detector. In this way, each pixel in the patch is assigned the same label. Consequently, rough location—not fine-grained—boundary-of-change information is obtained. However, patch-based change detection can reduce the influence of spatial misalignment to some extent in VHR change detection. Since patch-based deep learning networks view each patch as the change detection unit and encode each patch as a set of feature maps with coarser spatial resolutions, the spatial misalignment of these feature maps becomes smaller, and some errors of spatial alignment are therefore avoided in a change detection task. In other words, when regarding a patch as the change analysis unit, only a very large misalignment can cause an unchanged image patch to be identified as a changed one, and a small misalignment can be tolerated. Several important issues should be noticed for the patch-based method, such as the oversmoothing of results and the selection of the patch size. The multiscale strategy [135] may be appropriate for addressing these issues, but it inevitably leads to larger computation burdens. Pixel-based methods usually employ semantic segmentation architectures to predict pixel-wise change detection results [33]. Specifically, in semantic segmentation architectures, after extracting abstract semantic information through multilayer encoding (e.g., convolution layers), a series of operations, e.g., interpolation, deconvolution, and upsampling, is used to progressively decode semantic information into feature maps that have the same spatial resolution as the input images. Unlike traditional pixel-based change detectors that suffer from misregistration, viewing angle differences, and occlusions, deep learning methods can predict pixel-wise change detection with a highly semantic abstraction of the spatial context. However, object boundaries are often blurred in the change detection results, as up-sampling layers reconstruct the appearance but not the shape of objects. To cope with this issue, better networks are designed. UNet++, for example, combines nested features to preserve change region boundaries, considering that shallow layers are better able to capture spatial details [110]. Object-based deep learning methods are also considered for change detection [127], [140]. A simple approach is to adopt object-based segmentation in the pre/postprocessing step, as shown in [140]. On the other hand, object information can be also considered during the training process by adding object-wise loss terms [127]. However, issues related to conventional multitemporal image segmentation, such as oversegmentation, undersegmentation, and “sliver objects” caused by misregistration, remain unsolved. In the future, object-based detectors need to generate semantic segments and establish spatial correspondence between multitemporal segments. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE The types of characteristics most often used for each criterion (i.e., the learning strategy, fusion strategy, network model, and processing unit) in VHR change detection are summarized in the following: 1) For the learning strategy, supervised learning is the most widely used method for VHR change detection. However, the great amount of labor required to collect a large number of training samples becomes a bottleneck, especially for deep network models, which leads to increasing attention for other learning strategies. 2) Late and early fusion strategies have their own strengths and weaknesses in representing multitemporal features and their differences, and hence hybrid fusion is sometimes chosen. 3) Among various network models, CNNs are the most commonly considered, and they are coupled with other networks, i.e., hybrid models, for instance, CNN–RNNs [111]. 4) As for the processing unit, most studies consider patchand pixel-level models. Patch-level detectors are more tolerant to spatial misalignment, but pixel-based ones are more appropriate for identifying fine-grained changes. APPLICATIONS OF VHR CHANGE DETECTION VHR image change detection is widely used in a large number of practical scenarios. A series of representative applications is the focus of this review, including the monitoring and change detection of 1) land cover and land use, 2) buildings, 3) vegetation, 4) crops, 5) lakes and wetlands, 6) ecosystem services, and 7) impervious surfaces. LAND COVER AND LAND USE CHANGE DETECTION Compared to coarse- and medium-resolution images, VHR images can reveal detailed and subtle intraurban change information [141]. Specifically, urban change detection by combining multiple features (e.g., object-based spectral, shape, and texture attributes) was presented in [142], where changes to detailed urban objects, e.g., buildings, roads, and playgrounds, can be detected. Huang et al. [21] identified pixel-level change transitions in 2012–2013 using Ziyuan-3 orthographic images, and the experimental result is presented in Figure 11. It can be seen that, even in the one-year period, small-scale changes extensively occurred in the urban area of Wuhan, China. For instance, fine-scale urban land cover transitions caused by pond infilling, building demolitions, building construction, weed growth, and site preparation can be observed. In [143], changes in detailed land cover classes, including bright roofs, gray roofs, tile roofs, brown fields, dark asphalt, light asphalt, and so on, were analyzed using IKONOS and GeoEye-1 images. As for land use change detection, Wu et al. [108] interpreted change transitions, e.g., from sparse housing to industrial areas, by combining spectral and SIFT features. In [144], land use maps of Shenzhen (a highly dynamic and developed megacity in China) were generated in 2005 and 2017 based on VHR satellite data. As demonstrated in Figure 12, detailed land use categories, including residential, commercial, 81
industrial, infrastructure, grassland, farmland, woodland, water, breeding surfaces, and unused land, were monitored. In addition, the performance of different features, i.e., color histograms (CHs), LBPs, SIFTs, and deep features, were compared, and the best accuracies of 96.9% and 97.1% were obtained by the deep learning method [Figure 12(b)]. BUILDING CHANGE DETECTION Buildings are one of the most dynamic artificial structures, and building change detection is important for urban development monitoring (e.g., building demolition and construction) and disaster management (e.g., building damage caused by natural hazards). Numerous methods for building change detection have been proposed [19], [51]–[53], [85], [145]–[157]. Some studies focus on multitemporal building observation and subsequent change analysis, where descriptors for building detection in VHR images are a critical issue. The descriptors can be categorized as template matching (e.g., the snake model) [158], knowledge based (e.g., shadow evidence and the MBI) [36], [159], and machine learning [148], [160]. For example, in [52], the MBI and the Harris detector were used to identify building areas, and then building change detection was conducted through interest point matching. Other types of methods directly explore changes in shapes, colors, and textural properties that are highly related to characteristics of buildings. For example, in [51], multitemporal variations in the MBI and spectral information were used to identify altered buildings. Likewise, in [85], the change feature generated by the MBI and spectral features was considered the indicator of building change. In [161], building changes were detected through the aggregation of spectral and textural features. Figure 13 provides building change detection results from different methods, including SVMs based on MBI features (MBI–SVM), building interest point detection using the MBI and the Harris detector, MBI-based CVA (MBI–CVA), the fusion of the MBI and spectral and shape features, CVA using morphological features, and objectbased CVA. It can be seen that automatic methods can achieve performance comparable to or better than supervised ones, i.e., the MBI–SVM [Figure 13(d)]. Meanwhile, the results of the MBI–CVA [Figure 13(g)] show Result 2012 2013 b d c (b) (c) (d) e (e) f (a) Soil to Roof Roof to Soil (f) Soil to Grass Grass to Soil Water to Grass No Change FIGURE 11. Land cover change detection using Ziyuan-3 satellite imagery from 2012 and 2013. (a) The change detection result of the study area in Wuhan. (b)–(f) Five example cases of the change detection result and corresponding bi-temporal images [21]. 82 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
representation is the key to achieving good performance for VHR change detection. Apart from 2D characteristics, 3D information has been exploited for building change detection in recent more small false alarms. The fusion of the MBI and other features, e.g., the Harris detector [Figure 13(f)] and spectral and shape features [Figure 13(h)], can reduce these errors. These results illustrate that effective feature N W E 0 4 8 16 Km S 96.9 Overall Accuracy (%) 100 2005 80 74.6 97.1 84.6 79.2 88.9 77.2 78.9 69.7 63 60 40 20 2017 Residential Commercial Unused Land Breeding Surface 0 Infrastructure Grassland Industrial Woodland Water Farmland 2005 2017 CH LBP SIFT CH + LBP + SIFT Deep Learning (a) (b) FIGURE 12. Land use change detection in the city of Shenzhen using high-spatial-resolution satellite imagery from 2005 to 2017, including (a) land use maps and (b) an accuracy assessment with different features [144]. (a) (b) (c) (d) (e) Changed Unchanged (f) (g) (h) (i) FIGURE 13. Building change detection maps obtained by different algorithms: (a) image (t1), (b) image (t 2), (c) the reference change map, (d) the MBI–SVM, (e) object-based CVA, (f) the MBI and the Harris detector, (g) the MBI–CVA, (h) the fusion of the MBI with spectral and shape features, and (i) CVA using morphological features [51], [52]. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 83
Benefiting from time-series, multiview satellite imagery, Wen et al. [155] analyzed 3D annual building changes in inner city areas of four Chinese megacities (Beijing, Shanghai, Xi’an, and Wuhan). Their results characterized changes in the horizontal direction, such as construction and demolition, and quantified changes in the vertical direction, i.e., height and volume (Figure 14). It should be noted that uncertainty and the cost of 3D data can present a bottleneck for the development and application of 3D building change detection. Specifically, on the one hand, lidar data are relatively accurate but not recurrently acquired. On the other hand, photogrammetrically derived 3D data from multiview images are a sufficiently cost-effective alternative to lidar, but their 3D reconstruction qualities depend on metaparameters of stereo pairs (e.g., intersections, off-nadir angles, sun elevations, azimuth studies. With easier access to 3D data, such as multiview images, 3D information indicated by angular features can be conveniently used. More importantly, misregistration caused by spatial displacement is minimized [162]. Turker and Cetinkaya [163] detected damaged buildings by calculating the difference between digital elevation maps derived from pre- and postearthquake stereo images. In [157], multichannel indicators, such as height differences and texture similarities, are fused to monitor building changes. The incorporation of angular features is effective in improving the performance of building change detection, and it has potential for quantifying 3D dynamic processes in urban renewal and development. However, due to the relatively high cost of 3D data acquisition, such as lidar and multiview UAV images, only a few studies investigate detailed building change processes in 3D space. 2012 2013 2014 Constructed 2013 60 m 2015 2016 Height (2012) Height (2017) 2015 2016 Height (2012) Height (2017) 2017 Demolished 2013 2017 Unchanged 3m 2017 Building Change (a) 2012 2013 2014 Constructed 2013 60 m 2017 Demolished 2013 2017 Unchanged 2017 Building Change 3m (b) FIGURE 14. The annual 3D building change in subset areas of Shanghai that was achieved using multiview satellite imagery. (a) Subset area 1. (b) Subset area 2 [155]. 84 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
angles, completeness, and time differences) [164]. Therefore, successful 3D building change detection relies on more advanced models that can produce accurate multitemporal 3D data in an economical and effective way. Very recently, deep learning has been explored for 3D reconstruction from multiview images. For example, a CNN-based method was proposed for dense image matching in [165]. This novel technique may provide a new research orientation for 3D urban change detection when vertical and height information can be accurately derived from multiview satellite images. VEGETATION CHANGE DETECTION Analysis of vegetation change is important to understanding ecological transitions [166]. Using VHR imagery, vegetation change can be investigated at a much finer scale, e.g., from forest stands to individual trees. In general, there are three types of vegetation changes: 1) seasonal, caused by plant phenology; 2) gradual, caused by interannual climate variability, land management, and land degradation; and 3) abrupt, caused by disturbances, e.g., urbanization, deforestation, and fires [167]. In [168], to assess seasonal changes, both spectral and textural information extracted from multiseasonal Pléiades imagery (2 m) was used for multiseasonal leaf area index (LAI) mapping. The results showed that the highest LAI occurred in midsummer, followed by late spring, autumn, and winter, and the observed seasonal change trend was similar to that based on the in situ measured LAI. Seasonal changes in the crown scale in an Amazon tropical evergreen forest were assessed by Wang et al. [169] using Planet constellation imagery with a spatial resolution of 3 m. The crown scale fraction of nonphotosynthetic vegetation showed large seasonal trend variability from June to November. As for gradual changes, Gärtner et al. [170] used QuickBird and WorldView-2 imagery to quantify tree crown diameter changes in a degraded riparian tugai forest in northwestern China, and their results indicated that the diameter increased by 1.14 m, on average, during 2005–2011. Tian et al. [171] explored DSMs from satellite stereo sensors to monitor vertical tree growth and found that periodic annual increments at the study sites were in the range of 0.3–0.5 m. In the case of abrupt change, Dalagnol et al. [172] quantified tree canopy loss and gap recovery in tropical forests where there was low-intensity logging by using WorldView-2 and GeoEye-1 images. Their study showed that VHR satellite imagery has potential for tracking small-scale human disturbances. Ardila et al. [173] identified bi-temporal tree crown elliptical objects through the iterative surface fitting of a Gaussian model to crown membership in two urban residential areas in The Netherlands using QuickBird and aerial images. A detection rate of 77% was reported for both removed and planted trees. In addition to coverage, tree crown diameters, and canopy heights, species types are an essential parameter of vegetation community structures. In particular, VHR imagery is able to identify small and highly mixed species. Since different vegetation types exhibit similar spectral characteristics, textures are often used to identify various species. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE For instance, Lu and He [174] investigated seasonal species variations in a tall grassland in Ontario, Canada, during the growing season (from April to December) in 2015 using UAV images. The reflectance value, vegetation indices, and GLCM textures were used in the classification, and temporal change analysis revealed the growing process and succession of different species. Notably, some advanced methods, e.g., deep features [175], photogrammetric-derived DSMs from stereo images [176], phenological characteristics [177], and data fusion (e.g., lidar and airborne hyperspectral images) [178], have been considered for the change analysis of vegetation species. Moreover, some researchers attempted to discriminate vegetation function types, e.g., park, roadside, and residential–industrial trees in urban areas [179]. Likewise, vegetation function-type change monitoring is of great significance but has not been addressed in the current research. MONITORING CROP CHANGES Information about agricultural land changes, crop type conversions, and crop growth, critical for precision agriculture, can be effectively captured using VHR images. In [180], land cover data for Guanlin, Yixing City, China, in 2006, 2009, 2012, and 2015 were generated using QuickBird images, and they showed a decrease followed by an increase in the agricultural land area that was observed. Malinverni et al. [181] quantified the temporal variation of main crop rotations on the Capitanata plain of Southern Italy using WorldView-2 images, and the textural features (e.g., the GLCM and the Gabor wavelet) were employed to improve the classification accuracy. The study suggests that multitemporal classification is preferred in crop mapping, due to its rich phonological characteristics. Furthermore, frequent crop growth monitoring is extremely important for timely decision making in precision agriculture. Therefore, timeseries data are recommended, although dense time series of VHR images are relatively difficult to acquire. Recently, new generation micro-/nanosatellites (e.g., Planet) and UAV systems have become available and are able to obtain time-series VHR images, which has potential for agricultural applications. For example, Sadeh et al. [182] detected sowing dates using dense time-series Planet CubeSat data with an interval of two days. As shown in Figure 15, a partly sown field was successfully detected, implying that detailed processes on a near daily basis can be monitored by dense time series of VHR data. Likewise, Bendig et al. [183] monitored plant growth based on crop surface models using stereo UAV images. Notably, height differences between cultivars and their increased trend during the growing season can be observed. Crop change caused by disease and insect damage can also be located. VHR images are able to identify small-extent disease and insect damage, which is beneficial for controlling problems at early stages. Generally, diseases and insects can result in various kinds of harm to crop canopies, such as the removal of leaves, skeletonizing of leaf tissue, 85
management, restoration, and protection. Many studies have used remote sensing data for monitoring lakes, from a local to a global scale. They include lake changes between 1975 and 2015 across the Yangtze floodplain in China via Landsat images [191], water clarity changes in lakes and reservoirs across China that were observed using Moderate Resolution Imaging Spectroradiometer (MODIS) data [192] from 2000 to 2017, and global surface water changes between 1984 and 2015 acquired through Landsat images [193]. In these studies, which were subject to relatively low spatial resolution, lakes with large areas were targeted. However, more than 303.6 million of the 304 million lakes at the global scale are smaller than 1 km2 [194]. Therefore, VHR remote sensing images are required for observing them. To our knowledge, however, only a few studies have focused on lake monitoring using VHR images. Cooley et al. [195] tracked water changes in the 470 lakes (0.0025–1.23 km2) in the Yukon Flats of north-central Alaska during mid to late summer (23 June to 1 October) in 2016, using Planet CubeSat images with a spatial resolution of 3 m. A time-series analysis revealed that the area of 83% of the studied lakes had decreased and that 22% of the lakes had lost more than half their surface. Notably, more applications of advanced methods of water detection through VHR images, e.g., deep learning [196] and physical approaches [197], are needed. Furthermore, information about black and odorous water [198] and water types (e.g., rivers, lakes, canals, and ponds) [199] is of increasing interest, and multitemporal monitoring is imperative. In addition to lakes, VHR images have potential for monitoring detailed changes in wetland ecosystems. In [200], the results of five-level mangrove features, including vegetation boundaries, mangrove stands, mangrove zonations, individual tree crowns, and species communities, using different data sets [Landsat (30 m), Advanced Land and discoloration of leaves, and these effects vary depending on the type of disease, insect, and crop [184]. Therefore, different damage shows various spectral and structural characteristics in remote sensing images, which makes the identification of disease and insect problems via VHR images a challenging task. One of the successful applications was presented by Johansen et al. [185], where GeoEye-1 images acquired in 2012, 2013, and 2014 were used to detect canegrub damage in sugarcane fields. In the study, objects with low NDVI values and rough textures were identified as likely to be damaged, and they were further classified as low, medium, and high likelihood. Franke and Menz [186] observed different levels of disease severity in a plot of winter wheat using multitemporal QuickBird images acquired in April, May, and June. The experimental results show that VHR multispectral data are only moderately suitable for damage detection at an early growth stage, a fact attributed to the subtle spectrum and texture differences between damaged and healthy crops [187], [188]. However, VHR hyperspectral sensors seem to have potential to address this issue. For example, in [189], spectral and spatial features were extracted by a CNN from UAV hyperspectral images for the detection of yellow rust across a whole crop cycle of winter wheat. Satisfactory accuracy was achieved through all growing stages, due to the detailed spectral information and rich spatial details in VHR hyperspectral images. MONITORING LAKES AND WETLANDS Lakes and wetlands, which play a critical role in biodiversity, ecosystems, hydrology, and climate regulation, are highly dynamic due to various natural and anthropogenic factors, such as climate change, farming, urbanization, floods, and hydrological interventions [190]. Therefore, accurate and timely monitoring of lakes and wetlands is important for Change 0 0.5 1 Km Sown Field 0 0.5 1 Km Unsown Field (a) No Change 0 0.5 1 Km Noise (b) Sown Area (c) FIGURE 15. A sowing detection result obtained using time-series Planet CubeSat images [182]. (a) RGB satellite imagery. (b) The change result. (c) The sowing detection result. 86 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Observing Satellite Advanced Visible and Near-Infrared Radiometer 2 (10 m), pan-sharpened WorldView-2 (0.5 m), and lidar] were generated and compared. As described in Figure 16, the Landsat image cannot accurately discriminate the mangrove extent, due to the mixed-pixel problem [Figure 16(e)], and more fine-scale mangrove features, i.e., tree-crown-level species, can be captured only by pan-sharpened WorldView-2 imagery [Figure 16(l)–(p)]. By summarizing the current literature, it can be found that most studies focus on detecting the extent of wetland change but ignore species change. For instance, Hu et al. [201] monitored land cover changes in the Hangzhou Xixi wetland from 2000 to 2013 using IKONOS, QuickBird, and WorldView-2 images. It was shown that the nonwetland area increased by approximately 100%, mostly in the form of herbaceous zones, followed by forests, ponds, cropland, marshes, and rivers. Wu et al. [202] integrated lidar data and multitemporal aerial imagery (1 m) to map wetland inundation dynamics in the Prairie Pothole region of North America, which is characterized by millions of small depressional wetlands. The difficulties of species change detection in wetlands lie in the following aspects. On the one hand, tidal and phenological changes make different plant species highly dynamic on daily and seasonal frequencies, respectively. On the other hand, many species have a similar spectral reflectance during the peak biomass in complex wetland landscapes [203], and the spectral signature of the same species can be influenced by many complex factors, such as the off-nadir angle, sun-viewing geometry, crown porosity, leaf clumping, and ground surface scattering [204]. For instance, in [200], mangrove species were categorized from WorldView-2 images using the nearest-neighbor classifier to extract object-based spectral and textural features within tree crowns, but a low overall accuracy of around 54% was reported. As demonstrated in Figure 16(p), misclassified open scrub Avicennia marina can be clearly observed. To improve the discriminative power among various species, the potential of VHR hyperspectral images, dense time-series data, and vertical information for characterizing detailed spectral, phenological, and height attributes needs to be explored. ECOSYSTEM SERVICES MONITORING Ecosystem services link ecosystems to human welfare by regarding nature as a stock providing a flow of services (e.g., local climate regulation and water purification) [205]. Monitoring urban ecosystem services is of great value for investigating ecological function changes and can help improve the understanding of urbanization impacts on local ecological benefits. VHR satellite data can monitor spatially explicit ecosystem services at fine scales. Generally speaking, there are two categories of methods to derive ecosystem services: 1) statistical regression and radiative transfer models and 2) land use/cover-based methods [206]. Since in situ observations are not always available and the validity of statistical regression and radiative transfer models is affected by time inconsistencies between ground and remotely DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE sensed measurements, land use/cover-based methods are often preferred. For example, in [207], land use/cover maps of Shanghai’s urban core from 2000 to 2009 were classified using IKONOS and GeoEye-1 images, and the classes were then transformed into ecosystem service supply and demand budgets, including regulating, provisioning and cultural services, and ecological integrity. An increase of at least 20% in ecosystem service supply budgets was observed, which was mainly attributed to the replacement of continuous urban fabric and industrial areas by high-rise commercial/residential areas despite a slight increase in urban green sites. Huang et al. [144] assessed ecosystem service change in Shenzhen from 2005 to 2017 using Gaofen-2 (4-m) and QuickBird (2.4-m) images. In the study, multitemporal land use maps were generated by a transferred deep CNN (as shown in Figure 12), based on which ecosystem service supply and demand values were estimated. It was found that supply capacity had decreased by 13.7% due to a reduction in woodlands, water, farmland, and so on, but, on the other hand, demand values had grown by 23.5% because urban expansion and redevelopment had increased the amount of residential, commercial, and infrastructure land. The results clearly demonstrated the ecosystem degradation of Shenzhen during the previous 10 years. Ren et al. [208] evaluated the ecosystem services of Guyuan City in 2003, 2009, and 2014 via VHR satellite imagery (e.g., QuickBird and Gaofen-1) and showed that VHR images were advantageous in the dynamic, quantitative, and visual examination of ecological changes. With VHR remote sensing images, fine-scale ecosystem services within urban areas can be effectively quantified. However, most of the current works focus on urban areas and ignore the ecosystem services of natural scenes, such as forests and wetlands. Moreover, these works present only case studies, and large-scale examinations are still lacking. IMPERVIOUS-SURFACE CHANGE DETECTION The change detection of impervious surfaces is important in monitoring and understanding urban development and has been extensively studied in the remote sensing literature. However, most of the existing studies monitor the change of impervious surfaces based on coarse- and medium-spatial-resolution satellite imagery, such as MODIS and Landsat [209], [210], which, on the other hand, have difficulty dealing with areas that have low impervious-surface intensities and mixed pixels [211]. During recent decades, images with high spatial resolution have provided new opportunities for subtle impervious-surface monitoring at very fine scales. However, impervious-surface monitoring using VHR imagery is a challenging task. VHR multitemporal images exhibit a large number of details (e.g., buildings, roads, driveways, and sidewalks), greater spatial heterogeneity (e.g., different viewing geometries), and occlusion by urban trees, shadow, and vertical structure layover [212]. To address the problem caused by shadow, Li et al. [213] 87
153°10′15″E 153°10′E 153°10′15″E 153°10′E 153°10′15″E 153°10′E 153°10′15″E (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) (p) (q) 27°24′30″S 27°24′15″S Level 1 Local Vegetation Cover 153°10′E 27°24′30″S 27°24′15″S Level 2 Local Vegetation Community 250 m 153°10′15″E Not Vegetation Mangroves Not Mangroves 27°24′15″S Vegetation 27°24′30″S Vegetation Class Level 3 Mangrove Zonations 153°10′E Mangrove Zonation Tree Crowns Canopy Gaps Closed Forest, Avicennia marina Low-Closed Forest, A. marina Open Scrub, A. marina 27°24′16″S Level 5 Species Community Tree Crowns and Species Community 20 m 153°10′17″E 153°10′18″E 153°10′17″E 153°10′18″E 27°24′15″S Zone 4 27°24′16″S Zone 3 Level 4 Tree Crowns Zone 2 27°24′15″S Zone 1 153°10′17″E 153°10′18″E FIGURE 16. Five-level mangrove features generated using different data sets [200]. (a) Level 1 TM, (b) level 1 AVNIR-2, (c) level 1 WorldView-2, (d) WorldView-2 RGB image, (e) level 2 TM, (f) level 2 AVNIR-2, (g) level 2 WorldView-2, (h) level 2 WorldView-2+LiDAR, (i) level 3 AVNIR-2, (j) level 3 WorldView-2, (k) level 3 WorldView-2+LiDAR, (l) level 4 pan-sharpened WorldView-2, (m) level 4 pan-sharpened WorldView-2+LiDAR, (n) WorldView-2 PC1,2,1, (o) level 5 pan-sharpened WorldView-2, (p) level 5 pan-sharpened WorldView-2+LiDAR, and (q) aerial photograph. 88 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
extracted multiscale object features and further classified shaded areas to extract impervious surfaces using QuickBird and IKONOS imagery. More recently, Zhang and Huang [214] developed a two-stage object-based classification method based on multilevel features (i.e., spectral, textural, shape, and class related) for time-series impervious-surface change detection in Shenzhen in 2003–2017, including the impervious-surface mapping of both nonshaded and shaded areas. As can be seen in Figure 17, in addition to single changes across the studied period (i.e., cases 1 and 2), some regions (e.g., case 3) experienced multiple changes. 1) spatial resolution: HR (2–5 m), VHR (1–2 m), and ultraHR (UHR) (<1 m) 2) temporal resolution: bi-temporal and multitemporal 3) analysis unit: pixel, object, and patch 4) change category: binary change (BC), multiple change (MC), and directional change (DC) categories 5) targets. In terms of the previously mentioned categorization schemes, a distribution of the literature reviewed in this study appears in Figure 18. Most articles use only bi-temporal images (78.12%) and concern binary change (66.32%). With regard to spatial resolution, 43.75% of the papers use UHR images, followed by VHR (33.33%) and HR (22.92%) images. As for analysis units, pixels and objects have almost the same number of articles, but patch-based change detection is rarely reported. Of the studies reviewed in this research, more than half involve land cover and land use change detection with multiple targets considered, followed by a series of specific targets, including buildings (20%), vegetation (10.53%), crops (8.42%), lakes and wetlands (5.26%), ecosystem services (3.16%), and impervious surfaces (2.1%). SUMMARY OF VHR CHANGE DETECTION DIMENSIONS As suggested in [10], remote sensing change detection can be categorized according to different dimensions, e.g., input data, temporal resolutions, change categories, targets, and analysis units. Since this research focuses on VHR optical images, the input data are discussed in terms of spatial resolutions. Therefore, we divide VHR change detection studies by considering the following five categorization schemes: 2003 2005 2007 2010 2012 2015 2017 1 2 3 (a) 1 2 Unchanged 2010 2012 Multiple Times 3 2005 2015 2007 2017 N 0 1 2 3 4 km (b) FIGURE 17. Impervious-surface monitoring results from Shenzhen during 2003–2017. (a) Some typical cases of change profiles and (b) change detection results [214]. Red borders represent corresponding change times. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 89
RECOMMENDATIONS FOR FUTURE WORK but studies related to tracking moving objects (e.g., ships, planes, trains, and vehicles) in VHR sequential videos are limited. In [219], the automatic detection and tracking of moving ships using satellite video was achieved based on multiscale saliency and surrounding contrast analysis. Wang et al. [220] presented a UAV-based vehicle detecting and tracking system, which jointly considered edges, optical flows, and local feature points. The first-ranked team at the 2016 IEEE Geoscience and Remote Sensing Society Data Fusion Contest designed an innovative deep neural network with an MSI and spaceborne video as input, and object activity was analyzed using the Kanade–Lucas–Tomasi key point tracker [221], [222]. During the coming years, space videos are likely to be a very important data source for Earth monitoring, and more promising studies based on VHR sequential videos can be expected, while a new era in VHR change detection that shifts from conventional multitemporal change detection to video sequential tracking may dawn. Despite the preceding attempts, change tracking using VHR videos is still in its early stage and needs to be further explored. Notably, unlike conventional videos, challenges related to satellite video processing may include the small size of moving objects (e.g., vehicles), complex backgrounds (e.g., building relief displacement in urban scenes), camera movements, and low frame rates. FROM CHANGE DETECTION TO TRACKING Most VHR change detection studies focus on bi-temporal images and multiple time series. However, change events, such as phenology and urban development, cannot be well characterized by coarse temporal observations. Frequent HR monitoring of both human and natural activities deserves much attention, especially when small satellite constellation (e.g., Planet) images become available. With time series VHR images, change detection is advanced from simply locating variations via bi-temporal data to dense time-series monitoring [215]. There have been attempts at time-series monitoring using VHR images of buildings [155], crops [216], water [195], impervious surfaces [214], newly constructed building areas [2], forests [217], and landslides [218]. However, most of these methods are merely an extension of bi-temporal techniques by multiple pair comparisons, which is not sufficient to capture the temporal context and semantics and to support time series analysis. Recently, VHR videos acquired by SkySat-1, Jinlin-1, and the UrtheCast Iris camera have shown great potential for near-real-time target tracking from space. Most of the current change detection studies have focused on the appearance/disappearance and shape changes of objects, DC (15.79%) HR (22.92%) Multitemporal (21.88%) UHR (43.75%) Bi-Temporal (78.12%) MC (17.89%) BC (66.32%) VHR (33.33%) (a) (b) (c) Impervious Surfaces (2.1%) Ecosystem Services (3.16%) Patch (10.53%) Lakes and Wetlands (5.26%) Pixel (43.16%) Object (46.31%) Crops (8.42%) Land Cover and Land Use (50.53%) Vegetation (10.53%) Buildings (20%) (d) (e) FIGURE 18. The distribution of different dimensions for the studies reviewed in this research: (a) temporal resolution, (b) spatial resolution, (c) change categories, (d) analysis units, and (e) targets. 90 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
HR GLOBAL CHANGE DETECTION Remote sensing imagery has long been considered an effective data source for global change detection, due to its large coverage area, convenient access, and frequent revisits. Previous multitemporal global maps of land cover and thematic change detection are often generated at a relatively coarse resolution (i.e., >300 m), e.g., 8-km-resolution global forest change based on Advanced Very High Resolution Radiometer data for 1982–1999 [223], 500-m resolution mapping of the global urban extent from MODIS data from 2005 and 2009 [224], [225], and the 300-m resolution annual Climate Change Initiative Land Cover maps from 1992 to 2015 [226]. More recently, global-scale change detection with fine spatial resolution (around 30 m) has been attempted with open source Landsat imagery. Notable examples include the Global Forest Cover database [227], GlobeLand30 global land cover product [228], Global Artificial Impervious Area annual maps [229], Global Surface Water data sets by the European Commission Joint Research Center [230], and Global Human Settlement Layer framework [231]. Please note that 30 m is not a high spatial resolution in a common sense, but it should be regarded as high in the case of intercontinental and global mapping. Recently, Gong et al. [232] developed a 10-m resolution global land cover map through Sentinel-2 images acquired in 2017. It is a trend that global products are being developed in finer spatial and temporal resolutions that can characterize heterogeneous and mixed areas more accurately. For instance, the Planet CubeSats are able to acquire images at a 3–5-m spatial resolution with near-real-time daily global coverage [233], which has potential for VHR global change detection in the future. In addition, cloud computing platforms, such as Google Earth Engine and Amazon Web Services, can facilitate the processing of large volumes of satellite images and speed the development of VHR global mapping [234]. HYPERSPECTRAL CHANGE DETECTION Hyperspectral data can distinguish more detailed land cover types due to their rich spectral information. For a long time, the data availability of hyperspectral images seemingly limited real applications in precise change detection. Recently, however, the development of hyperspectral satellites with a relatively fine spatial resolution, e.g., Gaofen-5 (30 m, with 330 spectral bands), Tiangong-1 (10 m, with 128 spectral bands), and Zhuhai-1 (10 m, with 32 spectral bands), and airborne hyperspectral sensors, e.g., HyMap (3 m, with 126 spectral bands) and the Reflective Optics System Imaging Spectrometer (ROSIS) (1.3 m, with 115 spectral bands), has significantly increased the availability of multitemporal hyperspectral images. However, studies related to VHR hyperspectral change detection are very limited, and even the existing methodologies were developed based on synthetic data [235]. Moreover, advances in hyperspectral image classification benefit from a set of widely used public benchmark data sets, e.g., the ROSIS Pavia University and DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE Airborne Visible/Infrared Imaging Spectrometer Salinas data sets [236]. Therefore, there is an urgent need for public hyperspectral change detection data sets to promote the development of the related research fields. URBAN FUNCTIONAL ZONE CHANGE DETECTION Currently, the classification of urban functional zones is one of the important research areas in interpreting VHR remote sensing images, as the urban functional zones can bridge the semantic gap between land cover and human socioeconomic activities. Current urban functional zone mapping not only involves various image features, e.g., deep [237], [238], angular [97], object based [239], and textural [240], but it also refers to multisource geographic information, such as points of interest (POIs) [241], social media [242], and mobile phone positioning [100]. In rapidly urbanizing regions, the timely and accurate monitoring of urban functional zones is crucial for planning and management. However, studies for change detection in urban functional zones are lacking. Frankly, urban functional zone change detection is a difficult task since land cover change does not necessarily signify the conversion of a functional zone type. Meanwhile, multisource geographic data, e.g., POIs, are widely used for functional zone classification [230], but these data do not provide a time tag, which hampers the dynamic monitoring of urban functional zones. These issues should be overcome to effectively monitor changes in cities. CONCLUSIONS With the increasing availability of VHR remote sensing images, precise, frequent, and stereo change detection becomes possible. To the best of our knowledge, a comprehensive review of VHR change detection is lacking in the current literature. Therefore, this article aimed to summarize recent advances in VHR remote sensing image change detection, including methods and applications. The review of methods focused on feature extraction and change detectors for multitemporal VHR images. Applications including change detection for land cover and land use, impervious surfaces, buildings, crops, vegetation, lakes and wetlands, and ecosystem services were reviewed. Finally, some future directions were suggested and discussed for this important research area. Recommendations for future work include focusing on change tracking, global change detection, hyperspectral change detection, and urban functional zone change detection to generate frequent and detailed semantic change information on a global scale. ACKNOWLEDGMENTS The authors are grateful to the editor-in-chief, associate editor, and reviewers for their insightful comments and suggestions. This research was supported by the National Natural Science Foundation of China, under grants 41901279, 41771360, and 41971295, and the Chinese Academy of Sciences Interdisciplinary Innovation Team, under grant JCTD-2019-04. (Corresponding author: Xin Huang.) 91
AUTHOR INFORMATION Dawei Wen (daweiwen@mail.hzau.edu.cn) received the B.E. degree in surveying and mapping and the Ph.D. degree in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 2013 and 2018, respectively. She is a postdoctoral researcher in the College of Public Administration, Huazhong Agricultural University, Wuhan, 430070, China. Her research interests include the change analysis of multitemporal remote sensing images and remote sensing applications. Xin Huang (xhuang@whu.edu.cn) received the Ph.D. degree in photogrammetry and remote sensing in 2009 from Wuhan University, Wuhan, China. He is a Luojia Distinguished Professor at Wuhan University, Wuhan, 430079, China, where he teaches remote sensing, photogrammetry, and image interpretation. He is the founder and director of the Institute of Remote Sensing Information Processing, School of Remote Sensing and Information Engineering, Wuhan University. He has published more than 150 peerreviewed articles (Science Citation Index papers) in international journals. His research interests include remote sensing image processing methods and applications. He was the recipient of the Boeing Award for the Best Paper in Image Analysis and Interpretation from the American Society for Photogrammetry and Remote Sensing (ASPRS) in 2010, the second-place recipient of the John I. Davidson President’s Award from ASPRS in 2018, and the winner of the IEEE Geoscience and Remote Sensing Society 2014 Data Fusion Contest. He was an associate editor of Photogrammetric Engineering and Remote Sensing (2016–2019) and of IEEE Geoscience and Remote Sensing Letters (2014–2020), and he now serves as an associate editor of IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (since 2018). He is also an editorial board member of Remote Sensing of Environment (since 2019), Science of Remote Sensing (since 2020), and Remote Sensing (since 2018). He is a Senior Member of IEEE. Francesca Bovolo (bovolo@f bk.eu) received the B.S. and M.S. degrees in telecommunication engineering (summa cum laude) and the Ph.D. degree in communication and information technologies from the University of Trento, Italy, in 2001, 2003, and 2006, respectively, where she remained as a research fellow until June 2013. She is the founder and head of the Remote Sensing for Digital Earth unit at Fondazione Bruno Kessler, Trento, 38123, Italy, and a member of the Remote Sensing Laboratory, Trento. Her research interests include multitemporal remote sensing image analysis; change detection in multispectral, hyperspectral, and synthetic aperture radar images and VHR images; time series analysis; content-based time series retrieval; domain adaptation; and lidar and radar sounders. She was the publication chair for the 2015 IEEE International Geoscience and Remote Sensing Symposium. She is the cochair of the Society of Photographic Instrumentation Engineers International Conference on Signal and Image Processing for Remote Sensing. She is a Senior Member of IEEE. 92 Jiayi Li (zjjerica@whu.edu.cn) received the B.S. degree from Central South University, Changsha, China, in 2011 and the Ph.D. degree in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 2016. She is currently an assistant professor in the School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, 430079, China. She has authored more than 30 peer-reviewed articles (Science Citation Index papers) in international journals. Her research interests include hyperspectral imagery, sparse representation, computation vision and pattern recognition, and remote sensing images. She is a reviewer for more than 10 international journals, including IEEE Transactions on Geoscience and Remote Sensing, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Geoscience and Remote Sensing Letters, IEEE Signal Processing Letters, and International Journal of Remote Sensing. She is the guest editor of the special issue “Change Detection Using Multisource Remotely Sensed Imagery” of Remote Sensing (an open-access journal of the Multidisciplinary Digital Publishing Institute). She is a Member of IEEE. Xinli Ke (kexl@mail.hzau.edu.cn) received the B.S. degree in land planning and utilization from Huazhong Agricultural University, Wuhan, China, in 2001 and the M.S. degree in cartography and geographical information systems and the Ph.D. degree in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 2006 and 2009, respectively. He is a professor in the College of Public Administration, Huazhong Agricultural University, Wuhan, 430070, China. Anlu Zhang (zhanganlu@mail.hzau.edu.cn) received the Ph.D. degree in 1999 from Huazhong Agricultural University, Wuhan, China. He has been a professor at Huazhong Agricultural University, Wuhan, 430070, China, since 2000. He is an executive director of the China Land Society; deputy director of the Academic Committee, China Land Society; deputy director of the Youth Working Committee, China Land Society; and a member of the Expert Committee, Land Remediation Center, Ministry of Land and Resources. Jón Atli Benediktsson (benedikt@hi.is) received the Cand.Sci. degree in electrical engineering from the University of Iceland, Reykjavik, Iceland, in 1984, and the M.S.E.E. and Ph.D. degrees in electrical engineering from Purdue University, West Lafayette, Indiana, USA, in 1987 and 1990, respectively. He is with the Faculty of Electrical and Computer Engineering, University of Iceland, Reykjavik, IS 107, Iceland. From 2009 to 2015, he was the prorector of science and academic affairs and a professor of electrical and computer engineering at the University of Iceland. In 2015, he was the rector of the University of Iceland. He is a cofounder of Oxymap, Reykjavik, a biomedical start-up company. He has authored and coauthored extensively in his fields of interest. His research interests include remote sensing, image analysis, pattern recognition, biomedical analysis of signals, and signal processing. He is a Fellow of IEEE. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] S. Liu, Q. Du, X. Tong, A. Samat, and L. Bruzzone, “Unsupervised change detection in multispectral remote sensing images via spectral-spatial band expansion,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 12, no. 9, pp. 3578–3587, 2019. doi: 10.1109/JSTARS.2019.2929514. X. Huang, Y. Cao, and J. Li, “An automatic change detection method for monitoring newly constructed building areas using time-series multi-view high-resolution optical satellite images,” Remote Sensing Environment, vol. 244, p. 111,802, 2020. doi: 10.1016/j.rse.2020.111802. I. J, P. Coppin, K. Nackaerts, B. Muys, and E. Lambin, “Digital change detection methods in ecosystem monitoring: A review,” Int. J. Remote Sens., vol. 25, no. 9, pp. 1565–1596, 2004. doi: 10.1080/0143116031000101675. D. Lu, P. Mausel, E. Brondizio, and E. Moran, “Change detection techniques,” Int. J. Remote Sens., vol. 25, no. 12, pp. 2365– 2401, 2004. doi: 10.1080/0143116031000139863. A. P. Tewkesbury, A. J. Comber, N. J. Tate, A. Lamb, and P. F. Fisher, “A critical synthesis of remotely sensed optical image change detection techniques,” Remote Sens. Environ., vol. 160, pp. 1–14, 2015. doi: 10.1016/j.rse.2015.01.006. M. Hussain, D. Chen, A. Cheng, H. Wei, and D. Stanley, “Change detection from remotely sensed images: From pixelbased to object-based approaches,” ISPRS J. Photogrammetry Remote Sens., vol. 80, pp. 91–106, June 2013. doi: 10.1016/j.isprsjprs.2013.03.006. G. Chen, G. J. Hay, L. M. T. Carvalho, and M. A. Wulder, “Object-based change detection,” Int. J. Remote Sens., vol. 33, no. 14, pp. 4434–4457, 2012. doi: 10.1080/01431161.2011.648285. S. Liu, D. Marinelli, L. Bruzzone, and F. Bovolo, “A review of change detection in multitemporal hyperspectral images: Current techniques, applications, and challenges,” IEEE Geosci. Remote Sens. Mag. (replaces Newslett.), vol. 7, no. 2, pp. 140–158, 2019. doi: 10.1109/MGRS.2019.2898520. F. Bovolo and L. Bruzzone, “The time variable in data fusion: A change detection perspective,” IEEE Geosci. Remote Sens. Mag. (replaces Newslett.), vol. 3, no. 3, pp. 8–26, 2015. doi: 10.1109/ MGRS.2015.2443494. H. Si Salah, S. E. Goldin, A. Rezgui, B. Nour El Islam, and S. AitAoudia, “What is a remote sensing change detection technique? Towards a conceptual framework,” Int. J. Remote Sens., vol. 41, no. 5, pp. 1788–1812, 2020. doi: 10.1080/01431161.2019.1674463. H. Han et al., “A mixed property-based automatic shadow detection approach for VHR multispectral remote sensing images,” Appl. Sci., vol. 8, no. 10, p. 1883, 2018. doi: 10.3390/ app8101883. C. Toth and G. Jóźków, “Remote sensing platforms and sensors: A survey,” ISPRS J. Photogrammetry Remote Sens., vol. 115, pp. 22–36, May 2016. doi: 10.1016/j.isprsjprs.2015.10.004. D. Poli and T. Toutin, “Review of developments in geometric modelling for high resolution satellite pushbroom sensors,” Photogrammetric Rec., vol. 27, no. 137, pp. 58–73, 2012. doi: 10.1111/j.1477-9730.2011.00665.x. M. Dalla Mura, S. Prasad, F. Pacifici, P. Gamba, J. Chanussot, and J. A. Benediktsson, “Challenges and opportu- DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] nities of multimodality and data fusion in remote sensing,” Proc. IEEE, vol. 103, no. 9, pp. 1585–1601, 2015. doi: 10.1109/ JPROC.2015.2462751. R. Momeni, P. Aplin, and D. Boyd, “Mapping complex urban land cover from spaceborne imagery: The influence of spatial resolution, spectral band set and classification approach,” Remote Sens., vol. 8, no. 2, p. 88, 2016. doi: 10.3390/rs8020088. M. Volpi, D. Tuia, F. Bovolo, M. Kanevski, and L. Bruzzone, “Supervised change detection in VHR images using contextual information and support vector machines,” Int. J. Appl. Earth Observat. Geoinf., vol. 20, pp. 77–85, Feb. 2013. doi: 10.1016/j. jag.2011.10.013. J. P. Ardila, W. Bijker, V. A. Tolpekin, and A. Stein, “Multitemporal change detection of urban trees using localized regionbased active contours in VHR images,” Remote Sensing Environ., vol. 124, pp. 413–426, 2012. doi: 10.1016/j.rse.2012.05.027. J. Gong, C. Liu, and X. Huang, “Advances in urban information extraction from high-resolution remote sensing imagery,” Sci. China Earth Sci., vol. 63, no. 4, pp. 463–475, 2020. doi: 10.1007/ s11430-019-9547-x. R. Qin, X. Huang, A. Gruen, and G. Schmitt, “Object-based 3-D building change detection on multitemporal stereo images,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 8, no. 5, pp. 2125–2137, 2015. doi: 10.1109/JSTARS.2015.2424275. D. Liu et al., “Integration of historical map and aerial imagery to characterize long-term land-use change and landscape dynamics: An object-based analysis via Random Forests,” Ecol. Indicators, vol. 95, pp. 595–605, Dec. 2018. doi: 10.1016/j. ecolind.2018.08.004. X. Huang, D. Wen, J. Li, and R. Qin, “Multi-level monitoring of subtle urban changes for the megacities of China using highresolution multi-view satellite imagery,” Remote Sens. Environ., vol. 196, pp. 56–75, July 2017. doi: 10.1016/j.rse.2017.05.001. G. Xian and C. Homer, “Updating the 2001 National Land Cover Database impervious surface products to 2006 using Landsat imagery change detection methods,” Remote Sens. Environ., vol. 114, no. 8, pp. 1676–1686, 2010. doi: 10.1016/j. rse.2010.02.018. M. Pesaresi et al., “A global human settlement layer from optical HR/VHR RS data: Concept and first results,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 6, no. 5, pp. 2102– 2131, 2013. doi: 10.1109/JSTARS.2013.2271445. L. Bruzzone and F. Bovolo, “A novel framework for the design of change-detection systems for very-high-resolution remote sensing images,” Proc. IEEE, vol. 101, no. 3, pp. 609–630, 2012. doi: 10.1109/JPROC.2012.2197169. M. Lu, J. Chen, H. Tang, Y. Rao, P. Yang, and W. Wu, “Land cover change detection by integrating object-based data blending model of Landsat and MODIS,” Remote Sens. Environ., vol. 184, pp. 374–386, Oct. 2016. doi: 10.1016/j.rse.2016.07.028. S. Ye, D. Chen, and J. Yu, “A targeted change-detection procedure by combining change vector analysis and post-classification approach,” ISPRS J. Photogrammetry Remote Sens., vol. 114, pp. 115–124, Apr. 2016. doi: 10.1016/j.isprsjprs.2016.01.018. N. Longbotham, C. Chaapel, L. Bleiler, C. Padwick, W. J. Emery, and F. Pacifici, “Very high resolution multiangle urban 93
[28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] 94 classification analysis,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 4, pp. 1155–1170, 2012. doi: 10.1109/TGRS.2011.2165548. D. Poli, F. Remondino, E. Angiuli, and G. Agugiaro, “Radiometric and geometric evaluation of GeoEye-1, WorldView-2 and Pléiades-1A stereo images for 3D information extraction,” ISPRS J. Photogrammetry Remote Sens., vol. 100, pp. 35–47, 2015/02/01/, 2015. doi: 10.1016/j.isprsjprs.2014.04.007. F. Pacifici, N. Longbotham, and W. J. Emery, “The importance of physical quantities for the analysis of multitemporal and multiangular optical very high spatial resolution images,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 10, pp. 6241–6256, 2014. doi: 10.1109/TGRS.2013.2295819. K. Jacobsen, “High resolution satellite imaging systems-an overview,” Photogrammetrie Fernerkundung Geoinf., vol. 2005, pp. 487–496, Jan. 2005. D. Wen, X. Huang, L. Zhang, and J. A. Benediktsson, “A novel automatic change detection method for urban high-resolution remotely sensed imagery based on multiindex scene representation,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 1, pp. 609– 625, 2016. doi: 10.1109/TGRS.2015.2463075. N. Tatar, M. Saadatseresht, H. Arefi, and A. Hadavand, “A robust object-based shadow detection method for cloud-free high resolution satellite images over urban areas and water bodies,” Adv. Space Res., vol. 61, no. 11, pp. 2787–2800, 2018. doi: 10.1016/j.asr.2018.03.011. A. Movia, A. Beinat, and F. Crosilla, “Shadow detection and removal in RGB VHR images for land use unsupervised classification,” ISPRS J. Photogrammetry Remote Sen., vol. 119, pp. 485– 495, Sept. 2016. doi: 10.1016/j.isprsjprs.2016.05.004. G. Liasis and S. Stavrou, “Satellite images analysis for shadow detection and building height estimation,” ISPRS J. Photogrammetry Remote Sens., vol. 119, pp. 437–450, Sept. 2016. doi: 10.1016/j.isprsjprs.2016.07.006. N. Kadhim and M. Mourshed, “A shadow-overlapping algorithm for estimating building heights from VHR satellite images,” IEEE Geosci. Remote Sens. Lett., vol. 15, no. 1, pp. 8–12, 2018. doi: 10.1109/LGRS.2017.2762424. X. Huang and L. Zhang, “A multidirectional and multiscale morphological index for automatic building extraction from multispectral GeoEye-1 imagery,” Photogrammetric Eng. Remote Sens., vol. 77, no. 7, pp. 721–732, 2011. doi: 10.14358/PERS.77.7.721. H. Song, B. Huang, and K. Zhang, “Shadow detection and reconstruction in high-resolution satellite images via morphological filtering and example-based learning,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 5, pp. 2545–2554, 2014. doi: 10.1109/TGRS.2013.2262722. T. Blaschke et al., “Geographic object-based image analysis – Towards a new paradigm,” ISPRS J. Photogrammetry Remote Sens., vol. 87, pp. 180–191, Jan. 2014. doi: 10.1016/j.isprsjprs.2013.09.014. R. M. Haralick, K. Shanmugam, and I. H. Dinstein, “Textural features for image classification,” IEEE Trans. on systems, man, and cybernetics, vol. SMC-3, no. 6, pp. 610–621, 1973. doi: 10.1109/TSMC.1973.4309314. M. Hall-Beyer, “GLCM texture: A tutorial, version v3.0,” Univ. of Calgary, 2007. [Online]. Available: ttp://www.fp.ucalgary.ca/ mhallbey/tutorial.htm [41] S. Yao, S. Pan, T. Wang, C. Zheng, W. Shen, and Y. Chong, “A new pedestrian detection method based on combined HOG and LSS features,” Neurocomputing, vol. 151, pp. 1006–1014, Mar. 2015. doi: 10.1016/j.neucom.2014.08.080. [42] L. Zhang, X. Huang, B. Huang, and P. Li, “A pixel shape index coupled with spectral information for classification of high spatial resolution remotely sensed imagery,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 10, pp. 2950–2961, 2006. [43] K. Tan, X. Jin, A. Plaza, X. Wang, L. Xiao, and P. Du, “Automatic change detection in high-resolution remote sensing images by using a multiple classifier system and spectral– spatial features,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 9, no. 8, pp. 3439–3451, 2016. doi: 10.1109/ JSTARS.2016.2541678. [44] Z. Li, W. Shi, M. Hao, and H. Zhang, “Unsupervised change detection using spectral features and a texture difference measure for VHR remote-sensing images,” Int. J. Remote Sens., vol. 38, no. 23, pp. 7302–7315, 2017. doi: 10.1080/01431161.2017.1375616. [45] D. Peng and Y. Zhang, “Object-based change detection from satellite imagery by segmentation optimization and multifeatures fusion,” Int. J. Remote Sens., vol. 38, no. 13, pp. 3886– 3905, 2017. doi: 10.1080/01431161.2017.1308033. [46] L. Zhang, B. Zhong, and A. Yang, “Building change detection using object-oriented LBP feature map in very high spatial resolution imagery,” in Proc. 10th Int. Workshop on the Anal. Multitemporal Remote Sens. Images (MultiTemp), 2019, pp. 1–4. doi: 10.1109/Multi-Temp.2019.8866919. [47] H. Liu, M. Yang, J. Chen, J. Hou, and M. Deng, “Line-constrained shape feature for building change detection in VHR remote sensing imagery,” ISPRS Int. J. Geo-Inform., vol. 7, no. 10, p. 410, 2018. doi: 10.3390/ijgi7100410. [48] M. Dalla Mura, J. A. Benediktsson, F. Bovolo, and L. Bruzzone, “An unsupervised technique based on morphological filters for change detection in very high resolution images,” IEEE Geosci. Remote Sens. Lett., vol. 5, no. 3, pp. 433–437, 2008. doi: 10.1109/LGRS.2008.917726. [49] N. Falco, M. Dalla Mura, F. Bovolo, J. A. Benediktsson, and L. Bruzzone, “Change detection in VHR images based on morphological attribute profiles,” IEEE Geosci. Remote Sens. Lett., vol. 10, no. 3, pp. 636–640, 2013. doi: 10.1109/LGRS.2012.2222340. [50] S. Liu, Q. Du, X. Tong, A. Samat, L. Bruzzone, and F. Bovolo, “Multiscale morphological compressed change vector analysis for unsupervised multiple change detection,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 10, no. 9, pp. 4124– 4137, 2017. doi: /10.1109/JSTARS.2017.2712119. [51] X. Huang, L. Zhang, and T. Zhu, “Building change detection from multitemporal high-resolution remotely sensed images based on a morphological building index,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 7, no. 1, pp. 105–115, 2014. [52] Y. Tang, X. Huang, and L. Zhang, “Fault-tolerant building change detection from urban high-resolution remote sensing imagery,” IEEE Geosci. Remote Sens. Lett., vol. 10, no. 5, pp. 1060–1064, 2013. [53] X. Huang, T. Zhu, L. Zhang, and Y. Tang, “A novel building change index for automatic building change detection from high-resolution remote sensing imagery,” Remote sensing letters, vol. 5, no. 8, pp. 713–722, 2014. doi: 10.1080/2150704X.2014.963732. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[54] G. R. Cross and A. K. Jain, “Markov random field texture models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-5, no. 1, pp. 25–39, 1983. doi: 10.1109/TPAMI.1983.4767341. [55] T. Xu, I. D. Moore, and J. C. Gallant, “Fractals, fractal dimensions and landscapes—A review,” Geomorphology, vol. 8, no. 4, pp. 245–262, 1993. doi: 10.1016/0169-555X(93)90022-T. [56] C. Benedek, M. Shadaydeh, Z. Kato, T. Szirányi, and J. Zerubia, “Multilayer Markov Random Field models for change detection in optical remote sensing images,” ISPRS J. Photogrammetry Remote Sensing, vol. 107, pp. 22–37, Sept. 2015. doi: 10.1016/j. isprsjprs.2015.02.006. [57] L. Bruzzone and D. F. Prieto, “An adaptive semiparametric and context-based approach to unsupervised change detection in multitemporal remote-sensing images,” IEEE Trans. Image Process., vol. 11, no. 4, pp. 452–466, 2002. doi: 10.1109/TIP.2002.999678. [58] A. Ghosh, B. N. Subudhi, and L. Bruzzone, “Integration of Gibbs Markov random field and Hopfield-type neural networks for unsupervised change detection in remotely sensed multitemporal images,” IEEE Trans. Image Process., vol. 22, no. 8, pp. 3087–3096, 2013. doi: 10.1109/TIP.2013.2259833. [59] B. N. Subudhi, F. Bovolo, A. Ghosh, and L. Bruzzone, “Spatiocontextual fuzzy clustering with Markov random field model for change detection in remotely sensed images,” Optics Laser Technol., vol. 57, pp. 284–292, Apr. 2014. doi: 10.1016/j.optlastec.2013.10.003. [60] H. Yu, W. Yang, G. Hua, H. Ru, and P. Huang, “Change detection using high resolution remote sensing images based on active learning and Markov random fields,” Remote Sensing, vol. 9, no. 12, p. 1233, 2017. doi: 10.3390/rs9121233. [61] S. Aleksandrowicz, A. Wawrzaszek, W. Drzewiecki, and M. Krupiński, “Change detection using global and local multifractal description,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 8, pp. 1183–1187, 2016. doi: 10.1109/LGRS.2016.2574940. [62] S. Luan, C. Chen, B. Zhang, J. Han, and J. Liu, “Gabor convolutional networks,” IEEE Trans. Image Process., vol. 27, no. 9, pp. 4357–4366, 2017. [63] Z. Li, W. Shi, H. Zhang, and M. Hao, “Change detection based on Gabor wavelet features for very high resolution remote sensing images,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 5, pp. 783–787, 2017. doi: 10.1109/LGRS.2017.2681198. [64] C. Wei, P. Zhao, X. Li, Y. Wang, and F. Liu, “Unsupervised change detection of VHR remote sensing images based on multi-resolution Markov Random Field in wavelet domain,” Int. J. Remote Sens., vol. 40, no. 20, pp. 7750–7766, 2019. doi: 10.1080/01431161.2019.1602792. [65] Q. Li, X. Huang, D. Wen, and H. Liu, “Integrating multiple textural features for remote sensing image change detection,” Photogrammetric Eng. Remote Sens., vol. 83, no. 2, pp. 109–121, 2017. doi: 10.14358/PERS.83.2.109. [66] B. Hou, Y. Wang, and Q. Liu, “Change detection based on deep features and low rank,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 12, pp. 2418–2422, 2017. doi: 10.1109/LGRS.2017.2766840. [67] S. Saha, F. Bovolo, and L. Bruzzone, “Unsupervised deep change vector analysis for multiple-change detection in VHR images,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 6, pp. 3677–3693, 2019. doi: 10.1109/TGRS.2018.2886643. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [68] T. Zhan, M. Gong, J. Liu, and P. Zhang, “Iterative feature mapping network for detecting multiple changes in multi-source remote sensing images,” ISPRS J. Photogrammetry Remote Sens., vol. 146, pp. 38–51, Dec. 2018. doi: 10.1016/j.isprsjprs.2018.09.002. [69] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learning Res., vol. 11, pp. 3371–3408, Dec. 2010. [70] P. Zhang, M. Gong, L. Su, J. Liu, and Z. Li, “Change detection based on deep feature representation and mapping transformation for multi-spatial-resolution remote sensing images,” ISPRS J. Photogrammetry Remote Sens., vol. 116, pp. 24–41, June 2016. doi: 10.1016/j.isprsjprs.2016.02.013. [71] L. Su, M. Gong, P. Zhang, M. Zhang, J. Liu, and H. Yang, “Deep learning and mapping based ternary change detection for information unbalanced images,” Pattern Recogn., vol. 66, pp. 213–228, June 2017. doi: 10.1016/j.patcog.2017.01.002. [72] G. Liu, L. Li, L. Jiao, Y. Dong, and X. Li, “Stacked Fisher autoencoder for SAR change detection,” Pattern Recogn., vol. 96, p. 106,971, Dec. 2019. doi: 10.1016/j.patcog.2019.106971. [73] N. Lv, C. Chen, T. Qiu, and A. K. Sangaiah, “Deep learning and superpixel feature extraction based on contractive autoencoder for change detection in SAR images,” IEEE Trans. Ind. Inf., vol. 14, no. 12, pp. 5530–5538, 2018. doi: 10.1109/TII.2018.2873492. [74] X. X. Zhu et al., “Deep learning in remote sensing: A comprehensive review and list of resources,” IEEE Geosci. Remote Sens. Mag. (replaces Newslett.), vol. 5, no. 4, pp. 8–36, 2017. doi: 10.1109/MGRS.2017.2762307. [75] T. Zhan, M. Gong, X. Jiang, and M. Zhang, “Unsupervised scale-driven change detection with deep spatial-spectral features for VHR images,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 8, pp. 1–13, 2020. doi: 10.1109/TGRS.2020.2968098. [76] S. Saha, L. Mou, C. Qiu, X. X. Zhu, F. Bovolo, and L. Bruzzone, “Unsupervised deep joint segmentation of multitemporal highresolution images,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 12, pp. 1–13, 2020. doi: 10.1109/TGRS.2020.2990640. [77] Q. Wang, X. Zhang, G. Chen, F. Dai, Y. Gong, and K. Zhu, “Change detection based on Faster R-CNN for high-resolution remote sensing images,” Remote Sensing Letters, vol. 9, no. 10, pp. 923–932, 2018. doi: 10.1080/2150704X.2018.1492172. [78] J. Liu et al., “Convolutional neural network-based transfer learning for optical aerial images change detection,” IEEE Geosci. Remote Sens. Lett., vol. 17, no. 1, pp. 127–131, 2019. doi: 10.1109/LGRS.2019.2916601. [79] M. Volpi and D. Tuia, “Dense semantic labeling of subdecimeter resolution images with convolutional neural networks,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 2, pp. 881–893, 2016. doi: 10.1109/TGRS.2016.2616585. [80] L. Gueguen and R. Hamid, “Toward a generalizable image representation for large-scale change detection: Application to generic damage analysis,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 6, pp. 3378–3387, 2016. doi: 10.1109/TGRS.2016.2516402. [81] R. Caye Daudt, B. Le Saux, A. Boulch, and Y. Gousseau, “Multitask learning for large-scale semantic change detection,” Comput. Vision Image Understanding, vol. 187, p. 102783, 2019. doi: 10.1016/j.cviu.2019.07.003. 95
[82] R. Gupta et al., “Creating xBD: A dataset for assessing building damage from satellite imagery,” in Proc. IEEE Conf. Comput. Vision and Pattern Recogn. Workshops, 2019, pp. 10–17. [83] J. Zhu, Y. Su, Q. Guo, and T. C. Harmon, “Unsupervised objectbased differencing for land-cover change detection,” Photogrammetric Eng. Remote Sens., vol. 83, no. 3, pp. 225–236, 2017. doi: 10.14358/PERS.83.3.225. [84] D. Ming, J. Li, J. Wang, and M. Zhang, “Scale parameter selection by spatial statistics for GeOBIA: Using mean-shift based multi-scale segmentation as an example,” ISPRS J. Photogrammetry Remote Sens., vol. 106, pp. 28–41, Aug. 2015. doi: 10.1016/ j.isprsjprs.2015.04.010. [85] P. Xiao, M. Yuan, X. Zhang, X. Feng, and Y. Guo, “Cosegmentation for object-based building change detection from highresolution remotely sensed images,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 3, pp. 1587–1603, 2017. doi: 10.1109/ TGRS.2016.2627638. [86] Y. Liu, Q. Guo, and M. Kelly, “A framework of region-based spatial relations for non-overlapping features and its application in object based image analysis,” ISPRS J. Photogrammetry Remote Sens., vol. 63, no. 4, pp. 461–475, 2008. doi: 10.1016/ j.isprsjprs.2008.01.007. [87] M. Kim and M. Madden, Xu, Bo, “GEOBIA vegetation mapping in great smoky mountains national park with spectral and nonspectral ancillary information,” Photogrammetric Eng. Remote Sensing, vol. 76, no. 2, pp. 137–149, 2010. doi: 10.14358/PERS.76.2.137. [88] Z. Lv, T. Liu, and J. A. Benediktsson, “Object-oriented key point vector distance for binary land cover change detection using VHR remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 9, pp. 6524–6533, 2020. doi: 10.1109/ TGRS.2020.2977248. [89] F. Bovolo, “A multilevel parcel-based approach to change detection in very high resolution multitemporal images,” IEEE Geosci. Remote Sens. Lett., vol. 6, no. 1, pp. 33–37, 2009. doi: 10.1109/LGRS.2008.2007429. [90] C. Geiß, M. Klotz, A. Schmitt, and H. Taubenböck, “Objectbased morphological profiles for classification of remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10, pp. 5952–5963, 2016. doi: 10.1109/TGRS.2016.2576978. [91] J. Liang et al., “A comparison of two object-oriented methods for land-use/cover change detection with SPOT 5 imagery,” Sensor Lett., vol. 10, no. 1, pp. 415–424, 2012. doi: 10.1166/ sl.2012.1865. [92] W. Yu, W. Zhou, Y. Qian, and J. Yan, “A new approach for land cover classification and change analysis: Integrating backdating and an object-based method,” Remote Sensing Environment, vol. 177, pp. 37–47, May 2016. doi: 10.1016/j.rse.2016.02.030. [93] X. Zhang, S. Du, Q. Wang, and W. Zhou, “Multiscale geoscene segmentation for extracting urban functional zones from VHR satellite images,” Remote Sens., vol. 10, no. 2, p. 281, 2018. doi: 10.3390/rs10020281. [94] H. Liu, X. Huang, D. Wen, and J. Li, “The use of landscape metrics and transfer learning to explore urban villages in China,” Remote Sens., vol. 9, no. 4, p. 365, 2017. doi: 10.3390/rs9040365. [95] J. Zhou, B. Yu, and J. Qin, “Multi-level spatial analysis for change detection of urban vegetation at individual tree scale,” 96 Remote Sens., vol. 6, no. 9, pp. 9086–9103, 2014. doi: 10.3390/ rs6099086. [96] M. A. Aguilar, M. D. M. Saldana, and F. J. Aguilar, “Generation and quality assessment of stereo-extracted DSM from GeoEye-1 and WorldView-2 imagery,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 2, pp. 1259–1271, 2013. doi: 10.1109/TGRS.2013.2249521. [97] X. Huang, H. Chen, and J. Gong, “Angular difference feature extraction for urban scene classification using ZY-3 multi-angle high-resolution satellite imagery,” ISPRS J. Photogrammetry Remote Sens., vol. 135, pp. 127–141, Jan. 2018. doi: 10.1016/j. isprsjprs.2017.11.017. [98] H. Chaabouni-Chouayakh, I. R. Arnau, and P. Reinartz, “Towards automatic 3-D change detection through multispectral and digital elevation model information fusion,” Int. J. Image Data Fusion, vol. 4, no. 1, pp. 89–101, 2013. doi: 10.1080/19479832.2012.739577. [99] J. Tian, P. Reinartz, P. d’Angelo, and M. Ehlers, “Region-based automatic building and forest change detection on Cartosat-1 stereo imagery,” ISPRS J. Photogrammetry Remote Sens., vol. 79, pp. 226–239, May 2013. doi: 10.1016/j.isprsjprs.2013.02.017. [100] W. Tu et al., “Portraying urban functional zones by coupling remote sensing imagery and human sensing data,” Remote Sens., vol. 10, no. 1, p. 141, 2018. doi: 10.3390/rs10010141. [101] C. Liu, X. Huang, Z. Zhu, H. Chen, X. Tang, and J. Gong, “Automatic extraction of built-up area from ZY3 multi-view satellite imagery: Analysis of 45 global cities,” Remote Sens. Environ., vol. 226, pp. 51–73, June 2019. doi: 10.1016/j.rse.2019.03.033. [102] R. Duca and F. D. Frate, “Hyperspectral and multiangle CHRIS–PROBA images for the generation of land cover maps,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 10, pp. 2857–2866, 2008. doi: 10.1109/TGRS.2008.2000741. [103] Y. Yan, L. Deng, X. Liu, and L. Zhu, “Application of UAV-based multi-angle hyperspectral remote sensing in fine vegetation classification,” Remote Sens., vol. 11, no. 23, p. 2753, 2019. doi: 10.3390/rs11232753. [104] M. Zanetti and L. Bruzzone, “A theoretical framework for change detection based on a compound multiclass statistical model of the difference image,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 2, pp. 1129–1143, 2017. doi: 10.1109/ TGRS.2017.2759663. [105] Y. T. Solano-Correa, F. Bovolo, and L. Bruzzone, “An approach to multiple change detection in VHR optical images based on iterative clustering and adaptive thresholding,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 8, pp. 1–5, 2019. doi: 10.1109/ LGRS.2019.2896385. [106] J. S. Deng, K. Wang, Y. H. Deng, and G. J. Qi, “PCA‐based land‐ use change detection and analysis using multitemporal and multisensor satellite data,” Int. J. Remote Sens., vol. 29, no. 16, pp. 4823–4838, 2008. doi: 10.1080/01431160801950162. [107] A. Tahraoui, R. Kheddam, A. Bouakache, and A. Belhadj-Aissa, “Land change detection using multivariate alteration detection and Chi squared test thresholding,” in Proc. 4th Int. Conf. Adv. Technol. Signal and Image Process. (ATSIP), 2018, pp. 1–6. doi: 10.1109/ATSIP.2018.8364501. [108] C. Wu, L. Zhang, and L. Zhang, “A scene change detection framework for multi-temporal very high resolution remote IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
sensing images,” Signal Process., vol. 124, pp. 184–197, July 2016. doi: 10.1016/j.sigpro.2015.09.020. [109] X. Zhang, R. Fan, L. Ma, X. Liao, and X. Chen, “Change detection in very high-resolution images based on ensemble CNNs,” Int. J. Remote Sens., vol. 41, no. 12, pp. 4757–4779, 2020. doi: 10.1080/01431161.2020.1723818. [110] D. Peng, Y. Zhang, and H. Guan, “End-to-end change detection for high resolution satellite images using improved UNet++,” Remote Sens., vol. 11, no. 11, p. 1382, 2019. doi: 10.3390/rs11111382. [111] L. Mou, L. Bruzzone, and X. X. Zhu, “Learning spectral-spatialtemporal features via a recurrent convolutional neural network for change detection in multispectral imagery,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 2, pp. 924–935, 2019. doi: 10.1109/TGRS.2018.2863224. [112] T. Bao, C. Fu, T. Fang, and H. Huo, “PPCNET: A combined patch-level and pixel-level end-to-end deep network for high-resolution remote sensing image change detection,” IEEE Geosci. Remote Sens. Lett., vol. 17, no. 10, pp. 1–5, 2020. doi: 10.1109/LGRS.2019.2955309. [113] C. Zhang et al., “A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images,” ISPRS J. Photogrammetry Remote Sensing, vol. 166, pp. 183–200, Aug. 2020. doi: 10.1016/j.isprsjprs.2020.06.003. [114] T. Lei, Y. Zhang, Z. Lv, S. Li, S. Liu, and A. K. Nandi, “Landslide inventory mapping from bitemporal images using deep convolutional neural networks,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 6, pp. 982–986, 2019. doi: 10.1109/LGRS.2018.2889307. [115] W. Wiratama, J. Lee, S.-E. Park, and D. Sim, “Dual-dense convolution network for change detection of high-resolution panchromatic imagery,” Appl. Sci., vol. 8, no. 10, p. 1785, 2018. doi: 10.3390/app8101785. [116] W. Zhang and X. Lu, “The spectral-spatial joint learning for change detection in multispectral imagery,” Remote Sens., vol. 11, no. 3, p. 240, 2019. doi: 10.3390/rs11030240. [117] A. Song and J. Choi, “Fully convolutional networks with multiscale 3D filters and transfer learning for change detection in high spatial resolution satellite images,” Remote Sens., vol. 12, no. 5, p. 799, 2020. doi: 10.3390/rs12050799. [118] M. Zhai, H. Liu, and F. Sun, “Lifelong learning for scene recognition in remote sensing images,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 9, pp. 1472–1476, 2019. doi: 10.1109/LGRS.2019.2897652. [119] M. Rußwurm, S. Wang, M. Korner, and D. Lobell, “Meta-learning for few-shot land cover classification,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recogn. Workshops, 2020, pp. 200–201. [120] R. Hedjam, A. Abdesselam, and F. Melgani, “Change detection in unlabeled optical remote sensing data using Siamese CNN,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 13, pp. 4178–4187, July 2020. doi: 10.1109/JSTARS.2020.3009116. [121] H. Chen, C. Wu, B. Du, L. Zhang, and L. Wang, “Change detection in multisource VHR images via deep siamese convolutional multiple-layers recurrent neural network,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 4, pp. 2848–2864, 2020. doi: 10.1109/ TGRS.2019.2956756. [122] J. Liu, M. Gong, A. K. Qin, and K. C. Tan, “Bipartite differential neural network for unsupervised image change detection,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 3, pp. 876–890, 2020. doi: 10.1109/TNNLS.2019.2910571. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [123] X. Junfeng, Z. Baoming, G. Haitao, L. Jun, and L. Yuzhun, “Combining iterative slow feature analysis and deep feature learning for change detection in high-resolution remote sensing images,” J. Appl. Remote Sens., vol. 13, no. 2, pp. 1–16, 2019. doi: 10.1117/1.JRS.13.024506. [124] J. Fan, K. Lin, and M. Han, “A novel joint change detection approach based on weight-clustering sparse autoencoders,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 12, no. 2, pp. 685–699, 2019. doi: 10.1109/JSTARS.2019.2892951. [125] A. Argyridis and D. P. Argialas, “Building change detection through multi-scale GEOBIA approach by integrating deep belief networks with fuzzy ontologies,” Int. J. Image Data Fusion, vol. 7, no. 2, pp. 148–171, 2016. [126] P. F. Alcantarilla, S. Stent, G. Ros, R. Arroyo, and R. Gherardi, “Street-view change detection with deconvolutional networks,” Autonom. Robots, vol. 42, no. 7, pp. 1301–1322, 2018. doi: 10.1007/s10514-018-9734-5. [127] R. Jing et al., “Object-based change detection for VHR remote sensing images based on a Trisiamese-LSTM,” Int. J. Remote Sens., vol. 41, no. 16, pp. 6209–6231, 2020. doi: 10.1080/01431161.2020.1734253. [128] J. Geng, J. Fan, H. Wang, and X. Ma, “Change detection of marine reclamation using multispectral images via patchbased recurrent neural network,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2017, pp. 612–615. doi: 10.1109/ IGARSS.2017.8127028. [129] H. Lyu, H. Lu, and L. Mou, “Learning a transferable change rule from a recurrent neural network for land cover change detection,” Remote Sens., vol. 8, no. 6, p. 506, 2016. doi: 10.3390/ rs8060506. [130] M. Gong, X. Niu, P. Zhang, and Z. Li, “Generative adversarial networks for change detection in multispectral imagery,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 12, pp. 2310–2314, 2017. doi: 10.1109/LGRS.2017.2762694. [131] M. Gong, Y. Yang, T. Zhan, X. Niu, and S. Li, “A generative discriminatory classified network for change detection in multispectral imagery,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 12, no. 1, pp. 321–333, 2019. doi: 10.1109/ JSTARS.2018.2887108. [132] S. Saha, L. Mou, X. X. Zhu, F. Bovolo, and L. Bruzzone, “Semisupervised change detection using graph convolutional network,” IEEE Geosci. Remote Sens. Lett., 2020. [133] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Adv. Neural Information Process. Syst., 2012, pp. 1097–1105. [134] K. Simonyan and A. Zisserman, “Ver y deep convolutional n e t works for large-scale image recognition,” 2014, arXiv: 1409.1556. [135] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inceptionv4, inception-resnet and the impact of residual connections on learning,” 2016, arXiv:1602.07261. [136] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vision Pattern Recogn., 2016, pp. 770–778. [137] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vision Pattern Recogn., 2017, pp. 4700–4708. 97
[138] Y. Wu, Z. Bai, Q. Miao, W. Ma, Y. Yang, and M. Gong, “A classified adversarial network for multi-spectral remote sensing image change detection,” Remote Sensing, vol. 12, no. 13, p. 2098, 2020. doi: 10.3390/rs12132098. [139] B. Fang, G. Chen, L. Pan, R. Kou, and L. Wang, “GAN-based Siamese framework for landslide inventory mapping using bi-temporal optical remote sensing images,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 11, pp. 1–5, 2020. doi: 10.3390/ rs11111292. [140] A. Song, Y. Kim, and Y. Han, “Uncertainty analysis for objectbased change detection in very high-resolution satellite images using deep learning network,” Remote Sensing, vol. 12, no. 15, p. 2345, 2020. doi: 10.3390/rs12152345. [141] S. I. Toure, D. A. Stow, H-c Shih, J. Weeks, and D. Lopez-Carr, “Land cover and land use change analysis using multi-spatial resolution data and object-based image analysis,” Remote Sens. Environ., vol. 210, pp. 259–268, June 2018. doi: 10.1016/j.rse.2018.03.023. [142] X. Wang, S. Liu, P. Du, H. Liang, J. Xia, and Y. Li, “Object-based change detection in urban areas from high spatial resolution images based on multiple features and ensemble learning,” Remote Sens., vol. 10, no. 2, 2018. doi: 10.3390/rs10020276. [143] M. Chini, C. Bignami, A. Chiancone, and S. Stramondo, “Classification of VHR optical data for land use change analysis by scale object seletion (SOS) algorithm,” in Proc. IEEE Geosci. Remote Sens. Symp., 2014, pp. 2834–2837. [144] X. Huang, X. Han, S. Ma, T. Lin, and J. Gong, “Monitoring ecosystem service change in the City of Shenzhen by the use of high‐ resolution remotely sensed imagery and deep learning,” Land Degradation Develop., vol. 30, no. 12, 2019. doi: 10.1002/ldr.3337. [145] G. Doxani, K. Karantzalos, and M. Tsakiri-Strati, “Monitoring urban changes based on scale-space filtering and object-oriented classification,” Int. J. Appl. Earth Observat. Geoinf., vol. 15, pp. 38–48, Apr. 2012. doi: 10.1016/j.jag.2011.07.002. [146] Z. Guo and S. Du, “Mining parameter information for building extraction and change detection with very high-resolution imagery and GIS data,” GISci. Remote Sens., vol. 54, no. 1, pp. 38– 63, 2017. doi: 10.1080/15481603.2016.1250328. [147] B. Hou, Y. Wang, and Q. Liu, “A saliency guided semi-supervised building change detection method for high resolution remote sensing images,” Sensors, vol. 16, no. 9, p. 1377, 2016. doi: 10.3390/s16091377. [148] X. Huang, H. Liu, and L. Zhang, “Spatiotemporal detection and analysis of urban villages in mega city regions of China using high-resolution remotely sensed imagery,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 7, pp. 3639–3657, 2015. doi: 10.1109/ TGRS.2014.2380779. [149] M. Janalipour and A. Mohammadzadeh, “Building damage detection using object-based image analysis and ANFIS from high-resolution image (case study: BAM earthquake, Iran),” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 9, no. 5, pp. 1937–1945, 2016. doi: 10.1109/JSTARS.2015.2458582. [150] T. Leichtle, C. Geiß, M. Wurm, T. Lakes, and H. Taubenböck, “Unsupervised change detection in VHR remote sensing imagery–an object-based clustering approach in a dynamic urban environment,” Int. J. Appl. Earth Observat. Geoinf., vol. 54, pp. 15–27, 2017. doi: 10.1016/j.jag.2016.08.010. 98 [151] Y. Li, X. Huang, and H. Liu, “Unsupervised deep feature learning for urban village detection from high-resolution remote sensing images,” Photogrammetric Eng. Remote Sensing, vol. 83, no. 8, pp. 567–579, 2017. doi: 10.14358/PERS.83.8.567. [152] S. Radhika, Y. Tamura, and M. Matsui, “Cyclone damage detection on building structures from pre-and post-satellite images using wavelet based pattern recognition,” J. Wind Eng. Ind. Aerodynamics, vol. 136, pp. 23–33, 2015. doi: 10.1016/j. jweia.2014.10.018. [153] N. Sofina and M. Ehlers, “Building change detection using high resolution remotely sensed data and GIS,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 9, no. 8, pp. 3430–3438, 2016. doi: 10.1109/JSTARS.2016.2542074. [154] X. Tong et al., “Use of shadows for detection of earthquakeinduced collapsed buildings in high-resolution satellite imagery,” ISPRS J. Photogrammetry Remote Sensing, vol. 79, pp. 53–67, 2013. doi: 10.1016/j.isprsjprs.2013.01.012. [155] D. Wen, X. Huang, A. Zhang, and X. Ke, “Monitoring 3D building change and urban redevelopment patterns in inner city areas of Chinese megacities using multi-view satellite imagery,” Remote Sens., vol. 11, no. 7, p. 763, 2019. doi: 10.3390/rs11070763. [156] J. Tian, S. Cui, and P. Reinartz, “Building change detection based on satellite stereo imagery and digital surface models,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 1, pp. 406–417, 2014. doi: 10.1109/TGRS.2013.2240692. [157] R. Qin, “Change detection on LOD 2 building models with very high resolution spaceborne stereo imagery,” ISPRS J. Photogrammetry Remote Sens., vol. 96, pp. 179–192, Oct. 2014. doi: 10.1016/j.isprsjprs.2014.07.007. [158] A. Kovacs and T. Sziranyi, “Orientation based building outline extraction in aerial images,” ISPRS Ann. Photogrammetry, Remote Sens. Spatial Inf. Sci., vol. I-7, pp. 141–146, July 2012. doi: 10.5194/isprsannals-I-7-141-2012. [159] A. O. Ok, “Automated detection of buildings from single VHR multispectral images using shadow information and graph cuts,” ISPRS J. Photogrammetry Remote Sensing, vol. 86, pp. 21– 40, Dec. 2013. doi: 10.1016/j.isprsjprs.2013.09.004. [160] M. Vakalopoulou, K. Karantzalos, N. Komodakis, and N. Paragios, “Building detection in very high resolution multispectral data with deep learning features,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2015, pp. 1873–1876. [161] M. Janalipour and M. Taleai, “Building change detection after earthquake using multi-criteria decision analysis based on extracted information from high spatial resolution satellite images,” Int. J. Remote Sens., vol. 38, no. 1, pp. 82–99, 2017. doi: 10.1080/01431161.2016.1259673. [162] X. Huang and Y. Wang, “Investigating the effects of 3D urban morphology on the surface urban heat island effect in urban functional zones by using high-resolution remote sensing data: A case study of Wuhan, Central China,” ISPRS J. Photogrammetry Remote Sens., vol. 152, pp. 119–131, June 2019. doi: 10.1016/j. isprsjprs.2019.04.010. [163] M. Turker and B. Cetinkaya, “Automatic detection of earthquake‐ damaged buildings using DEMs created from pre‐ and post‐earthquake stereo aerial photographs,” Int. J. Remote Sens., vol. 26, no. 4, pp. 823–832, 2005. doi: 10.1080/01431160512331316810. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[164] R. Qin, “A critical analysis of satellite stereo pairs for digital surface model generation and a matching quality prediction model,” ISPRS J. Photogrammetry Remote Sensing, vol. 154, pp. 139–150, Aug. 2019. doi: 10.1016/j.isprsjprs.2019.06.005. [165] S. Ji, J. Liu, and M. Lu, “CNN-based dense image matching for aerial remote sensing images,” Photogrammetric Eng. Remote Sens., vol. 85, no. 6, pp. 415–424, 2019. doi: 10.14358/ PERS.85.6.415. [166] Y. Lü et al., “Recent ecological transitions in China: Greening, browning and influential factors,” Sci. Rep., vol. 5, no. 1, p. 8732, 2015. doi: 10.1038/srep08732. [167] J. Verbesselt, R. Hyndman, G. Newnham, and D. Culvenor, “Detecting trend and seasonal changes in satellite image time series,” Remote Sens. Environ., vol. 114, no. 1, pp. 106–115, 2010. doi: 10.1016/j.rse.2009.08.014. [168] R. Pu and S. Landry, “Evaluating seasonal effect on forest leaf area index mapping using multi-seasonal high resolution satellite pléiades imagery,” Int. J. Appl. Earth Observat. Geoinf., vol. 80, pp. 268–279, Aug. 2019. doi: 10.1016/j.jag.2019.04.020. [169] J. Wang, D. Yang, M. Detto, B. W. Nelson, M. Chen, K. Guan, et al. “Multi-scale integration of satellite remote sensing improves characterization of dry-season green-up in an Amazon tropical evergreen forest,” Remote Sens. Environ., vol. 246, p. 111,865, 2020. doi: 10.1016/j.rse.2020.111865. [170] P. Gärtner, M. Förster, A. Kurban, and B. Kleinschmit, “Object based change detection of Central Asian Tugai vegetation with very high spatial resolution satellite imagery,” Int. J. Appl. Earth Observat. Geoinf., vol. 31, pp. 110–121, Sept. 2014. doi: 10.1016/j.jag.2014.03.004. [171] J. Tian, T. Schneider, C. Straub, F. Kugler, and P. Reinartz, “Exploring digital surface models from nine different sensors for forest monitoring and change detection,” Remote Sens., vol. 9, no. 3, p. 287, 2017. doi: 10.3390/rs9030287. [172] R. Dalagnol et al., “Quantifying canopy tree loss and gap recovery in tropical forests under low-intensity logging using VHR satellite imagery and airborne LiDAR,” Remote Sensing, vol. 11, no. 7, p. 817, 2019. doi: 10.3390/rs11070817. [173] J. P. Ardila, W. Bijker, V. A. Tolpekin, and A. Stein, “Quantification of crown changes and change uncertainty of trees in an urban environment,” ISPRS J. Photogrammetry Remote Sens., vol. 74, pp. 41–55, 2012. doi: 10.1016/j.isprsjprs.2012.08.007. [174] B. Lu and Y. He, “Species classification using Unmanned Aerial Vehicle (UAV)-acquired high spatial resolution imagery in a heterogeneous grassland,” ISPRS J. Photogrammetry Remote Sens., vol. 128, pp. 73–85, 2017. doi: 10.1016/j.isprsjprs.2017.03.011. [175] Y. Sun, Q. Xin, J. Huang, B. Huang, and H. Zhang, “Characterizing tree species of a tropical wetland in southern China at the individual tree level based on convolutional neural network,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 12, no. 11, pp. 4415–4425, 2019. doi: 10.1109/JSTARS.2019.2950721. [176] Z. Xie, Y. Chen, D. Lu, G. Li, and E. Chen, “Classification of land cover, forest, and tree species classes with ZiYuan-3 multispectral and stereo data,” Remote Sens., vol. 11, no. 2, p. 164, 2019. doi: 10.3390/rs11020164. [177] R. Pu, S. Landry, and Q. Yu, “Assessing the potential of multiseasonal high resolution Pléiades satellite imagery for mapping DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE urban tree species,” Int. J. Appl. Earth Observat. Geoinf., vol. 71, pp. 144–158, Sept. 2018. doi: 10.1016/j.jag.2018.05.005. [178] S. Hartling, V. Sagan, P. Sidike, M. Maimaitijiang, and J. Carron, “Urban tree species classification using a WorldView-2/3 and LiDAR data fusion approach and deep learning,” Sensors, vol. 19, no. 6, p. 1284, 2019. doi: 10.3390/s19061284. [179] D. Wen, X. Huang, H. Liu, W. Liao, and L. Zhang, “Semantic classification of urban trees using very high resolution satellite imagery,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 10, no. 4, pp. 1413–1424, 2017. doi: 10.1109/ JSTARS.2016.2645798. [180] M. Xia, Y. Zhang, Z. Zhang, J. Liu, W. Ou, and W. Zou, “Modeling agricultural land use change in a rapid urbanizing town: Linking the decisions of government, peasant households and enterprises,” Land Use Policy, vol. 90, pp. 104266, 2020. doi: 10.1016/j.landusepol.2019.104266. [181] E. S. Malinverni, M. Rinaldi, and S. Ruggieri, “Agricultural crop change detection by means of hybrid classification and high resolution images,” EARSeL eProc., vol. 11, no. 2, pp. 132–154, 2012. [182] Y. Sadeh, X. Zhu, K. Chenu, and D. Dunkerley, “Sowing date detection at the field scale using CubeSats remote sensing,” Comput. Electron. Agriculture, vol. 157, pp. 568–580, 2019. doi: 10.1016/j.compag.2019.01.042. [183] J. Bendig, A. Bolten, and G. Bareth, “UAV-based imaging for multi-temporal, very high resolution crop surface models to monitor crop growth variability monitoring des Pflanzenwachstums mit Hilfe multitemporaler und hoch auflösender Oberflächenmodelle von Getreidebeständen auf Basis von Bildern aus UAV-Befliegungen,” Photogrammetrie-FernerkundungGeoinf., vol. 2013, pp. 551–562, Dec. 2013. [184] P. L. Hatfield and P. J. Pinter, “Remote sensing for crop protection,” Crop Protection, vol. 12, no. 6, pp. 403–413, 1993. doi: 10.1016/0261-2194(93)90001-Y. [185] K. Johansen et al., “Using GeoEye-1 imagery for multi-temporal object-based detection of canegrub damage in sugarcane fields in Queensland, Australia,” GISci. Remote Sensing, vol. 55, no. 2, pp. 285–305, 2018. doi: 10.1080/15481603.2017.1417691. [186] J. Franke and G. Menz, “Multi-temporal wheat disease detection by multi-spectral remote sensing,” Precision Agriculture, vol. 8, no. 3, pp. 161–172, 2007. doi: 10.1007/s11119-007-9036-y. [187] L. Yuan, Y. Huang, R. W. Loraamm, C. Nie, J. Wang, and J. Zhang, “Spectral analysis of winter wheat leaves for detection and differentiation of diseases and insects,” Field Crops Res., vol. 156, pp. 199–207, 2014. doi: 10.1016/j.fcr.2013.11.012. [188] A. M. Mouazen et al., “Chapter 2—Monitoring,” “ in Agricultural Internet of Things and Decision Support for Precision Smart Farming, A. Castrignanò, G. Buttafuoco, R. Khosla, A. M. Mouazen, D. Moshou, and O. Naud, Eds. New York: Academic Press, 2020, pp. 35–138. [189] X. Zhang et al., “A deep learning-based approach for automated yellow rust disease detection from high-resolution hyperspectral UAV images,” Remote Sens., vol. 11, no. 13, p. 1554, 2019. doi: 10.3390/rs11131554. [190] Y. Wang and H. Yésou, “Remote sensing of floodpath lakes and wetlands: A challenging frontier in the monitoring of changing 99
environments,” Remote Sens., vol. 10, no. 12, p. 1955, 2018. doi: 10.3390/rs10121955. [191] C. Xie, X. Huang, H. Mu, and W. Yin, “Impacts of land-use changes on the lakes across the Yangtze floodplain in China,” Environ. Sci. Technol., vol. 51, no. 7, pp. 3669–3677, 2017. doi: 10.1021/acs.est.6b04260. [192] S. Wang et al., “Changes of water clarity in large lakes and reservoirs across China observed from long-term MODIS,” Remote Sens. Environ., vol. 247, pp. 111949, 2020. doi: 10.1016/j. rse.2020.111949. [193] J.-F. Pekel, A. Cottam, N. Gorelick, and A. S. Belward, “Highresolution mapping of global surface water and its long-term changes,” Nature, vol. 540, no. 7633, pp. 418–422, 2016. doi: 10.1038/nature20584. [194] J. A. Downing et al., “The global abundance and size distribution of lakes, ponds, and impoundments,” Limnol. Oceanogr., vol. 51, no. 5, pp. 2388–2397, 2006. doi: 10.4319/ lo.2006.51.5.2388. [195] S. W. Cooley, L. C. Smith, L. Stepan, and J. Mascaro, “Tracking dynamic northern surface water changes with high-frequency planet CubeSat imagery,” Remote Sens., vol. 9, no. 12, p. 1306, 2017. doi: 10.3390/rs9121306. [196] W. Feng, H. Sui, W. Huang, C. Xu, and K. An, “Water body extraction from very high-resolution remote sensing imagery using deep U-Net and a superpixel-based conditional random field model,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 4, pp. 618–622, 2019. doi: 10.1109/LGRS.2018.2879492. [197] F. Chen, X. Chen, T. Van de Voorde, D. Roberts, H. Jiang, and W. Xu, “Open water detection in urban environments using high spatial resolution remote sensing imagery,” Remote Sens. Environ., vol. 242, p. 11,1706, June 2020. [198] Q. Shen et al., “A CIE color purity algorithm to detect black and odorous water in urban rivers using high-resolution multispectral remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 9, pp. 6577–6590, 2019. doi: 10.1109/ TGRS.2019.2907283. [199] X. Huang, C. Xie, X. Fang, and L. Zhang, “Combining pixel- and object-based machine learning for identification of water-body types from urban high-resolution remote-sensing imagery,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 8, no. 5, pp. 2097–2110, 2015. doi: 10.1109/JSTARS. 2015.2420713. [200] M. Kamal, S. Phinn, and K. Johansen, “Object-based approach for multi-scale mangrove composition mapping using multiresolution image datasets,” Remote Sens., vol. 7, no. 4, pp. 4753– 4783, 2015. doi: 10.3390/rs70404753. [201] T. Hu, J. Liu, G. Zheng, Y. Li, and B. Xie, “Quantitative assessment of urban wetland dynamics using high spatial resolution satellite imagery between 2000 and 2013,” Sci. Rep., vol. 8, no. 1, p. 7409, 2018. doi: 10.1038/s41598-018-25823-9. [202] Q. Wu et al., “Integrating LiDAR data and multi-temporal aerial imagery to map wetland inundation dynamics using Google Earth Engine,” Remote Sens. Environ., vol. 228, pp. 1–13, July 2019. doi: 10.1016/j.rse.2019.04.015. [203] K. S. Schmidt and A. K. Skidmore, “Spectral discrimination of vegetation types in a coastal wetland,” Remote Sens. En- 100 viron., vol. 85, no. 1, pp. 92–108, 2003. doi: 10.1016/S00344257(02)00196-7. [204] G. Viennois, C. Proisy, J. Féret, J. Prosperi, F. Sidik, Suhardjono, et al. “Multitemporal analysis of high-spatial-resolution optical satellite imagery for mangrove species mapping in Bali, Indonesia,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 9, pp. 3680–3686, 2016. doi: 10.1109/JSTARS.2016.2553170. [205] R. B. Norgaard, “Ecosystem services: From eye-opening metaphor to complexity blinder,” Ecol. Econ., vol. 69, no. 6, pp. 1219–1227, 2010. doi: 10.1016/j.ecolecon.2009.11.009. [206] Y. Z. Ayanu, C. Conrad, T. Nauss, M. Wegmann, and T. Koellner, “Quantifying and mapping ecosystem services supplies and demands: A review of remote sensing applications,” Environ. Sci. Technol., vol. 46, no. 16, pp. 8529–8541, 2012. doi: 10.1021/ es300157u. [207] J. Haas and Y. Ban, “Mapping and monitoring urban ecosystem services using multitemporal high-resolution satellite data,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 10, no. 2, pp. 669–680, 2016. doi: 10.1109/JSTARS.2016.2586582. [208] X. Ren, X. Chen, and Q. Ma, “Urban spatial ecological performance based on the data of remote sensing of Guyuan,” Int. Arch. Photogrammetry, Remote Sens. Spatial Inf. Sci., vol. 42, p. 3, Apr. 2018. [209] C. R. Hakkenberg, M. P. Dannenberg, C. Song, and G. Vinci, “Automated continuous fields prediction from landsat time series: application to fractional impervious cover,” IEEE Geosci. Remote Sens. Lett., vol. 17, no. 1, pp. 132–136, 2019. doi: 10.1109/LGRS.2019.2915320. [210] L. Zhang, Q. Weng, and Z. Shao, “An evaluation of monthly impervious surface dynamics by fusing Landsat and MODIS time series in the Pearl River Delta, China, from 2000 to 2015,” Remote Sens. Environ., vol. 201, pp. 99–114, Nov. 2017. doi: 10.1016/j. rse.2017.08.036. [211] G. Xian, H. Shi, J. Dewitz, and Z. Wu, “Performances of WorldView 3, Sentinel 2, and Landsat 8 data in mapping impervious surface,” Remote Sens. Appl., Soc. Environ., vol. 15, p. 100,246, 2019. doi: 10.1016/j.rsase.2019.100246. [212] W. Zhou, G. Huang, A. Troy, and M. L. Cadenasso, “Objectbased land cover classification of shaded areas in high spatial resolution imagery of urban areas: A comparison study,” Remote Sens. Environ., vol. 113, no. 8, pp. 1769–1777, 2009. doi: 10.1016/j.rse.2009.04.007. [213] P. Li, J. Guo, B. Song, and X. Xiao, “A multilevel hierarchical image segmentation method for urban impervious surface mapping using very high resolution imagery,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 4, no. 1, pp. 103–116, 2011. doi: 10.1109/JSTARS.2010.2074186. [214] T. Zhang and X. Huang, “Monitoring of urban impervious surfaces using time series of high-resolution remote sensing images in rapidly urbanized areas: A case study of Shenzhen,” IEEE J. of Select. Topics Appl. Earth Observat. Remote Sens., vol. 11, no. 8, pp. 2692–2708, 2018. doi: 10.1109/JSTARS.2018.2804440. [215] C. E. Woodcock, T. R. Loveland, M. Herold, and M. E. Bauer, “Transitioning from change detection to monitoring with remote sensing: A paradigm shift,” Remote Sens. Environ., vol. 238, p. 111,558, 2020. doi: 10.1016/j.rse.2019.111558. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[216] D. Helman et al., “Using time series of high-resolution planet satellite images to monitor grapevine stem water potential in commercial vineyards,” Remote Sens., vol. 10, no. 10, p. 1615, 2018. doi: 10.3390/rs10101615. [217] M. A. Wulder, J. C. White, N. C. Coops, and C. R. Butson, “Multi-temporal analysis of high spatial resolution imagery for disturbance monitoring,” Remote Sens. Environ., vol. 112, no. 6, pp. 2729–2740, 2008. doi: 10.1016/j.rse.2008.01.010. [218] D. Turner, A. Lucieer, and S. M. De Jong, “Time series analysis of landslide dynamics using an Unmanned Aerial Vehicle (UAV),” Remote Sens., vol. 7, no. 2, pp. 1736–1757, 2015. doi: 10.3390/rs70201736. [219] H. Li, L. Chen, F. Li, and M. Huang, “Ship detection and tracking method for satellite video based on multiscale saliency and surrounding contrast analysis,” J. Appl. Remote Sens., vol. 13, no. 2, p. 026511, 2019. doi: 10.1117/1.JRS.13.026511. [220] L. Wang, F. Chen, and H. Yin, “Detecting and tracking vehicles in traffic by unmanned aerial vehicles,” Automat. Construct., vol. 72, pp. 294–308, Dec. 2016. doi: 10.1016/j.autcon.2016.05.008. [221] L. Mou et al., “Multitemporal very high resolution from space: Outcome of the 2016 IEEE GRSS data fusion contest,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 10, no. 8, pp. 3435–3447, 2017. doi: 10.1109/JSTARS.2017.2696823. [222] L. Mou and X. X. Zhu, “Spatiotemporal scene interpretation of space videos via deep neural network and tracklet analysis,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2016, pp. 1823–1826. [223] M. C. Hansen and R. S. DeFries, “Detecting long-term global forest change using continuous fields of tree-cover maps from 8-km advanced very high resolution radiometer (AVHRR) data for the years 1982–99,” Ecosystems, vol. 7, no. 7, pp. 695–716, 2004. doi: 10.1007/s10021-004-0243-3. [224] A. Schneider, M. A. Friedl, and D. Potere, “A new map of global urban extent from MODIS satellite data,” Environ. Res. Lett., vol. 4, no. 4, p. 044003, 2009. doi: 10.1088/1748-9326/4/4/044003. [225] M. A. Friedl et al., “MODIS collection 5 global land cover: Algorithm refinements and characterization of new datasets,” Remote Sens. Environ., vol. 114, no. 1, pp. 168–182, 2010. doi: 10.1016/j.rse.2009.08.016. [226] ESA. CCI-LC Product User Guide v2.4 [Online]. Available: Http://maps.elie.ucl.ac.be/CCI/viewer/download/ESACCI-LC -PUG-v2.4.pdf [227] M. C. Hansen et al., “High-resolution global maps of 21st-century forest cover change,” Science, vol. 342, no. 6160, pp. 850– 853, 2013. doi: 10.1126/science.1244693. [228] J. Chen et al., “Global land cover mapping at 30m resolution: A POK-based operational approach,” ISPRS J. Photogrammetry Remote Sens., vol. 103, pp. 7–27, May 2015. doi: 10.1016/j.isprsjprs.2014.09.002. [229] P. Gong et al., “Annual maps of global artificial impervious area (GAIA) between 1985 and 2018,” Remote Sens. Environ., vol. 236, p. 111,510, Jan. 2020. doi: 10.1016/j.rse.2019.111510. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [230] P. Gong et al., “Mapping essential urban land use categories in China (EULUC-China): Preliminary results for 2018,” Sci. Bull., vol. 65, no. 3, pp. 182–187, 2020. doi: 10.1016/j. scib.2019.12.007. [231] M. Pesaresi, D. Ehrilch, A. J. Florczyk, S. Freire, A. Julea, T. Kemper, et al. “GHS built-up grid, derived from Landsat, multitemporal (1975, 1990, 2000, 2014),” European Commission, Joint Res. Centre, JRC Data Catalogue, 2015. [232] P. Gong, H. Liu, M. Zhang, C. Li, J. Wang, H. Huang, et al. “Stable classification with limited sample: Transferring a 30-m resolution sample set collected in 2015 to mapping 10-m resolution global land cover in 2017,” Sci. Bull., vol. 64, no. 6, pp. 370–373, 2019. doi: 10.1016/j.scib.2019.03.002. [233] R. Houborg and M. McCabe, “High-resolution NDVI from Planet’s constellation of earth observing nano-satellites: A new data source for precision agriculture,” Remote Sens., vol. 8, no. 9, p. 768, 2016. doi: 10.3390/rs8090768. [234] L. Wang, M. Jia, D. Yin, and J. Tian, “A review of remote sensing for mangrove forests: 1956–2018,” Remote Sens. Environ., vol. 231, p. 111,223, 2019. doi: 10.1016/j.rse.2019.111223. [235] A. Ertürk, M. Iordache, and A. Plaza, “Sparse unmixing with dictionary pruning for hyperspectral change detection,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 10, no. 1, pp. 321–330, 2017. doi: 10.1109/JSTARS.2016.2606514. [236] M. V. M Graña, B Ayerdi. “Hyperspectral remote sensing scenes.” Grupo de Inteligencia Computacional (GIC). http://www.ehu.es/ ccwintco/index.php?title=Hyperspectral_Remote_Sensing _Scenes&redirect=no (accessed 2012). [237] K. Nogueira, O. A. B. Penatti, and J. A. dos Santos, “Towards better exploiting convolutional neural networks for remote sensing scene classification,” Pattern Recognition, vol. 61, pp. 539–556, Jan. 2017. doi: 10.1016/j.patcog.2016.07.001. [238] W. Zhou, D. Ming, X. Lv, K. Zhou, H. Bao, and Z. Hong, “SO– CNN based urban functional zone fine division with VHR remote sensing image,” Remote Sens. Environ., vol. 236, p. 111,458, 2020. doi: 10.1016/j.rse.2019.111458. [239] M. Li, K. M. d Beurs, A. Stein, and W. Bijker, “Incorporating open source data for Bayesian classification of urban land use from VHR stereo images,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 10, no. 11, pp. 4930–4943, 2017. doi: 10.1109/JSTARS.2017.2737702. [240] S. Du, S. Du, B. Liu, and X. Zhang, “Context-enabled extraction of large-scale urban functional zones from very-high-resolution images: A multiscale segmentation approach,” Remote Sens., vol. 11, no. 16, p. 1902, 2019. doi: 10.3390/rs11161902. [241] X. Zhang, S. Du, and Q. Wang, “Hierarchical semantic cognition for urban functional zones with VHR satellite images and POI data,” ISPRS J. Photogrammetry Remote Sens., vol. 132, pp. 170–184, Oct. 2017. doi: 10.1016/j.isprsjprs.2017.09.007. [242] X. Liu et al., “Classifying urban land use by integrating remote sensing and social media data,” Int. J. Geographical Inform. Sci., vol. 31, no. 8, pp. 1675–1696, 2017. doi: 10.1080/13658816.2017.1324976. GRS 101
The CCSDS 123.0-B-2 “Low-Complexity Lossless and Near-Lossless Multispectral and Hyperspectral Image Compression” Standard A comprehensive review MIGUEL HERNÁNDEZCABRONERO, AARON B. KIELY, MATTHEW KLIMESH, IAN BLANES, JONATHAN LIGO, ENRICO MAGLI, AND JOAN SERRA-SAGRISTÀ ©SHUTTERSTOCK.COM/ASVMAGZ T he Consultative Committee for Space Data Systems (CCSDS) published the CCSDS 123.0-B-2, “LowComplexity Lossless and Near-Lossless Multispectral and Hyperspectral Image Compression” standard. This standard extends the previous issue, CCSDS 123.0-B-1, which supported only lossless compression, while maintaining backward compatibility. The main novelty of the new issue is support for near-lossless compression, i.e., lossy compression with user-defined absolute and/or relative error limits in the reconstructed images. This new feature is achieved via closed-loop quantization of prediction errors. Two further additions arise from the new nearlossless support: first, the calculation of predicted sam- 102 ple values using sample representatives that may not be equal to the reconstructed sample values, and, second, a new hybrid entropy coder designed to provide enhanced compression performance for low-entropy data, prevalent when nonlossless compression is used. These new features enable significantly smaller compressed data volumes than those achievable with CCSDS 123.0-B-1 while controlling the quality of the decompressed images. As a result, larger amounts of valuable information can be retrieved given a set of bandwidth and energy consumption constraints. Digital Object Identifier 10.1109/MGRS.2020.3048443 Date of current version: 10 February 2021 BACKGROUND During the past 30 years, multispectral imaging and hyperspectral imaging (HSI) have become a staple tool used for geoscience remote sensing and Earth observation [1], 0274-6638/21©2021IEEE IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[2]. This type of imagery enables the simultaneous registration of multiple parts of the electromagnetic spectrum, providing invaluable information for many detection, classification, and unmixing problems [3]. As a result, today, remote sensing HSI is used in many commercial, scientific, and defense areas, including precision agriculture, mining, forestry, coastal and oceanic observation, intelligence, and disaster monitoring [3]–[6]. Due to the growing quantity of deployed sensors [7], the number of public and private remote sensing stakeholders [4], and the ongoing effort to improve the analysis of retrieved images [8]–[19], the importance of HSI is likely to increase in the future. Images produced by multispectral and hyperspectral sensors consist of multiple spectral bands, instead of the three—red, green, and blue—present in traditional color images. Depending on the application and the available hardware, the number of registered bands can be on the order of tens, hundreds, and even thousands [20]. Thus, HSI generates significantly larger volumes of data compared to traditional imagers. Moreover, the spatial resolution of the deployed sensors also follows a rising trend, further increasing the amount of data produced. For instance, the HyspIRI sensor developed by NASA can produce up to 5 TB of data per day [21]. However, the downlink channel capacity between the remote sensing devices and the ground stations is constrained, which limits the amount and quality of the retrieved data [22]. Data compression is typically applied to reduce the amount of data to be downloaded, hence improving effective transmission capacity [23]–[27]. Due to hardware and energy constraints, employed algorithms must be tailored to attain a beneficial tradeoff between complexity and efficiency [22], [28]. When lossless compression is applied to images, the resulting compressed data suffice to reconstruct identical copies of the originals. On the other hand, lossy compression enables the transmission of even smaller data volumes at the cost of the reconstructed images not being identical to the originals. Among lossy compression algorithms, those that provide user-controlled bounds on the maximum error introduced in any sample are referred to as near lossless. In spite of the distortion introduced by lossy and nearlossless methods, several studies have concluded that reconstructed images can be successfully used for the intended analysis tasks [29]. This is sometimes observed for compressed images up to 25-times smaller than the original ones [30]. Notwithstanding, a successful analysis can be performed only when the amount of loss is adequate for the type of images and the task at hand [29], [31]. One of the main advantages of near-lossless compressors is that they guarantee the accuracy of all the reconstructed samples in an image. This is in contrast to regular lossy compression approaches, which typically provide competitive average distortion results but no assurance about the fidelity of any given set of samples. Regardless of the employed compression regime, compression algorithms DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE must meet very stringent limitations in terms of complexity and required computational resources [32]. This constraint is particularly relevant for small satellites and CubeSats, which have attracted much scientific and industrial interest recently [4], [33]. The CCSDS, founded in 1982, publishes the standards for spaceflight communication used in more than 900 space missions to date. (An updated list of space missions using CCSDS standards can be found at https://public. ccsds.org/implementations/missions.aspx.) CCSDS standards enable cooperation among space agencies and with industrial associates, seeking enhanced interoperability, reliability, and cost-effectiveness. The latest CCSDS compression standard (CCSDS 123.0-B-2, “Low-Complexity Lossless and Near-Lossless Multispectral and Hyperspectral Image Compression [34]), the central topic of this article, supersedes CCSDS 123.0-B-1 [35] while maintaining backward compatibility. In the CCSDS naming convention, suffixes “-1” and “-2” denote the first and second issues, respectively, of a standard. Hereafter, CCSDS 123.0-B-1 and CCSDS 123.0-B-2 are also denoted as Issue 1 and Issue 2, respectively. Perhaps the most relevant novel feature of Issue 2 is a new near-lossless compression regime, enabled by a closed-loop scalar quantizer in the prediction stage [36]. Note that this in-loop quantization approach enables a higher compression performance than quantization of input samples before prediction [26]. With this new feature, users can specify the maximum error limits—absolute and/or relative—introduced in the decompressed images. Fidelity settings can vary from band to band and can be periodically updated within an image. Another new feature of Issue 2 is a hybrid entropy coder option. It is specifically designed to provide improved performance on low-entropy data, i.e., for the case when prediction errors tend to be small compared to the quantizer stepsize. The hybrid encoder extends the sample-adaptive codes of CCSDS 123.0-B-1 with 16 additional variable-to-variablelength codes, which can represent multiple input symbols using a single codeword. To guarantee backward compatibility, both lossless and near-lossless compression can be performed with either of CCSDS 123.0-B-1’s original entropy coders or with the new hybrid option. A third novelty in the new standard is a new mode within the predictor stage called narrow local sums, which are designed to facilitate the design of efficient hardware implementations. Yet another change introduced in the new standard is added support for optional supplementary information tables, which can provide ancillary image or instrument information, e.g., to identify the wavelengths associated with each spectral band. This article provides a comprehensive overview of Issue 2, paying special attention to the new concepts and capabilities not present in Issue 1. The content hereafter presented extends those presented in a previous conference work [36]. The following overview is more in depth, it assumes no 103
previous knowledge of Issue 1, and a performance evaluation is included. Furthermore, the experimental results discussed here complement those in [37] by providing both a quantitative and qualitative comparison to other relevant compression methods. THE NEW CCSDS 123.0-B-2 STANDARD PREVIOUS WORK The CCSDS Data Compression Working Group (1995–2007; 2020–present) and the Multispectral and Hyperspectral Data Compression (MHDC) Working Group (2007–2020) have developed and maintained several compression standards applicable to remote sensing HSI, listed chronologically in Table 1. The CCSDS 121.0-B-1 standard describes a general-purpose adaptive entropy coder. In CCSDS 121.0B-2, the efficiency and flexibility of this entropy coder was enhanced by allowing larger block sizes and the possibility of using a restricted set of codewords. (As this entropy coder is available in the new CCSDS 123.0-B-2 standard, an overview is provided later in the “Block-Adaptive Coder” section). The CCSDS 122.0-B-1 standard was designed specifically for image data and supports both lossless and lossy regimes. It consists of a spatial discrete wavelet transform, which is then followed by a bit-plane coder. The CCSDS 122.1-B-0 standard extends CCSDS 122.0-B-1 by allowing the application of spectral decorrelation transforms. To provide compatibility between the 122.0 and 122.1 standards, a second issue of 122.0 (CCSDS 122.0-B-2) was also published. Finally, the CCSDS 123.0-B-1 standard formalizes a predictive coding scheme for multispectral and hyperspectral data. This standard is the immediate predecessor of the one addressed in this article, and their functional blocks are described in subsequent subsections. Several hardware implementations can be found in the literature of the CCSDS 123.0-B-1 standard. In [44], a parallelization technique is described that achieves from 31 to 123 Megasamples per second (Ms/s), respectively, on the Xilinx V-7 XC7VX690T and V-5QV FX130T field-programmable TABLE 1. A CHRONOLOGY OF CCSDS DATA-COMPRESSION STANDARDS. NAME RELEASE STATUS REGIME MULTISPECTRAL 121.0-B-1 [38] May 1997 Retired LL No 122.0-B-1 [39] May 2005 Retired LL, LS No 121.0-B-2 [40] April 2012 Retired LL No 123.0-B-1 [35] May 2012 Retired LL Yes 122.0-B-2 [41] September 2017 Active LL, LS No 122.1-B-1 [42] Active LL, LS Yes September 2017 123.0-B-2 [34] February 2019 Active LL, NL Yes 121.0-B-3 [43] August 2020 Active LL No The active recommendations (blue books) are shown in blue while retired (superseded) standards (silver books) are presented in gray. Lossless, lossy, and near-lossless compression regimes are denoted as LL, LS, and NL, respectively. The “multispectral” column indicates whether or not several bands can be compressed simultaneously. 104 gate arrays (FPGAs). In [45], parallelization using C-slow retiming is proposed, which achieves a throughput of up to 213 Ms/s on a space-grade Virtex-5QV FPGA. In [46], another implementation, this one with a throughput of 147 Ms/s on a Xilinx Zynq-7020 FPGA, is described. The FPGA design discussed in [47] allows parallel processing of any number of samples, provided that resource constraints are met. This enables configurable tradeoffs between throughput and power consumption. In [48], a low-cost FPGA design is described for the prediction block of CCSDS 123.0-B-1, with a throughput as high as 20 Ms/s on a Xilinx Zynq-7000 FPGA. In [49]– [51], low-complexity and low-occupancy FPGA designs are proposed. These implementations are designed to be independent and combinable in a plug-and-play fashion. The latest version of this system, referred to as SHyLoC 2.0, yields a throughput of 150 Ms/s on a Xilinx Virtex XQR5VFX130 FPGA. The hardware designs for CCSDS 123.0-B-2 are currently ongoing, with the European Commission funding two research projects within the framework of the Horizon 2020 (H2020) program [52], [53] and with NASA and the European Space Agency funding other research projects [54], [55]. To the best of our knowledge, there are no public implementations of Issue 2 available. The extensions to CCSDS compression algorithms have been published as well. In [56], a method to extend lossless predictive coding schemes—in particular, CCSDS 123.0B-1—was proposed. This method enables compression in a lossy regime, producing constant signal-to-noise ratio (SNR) and accurate rate control. In [57], a lightweight arithmetic coder was proposed as a possible replacement for the entropy coder of CCSDS 123.0-B-1. Some algorithms have been suggested related to the prediction stage of Issue 2, based on recursive least-squares theory. These algorithms describe more adaptive prediction methods at the cost of increased computational complexity. In [58], the inverse correlation matrix of the local differences is used to update the prediction weights. In [59], this predictor is enhanced by adaptively selecting the number of local differences to be used. In [60], two prediction modes are described: the first uses only spectral neighbors in the weight update process; the second also employs spatial neighbors. The best of the two for each band in terms of mean absolute error is selected for coding. In [61], the image is divided into nonoverlapping regions, which allows for parallel application of the methods described in [59] and [60]. OVERVIEW OF THE NEW STANDARD The CCSDS 123.0-B-2 standard is based on the fast lossless extended (FLEX) compressor [62]. In turn, FLEX is based on the fast lossless (FL) compressor [63], which was formalized as CCSDS 123.0-B-1. FLEX improves upon FL by adding adjustable lossy compression capabilities while maintaining the option to perform lossless compression. The latest CCSDS compression standard extends FLEX by adding new features, such as relative error limits, periodic error limit updating, and new prediction modes that facilitate hardware IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
implementations. Very importantly, Issue 2 has been designed to retain many of FL’s desirable properties, including low computational complexity; single-pass compression and decompression; automatic adaptation to the data being compressed; and the ability to operate requiring a constant, reasonably sized memory space. Moreover, Issue 2 inherits all the capabilities of CCSDS 123.0-B-1, allowing decompression of the data output by the latter. These features make both issues of the CCSDS 123.0 standard suitable for use onboard spaceborne systems, including small satellite missions. Note that compressed images do not include synchronization markers or any other similar scheme. It is assumed that the transport layer will provide the ability to locate the next image in the event of a bit error or data loss. The general structure of the Issue 2 compressor is shown in Figure 1. Similar to CCSDS 123.0-B-1, the input data—signed or unsigned integers—go through a predictor stage in which previously coded information is employed to predict the value of the next sample to be compressed. As a main novelty of Issue 2, prediction errors are uniformly quantized. The quantization bin sizes are determined by the user’s choice of absolute error limit (i.e., the maximum allowed absolute difference between the original and reconstructed sample values) and/or the relative error limit, which controls the maximum ratio of the error to the sample’s predicted value. Quantized data are then mapped to nonnegative integers, which then are input to the entropy coder. When nonzero error limits are selected, quantizer indices represent approximations of the aforementioned prediction errors, instead of the actual values. In this case, data output by the predictor stage typically exhibit lower entropy rates, which allows the coder to produce smaller compressed files. To make decompression possible, the decoder must be able to make the same predictions as the encoder. To guarantee this, when nonzero error limits are selected, prediction is done using so-called sample representatives instead of the original samples. The following sections provide an informative description of the aforementioned functional blocks. For the sake of readability, some definitions in this description are simplified so as to not contemplate boundary cases, e.g., the image edges when neighboring samples are involved. Interested readers are referred to [34] for complete, normative definitions. A list of the symbols employed hereafter is available in Table 2 for ease of reference. PREDICTOR STAGE The predictor stage is designed to process input samples sequentially in a single pass, producing one mapped quantizer index per input sample. Although CCSDS 123.0-B-1 was designed to accept input samples of, at most, 16 bits, Issue 2 accepts bit depths, D, up to 32 bits. Hereafter, s z (t) denotes the tth sample of the zth spectral band in raster scan order, and d z (t) is its corresponding mapped quantizer index. To obtain d z (t), a prediction of the sample’s original value, denoted as ts z (t), is computed as described in the “Prediction” section, and the prediction error is computed as D z (t) = s z (t) - ts z (t). (1) This prediction error is then quantized, as discussed in the “Quantization” section, to produce a quantizer index q z (t). This index is mapped to a nonnegative value: d z (t), the output of the predictor stage, as described in the “Quantizer Index Mapping” section. The quantizer index is also transformed into its corresponding sample representative smz (t), as described in the “Sample Representatives” section. These representatives are then used to obtain the predicted values, ts z (t), used in (1). As mentioned previously in this section, the sample value prediction must be based on smz (t) instead of s z (t) to avoid compressor–decompressor prediction differences when compression is not lossless. QUANTIZATION The CCSDS 123.0-B-2 standard allows for quantization of each prediction error D z (t) into a quantizer index q z (t) so that D z (t)—and, thus, also the input sample s z (t)—can be reconstructed with maximum error m z (t). A quantizer with uniform bin size 2m z (t) + 1 is used, i.e., Encoder Predictor Quantized Prediction Prediction Input Errors Errors Image Quantizer qz(t ) ∆z(t ) sz(t ) Index Quantization – Mapping Sample Representatives Predicted s z″(t) Sample Sample Prediction Values Representative sz(t ) Calculation Mapped Quantizer Indices δz(t ) BlockAdaptive Coder Coder Selection (Once Per Image) Encoded Bitstream SampleAdaptive Coder Hybrid Coder FIGURE 1. A structure overview of the CCSDS 123.0-B-2 compressor. The new functional blocks with respect to CCSDS 123.0-B-1 are high- lighted in blue while the modified blocks are shown in green. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 105
TABLE 2. A LIST OF SYMBOLS REFERENCED IN THIS ARTICLE. SYMBOL D z (t) + m z (t) q z (t) = sgn (D z (t))· < 2m (t) + 1 F , (2) z MEANING GENERAL s z (t) Original sample value (tth sample of spectral band z) D Dynamic range in bits s min, s max Minimum and maximum allowed sample values s mid Midrange sample value NX , NY Horizontal and vertical spatial dimensions of the image NZ Number of spectral bands of the image where the sgn function is defined as 1, x 2 0 sgn (x) = * 0, x = 0 . (3) - 1, x 1 0 The users control m z (t) indirectly by selecting the maximum absolute error az , the maximum relative error rz , or both for each spectral band z. When only absolute error limits are specified, SAMPLE REPRESENTATIVE CALCULATION smz (t) Sample representative for s z (t) H Sample representative resolution zz Sample representative damping for band z ]z Sample representative offset for band z st z (t) Predicted value for s z (t) X Prediction weight arithmetic resolution P Number of previous bands used for the prediction PREDICTION s z, y, x Alternative notation for s z (t) v z, y, x Local sum for s z, y, x NW d z, y, x, d Nz, y, x, d W z, y, x, d z, y, x Local differences for s z, y, x U z, y, x Local difference vector for s z, y, x W z, y, x Prediction weight vector for s z, y, x dt z, y, x Predicted central local difference for s z, y, x Double resolution predicted value for s z (t) su z (t) (i ) z v min, v max, t inc, g , g * z User-specified weight update parameters QUANTIZATION D z (t) Prediction error for s z (t) q z (t) Quantizer index of D z (t) sź (t) Clipped quantizer bin center for D z (t) az Maximum absolute error in the spectral band z rz Maximum relative error in the spectral band z m z (t) Maximum reconstruction error | s z (t) - slz (t) | d z (t) Mapped quantizer index for q z (t) i z (t) Scaled difference between st z (t) and the closest of smin and smax U max Golomb-power-of-2 (GPO2) length limit R z (t) Accumulator value for d z (t) QUANTIZER INDEX MAPPING ENTROPY CODING C (t) c * k z (t) Counter value for d z (t) Sample-adaptive rescaling counter size GPO2 code index for d z (t) Ru z (t) High-resolution counter value for d z (t) i Hybrid code index Ti Hybrid code entropy-threshold constants Li Hybrid code symbol-limit constants X Hybrid code escape symbol 106 m z (t) = a z . (4) When only relative error limits are set, m z (t) = ; rz | ts z (t)| E, (5) 2D where ts z (t) is the predicted value for the original sample s z (t). Setting relative error limits allows for the reconstruction of different samples with dissimilar degrees of precision. More specifically, the samples predicted to have a smaller magnitude are reconstructed with lower error. Note that predicted, rather than actual, sample values are used in (5) to keep the encoder and the decoder synchronized. Thus, absolute error bounds are not guaranteed when only a relative error limit rz 2 0 is specified. When both the absolute and relative error limits are used, m z (t) is set to the minimum of (4) and (5). When lossless compression is desired in band z, users may set a z = 0 or rz = 0 so that m z (t) = 0. This guarantees that q z (t) = D z (t), i.e., that the original samples can be reconstructed exactly. It is worth emphasizing that error limits can be set individually for each spectral band. With this mechanism, higher-importance bands can be reconstructed with greater fidelity (even perfect fidelity), while lesser-priority bands can be represented with lower fidelity using smaller compressed data volumes [56], [64]–[6]. Furthermore, the periodic error limit update option can be activated so that different fidelity choices can be adapted within a band. This option is useful to meet a given downlink transmission rate constraint and/or to better preserve the image regions expected to contain features of interest. It should be highlighted that the standard does not define a specific method for selecting error limit values, e.g., to meet a given downlink rate. This is because error limit values are encoded in the bitstream, and thus, the decoder does not need to know how those error limits were selected. SAMPLE REPRESENTATIVES The decompressor must duplicate the prediction operation performed by the compressor, but, in general, the original image samples s z (t) cannot be perfectly reconstructed from the compressed bitstream because of information lost during the quantization stage. Consequently, the prediction IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
calculation (in both the compressor and decompressor) is performed using sample representatives smz (t) in place of the original samples s z (t). A naive solution to this problem is to use the central point slz (t) of the quantizer bin, whose index q z (t) is transmitted to the decoder. The quantizer bin center slz (t) can be calculated as slz (t) = clip ^ts z (t) + q z (t)· (2m z (t) + 1), s min, s max h, (6) where s min and s max are the minimum and maximum values, respectively, allowed for an input sample and clip (x, a, b) = min (b, max (a, x))(7) guarantees that slz (t) falls within the allowed value range. However, using the quantizer bin center slz (t) as the sample representative smz (t) for prediction does not always minimize compressed data volume [37]. This is true even for m z (t) = 0, i.e., lossless compression. In the CCSDS 123.0-B-2 standard, three user-specified parameters can be used to adjust the choice of smz (t). These are the sample representative resolution (H), damping (z z), and offset (} z) parameters. Based on them, sample representatives smz (t) are defined as an integer approximation to zz z ts z (t) + c 1 - Hz mc slz (t) - }Hz sgn ^q z (t) h m z (t) m . (8) 2H 2 2 Regardless of the parameter choice, the sample representatives always fall between slz (t) and ts z (t). Parameter H determines the precision with which representatives are computed. Parameter z z limits the effect of noisy samples in the representative calculation. In turn, parameter } z establishes a bias toward slz (t) or ts z (t), depending on its value. Although H is defined for the whole image, z z and } z can be chosen on a band-by-band basis. Setting z z = } z = 0 causes the sample representatives to be equal to slz (t); the larger values of z z and/or } z produce representatives closer Sample Band z Representation S″z + Local Difference Vector Uz,y,x Sample Band z – 1 Representation S″z–1 Local Sums σz–P dz–1,y,x ... PREDICTION The predicted sample value ts z (t) for an input sample s z (t) is computed causally using sample representatives from spectral bands z - P, f, z, where P $ 0 is a user-defined parameter. Within each band, previous sample representatives are used to compute local sums. These can be regarded as preliminary, scaled estimates of the actual sample value. Local sums, in combination with the sample representatives, are used to compute local differences. The predicted value ts z (t) is then calculated using the local sum in the current band z as well as a weighted sum of local differences from the current and previous bands. Local sums can be understood as a local mean subtraction, and prediction as being made in the mean-subtracted domain. Figure 2 displays an overview of the prediction process. Its stages are more precisely described in the following. Local sums are computed from previous sample representatives using one of the four available modes. Similar to CCSDS 123.0-B-1, each mode is either neighbor- or column-oriented. As a novelty of Issue 2, modes can now be narrow instead of wide. The sample representatives used to calculate the local sums depend on the selected mode, as depicted in Figure 3. In the figure and hereafter, s z, y, x is used to denote the current sample s z (t), which makes explicit the band index z as well as the spatial coordinates (x, y) within the band. In all of the modes, the highlighted sample representatives are multiplied by the factor indicated in the Figure 3 and added together to obtain the local sum v z, y, x corresponding to s z, y, x . For instance, the narrow neighbororiented local sums are computed as Central dz–P,y,x Local + − Differences dz–P Prediction ... ... Weight Predicted Value Vector Wz,y,x Central ≈ Sz,y,x Local Local Sums + − Differences Predicted Central dz,y,x + 2Ωσz,y,x σz–1 dz–1 Local Difference 2Ω+2 dz,y,x Directional Inner Prediction Local Local Product − Differences Sums N W NW d z,y,x, d z,y,x , d z,y,x dz σz σz,y,x (Full Prediction Only) Sample Band z – P Representation S″z–P ... to ts z (t). Note that, depending on the parameter choice, smz (t) may not be contained in the quantizer bin identified by q z (t). The empirical results indicate that setting the damping and offset parameters to values other than zero tends to provide larger benefits to compression performance when spectral bands are closer in wavelength and for images with larger noise prevalence [37]. FIGURE 2. An overview of the prediction block in CCSDS 123.0-B-2. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 107
v z, y, x = smz, y - 1, x - 1 + 2smz, y - 1, x + smz, y - 1, x + 1 . (9) dtz, y, x = W Tz, y, x U z, y, x . (13) As can be observed in Figure 3, column-oriented local sums employ sample representatives at the same x coordinate, whereas neighbor-oriented sums also use sample representatives at contiguous x coordinates. In turn, the new narrow option removes the dependency on smz, y, x - 1, which facilitates pipelining in a hardware implementation at the cost of some compression performance loss [37]. Note that wide and narrow column-oriented modes are identical in the general case. Notwithstanding, only the wide columnoriented mode uses smz, y, x - 1 for calculating local sums at the first row, i.e., y = 0, of each spectral band. Local differences are computed based on the sample representatives and the local sums. For an input sample s z, y, z, up to four local difference types are computed: the central difference ^d z, y, x h and three directional differences, i.e., NW north ^d Nz, y, x h, west ^d W z, y, x h, and northwest ^ d z, y, x h . They are defined as follows: The predicted sample is then calculated as an integer approximation to d z, y, x d Nz, y, x dW z, y, x d NW z, y, x = = = = 4smz, y, x - v z, y, x, 4smz, y - 1, x - v z, y, x,  4smz, y, x - 1 - v z, y, x, 4smz, y - 1, x - 1 - v z, y, x . ts z, y, x . < where Ω is a parameter that controls arithmetic precision. The initial prediction weight vector for each band, W z, 0, 0, can be defined based on default or user-provided values. In either case, vector elements are updated after processing each input sample s z (t). The updates are based on the obtained prediction error and several user-defined parameters, namely, v min, v max, g (zi), g *z , and t inc, which control the rate at which weights are adapted to the original image statistics. More precisely, the smaller values of g (zi), g *z , v min, v max, and 1/t inc typically produce larger weight updates. This results in a faster adaptation to the source statistics at the cost of worse steady-state compression performance [37]. It is important to highlight that the existence of two prediction modes (full and reduced) as well as two different local mean types (column and neighbor oriented) is present in Issue 2 so that prediction is effective for the image data produced by different types of instruments. For instance, when streaking artifacts are present in the images, reduced column-oriented prediction tends to produce the best results [37]. (10) The predicted sample value is then computed using either the full or reduced prediction modes. In the full prediction mode, the local difference vector U z, y, x is defined using directional differences from the current spectral band and central differences from the previous bands: QUANTIZER INDEX MAPPING The prediction errors D z (t) obtained in (1) as well as their corresponding quantizer indices q z (t) defined in (2) may be negative. However, the entropy coders available in CCSDS 123.0-B are defined for nonnegative input values. The quantizer index mapping stage depicted in Figure 1 provides a one-to-one mapping between valid quantizer indices and nonnegative values, referred to as mapped quantizer indices, and is denoted as d z (t). This functional block remains unaltered with respect to the previous Issue of the standard [35]. A key property NW U z, y, x = 6d Nz, y, x, d W z, y, x, d z, y, x, d z - 1, y, x, f, d z - P, y, x@ . (11) In the reduced prediction mode, the local difference vector uses only central differences from previous bands: U z, y, x = 6d z - 1, y, x, f, d z - P, y, x@ . (12) In both modes, a prediction weight vector W z, y, x is used to obtain a weighted sum of local differences, called the predicted central local difference, as x 1× 1× S″z,y–1, x–1 S″z,y–1, x y S″z,y,x–1 x 1× 1× S″z,y–1, x+1 x 2× S″z,y–1, x–1 S″z,y–1, x y 1× Sz,y,x (a) dt z, y, x + 2 X v z, y, x F, (14) 2X + 2 1× 4× S″z,y–1, x+1 S″z,y–1, x–1 S″z,y–1, x S″z,y–1, x+1 y S″z,y,x–1 Sz,y,x (b) S″z,y,x–1 Sz,y,x (c) FIGURE 3. The local sum calculation modes available in Issue 2. The current sample position is highlighted with a blue border. The sample representatives employed for the corresponding local sum are shown in orange. (a) Wide neighbor-oriented, (b) narrow neighbor-oriented, and (c) column-oriented. 108 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
of this mapping is that indices can be represented using the same number of bits as in the original image. This is true because predicted values are guaranteed to satisfy ts z (t) ! [s min, s max]; i.e., predictions do not exceed the range of allowed sample input values given bit depth D. Thus, the number of possible prediction errors equals the number of elements in the aforementioned interval. Based on this, the mapping is defined as one of them to code all the mapped quantizer indices for an image. The first two encoding options were already present in the previous issue of the standard [35] while the hybrid coder in Issue 2 is new. The hybrid coder tends to provide better compression performance than the other two options, but the benefit may be small when compression is lossless. An overview of the three available coders is provided in the following sections. | q z (t) | + i z (t), | q z (t) | 2 i z (t) d z (t) = * 2 | q z (t) | , 0 # (- 1) us z (t) q z (t) # i z (t) (15)  2 | q z (t) | - 1, otherwise, BLOCK-ADAPTIVE CODER A block-adaptive coder is a separate CCSDS standard, originally specified in [38] and later extended in [40], based on Rice coding. In this coder, the samples are partitioned into disjoint blocks of fixed length of between eight and 64 samples. Each block is encoded using the most effective of five available coding methods: zero block, second extension, fundamental sequence, sample splitting, and no compression. A simplified diagram of this process is shown in Figure 4. Interested readers are referred to [67] for a summary of key operational concepts and a detailed performance analysis of this coder. where us z (t) is a double-resolution version of the predicted sample value defined in the “Prediction” section, and i z (t) is the difference between the predicted value and the nearest interval endpoint, i.e., i z (t) = min d< ts z (t) - s min + m z (t) F, (16) 2m z (t) + 1 s max - ts z (t) + m z (t) < Fn . (17) 2m z (t) + 1 ENCODER STAGE The encoder stage compresses the sequence of mapped quantizer indices d z (t) produced by the predictor stage into a variable-length bitstream. This operation is reversible, meaning that an identical sequence of mapped quantizer indices can be recovered from the bitstream. These indices allow for an exact or approximate reconstruction of the input image, depending on the error limits set in the predictor stage. In Issue 2, three coders are available for this purpose: sample and block adaptive and hybrid. The user must select Zero Block Mapped Quantizer Indices δz(t) Mapped Quantizer Index Block Block Splitting Fundamental Sequence Second Extension Sample Splitting k = 1 Sample Splitting k = 2 SAMPLE-ADAPTIVE CODER In the sample-adaptive coder, each mapped quantizer index d z (t) is compressed using a variable-length codeword from a family of length-limited Golomb-power-of-2 (GPO2) codes. Each GPO2 code is identified by an index k, which is selected based on the statistics of previously coded samples. Given k and d z (t), the selected codeword is denoted as 0 k (d z (t)) and defined as follows: ◗◗ If 6d z (t)/2 k@ 1 U max, 0 k (d z (t)) consists of 6d z (t)/2 k@ zeros, followed by a one, followed by the k least-significant bits of the binary representation of d z (t). Zero-Block Codeword Fundamental Sequence Codeword Second Extension Codeword Sample Splitting (k = 1) Codeword Sample Splitting (k = 2) Codeword Block of Selected Codewords Coding Option Selection Selected Option ID ... ... No Compression No Compression Codeword FIGURE 4. An overview of the block-adaptive entropy coder. The coding options executed in parallel for each block are highlighted in orange. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 109
◗◗ Otherwise, 0 k (d z (t)) consists of U max zeros, followed by the binary representation of d z (t) using D bits. Here, U max is a user-specified parameter utilized to limit the maximum codeword length, and D is the image’s bit depth. Two variables are used to keep track of the input data statistics and to choose the GPO2 family’s index k z (t) to code d z (t): an accumulator R z (t) and a counter C (t). The ratio of these two variables determines k z (t): ◗◗ If 2C (t) 2 R z (t) + 649C (t) /2 7@, then k z (t) = 0. ◗◗ Otherwise, k z (t) is the largest positive integer such that k z (t) # D - 2,  C (t) 2 k z (t) # R z (t) + 649C (t) /2 7@ . (18) Mapped quantizer indices typically follow a nonstationary geometric distribution, for which k z (t) is a good parameter estimator. Note that the counter and accumulator variables are initialized based on user-specified parameters. The values of the counter and the accumulator variables are updated after coding each input sample d z (t - 1) . More specifically, C is increased by one, and R is increased by d z (t - 1) . In addition, both C and R are periodically divided by two (rounding down) to enable calculation using finite-precision arithmetic. This division is hereafter referred to as renormalization. HYBRID CODER The hybrid coder uses the statistics of previously encoded data to classify each input-mapped quantizer index as either a high- or low-entropy sample. The high-entropy samples are compressed using a variation of the length-limited GPO2 code family described in the “Sample-Adaptive Coder” section. The low-entropy samples are coded using another family of 16 variable-to-variable-length codes, i.e., several input samples can be encoded with a single codeword. A detailed description of these variable-to-variable-length Start Input Sample δz(t) Update Counter and Accumulator kz(t) Calculate GPO2 Code Index High Entropy Sample Type? Low Entropy Calculate Low-Entropy Code Index Code Index i codes can be found in [68]. The ability to adaptively switch between GPO2 and variable-to-variable-length codes gives this code the name hybrid. Variable-to-variable-length codes enable very efficient compression of highly predictable (low-entropy) samples, which become more prevalent when near-lossless error limits are used in the predictor stage. Meanwhile, variableto-variable-length codes introduce variability in the latency between the arrival of a low-entropy mapped quantizer index and the output of the codeword that encodes it. To accommodate this, codewords emitted by the hybrid coder are designed so that they can be decoded in reverse order. This is possible thanks to two main properties of the coder. First, output codewords are suffix-free rather than prefixfree. Second, the compressed image ends with a specification of the final state of the coder. A set of flush tables is provided in the standard to signal the code states in an unambiguous and compact manner. Reverse decoding allows for simpler and more memory-efficient implementations than does FLEX’s original hybrid entropy coder [62]. The remainder of this section describes Issue 2’s hybrid coder. A flow diagram of this coder’s logic is provided in Figure 5 to support this description. The classification of samples as high or low entropy is performed using a similar statistical approach to that of the sample-adaptive coder. Two variables are used to keep track of these statistics: a counter C (t) and a high-resoluu z (t) . These variables are updated the tion accumulator R same way as in the sample-adaptive coder, with two main differences. First, variables are updated before coding the input sample; this is done so that decoding can proceed in reverse order. To this effect, the least-significant bit of the accumulator variable is output before renormalization u z (t) so that the decoder can invert this process. Second, R is increased by 4d z (t) instead of d z (t) to enable a more Emit Reversed, Limited-Length GPO2 Codeword R′k (t)(δz(t)) z Update Code i Prefix Yes (Likely Sample) No End Yes Emit Code i’s Codeword Given Its Current Prefix No (Unlikely Sample) δz(t ) ≤ Code i Symbol Limit Li? Is Code i’ s Prefix Complete? Emit R′0(δz(t) – Li – 1) Clear Code i Prefix Complete Code i’s Codeword With Scape Symbol FIGURE 5. A flow diagram of CCSDS 123.0-B-2’s hybrid coder. The logical decisions are highlighted in orange, the processes that update the codes’ internal state are shown in green, and the processes that emit codewords are presented in purple background. 110 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
precise estimation of the input data statistics. The ratio u z (t) /C (t) determines whether a sample is a high- or lowR entropy symbol. More specifically, d z (t) is defined as high entropy if and only if u z (t)·2 14 $ T0·C (t), (19) R where T0 is a constant provided in the standard. This definition allows for image regions that are well predicted to be coded with low-entropy codes and using the high-entropy mode otherwise. Each high-entropy sample is encoded using a family of reversed-length-limited GPO2 codes. As in the sampleadaptive case, each code is identified by an index, k z (t). For the hybrid coder, k z (t) is the largest positive integer that satisfies k z (t) # max (D - 2, 2), u z (t) + 649C (t) /2 5@ .  C (t) 2 k z (t) + 2 # R (20) The codeword emitted for the high-entropy sample d z (t), 0lk z (t) (d z (t)) is defined as follows: ◗◗ If 6d z (t) /2 k z (t)@ 1 U max, then 0lk z (t) (d z (t)) consists of the k z (t) least-significant bits of the binary representation of d z (t), followed by a one, followed by 6d z (t) /2 k z (t)@ zeros. ◗◗ Otherwise, 0lk z (t) (d z (t)) consists of the D-bit binary representation of d z (t) followed by U max zeros. The low-entropy samples are processed with one of 16 available variable-to-variable-length codes. The code index used to process a low-entropy sample d z (t) is the largest i satisfying u z (t)·2 14 1 C (t)·Ti, 0 # i # 15, (21) R where T0, f, T15 are constants provided in the standard, and T0 is used in (19). This definition allows for the magnitude of recent prediction errors to determine the next variable-to-variable-length code to be used. Each code i has a prefix of previously input samples. When a sample is processed, a symbol is added to the corresponding code’s prefix. The standard defines a list of complete prefixes for each code. At this point, if code i’s prefix matches any of those complete prefixes, a codeword that uniquely identifies that prefix and its associated sequence of input samples is emitted. After that, the prefix for that code is cleared. It is worth noting that the complete prefixes defined for code i cannot contain sample values satisfying d z (t) 2 L i, where L 0, f, L 15 are constants defined in the standard. When such a sample is processed, i.e., referred to as an unlikely sample, 0l0 (d z (t) - L i - 1) is emitted, and an escape symbol X is added to the prefix instead of d z (t). Adding X to any code’s prefix is guaranteed to make it complete and trigger the emission of an output codeword. The input symbol limit Li limits the size of the input alphabet in the low-entropy codes by treating all of the unlikely symbols in the same way. This enables us to reduce the number of DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE codewords in a code. As escape symbols occur with low probability, the efficiency with which these residual values are encoded has only a small impact on the overall coding effectiveness. COMPRESSION PERFORMANCE EXPERIMENTAL SETUP The lossless and near-lossless compression performances of Issue 2 are explained in this section. The results are provided for both the block- and sample-adaptive entropy coders already present in Issue 1 and are compared to those of the new hybrid coder defined in Issue 2. The hybrid coder’s computational complexity is comprehensively addressed in [69], so the execution time results are not presented here. The empirical results were obtained using a varied corpus of 17 multispectral images, 38 hyperspectral images, and two sounder data samples. These were generated by 14 different instruments deployed in real missions, except for the Pleiades images, which are simulated. Most of the images included are raw, giving more weight to the direct compression of images as they are acquired, while the nonraw instances processed after acquisition are also included to represent some possible onboard calibration. Both pushbroom and whiskbroom sensors are covered in the corpus and include the streaking artifacts that are characteristic of pushbroom instruments (such as Hyperion) in uncalibrated images. A diverse range of spectral separations is considered, and examples of images with significant noise levels (the Moon Mineralogy Mapper) or that are acquired with airborne instruments [the Compact Airborne Spectrographic Imager (CASI)] are included as well. Regarding the dynamic range, all the hyperspectral and sounding instruments produce data with bit depths of at least 11 bits, whereas, for multispectral instruments, samples of lower bit depths are available, too. A summary of this corpus, produced by the CCSDS MHDC Working Group, is provided in Table 3. All of the images are publicly available, except for those produced by the Infrared Atmospheric Sounding Interferometer (IASI) and Meteosat Second Generation instruments, due to licensing restrictions. (The download links for the test images can be found at http://cwe.ccsds.org/sls/docs/sls-dc/123.0-B-Info/TestData.) The “Entropy” column in the table represents the zero-order entropy of the images. Note that this is not a strict bound on compression efficiency and should be regarded as only an assessment of the difficulty of compressing the images. The performance results are obtained by invoking Issue 2’s compressor with the default set of parameters described in [37], except for the Hyperion, IASI, Moderate Resolution Imaging Spectroradiometer (MODIS), and Système Pour l’Observation de la Terre 5 (SPOT5) instruments. For these, the following parameters are modified to enhance compression performance: t inc = 2 9, v min = v max = 0, U max = 32, c* = 11, and c 0 = 4. A full prediction with wide, neighbor-oriented local sums is used in most of the images, including the 111
terms of rate distortion, i.e., float discrete wavelet transform (DWT) and spectral pairwise-orthogonal transform (POT), is used. JPEG-LS is arguably the best-known compression standard; it offers low complexity and supports both lossless and near-lossless regimes. In turn, M-CALIC is another low-complexity algorithm well known for its competitive compression performance. Note that, because JPEG-LS does not admit an arbitrary number of spectral bands, images are reshaped by concatenating the bands along the y-axis. More specifically, an image with a width, height, and number of bands equal to NX, NY, and NZ, respectively, is transformed into a one-band image with the same width and height as NY and NZ, respectively. No attempt is made to perform decorrelation across spectral bands for JPEG-LS. In contrast, M-CALIC is designed specifically to exploit spectral redundancy in hyperspectral images. four aforementioned instruments. The column-oriented local sums are employed for images that present streaking artifacts, i.e., when the average sample values exhibit strong differences for contiguous x positions. A full analysis of the impact on performance of parameter tuning as well as an identification of images with streaking artifacts can be found in [37]. To provide a comparison baseline, the authors’ implementation of CCSDS 122.1-B-1, the reference implementation of the JPEG-LS standard, and the original authors’ implementation of multiband context-based adaptive lossless image coding (M-CALIC) [70] are included in the comparison as well. (Note that the employed JPEG-LS implementation is available at https://github.com/thorfdbg/ libjpeg; to attain lossless and near-lossless compression, this compressor was invoked with parameter −ls 0.) For CCSDS 122.1-B-1, the best-performing configuration in TABLE 3. A SUMMARY OF THE EMPLOYED CORPUS PROPERTIES. THE ENTROPY (IN BITS) IS AVERAGED FOR ALL OF THE IMAGES IN EACH ROW. INSTRUMENT ACRONYM IMAGE TYPE BIT DEPTH D ENTROPY NUMBER OF BANDS WIDTH HEIGHT NUMBER OF IMAGES Atmospheric Infrared Sounder AIRS Raw 12 11.2 1,501 90 135 1 Airborne Visible/Infrared Imaging Spectrometer AVIRIS Raw 15 12.6 224 680 512 1 — — Raw 10 8.6 224 614 512 1 — — Calibrated 13 10.3 224 677 512 13 Compact Airborne Spectrographic Imager CASI Raw 12, 13, and 15 11.6 72 406 1,225 3 Compact Reconnaissance Imaging Spectrometer for Mars CRISM FRT, raw 11 10.1 107 640 510 2 — — FRT, raw 12, 13 10.4 438 640 510 2 — — FRT, raw 12, 13 10.6 545 640 510 2 — — HRL, raw 12, 13 11.2 545 320 450 2 — — MSP, raw 11 9.8 74 64 2,700 2 Hyperion Hyperion Raw 12 8.5 242 256 1,024 3 Infrared Atmospheric Sounding Interferometer IASI Calibrated 12 11 8,461 66 60 1 Landsat Landsat Raw 8 6.6 6 1,024 1,024 3 Moon Mineralogy Mapper M3 Target, raw 12 9.7 260 640 512 2 — — Global, raw 11, 12 9.4 86 320 512 2 Moderate Resolution Imaging Spec- MODIS troradiometer Night, raw 12 10.8 17 1,354 2,030 2 — — Day, raw 12, 13 8.6 14 1,354 2,030 2 — — 500 m, raw 12, 13 11.1 5 2,708 4,060 2 — — 250 m, raw 12 10.4 2 5,416 8,120 2 Meteosat Second Generation MSG Calibrated 10 8.2 11 3,712 3,712 1 Pleiades High Resolution Pleiades High resolution, simulated 12 10.8 4 224 2,465 1 — — High resolution, simulated 12 10.2 4 224 2,448 3 SWIR Full Spectrum Imager SFSI Calibrated 15 9.9 240 452 140 1 — — Raw 9, 11 7.4 240 496 140 2 Système Pour l’Observation de la Terre 5 High Resolution Geometric SPOT5 HRG, processed 8 6.8 3 1,024 1,024 1 Vegetation Vegetation Raw 10 9.4 4 1,728 10,080 2 FRT: full-resolution target; HRL: half-resolution long; MPS: multispectral survey; HRG: half-resolution. 112 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
LOSSLESS COMPRESSION RESULTS Lossless compression results are obtained for all of Issue 2’s entropy coders, for JPEG-LS, and for M-CALIC by setting the absolute error limit to zero. For each image I in the test corpus, the compression ratio is defined as Relative Frequency 0.15 N X ·N Y ·N Z ·D CR (I) = compressed data size (bits) . (22) Based on this definition, higher compression ratio values indicate better compression. A distribution of the obtained compression ratios for each compressor is shown in Figure 6. Vertical bar heights indicate the relative frequency of each range of compression ratios. The average compression ratio, plus/minus one standard deviation, is denoted with a dot and two horizontal bars. Note that the aggregated results presented here and in the “Near-Lossless Compression Results” section are not necessarily representative of any particular image or instrument. This is due to their different statistical properties and the fact that a different number of images is available for each instrument. As can be observed, all three entropy coders in Issue 2 yield similar compression ratio distributions and average values. In turn, JPEG-LS and M-CALIC produce average compression ratios 25 and 13% lower, respectively, than those of Issue 2. These differences can be explained by the more advanced predictor stage used in Issue 2. To provide further insight, the average compression ratios grouped by instrument are shown in the “Lossless” columns of Table 4 for Issue 2 using the hybrid coder, for JPEG-LS, and for M-CALIC. Consistent with the previous discussion, the CCSDS compressor yields higher compression efficiency than do JPEG-LS and M-CALIC for most instruments. Improvements of up to 63.7 and 63.4%, respectively, can be observed. Only for the MODIS instrument does JPEG-LS Sample Adaptive 0 0.15 Hybrid 0 0.15 Block Adaptive 0 0.15 JPEG-LS 0 0.15 0 M-CALIC 0 1 2 3 4 5 Compression Ratio 6 FIGURE 6. A distribution of lossless compression ratios. perform better, yielding an average compression ratio 7.7% higher than Issue 2’s with the hybrid coder. In turn, M-CALIC improves upon JPEG-LS in all cases and is able to yield results between 0.3 and 8.9% better than Issue 2 for five of the tested instruments. These differences can be explained by the fact that M-CALIC employs an arithmetic entropy coder, which enables better modeling of the source’s statistics, although at the cost of higher computational complexity. NEAR-LOSSLESS COMPRESSION RESULTS Near-lossless compression results are obtained for all three entropy coders in CCSDS 123.0-B-2 as well as for JPEG-LS and M-CALIC by limiting the maximum absolute error in any pixel of the reconstructed images. This error is hereafter denoted as peak absolute error (PAE). Two illustrative examples of near-lossless compression using Issue 2 and JPEG-LS are provided in Figure 7. In the top row, it can be TABLE 4. THE AVERAGE COMPRESSION RATIO RESULTS GROUPED BY INSTRUMENT. CCSDS 123.0-B-2 (HYBRID CODER) JPEG-LS M-CALIC INSTRUMENT LOSSLESS PAE 1 PAE 2 PAE 5 PAE 16 LOSSLESS PAE 1 PAE 2 PAE 5 PAE 16 LOSSLESS PAE 1 PAE 2 PAE 5 PAE 16 AIRS 2.86 4.56 6.09 10.76 35.74 1.89 2.51 2.95 3.97 6.68 2.87 4.51 5.92 9.95 27.51 AVIRIS 3.11 5.28 7.66 15.29 37.52 1.90 2.56 3.03 4.09 7.06 3.01 4.82 6.31 10.23 21.36 CASI 2.29 3.22 3.96 5.91 12.21 1.66 2.08 2.36 2.96 4.38 2.27 3.17 3.87 5.63 11.09 CRISM 3.10 5.05 6.87 11.15 22.93 2.20 3.08 3.71 5.14 8.17 2.21 3.21 4.01 6.08 13.43 Hyperion 2.86 4.57 6.09 10.80 44.75 2.44 3.56 4.48 6.76 13.22 2.79 4.36 5.72 9.59 28.38 IASI 2.53 3.75 4.70 7.17 14.96 1.92 2.56 3.01 4.05 7.12 2.48 3.64 4.55 6.94 15.75 Landsat 2.35 4.12 6.24 12.8 41.88 2.13 3.68 5.09 8.46 20.33 2.37 3.97 5.4 9.25 19.51 M3 4.38 7.44 9.61 14.27 24.28 2.72 4.15 5.29 7.27 10.33 2.68 4.17 5.42 8.86 22.49 MODIS 1.94 2.60 3.07 4.12 2.09 2.77 3.24 4.27 6.95 2.13 2.72 3.22 4.35 7.39 MSG 2.77 4.49 6.06 10.01 24.18 2.64 4.20 5.39 8.08 14.78 2.73 4.12 5.31 8.22 17.45 Pleiades 1.66 2.12 2.43 3.11 5.04 1.62 2.06 2.36 3.01 4.64 1.68 2.16 2.49 3.23 5.18 SFSI 3.07 5.18 7.02 11.97 53.21 2.58 3.75 4.65 6.99 16.5 2.91 4.39 5.65 9.13 30.05 SPOT5 1.55 2.21 2.74 4.22 10.00 1.45 2.03 2.48 3.63 6.69 1.54 2.22 2.74 4.07 8.90 Vegetation 1.95 2.77 3.40 5.04 10.54 1.87 2.61 3.16 4.42 7.78 2.03 2.86 3.51 5.08 10.05 All 2.67 4.20 5.55 9.07 22.98 2.12 3.00 3.67 5.17 9.20 2.35 3.44 4.34 6.7 15.43 7.35 PAE: peak absolute error. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 113
reconstructed images. For each compressor, PAE, and input image I, the compressed data rate expressed in bits per sample (bps) is computed as observed that Issue 2’s hybrid coder enables higher image quality, i.e., a lower PAE, at similar, albeit smaller, compressed data sizes. Furthermore, for sufficiently low PAEs, reconstructed images are hardly distinguishable from the originals. In turn, the bottom row illustrates how moderately larger PAEs introduce some texture artifacts, but retain the image’s structure, and so might not hinder analysis tasks performed on it [30], [31]. A visual inspection of this row also reveals that Issue 2 introduces distortion patterns similar to those of JPEG-LS. This is expected because both algorithms apply quantization after prediction. It is worth noting that the choice of entropy coder in CCSDS 123.0-B does not affect the obtained reconstructed image, only the compressed data size. Compressed data rate differences aside, a similar discussion regarding visual quality applies for M-CALIC, too. It is omitted here for space constraints. The remainder of this section provides quantitative discussion of the compression performance of the aforementioned algorithms in relation to the fidelity of the compressed data size (bits) . (23) N X ·N Y ·N Z compressed data rate = In turn, the peak SNR (PSNR) between I and its reconstructed counterpart It is defined as PSNR (I, It) = 10·log 10 d MAX 2I n (dB). (24) MSE(I, It) Here, MAX I denotes the dynamic range of an image, i.e., 2 D - 1, where D is I’s bit depth, and mean square error (MSE)(I, It) is the mean squared error between I and It, i.e., N X NY N Z MSE(I, It) = / / / ^I z,y,x - It z,y,xh2 x y z N X ·N Y ·N Z (a) (b) (c) (d) (e) (f) . (25) FIGURE 7. (a) A crop (256 × 256) of Band 220 of an original AVIRIS f060925t01p00r12_sc00 image (calibrated, 16 bit); (b) and (c) the colo- cated crops of the same AVIRIS image after reconstruction with CCSDS 123.0-B-2’s hybrid coder [compressed at 2.4 bits per sample (bps)] and JPEG-LS (2.9 bps) with absolute error limits of 2 and 16, respectively; (d) a crop (128 × 128) of an original SPOT5 toulouse_spot5_xs_ extract1 image (processed, 8 bit); and (e) and (f) the colocated crops of the same SPOT5 image after reconstruction with CCSDS 123.0-B-2’s hybrid coder (1.1 bps) and JPEG-LS (1.4 bps) with an absolute error limit 12. The brightness and magnification have been adjusted in all of the images to facilitate a comparison. The SPOT5 images are presented using false color. 114 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Nz N O O Nz O . (26) 2 t / I z,y,x O z P / I z,y,x ·It z,y,x z Nz / I 2z,y,x · z The mean spectral angle and maximum spectral angle metrics are defined as the average and maximum spectral angle, respectively, for all (x, y) positions in the image. Figure 8 provides near-lossless compressed data rate results for the three entropy coders of Issue 2, for JPEG-LS, and for M-CALIC, setting PAE limits between 0 (lossless) and 32. For each coder and PAE value, the plotted value is the mean compressed data rate for all the images in the corpus. Markers have been included in the figure at the integer PAE values for which data have been obtained, and linear interpolation is used between them for the sake of readability. The results indicate that, for larger PAE values, the differences between Issue 2’s coders become more apparent than for the lossless case. When compared to the block- and sample-adaptive coders, the hybrid coder yields compressed data rates up to 0.2 and 0.6 bps better, respectively. For PAE values up to 5, both JPEG-LS and M-CALIC are outperformed by all entropy coders of Issue 2. For PAE value from 20 onward, M-CALIC improves upon the blockadaptive coder. For PAEs larger than 25, JPEG-LS produces results better than the sample-adaptive coder. Notwithstanding, for PAE values of 2 and above, the hybrid coder’s average results are consistently better than all other compressors for all tested PAE values. The global results presented in Figure 8 are complemented by Table 4, which also reports average compression ratios for several PAE values. In it, the average compression ratios for each instrument are provided. It can be observed that the per-instrument results are generally consistent with global averages, with similar exceptions as for the lossless case. These behaviors are explained by the different predictor stages and by the way in which each coder handles the low-entropy data prevalent in near-lossless compression. The sample-adaptive coder does not have a mode in which multiple input symbols are compressed in a single codeword. Therefore, the minimum length of any sampleadaptive codeword sets a lower bound for the compression rates achievable by this coder. Both JPEG-LS and the blockadaptive coder have run-length modes that allow coding of consecutive zeros in a single codeword. Thus, their compression performance is increased as the prevalence of such runs is increased. In turn, the 16 stateful codes featured in the hybrid coder enable a more efficient processing of low-entropy data, including inputs that are not sequences of only zeros. Finally, M-CALIC’s performance improvement for higher PAEs is due to its arithmetic entropy coder, which is close to optimal for many data distributions. In addition to considering the compressed data rates and PAE of the reconstructed images, it is useful to consider DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 5.6 Compressed Data Rate (bps) J K K a (x, y) = cos -1 K K L other distortion metrics to better understand the efficiency of each coder. To complete the rate-distortion compression performance comparison, the average PSNR as a function of the average compressed data rate is plotted in Figure 9. The mean spectral angle and maximum spectral angle metrics are plotted in Figure 10(a) and (b), respectively. All of the metrics are computed for each coder, PAE value (or target bitrate, for CCSDS 122.1-B-1), and test image, and the mean values are used in the plots. Markers are placed at the obtained data points, and linear interpolation is used 4.8 4 3.2 2.4 1.6 0.8 0 0 4 8 12 16 20 24 28 32 PAE Block Adaptive Sample Adaptive Hybrid JPEG-LS M-CALIC CCSDS 122.1 (POT) FIGURE 8. The average compressed data rate in bps as a function of the maximum absolute error. 72 68 64 PSNR (dB) The spectral angle is computed at each (x, y) position for each original and reconstructed image pair, defined in [71] as 60 56 52 48 1 0 2 3 4 Compressed Data Rate (bpppc) Block Adaptive Sample Adaptive Hybrid JPEG-LS M-CALIC CCSDS 122.1 (POT) FIGURE 9. The average PSNR results as a function of the average compressed data rate. 115
between them to enhance readability. As in the previous case, the hybrid coder yields better fidelity results than do the other near-lossless coders for all the metrics, especially at low compressed data rates. This and other differences among compressors are comparable to those shown in Figure 8, for similar reasons as mentioned previously in this section. When compared to CCSDS 122.1-B-1, all of the nearlossless codecs yield significantly better PAE results. This is as expected, as the CCSDS 122.1-B-1 standard is not designed to bound the maximum introduced error, but rather to minimize MSE. At low bitrates, i.e., below 1.25 bps, CCSDS 122.1B-1 yields the best PSNR results of all the tested codecs. Again, this can be explained by the minimization goal of the standard. At higher bitrates, the hybrid coder of Issue 2 produces the best PSNR results, which illustrates the competitive performance of CCSDS 123.0-B-2. When spectral angles are considered, the relative performance of the near-lossless coders is very similar to the PAE and PSNR cases. In turn, for the mean spectral angle metric, CCSDS 122.1-B-1 improves upon all the other coders for bitrates up to 2 bps. This can be explained by the fact that CCSDS 122.1-B-1 applies a spectral transform across all bands, instead of predicting pixel values using a local spatial and spectral neighborhood. Interestingly, when the maximum spectral angle is considered, CCSDS 123.0-B-2 yields better results than does CCSDS 122.1-B-1, except for low bitrates, i.e., below 0.75 bps. This can be explained by the fact that CCSDS 123.0-B-2 is near lossless, i.e., it bounds the maximum error introduced in any pixel of the image. CONCLUSIONS Multispectral imaging and HSI have become invaluable tools for many commercial, scientific, and defense applications of remote sensing. With the advent of sensors allowing enhanced spatial and spectral resolution, data compression is paramount to maximize the amount of valuable information retrieved from spaceborne systems. In particular, nearlossless compression can significantly improve the effective capacity of transmission channels while providing strict control of the distortion introduced in the images. Even if rate-control strategies are possible, strong quality guarantees are prioritized over obtaining constant data rates in near-real time transmission. The CCSDS 123.0-B-2 compression standard published by the CCSDS enables the specification of absolute and/ or relative error limits at the image or band level. This is achieved via the uniform, in-loop quantization of prediction errors, obtaining higher performance at the expense of a simpler implementation. As the decompressor does not have access to the original image samples, sample representatives are used instead in the predictor stage. To fully exploit the lower entropy rates exhibited by quantized data, a new hybrid entropy coder is defined for Issue 2. This coder includes 16 variable-to-variable-length codes selected on a sample-by-sample basis depending on the statistics of previously coded information. One last improvement over CCSDS 123.0-B-1 is the definition of narrow local sums that facilitate the design of highly efficient hardware implementations. Experimental results with a comprehensive corpus of test images indicate that the new hybrid coder yields competitive compression performance results, measurably improving upon the other coding modes of Issue 2 as well as upon the JPEG-LS compression standard and the M-CALIC algorithm. The standard obtains state-of-the-art performance in absolute or relative error measurements, 24 2.4 Maximum Spectral Angle (°) Mean Spectral Angle (°) 2.8 2 1.6 1.2 0.8 0.4 0 21 18 15 12 9 6 3 0 1 2 3 4 0 0 1 2 3 Compressed Data Rate (bps) Compressed Data Rate (bps) (a) (b) Block Adaptive Sample Adaptive JPEG-LS M-CALIC 4 Hybrid CCSDS 122.1 (POT) FIGURE 10. The spectral angle metrics as a function of the compressed data rate. (a) The mean spectral angle and (b) the maximum ­spectral angle. 116 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
while other approaches may provide better performance in terms of quadratic error at very low rates. Regarding future developments related to this standard, it is unlikely that major changes are introduced soon. ACKNOWLEDGMENTS Miguel Hernández-Cabronero, Ian Blanes, and Joan SerraSagristà received partial funding from the postdoctoral fellowship program Beatriu de Pinós, reference 2018-BP00008, funded by the Secretary of Universities and Research (Government of Catalonia) and by the H2020 Programme of Research and Innovation of the European Union (EU) under Marie Skłodowska-Curie grant agreement 801370; from the EU’s H2020 program under grant agreement 776151; from the Spanish Government under grant RTI2018-095287-B-I00; and from the Catalan Government under grant 2017SGR-463. The research conducted at the Jet Propulsion Laboratory at the California Institute of Technology was performed under a contract with NASA. Miguel Hernández-Cabronero is the corresponding author. AUTHOR INFORMATION [3] [4] [5] [6] [7] Miguel Hernández-Cabronero (miguel.hernandez@uab .cat) is with the Department of Information and Communications Engineering, Universitat Autònoma de Barcelona, Barcelona, 08193, Spain. Aaron B. Kiely (aaron.b.kiely@jpl.nasa.gov) is with the Jet Propulsion Laboratory, California Institute of Technology, Pasadena, California, 91109, USA. He is a Senior Member of IEEE. Matthew Klimesh (matthew.a.klimesh@jpl.nasa.gov) is with the Jet Propulsion Laboratory, California Institute of Technology, Pasadena, California, 91109, USA. He is a Senior Member of IEEE. Ian Blanes (ian.blanes@uab.ca) is with the Department of Information and Communications Engineering, Universitat Autònoma de Barcelona, Barcelona, 08193, Spain. He is a Senior Member of IEEE. Jonathan Ligo (jonathan.ligo@jhuapl.edu) is with the Applied Physics Laboratory, Johns Hopkins University, Baltimore, Maryland, 20723, USA. He is a Member of IEEE. Enrico Magli (enrico.magli@polito.it) is with the Department of Electronics and Telecommunications, Politecnico di Torino, Turin, 10129, Italy. He is a Fellow of IEEE. Joan Serra-Sagristà (joan.serra@uab.cat) is with the Department of Information and Communications Engineering, Universitat Autònoma de Barcelona, Barcelona, 08193, Spain. He is a Senior Member of IEEE. [8] [9] [10] [11] [12] [13] [14] REFERENCES [1] [2] M. Parente, J. Kerekes, and R. Heylen, “A special issue on hyperspectral imaging [from the guest editors],” IEEE Geosci. Remote Sens. Mag. (replaces Newslett.), vol. 7, no. 2, pp. 6–7, June 2019. doi: 10.1109/MGRS.2019.2912617. E. J. Ientilucci and S. Adler-Golden, “Atmospheric compensation of hyperspectral data: An overview and review of in-scene DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [15] and physics-based approaches,” IEEE Geosci. Remote Sens. Mag. (replaces Newslett.), vol. 7, no. 2, pp. 31–50, June 2019. doi: 10.1109/MGRS.2019.2904706. M. J. Khan, H. S. Khan, A. Yousaf, K. Khurshid, and A. Abbas, “Modern trends in hyperspectral image analysis: A review,” IEEE Access, vol. 6, pp. 14,118–14,129, Mar. 2018. doi: 10.1109/ACCESS.2018.2812999. M. Malyy, Z. Tekic, and A. Golkar, “What drives technology innovation in new space? A preliminary analysis of venture capital investments in earth observation start-ups,” IEEE Geosci. Remote Sens. Mag. (replaces Newslett.), vol. 7, no. 1, pp. 59–73, Mar. 2019. doi: 10.1109/MGRS.2018.2886999. J. Theiler, A. Ziemann, S. Matteoli, and M. Diani, “Spectral variability of remotely sensed target materials: Causes, models, and strategies for mitigation and robust exploitation,” IEEE Geosci. Remote Sens. Mag. (replaces Newslett.), vol. 7, no. 2, pp. 8–30, June 2019. doi: 10.1109/MGRS.2019.2890997. Y. Zhong et al., “Mini-UAV-borne hyperspectral remote sensing: From observation and processing to applications,” IEEE Geosci. Remote Sens. Mag. (replaces Newslett.), vol. 6, no. 4, pp. 46–62, Dec. 2018. doi: 10.1109/MGRS.2018.2867592. G. Denis et al., “Towards disruptions in Earth observation? New Earth Observation systems and markets evolution: Possible scenarios and impacts,” Acta Astronaut. (U.K.), vol. 137, pp. 415–433, Aug. 2017. doi: 10.1016/j.actaastro.2017.04.034. W. Sun and Q. Du, “Hyperspectral Band Selection: A Review,” IEEE Geosci. Remote Sens. Mag. (replaces Newslett.), vol. 7, no. 2, pp. 118–139, June 2019. doi: 10.1109/MGRS.2019.2911100. S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benediktsson, “Deep learning for hyperspectral image classification: An overview,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 9, pp. 6690–6709, 2019. doi: 10.1109/TGRS.2019.2907932. P. Duan, X. Kang, S. Li, P. Ghamisi, and J. A. Benediktsson, “Fusion of multiple edge-preserving operations for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 12, pp. 10,336–10,349, 2019. doi: 10.1109/TGRS.2019.2933588. Y. Su, J. Li, A. Plaza, A. Marinoni, P. Gamba, and S. Chakravortty, “DAEN: Deep autoencoder networks for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 7, pp. 4309–4321, 2019. doi: 10.1109/TGRS.2018.2890633. Y. Chen, K. Zhu, L. Zhu, X. He, P. Ghamisi, and J. A. Benediktsson, “Automatic design of convolutional neural network for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 9, pp. 7048–7066, 2019. doi: 10.1109/ TGRS.2019.2910603. J. M. Haut et al., “Cloud deep networks for hyperspectral image analysis,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 12, pp. 9832–9848, 2019. doi: 10.1109/TGRS.2019.2929731. B. Tu, X. Zhang, X. Kang, J. Wang, and J. A. Benediktsson, “Spatial density peak clustering for hyperspectral image classification with noisy labels,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 7, pp. 5085–5097, 2019. doi: 10.1109/TGRS.2019.2896471. K. Bhardwaj, S. Patra, and L. Bruzzone, “Threshold-free attribute profile for classification of hyperspectral images,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 10, pp. 7731–7742, 2019. doi: 10.1109/TGRS.2019.2916169. 117
[16] X. Lu, L. Dong, and Y. Yuan, “Subspace clustering constrained sparse NMF for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 5, pp. 3007–3019, 2020. doi: 10.1109/ TGRS.2019.2946751. [17] C. J. Della Porta, A. A. Bekit, B. H. Lampe, and C. Chang, “Hyperspectral image classification via compressive sensing,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 10, pp. 8290–8303, 2019. doi: 10.1109/TGRS.2019.2920112. [18] J. Nalepa, M. Myller, and M. Kawulok, “Validating hyperspectral image segmentation,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 8, pp. 1264–1268, 2019. doi: 10.1109/LGRS.2019. 2895697. [19] D. Hong, X. Wu, P. Ghamisi, J. Chanussot, N. Yokoya, and X. X. Zhu, “Invariant attribute profiles: A spatial-frequency joint feature extractor for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., 2020, pp. 1–18. [20] “IASI level 1: Product guide,” EUMETSAT, Tech. Rep. EUM/ OPS-EPS/MAN/04/0032, Darmstadt, Germany, Sept. 2019. [21] K. Turpie, S. Veraverbeke, R. Wright, M. Anderson, and D. Quattrochi, “NASA 2014 The Hyperspectral Infrared Imager (HyspIRI) – Science impact of deploying instruments on separate platforms,” Jet Propulsion Lab., Tech. Rep. JPL-Publ-14-13, July 2014. [Online]. Available: http://hdl.handle.net/2060/20160001776 [22] S.-E. Qian, Optical Satellite Data Compression and Implementation. Bellingham, WA: SPIE, 2013. [23] B. Huang, Satellite Data Compression. Berlin, Germany: Springer Science & Business Media, 2011. [24] K. Sayood, Introduction to Data Compression, 5th ed. San Mateo, CA: Morgan Kaufmann, 2017. [25] S. Álvarez-Cortés, J. Serra-Sagristà, J. Bartrina-Rapesta, and M. W. Marcellin, “Regression wavelet analysis for near-lossless remote sensing data compression,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 2, pp. 790–798, 2020. doi: 10.1109/ TGRS.2019.2940553. [26] D. Valsesia and E. Magli, “High-throughput onboard hyperspectral image compression with ground-based CNN reconstruction,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 12, pp. 9544–9553, Dec. 2019. doi: 10.1109/TGRS.2019.2927434. [27] M. Díaz et al., “Real-time hyperspectral image compression onto embedded GPUs,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 12, no. 8, pp. 2792–2809, 2019. doi: 10.1109/ JSTARS.2019.2917088. [28] S.-E. Qian, Optical Satellite Signal Processing and Enhancement. Bellingham, WA: SPIE, 2013. [29] Z. Chen, Y. Hu, and Y. Zhang, “Effects of compression on remote sensing image classification based on fractal analysis,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 7, pp. 4577–4590, July 2019. doi: 10.1109/TGRS.2019.2891679. [30] J. García-Sobrino, J. Serra-Sagristà, and A. J. Pinho, “Competitive segmentation performance on near-lossless and lossy compressed remote sensing images,” IEEE Geosci. Remote Sens. Lett, vol. 17, no. 5 , pp. 834–838, 2020. doi: 10.1109/LGRS.2019.2934997. [31] F. Garcia-Vilchez et al., “On the impact of lossy compression on hyperspectral image classification and unmixing,” IEEE Geosci. Remote Sens. Lett., vol. 8, no. 2, pp. 253–257, 2010. doi: 10.1109/ LGRS.2010.2062484. 118 [32] I. Blanes, E. Magli, and J. Serra-Sagrista, “A tutorial on image compression for optical space imaging systems,” IEEE Geosci. Remote Sens. Mag. (replaces Newslett.), vol. 2, no. 3, pp. 8–26, Sept. 2014. doi: 10.1109/MGRS.2014.2352465. [33] A. D. George and C. M. Wilson, “Onboard processing with hybrid and reconfigurable computing on small satellites,” Proc. IEEE, vol. 106, no. 3, pp. 458–470, 2018. doi: 10.1109/JPROC. 2018.2802438. [34] Low-Complexity Lossless and Near-Lossless Multispectral and Hyperspectral Image Compression, Consultative Committee for Space Data Systems (CCSDS) Standard CCSDS 123.0-B-2, Feb. 2019. [Online]. Available: https://public.ccsds.org/Pubs/123x0b2c1.pdf [35] Lossless Multispectral & Hyperspectral Image Compression. Silver Book, Consultative Committee for Space Data Systems (CCSDS) Standard CCSDS 123.0-B-1-S, May 2012. [Online]. Available: https://public.ccsds.org/Pubs/123x0b1ec1s.pdf [36] A. Kiely et al., “The new CCSDS Standard for low-complexity lossless and near-lossless multispectral and hyperspectral image compression,” in Proc. 6th Int. Workshop on On-Board Payload Data Compression (OBPDC), 2018, pp. 1–6. [37] I. Blanes, A. Kiely, M. Hernández-Cabronero, and J. Serra-Sagristà, “Performance impact of parameter tuning on the CCSDS-123.0B-2 low-complexity lossless and near-lossless multispectral and hyperspectral image compression standard,” MDPI Remote Sens., vol. 11, no. 11, p. 1390, 2019. doi: 10.3390/rs11111390. [38] Lossless Data Compression, Consultative Committee for Space Data Systems (CCSDS) Standard CCSDS 121.0-B-1-S, May 1997. [Online]. Available: https://public.ccsds.org/Pubs/121x0b1sc2.pdf [39] Image Data Compression, Consultative Committee for Space Data Systems (CCSDS) Standard CCSDS 122.0-B-1-S, May 2005. [Online]. Available: https://public.ccsds.org/Pubs/122x0b1c3s .pdf [40] Lossless Data Compression, Consultative Committee for Space Data Systems (CCSDS) Standard CCSDS 121.0-B-2, Apr. 2012. [Online]. Available: https://public.ccsds.org/Pubs/121x0b2ec1s .pdf [41] Image Data Compression, Consultative Committee for Space Data Systems (CCSDS) Standard CCSDS 122.0-B-2, Sept. 2017. [Online]. Available: https://public.ccsds.org/Pubs/ 122x0b2.pdf [42] Spectral Preprocessing Transform for Multispectral and Hyperspectral Image Compression, Consultative Committee for Space Data Systems (CCSDS) Standard CCSDS 122.1-B-1, Sept. 2017. [Online]. Available: https://public.ccsds.org/Pubs/122x1b1.pdf [43] Lossless Data Compression, Consultative Committee for Space Data Systems (CCSDS) Standard CCSDS 121.0-B-3, Aug. 2020. [Online]. Available: https://public.ccsds.org/Pubs/121x0b3.pdf [44] D. Báscones, C. González, and D. Mozos, “Parallel implementation of the CCSDS 1.2.3 standard for hyperspectral lossless compression,” MDPI Remote Sens., vol. 9, no. 10, p. 973, 2017. doi: 10.3390/rs9100973. [45] A. Tsigkanos, N. Kranitis, G. A. Theodorou, and A. Paschalis, “A 3.3 Gbps CCSDS 123.0-B-1 multispectral hyperspectral image compression hardware accelerator on a space-grade SRAM FPGA,” IEEE Trans. Emerg. Topics Comput., early access, July 12, 2018. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[46] J. Fjeldtvedt, M. Orlandić, and T. A. Johansen, “An efficient real-time FPGA Implementation of the CCSDS-123 compression standard for hyperspectral images,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 11, no. 10, pp. 3841–3852, 2018. doi: 10.1109/JSTARS.2018.2869697. [47] M. Orlandić, J. Fjeldtvedt, and T. A. Johansen, “A parallel FPGA implementation of the CCSDS-123 compression algorithm,” MDPI Remote Sens., vol. 11, no. 6, p. 673, 2019. doi: 10.3390/ rs11060673. [48] L. M. V. Pereira, D. A. Santos, C. A. Zeferino, and D. R. Melo, “A low-cost hardware accelerator for CCSDS 123 predictor in FPGA,” in Proc. IEEE Int. Symp. Circuits and Syst. (ISCAS), 2019, pp. 1–5. doi: 10.1109/ISCAS.2019.8702428. [49] L. Santos, L. Berrojo, J. Moreno, J. F. López, and R. Sarmiento, “Multispectral and hyperspectral lossless compressor for space applications (HyLoC): A low-complexity FPGA implementation of the CCSDS 123 standard,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 9, no. 2, pp. 757–770, 2016. doi: 10.1109/JSTARS.2015.2497163. [50] L. Santos, A. J. Gomez, and R. Sarmiento, “Implementation of CCSDS standards for lossless multispectral and hyperspectral satellite image compression,” IEEE Trans. Aerosp. Electron. Syst., vol. 56, no. 2, pp. 1120–1138, 2020. doi: 10.1109/ TAES.2019.2929971. [51] Y. Barrios, A. J. Sánchez, L. Santos, and R. Sarmiento, “Shyloc 2.0: A versatile hardware solution for on-board data and hyperspectral image compression on future space missions,” IEEE Access, vol. 8, pp. 54,269–54,287, 2020. doi: 10.1109/ACCESS.2020.2980767. [52] “High-speed integrated satellite data systems for leading EU industry,” European Commission, Hi-SIDE Project, H2020-COMPET-3-2017 (RIA): High speed data chain, Gemany, 2018–2021. [53] “Next generation satellite processing chain for rapid civil alerts,” European Commission, EO-ALERT Project, H2020COMPET-3-2017 (RIA): High speed data chain Spain, 2018– 2021. [54] D. Keymeulen et al., “High performance space data acquisition, clouds screening and data compression with modified COTS embedded system-on-chip instrument avionics for space-based next generation imaging spectrometers (NGIS),” in Proc. 6th Int. Workshop on On-Board Payload Data Compression (OBPDC), 2018, pp. 7–15. [55] “Copernicus Hyperspectral Imaging Mission for the Environment, mission requirements document.” European Space Agency, France, 2018. http://esamultimedia.esa.int/docs/EarthObservation/Copernicus_CHIME_MRD_v2.1_Issued20190723.pdf [56] M. Conoscenti, R. Coppola, and E. Magli, “Constant SNR, rate control, and entropy coding for predictive lossy hyperspectral image compression,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12, pp. 7431–7441, 2016. doi: 10.1109/TGRS.2016.2603998. [57] J. Bartrina-Rapesta, I. Blanes, F. Aulí-Llinàs, J. Serra-Sagristà, V. Sanchez, and M. W. Marcellin, “A lightweight contextual arithmetic coder for on-board remote sensing data compression,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 8, pp. 4825–4835, 2017. doi: 10.1109/TGRS.2017.2701837. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [58] J. Song, Z. Zhang, and X. Chen, “Lossless compression of hyperspectral imagery via RLS filter,” Electron. Lett., vol. 49, no. 16, pp. 992–994, 2013. doi: 10.1049/el.2013.1315. [59] F. Gao and S. Guo, “Lossless compression of hyperspectral images using conventional recursive least-squares predictor with adaptive prediction bands,” J. Appl. Remote Sens., vol. 10, no. 1, p. 015010, 2016. doi: 10.1117/1.JRS.10.015010. [60] A. C. Karaca and M. K. Güllü, “Lossless hyperspectral image compression using bimodal conventional recursive leastsquares,” Remote Sens. Lett., vol. 9, no. 1, pp. 31–40, 2018. doi: 10.1080/2150704X.2017.1375612. [61] A. C. Karaca and M. K. Güllü, “Superpixel based recursive leastsquares method for lossless compression of hyperspectral images,” Multidimensional Syst. Signal Process., vol. 30, no. 2, pp. 903–919, 2019. [62] D. Keymeulen et al., “High performance space computing with system-on-chip instrument avionics for space-based Next Generation Imaging Spectrometers (NGIS),” in Proc. NASA/ESA Conf. Adaptive Hardware and Syst. (AHS), Aug. 2018, pp. 33–36. doi: 10.1109/AHS.2018.8541473. [63] M. Klimesh, “Low-complexity lossless compression of hyperspectral imagery via adaptive filtering,” Jet Propulsion Lab., NASA, Pasadena, CA, Tech. Rep., 2005. [Online]. Available: http://ipnpr.jpl.nasa.gov/progress_report/42-163/163H.pdf [64] D. Valsesia and E. Magli, “A novel rate control algorithm for onboard predictive coding of multispectral and hyperspectral images,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 10, pp. 6341–6355, 2014. doi: 10.1109/TGRS.2013.2296329. [65] D. Valsesia and E. Magli, “Fast and lightweight rate control for onboard predictive coding of hyperspectral images,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 3, pp. 394–398, 2017. [66] R. Guerra, Y. Barrios, M. Díaz, A. Baez, S. López, and R. Sarmiento, “A hardware-friendly hyperspectral lossy compressor for nextgeneration space-grade field programmable gate arrays,” IEEE J. Select. Topics Appl. Earth Observat. Remote Sens., vol. 12, no. 12, pp. 4813–4828, 2019. doi: 10.1109/JSTARS.2019.2919791. [67] Lossless Data Compression, Green Book, no. 3, Consultative Committee for Space Data Systems (CCSDS), Washington, D.C., 2013. [68] I. Blanes, A. Kiely, L. Santos, M. Hernández-Cabronero, and J. Serra-Sagristà, “The hybrid entropy encoder of CCSDS 123.0B-2: Insights and decoding process,” in Proc. 7th Int. Workshop on On-Board Payload Data Compression (OBPDC), Sept. 2020, pp. 1–10. [69] M. Hernández-Cabronero, J. Portell, I. Blanes, and J. SerraSagristà, “High-performance lossless compression of hyperspectral remote sensing scenes based on spectral decorrelation,” MDPI Remote Sens., vol. 12, no. 18, p. 2955, 2020. doi: 10.3390/rs12182955. [70] E. Magli, G. Olmo, and E. Quacchio, “Optimized onboard lossless and near-lossless compression of hyperspectral data using CALIC,” IEEE Geosci. Remote Sens. Lett., vol. 1, no. 1, pp. 21–25, 2004. doi: 10.1109/LGRS.2003.822312. [71] F. A. Kruse et al., “The spectral image processing system (SIPS)interactive visualization and analysis of imaging spectrometer data,” AIP Conf. Proc., vol. 283, no. 1, pp. 192–201, 1993. GRS 119
©SHUTTERSTOCK.COM/SALMANALFA Advances and Opportunities in Remote Sensing Image Geometric Registration A systematic review of state-of-the-art approaches and future research directions RUITAO FENG, HUANFENG SHEN, JIANJUN BAI, AND XINGHUA LI G Digital Object Identifier 10.1109/MGRS.2021.3081763 Date of current version: 28 June 2021 OVERVIEW Remote sensing images from various sensors, periods, and viewpoints can provide complementary information about regions of interest (ROIs) and Earth surface observation. Owing to various factors, such as Earth’s rotation and curvature and variations in platform altitudes, remote sensing images contain systematic geometric distortions that cannot be thoroughly corrected without high-precision elevation data [through the digital elevation model (DEM) or the digital surface model (DSM)] and control points on the ground. Although the true digital orthophoto map (TDOM) promises accurate spatial positions, it has high production costs and is difficult for general users to obtain. Therefore, most available remote sensing images retain small geometrical distortions after systematic correction, resulting in objects in one image not spatially corresponding to those in another image, as in Figure 1. Furthermore, topographical fluctuations in mountainous regions, differences in imaging viewpoints (shown in Figure 2), and spatial resolutions cause dislocation in two 0274-6638/21©2021IEEE IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE eometric registration is often an accuracy assurance for most remote sensing image processing and analysis, such as image mosaicking, image fusion, and time-series analysis. In recent decades, geometric registration has attracted considerable attention in the remote sensing community, leading to a large amount of research on the subject. However, few studies have systematically reviewed its current status and deeply investigated its development trends. Moreover, new approaches are constantly emerging, and some issues still need to be solved. Thus, this article presents a survey of state-of-the-art approaches for remote sensing image registration in terms of intensity-based, feature-based, and combination techniques. Optical flow estimation and deep learning-based methods are summarized, and software-operated registration and registration evaluation are introduced. Building on recent advances, promising opportunities are explored. 120 DECEMBER 2021
(a) (b) (c) FIGURE 1. Multitemporal optical image geometrical dislocation. (a) A reference image taken by Landsat 5 on 15 October 1990. (b) A sensed image taken by Landsat 5 on 15 September 1993. (c) The overlapping images of (a) and (b). images covering the same scene. Thus, geometrical registration techniques are implemented to align two or more images from the image-to-image perspective rather than the imaging mechanism. Consequently, geometrical registration is an image-processing technique that aligns different images of the same scene acquired at various times and viewing angles and with multiple sensors [1]. As a fundamental task in remote sensing information processing, it is a prerequisite for many practical applications, such as image mosaicking [2], image fusion [3], land cover change detection [4], [5], and disaster evaluation [6], [7]. It worth noting that there is a technical term, coregistration, that is similar but not exactly the same as image registration. It is now commonly used in aerial and unmanned aerial vehicle image registration, generally including multimode registration and alignment through the aid of auxiliary data. When the registration is conducted with a GPS/inertial measurement unit, it usually establishes a connection between an image and the simulated or real ground [8]. Certainly, the registration technology works on tie points generation for the construction of relationships. With real ground control points (GCPs), the tie points between the reference and sensed bands are produced to register different bands of hyperspectral images [9]. Additionally, when the orientation of the reference image is determined, without GCPs, the coregistration of multitemporal high-resolution image blocks is automatically achieved [10]. Although there are time-increasing papers focused on coregistration techniques doing some auxiliary work with the positioning data, the core of the process is image registration, as far as we are concerned. Therefore, the emphasis is put on the opportunities and challenges of geometrical registration in remote sensing fields. Geometrical registration can be traced to the 1970s, when the United States proposed image registration to analyze target objects in aircraft-aided navigation and weapons systems. Since then, it has rapidly developed, particularly in the domains of remote sensing, computer vision, DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE and medical image processing. Some conclusive studies of computer vision and medical image processing have been published [11]–[16]. Building on a widespread survey of image registration, published in 1992 by Brown [15], a 2003 review [16] comprehensively summarized the subsequent research. In recent years, several overviews of image registration have focused on newly developed approaches inspired by extant versions [17]–[19]. However, these surveys are limited to analyzing and drawing conclusions based on conventional approaches [20]–[22]. Since the first study of multispectral and multitemporal digital imagery registration in 1970 [23], an increasing number of papers have contributed to the field. A total of 140,983 related studies with the keywords image registration or image matching were retrieved, from 1979 to January 2021, from Web of Science (WoS). When screening again using the keyword remote sensing, 46,141 articles were found, as plotted, based on their publication year, in Figure 3. The respective proportions of the total number of papers on WoS per year are also presented. It can be seen that a small number of papers T T+t FIGURE 2. The angle difference from multitemporal images in a mountainous region. 121
feature-based, and combination registration, as detailed in Figure 4. The intensity-based technique directly uses pixel intensity information to register images, including the conventional area-based approach and optical flow estimation. The geometrical and advanced features used to register images instead of intensity information are defined as featurebased approaches. Combination registration mainly consists of the integration of feature- and area-based methods as well as two geometric feature-based techniques. Many detailed classifications are presented in each category. All registration approaches must undergo coordinate transformation and resampling to ultimately acquire the aligned image, as demonstrated in Figure 5. Before this step, transformation models for coordinate recalculation other than optical flow estimation should be constructed. In general, transformation models, such as the affine, projective, piecewise linear, and thin spline models, are 3,500 0.14 derived from global or local paramet2019: 3,154 Published Papers ric models. To calculate these models, 3,000 0.12 Proportion images are preprocessed to extract 2,500 0.1 representative features through techniques including geometrical- and 2,000 0.08 advanced-feature extraction and 1,500 0.06 matching. Given that intensity information is directly utilized in area1,000 0.04 based registration, feature extraction is omitted, and the transformation 500 0.02 model is constructed when matching 0 0 the intensity information. Since most approaches prefer to contribute to the preliminary steps (e.g., feature exYear traction, feature matching, and mismatched feature elimination) rather FIGURE 3. The number of papers about remote sensing image registration on WoS, per year. 1979 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 2019 2021 Proportion (%) Published Papers were presented early in the field’s development, with remote sensing image registration accounting for a minimal percentage of annual WoS publications. More recently, a considerable number of studies have been published, peaking in 2019. Thus, comprehensive analysis is necessary to identify unsolved problems for the rapid development of this field. In this article, we summarize various classical approaches to remote sensing image registration as well as recent methods based on deep learning, optical flow estimation, and image registration software. We also point out interesting aspects and analyze development trends from our perspective, without describing specific approaches in detail. Concretely, the registration approaches can be classified into three categories, namely, intensity-based, • Two Geometric Feature-Based Methods • Feature- and Area-Based Methods Combination Method Remote Sensing Image Registration Intensity-Based Method • Area Based • Optical Flow Feature-Based Method • Geometrical Feature Based • Deep Learning Frequency Domain Dense Optical Flow Points, Lines, and Polygons Siamese Network Spatial Domain Sparse Optical Flow Feature Matching... GAN FIGURE 4. The remote sensing image registration algorithms. GAN: generative adversarial network. 122 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
than designing new transformation models and presenting novel resampling techniques, this article emphasizes the previous steps, as well, comprehensively summarizing studies and further predicting development trends. INTENSITY-BASED REGISTRATION Intensity-based registration directly employs original or extended intensity information, such as gradients, for registering remote sensing images. In addition to the traditional area-based approach, we classify optical flow estimation, a direct calculation of the increased displacement of corresponding pixels with intensity information, as intensitybased registration. AREA-BASED METHOD In general, area-based registration accords with a similarity criterion established in advance and adopts the optimal search strategy to iteratively find the parameters of the transformation model that yield the maximum or minimum similarity measurement to achieve the spatial registration of images, as illustrated in Figure 6. With the transformation model constantly being optimized, the aligned image changes gradually, which is mainly reflected in the growing black area in the lower- and upper-left-hand corners of the aligned image. This approach differs from image matching, which is generally understood as template matching. Although both methods directly employ intensity information, template matching aims to extract the centroids of matched windows as a feature point. This process is not true geometric registration, but it constitutes an important step. Here, we introduce areabased registration. The well-known core of this technique is the similarity metric, which has been researched in terms of spatial- and frequency-domain approaches [16], [24], [25]. SPATIAL-DOMAIN APPROACH Spatial-domain techniques directly employ intensity difference and statistical information of all pixels, without any image transformation. These methods generally come at the problem from one of two perspectives, namely, the correlation-like technique or the mutual information (MI) algorithm. CORRELATION-LIKE SIMILARITY METRIC This technique determines the spatial alignment of images by directly comparing the similarity of corresponding pixels. It is vulnerable to intensity changes, which may be introduced, for instance, by noise, thick or thin clouds, and differences in the photosensitive components of various sensors. As a fundamental similarity metric, the cross-correlation (CC) algorithm directly calculates the difference between corresponding pixels to iteratively register images until they have the largest CC, which is useful for small rigid-body and affine transformation [26], [27]. Many other correlation-like similarity metrics are available, including the sequential similarity detection algorithm [28], correlation coefficient [29], [30], normalized CC (NCC) [31]–[33], sum of squared differences [34], Hausdorff distance [35], and other minimum distance criteria. NCC, in DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE particular, is very popular and widely applied due to its invariance to linear intensity variations [31], [36], [37]. Recently, the centers of windows well-matched by NCC have been used as feature points to solve transformation model parameters [38], namely, image matching. Supposing t (R, S) to be the NCC coefficient of matched windows, we calculate NCC as follows: m#n t (R, S) = / (R (i) - n R)(S (i) - n S) i=1 m#n m#n i=1 i=1 / (R (i) - n R) 2 / (S (i) - n S) 2 ,(1) where the predefined window consists of m # n pixels, R (i) and S (i) denote specified positions in the windows of the reference and sensed images, and n R and n S are the average intensity values of a specified window. The algorithm was developed to generate tie points that resist complicated geometric deformation [31], [38], [39]; it has recently been integrated with a novel feature descriptor [e.g., the local self-similarity (LSS) descriptor] for robust feature extraction in multimodal remote sensing image registration [36]. Although NCC is superior to the traditional correlation-like similarity metric, it is unable to handle the nonlinear radiometric difference, which is a common problem for correlation-like similarity metrics. MI APPROACH MI has appeared recently compared with correlation-like techniques; it has been successfully applied to multispectral and multisensor image registration due to its robustness against nonlinear radiation differences [40]–[43], which are usually calculated by (2). The normalized MI (NMI) method is a measure that is independent of changes in the marginal entropies of two images in their region of overlap [44], [45]. MI and NMI are the same type of statistical similarity measurement, and both are prone to registration errors. Inspired by these approaches, the region–MI approach was developed [46] with consideration of structural information. Reference Image Intensity Information Gradient Information Sensed Image Geometrical Feature Image Preprocessing Transformation Model Construction Advanced Feature Displacement Field Coordinate Transformation/Resampling Aligned Image FIGURE 5. General geometrical registration. 123
where H (R) and H (S) are the Shannon entropies of the reference and sensed images, respectively; H (R, S) represents the mutual entropy; P (r) and P (s) are the marginal probability distributions of R and S; and P (r, s) is the joint probability distribution that is calculated, in practice, by 2D histogram binning as the discrete random variables. Additionally, there is an MI registration based on displacement maps, which is similar to optical flow estimation. In this variational framework, MI is employed as the similarity metric for displacement calculation [47]. Overall, the MI-like algorithms originating from information theory are a measure of the statistical Furthermore, rotationally invariant regional MI considers not only the spatial information but also the influence that local gray variations and rotation changes have on the computation of the probability density function [45]: MI (R, S) = H (R) + H (S) - H (R, S), H (R) = - / P (r) log 2 P (r), r!R H (S) = - / P (s) log 2 P (s),  s!S H (R, S) = - / P (r, s) log 2 P (r, s), (2) r ! R, s ! S (a) (d) (g) (c) (b) (f) (e) (h) (i) FIGURE 6. Conventional area-based registration. Pay attention to how the black-edge region changes in the lower- and upper-left corners of the aligned image. (a) The aligned images overlapping. (b) The sensed image. (c) The original images overlapping. (d) The fifth iteration. (e) The reference image. (f) The first iteration. (g) The fourth iteration. (h) The third iteration. (i) The second iteration. 124 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
dependence between two data sets and particularly suitable for registration with different imaging mechanisms. However, they are computationally expensive, which may be restrictive, as remote sensing images are always relatively large. or the overlap between two scenes inevitably reduces their robustness [16], [25]. Overall, intensity-based approaches directly use the pixel value of an image, without error accumulation, offering high-precision registration. However, these algorithms have limitations in terms of large rotations, translations, scale differences, and so on and are quite time-consuming. FREQUENCY-DOMAIN APPROACHES Frequency-domain approaches indirectly utilize intensity inOPTICAL FLOW ESTIMATION formation, transforming an image and exploiting its frequenSimilar to the area-based approaches, optical flow estimation cy-domain features for registration. By so doing, they accelcalculates object motions with direct and indirect consistency erate the computational speed of relatively small geometric constraints based on pixel intensity. This technique is popudislocations. Fourier techniques are typical representations lar in computer vision for motion estimation. Owing to the of frequency-domain registration, which were first used to similarity between the displacements of corresponding pixels register images with translational changes [48]. Phase-based under the same coordinate system and the optical flow of an correlation approaches [23], [49]–[51] exploit the Fourier object, some studies have utilized optical flow estimation to transform to register images by searching for global optimal register remote sensing images [60], [61]. Unlike area-based matching [53]; they compute the cross-power spectra of the approaches, optical flow estimation calculates pixel displacesensed and reference images and seek the location of the ment based on intensity and gradient consistency constraints peak. The translational and rotational properties of the Foufor coordinate recalculation. After resampling, the intensity rier transform are employed to calculate the transformation value is assigned to the new noninteger position, and the parameters [53]. Frequency domain approaches are robust aligned image is acquired [62], as summarized in Figure 7. against frequency-dependent noise and illumination changOptical flow is a 2D displacement field that describes the es. They also contribute to the acceleration of computational apparent motion of brightness patterns between two succesefficiency [54] since they neither involve feature extraction, sive images [63], and its concept was proposed by Gibson as feature-based approaches do, nor require an optimization approach in the spatial domain, which would increase their [64]. Horn and Schunck (HS) [63] and Lucas and Kanada computational complexity [53]. However, given that the Fou(LK) [65] proposed a differential approach for optical flow rier transform offers poor spatial localization, the operation calculation in 1981. Since then, many extensions and modican be replaced by a wavelet transform with strong spatial and fications have been proposed for video image processing frequency localization [55], which can be applied to remote [66]–[68]. Given that the process is at the initial stage of desensing image registration [56]. Recently, phase congruency velopment in the remote sensing field and that many stud(PC) has been used to represent structural information in ies have focused on differential techniques, the following remote sensing images; it is similar to the image gradient but is invariant in terms of image contrast and brightReference Image Sensed Image ness variations [57], [58]. In short, most correlation-like approaches are statistical similarity metrices that do not facilitate structural Displacement information or high computational Calculation complexity. Owing to their easy hardware implementations, they remain in frequent use for registration evaluation [59]. Fourier techniques have some advantages in terms of comPixels (Assumption) putational sufficiency, and they are u1 = u1′ + ∆u1 robust against frequency-dependent noise. However, they have limitations v1 = v1′ + ∆v1 u = u ′ + ∆u in the case of image pairs with signifiCoordinate v = v ′ + ∆v cantly different spectral content. AlTransformation un = un′ + ∆un though MI methods offer outstandvn = vn′ + ∆vn ing performance compared with the two aforementioned algorithms, they do not always provide a global FIGURE 7. Optical flow estimation for remote sensing image registration. [(u i , v j ) indicates the maximum of the entire search space pixel coordinates in the reference image, and (uli , v lj ) indicates the coordinates of the correfor the transformation, as images sponding pixel in the sensed image. The coordinate difference, which we called the displacecontaining insufficient information ment, is depicted as (Tu i , Tv j ).] DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 125
aspects are generally emphasized in research on remote sensing image registration. DENSE OPTICAL FLOW ESTIMATION The differential method for dense optical flow calculation proposed by HS is generally called the typical global approach [63]. Dense optical flow calculates each pixel’s motion in a scene, as in Figure 8. The regular grid represents image pixels, and the displacement is displayed at equal intervals, where only the displacement directions and magnitudes of the green pixels are marked, for brevity. The HS optical flow integrates the brightness constancy assumption and the global smoothness constraint to separately estimate the pixel motion in the x and y directions. The intensity value constancy assumption is markedly susceptible to slight brightness changes [69], which are inevitable for remote sensing images. Applying the spatial gradient constancy assumption to the HS equation [as in (3)] is popular in research on multitemporal remote sensing image registration [62], [69]: E (u, v) =  #X } (; I (x + w) - I (x) ;2 + c ; dI (x + w) - dI (x) ;2) dx (3) + a # } (; d 3 u ;2 + ; d 3 v ;2) dx, X where w = (u, v, 1) T is the pixel displacement to be solved, X = (x, y, t) T is a pixel coordinate, } (s 2) = s 2 + f 2 is an increasing concave function, and f is a fixed value. Here, a and c are the weights for the gradient and smoothness terms, respectively, and d 3 = (2 x, 2 y, 2 t) T indicates a spatiotemporal smoothness assumption and is often replaced by the spatial gradient when used for remote sensing image registration. Owing to the advantages of the per-pixel computation of optical flow estimation, very local deformation due to terrain elevations can be eliminated. Occlusion remains a challenge for accurate dense optical flow calculation [66], which is similar to land use (LU) and land cover (LC) changes in remote Pixels FIGURE 8. Dense optical flow. 126 Optical Flow sensing images [62]. Under this circumstance, an object in the reference (sensed) image cannot be sought in the sensed (reference) image. For example, in the yellow, rounded rectangles in Figure 9(a) and (b), a road disappears in the sensed image. This leads to further abnormal pixel displacement, in Figure 9(c), where the magnitudes and directions of the displacements are inconsistent with the neighborhood. The successive abnormal displacements further change the content of the aligned image, although it is highly geometrically aligned with the reference image in Figure 9(d). This change opposes the principle of image registration in that it does not alter the image content but spatially aligns the sensed and reference images. After the abnormal displacement correction, the recalculated displacement is similar to that of the surrounding region, as in Figure 9(e). Furthermore, the aligned image is similar to the corresponding region in the sensed image in Figure 9(b), and the two are spatially aligned with the reference image, as in Figure 9(f). For large-scale movements, which are another concern when applying optical flow for remote sensing image registration, an improved approach was proposed in [70]. The pixel displacement calculated by the extended phase correlation technique is determined as the initial motion estimator for the global optical flow to achieve general remote sensing image registration, especially for large-scale movement deformation [70]. However, given that dense optical flow estimation calculates the displacement for each pixel, it is unavailable for the real-time registration of large images, although it provides a high-precision result. SPARSE OPTICAL FLOW ESTIMATION Sparse optical flow estimation is more popular for remote sensing image registration than its dense counterpart is. The sparse optical flow represented by the local difference may be supported in a specified local region, such as the position of the feature points extracted by popular extractors, including the scale-invariant feature transform (SIFT), as shown in Figure 10. This approach assumes that pixel motions are identical within a local neighborhood and estimates the optical flow by performing least-squares regression with a set of similar equations [66]. The LK gradient-based approach [65], as the origin, is widely used to estimate the motion of video images, on an equal footing with the HS model. The GeFOLKI algorithm was developed from LK and implemented on a graphics processing unit to achieve real-time and robust optical flow estimation [60], [71]. Furthermore, the GeFOLKI algorithm is adopted for the coregistration of heterogeneous data, such as synthetic aperture radar (SAR) lidar images and SAR optical images [61]. Subsequently, given the different imaging mechanisms of SAR and high-resolution optical images, which benefit from the high registration precision of optical flow estimation, two dense feature descriptors replace raw intensities when aligning images by an optical-to-SAR flow; this combines the global and local optical flow estimation approaches [72]. Sparse optical flow based on specified and distinct pixels is computationally time saving, whereas IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
(a) (b) (c) (d) (e) (f) FIGURE 9. Abnormal displacement detection and correction. (a) The reference image. (b) The sensed image. (c) The displacement field estimated by (3). (d) The aligned image overlapping (a). (e) The corrected displacement field. (f) The aligned image formed by overlapping the corrected optical flow with (a) The highlighted road in (a) disappears in (b), leading to similar occlusion. its accuracy for remote sensing image registration is relatively low compared with the dense optical flow approach. In addition, it is not vulnerable to LU–LC changes because it does not have similar features for sparse optical flow estimation in the changed region. In summary, optical flow estimation has been developed in computer vision for motion estimation in superresolution reconstruction for several decades, whereas it is in the initial stage of use in remote sensing image registration. Optical flow estimation is a superior pixel displacement calculation approach that is particularly interesting in the case of very local deformation due to, for example, terrain elevation, which has considerable influence on high-resolution image registration [61]. The efficiency of optical flow estimation should be considered when applying it to remote sensing because a wide field of view (WFV) is a characteristic of remote sensing image. Therefore, due to social development and seasonal changes, LU–LC changes are frequent phenomena for multitemporal remote sensing images. The dense optical flow approach is sensitive to such changes, leading to abnormal displacement and the alteration of the content of an aligned image. Therefore, efficient and accurate correction should be integrated into the initial optical flow estimation when used for registration. and automatically detected to represent the original remote sensing image. The feature correspondence is then established between the reference and sensed images by a similarity comparison of the feature descriptors. The geometric relationship is calculated, guiding a sensed image that is spatially aligned with the reference. Ultimately, coordinates in the sensed image are transformed. The transformed coordinates are usually noninteger, and they are calculated by interpolation to acquire their intensity values, as demonstrated in Figure 11. In the following, we summarize geometrical feature extraction and matching because research into this subject has been at the core of the traditional feature-based approach. FEATURE EXTRACTION The feature extraction mentioned here is a representation of feature detection and extraction. Detection aims to locate distinctive features in an image and determine their positions. In the feature-extraction stage, the recognizable descriptor is uniquely constructed, identifying the detected feature. Formerly, features were manually selected. This approach is still in use today, as in the “image-to-image registration” module in Environment for Visualizing Images (ENVI) software. Experts require a considerable amount of time for this approach, FEATURE-BASED REGISTRATION The feature-based approach directly exploits the abstract features of an image, rather than the pixel intensity, for registration. Feature refers to a distinct geometrical or advanced characteristic extracted by a specified approach. Geometrical features are distinct points, line segments, and closed boundary regions in a remote sensing image that can be detected or extracted by extant or novel approaches. Advanced features are abstract descriptions of local regions, which are extracted by a neural network (NN) (especially in the deep learning approach) to represent the original image. Geometric features are understood as being conventional for feature-based registration, and the use of advanced features is defined as novel feature-based registration. CONVENTIONAL FEATURE-BASED METHOD In general, salient and distinctive features, such as points, line segments, and closed boundary regions, are manually DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE Pixels Feature Points Optical Flow FIGURE 10. Sparse optical flow. 127
especially for large remote sensing images. At present, many methods have been proposed to automatically acquire representative features. Common geometrical features, including salient points (line intersections, corners, points on curves with high curvature, and road crossings) [73], [74], polylines (roads, contours, and edges) [41], [75], and polygons (closed boundary regions and lakes) [76], are selected by the specified approach. As shown in Figure 11, the yellow points, line segments, and regions are detected to abstractly describe the original image. FEATURE POINTS The local points at which the gray value varies dramatically in all directions are feature points, including corner points, inflection points, and T-intersection points. Many attempts have been made to extract them in computer vision, inspiring the development of feature point extraction in remote sensing. The first corner detection approach was proposed by Moravec in 1977 [77]. This algorithm has fast computation but is sensitive to noise and vulnerable to image rotation, leading to its rare use in the remote sensing field. The Harris corner detector was proposed in 1988 [78]. This algorithm is invariant under grayscale and rotational changes. It and improved Harris algorithms are applied to remote sensing image processing [38], [74], [79], [80], mainly with respect to multiscale corner detection. Smith and Brady presented the smallest unvalued segment assimilating nucleus operator [81], which is insensitive to local noise and has high anti-interference ability [82]. However, it is not widely used in remote sensing image registration [83], whereas the SIFT algorithm is [45], [58], [74], [84]–[90]. The SIFT was developed by Lowe [92] and is invariant under rotation, scale, and translational changes [93]. It has been followed by many improved versions, such as principal component analysis SIFT [94], scale-restriction (SR) SIFT [36], [95], affine SIFT [96], and uniform robust SIFT [97], [98]. Moreover, the speeded-up robust features (SURF) [99] algorithm was proposed, by Bay et al. to overcome the time-consuming nature of the SIFT for large-scale remote sensing images [100]–[102]. SURF applies an integral image to compute image derivations and quantifies the gradient orientations in a small number of histogram bins [103]. Additionally, the features from accelerated segment test (FAST) [104]; binary, robust, independent elementary features (BRIEF) [105]; oriented FAST and rotated BRIEF [106], [107]; Kaze [108]; and accelerated Kaze [109] algorithms are fast tools for descriptor construction but are less widely utilized in remote sensing. In addition, a novel key point detector combining corners and blobs for remote sensing image registration is under development to increase the number of correctly matched features [110]. Recently, looking at intensity differences in multimodal remote sensing images, robust and novel feature descriptors have been adopted to depict detected feature points; these include the LSS descriptor, which accommodates effects such as nonlinear intensity differences [36]; the histogram of oriented PC, based on structural similarity measures [57]; and maximally stable PC, representing a novel affine and contrastinvariant descriptor [111]. All these coincidentally absorb PC information. PC is similar to the image gradient, presenting structural information with resistance to variations in illumination [112]. Therefore, the use of phase consistency information is a trend in the construction of robust feature descriptors for multimodal remote sensing image. FEATURE LINES A feature line is also known as a line feature; it is the generalization of feature points, such as general line segments [113], object contours [75], roads, coast lines [114], and rivers [115]. Given that feature lines have more attributes than feature points as control features [116], they have been gradually developed for use in image registration [117] as well as remote sensing image registration [116], [118], [119]. Standard edge detection, as with the Canny detector [120], [121] and detectors based on the Laplacian of Gaussian [122] are conventional feature line detection approaches [16]. Recently, some excellent detectors generating precise Feature Point Feature Line Feature Region and robust line segments have been proposed [123], [124], and they are suitable for line detection in remote sensing images. Feature lines are comFeature Matching Mismatched Features Elimination paratively less utilized in the remote sensing field than are feature points because matching them is an obstaCoordinates Transformation Model Aligned cle. They are often abstracted from Transformation/Resampling Construction Image corners, midpoints, and endpoints as final features [16], thereby losing FIGURE 11. The geometrical feature-based registration algorithm. their geometric value. 128 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
FEATURE REGION Feature region is a general term for all closed boundary regions of appropriate size, e.g., lakes [125], forests [126], buildings [113], urban areas [127], and so on. Before the robust feature point extraction approach was developed, the feature region was used to indirectly extract feature points. Regions with high contrast were extracted by filtering [128] and image segmentation [129] and described with moment-invariant descriptors [130], [131]. They are often abstracted by their centers of gravity [128], [132]–[135], which are invariant with respect to rotation, scaling, and skewing and are stable under random noise and gray-level variation [16]. Compared with feature points and lines, the extraction and description of feature regions were relatively early foci of research, and they have been used less for recent feature-based registration. FEATURE MATCHING AND MISMATCHED FEATURE ELIMINATION The correspondence relationship between reference and sensed images can be established based on detected feature points, lines, and regions, exploiting various descriptors of features [16], [136], [137]. Mismatched features are an inevitable byproduct of general feature matching, the elimination of which purifies correspondences for generating transformation models that are as accurate as possible. A pair of features with similar attributes is considered a selectable matching despite radiometric differences, noise, image distortion, and so forth. Under the circumstances, a robust matching measurement is essential. Feature matching approaches can be generally classified into two categories, namely, feature similarity and spatial relations. FEATURE SIMILARITY The constructed feature descriptors are used to establish the correspondence between extracted features in the reference and sensed images through feature similarity comparison. Feature similarity is conducted in the feature space by using the Euclidean distance ratio between the first and second nearest neighbors [92]. For efficiency, the k-dimensional tree and the best-bin-first algorithms are employed for feature similarity determination [93], [138]. The clustering technique [140], chamfer matching [141], and PC models are frequently used matching approaches, and they are invariant under intensity changes during matching [1]. SPATIAL RELATIONS Aimed at tie point matching in poor textural regions, approaches based on spatial relations have been developed. Representative of these, graph-based feature points matching considers feature points as graph nodes. Feature matching is then transformed into a node-correspondence problem and solved by graph matching [125], [142]. Graph matching is applied to image feature correspondences, although it is not affine invariant [143]. By finding a consensus nearest-neighbor graph from candidate matches, a graph-transformation matching approach is developed [144]. Targeting the problem DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE in [143], a similar graph matching for tie point matching in poor textural images is proposed [101]. Furthermore, Xiong and Zhang introduced a novel interest point matching for high-resolution satellite images [145]. For this, the relative position and angle are used to reduce ambiguity and to avoid false matching, as the approach is suitable for image shifting and rotation. Affine and large-scale transformations are not considered [144]. MISMATCHED FEATURE ELIMINATION Although the extracted features in a reference image have been matched with the corresponding ones in the sensed image via the aforementioned approach, some mismatched feature points are inevitable, further affecting the transformation model estimation [32], [76]. Therefore, eliminating mismatched features with a specified approach is necessary [146], [147]. Generally, based on the initial matching result, random sample consensus (RANSAC) is used to remove a mismatched point. This method randomly selects a sample from the consensus set in each iteration and finds the largest consensus set to calculate the final model parameters [33], [148]. RANSAC performs well and robustly when there are no more than 50% outliers [144], [149], [150]. Combining the local structure with global information, a restricted spatial order constraints algorithm is developed to find exact matched feature points in reference and sensed images [144]. Based on the affine-invariance property of the triangle-area representation (TAR), a robust sample consensus judging algorithm is proposed to efficiently identify bad samples and ensure accuracy with a light computational load [151]. For images with simple patterns, large affine transformations, and low overlapping areas, a mismatch- removal principle based on the TAR value of the k-nearest neighbors is proposed and referred to as k-nearest neighbors–TAR [149]. Furthermore, an improved RANSAC approach called fast sample consensus is developed to obtain correct matching in a few iterations [150], [152]. Thus, most of the reserved feature points in the reference image accurately correspond to the specified feature points in the sensed image, as the feature points connected by the yellow lines in Figure 12 will add precision to the transformation model estimation in the following step. The geometrical feature-based approach abstracts an original remote sensing image with distinct features instead of its intensity information, which is efficient and can easily process large rotations, translations, and scale differences between reference and sensed images. However, position errors in the automatically extracted features are inevitable, and a few mismatched features cannot be eliminated. This leads to a relatively low registration precision compared with the intensity-based approach. NOVEL FEATURE-BASED REGISTRATION BY DEEP LEARNING Deep learning provides a new concept for remote sensing image registration. It essentially refers to image registration based on advanced feature extraction [153]. Deep learning 129
originated in computer vision and has a long history [154]. In recent years, it has gradually entered use in remote sensing image applications, such as image fusion [155], [156], LC classification [157], [158], and segmentation [159]. The framework is data driven and can generate image features by learning from many training data sets with a specified principle [158]. Therefore, it is suitable for remote sensing image registration. Some studies have focused on feature matching for this purpose [158], [160]. Most utilize a Siamese network consisting of two parts to train a deep NN (DNN) [161]–[164]. One part extracts features from image patch pairs by training a Siamese, pseudo-Siamese, or improved Siamese network [165]; the other part measures the similarity between these features for image matching. In [164], the DNN inspired the construction of a deep learning framework for remote sensing image registration. In addition, generative adversarial networks (GANs) are applied to image matching and registration [166], [167]. These approaches first translate an image into another one by training the GANs, enabling two images to have similar intensities and feature information [166], [168]. Feature extraction and matching are subsequently performed between two artificially generated images, effectively improving the performance of image matching. For the deficiencies of specified-scale NNs, multitask learning is introduced to improve the registration precision [169]. Wang et al. break through the limitations of the traditional deep learning approach, which extracts image features in one network and matches them with the other NN. They design an end-toend network using forward propagation and backward feedback to learn the mapping functions of the patches and their matching labels for remote sensing image registration [164]. Recently, Li et al. paired image blocks from sensed and reference images and directly learned the displacement parameters of four corners of the sensed block relative to the reference image on a deep learning regression network, which differs from the traditional deep learning method [170]. Deep learning has advantages over the traditional registration approach. It is completely data driven and has strong flexibility, enabling it to theoretically fit any complex mapping function, whereas the traditional registration method can deal only with fixed pattern registration. Moreover, deep learning extracts abstract and high-level semantic information. Compared with low-level gray and gradient data, deep FIGURE 12. Feature matching examples. 130 semantic information is more consistent with the way humans understand images. Therefore, deep learning methods can extract robust features. However, deep learning has challenges. It highly depends on image samples; when there is a lack of data or the data quality is poor, deep learning methods have difficulty ensuring the effectiveness of the registration results. Although remote sensing images are now easy to acquire, the lack of manual annotation and standard data is still very serious. Deep learning, in essence, learns the statistical characteristics of a large number of similar images, but its input–output process is a complex, nonlinear mapping without clear physical significance. Additionally, deep learning requires high computing power and has major hardware requirements, limiting its applicability. In short, remote sensing image registration based on deep learning is still in its infancy, and its registration framework is not mature. However, many studies have demonstrated that deep learning methods can achieve or even surpass the optimal level of traditional registration approaches in terms of accuracy and efficiency. We predict that deep learning-based methods will become important solutions to the problem of real-time and high-precision remote sensing image registration. REGISTRATION BASED ON THE COMBINATION METHOD As mentioned, feature- and intensity-based approaches have their own advantages. Different feature extractors also have various precisions. To integrate these strengths as fully as possible, combination techniques have been developed. Typically, popular combinations consist of two aspects, namely, feature- and area-based approaches; however, some integrate two geometric feature-based approaches, such as the SIFT and Harris detectors. COMBINATIONS OF FEATURE- AND AREA-BASED ALGORITHMS Feature-based approaches are typically suitable for images with more significant structural data than intensity information. However, they are restricted by the distribution and accuracy of the features. On the other hand, area-based approaches are appropriate for images with more distinctive intensity information; however, they require the intensity information of the reference and sensed images to be correlated. Thus, the two methods have complementary pros and cons. To further improve registration accuracy and robustness, some studies focus on a combination of geometric featureand area-based techniques [171]. Huang et al. [172] proposed a hybrid approach to aligning images by intensities within a scale-invariant feature region. Elsewhere, a wavelet-based feature extraction technique and an area-based method with NCC were combined to reduce the local distortion caused by terrain relief [173]. In a wavelet-based hierarchical pyramid framework, Mekky et al. [174] proposed a hybrid approach using MI and the SIFT; employing the rough registration parameters of the area-based approach for MI, the number of false alarms obtained by the SIFT was reduced. In addition, IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Gong et al. employed the robustness of the SIFT and the accuracy of MI, proposing a novel coarse-to-fine registration framework aimed at registering optical and SAR remote sensing images [90]. For multisensor SAR image registration, Suri et al. proposed a multistage registration strategy. The rough parameters of the transformation model are estimated by MI, and this model is introduced during the SIFT matching phase to increase the number of tie points [175]. Under the SIFT and MI combinations, Heo et al. introduced a stereo matching method that produces accurate depth maps [176]. All these approaches can be considered coarse-to-fine-processing chains. The basic idea is to improve the result of the featurebased approach by adopting an optimization process from an area-based technique [90], [171]. The combined methods integrate the robustness of the feature-based algorithm with the accuracy of the area-based approach. They are relatively few compared with individual methods, but their combination will be the focus in the near future, from our point of view. To deal with the possible accumulation of errors, bundle block adjustment is usually needed [178], [179] to register sequential images. Moreover, the integration of different geometric feature-based approaches is being developed, as well, for ever-increasing transformation model estimation accuracy, generating precise registration results to the greatest extent possible. INTEGRATION OF TWO GEOMETRIC FEATURE-BASED APPROACHES In addition to combinations of feature- and area-based techniques, the integration of two geometric feature-based approaches is a developing trend for high-precision registration. In particular, the feature points extracted by different methods are used to register images in two stages. Yu et al. proposed to extract feature points using the SIFT for the preregistration of Satellite Pour l’Observation de la Terre-5/Thematic Mapper/ Quickbird images from different sensors [74]. In the fine registration stage, the Harris algorithm for corner point detection is enforced to detect the distinct corner, and the extracted point is matched by the NCC algorithm. Similarly, Lee used SURF to extract the feature point of a low-resolution image after Harr wavelet transformation, which is defined as rough registration [180]. Fine registration is the same as the approach proposed by Yu et al. Recently, Ye et al. utilized SR–SIFT to extract the feature point in the preregistration stage for distinct translation, rotation, and scale difference elimination. To further optimize registration, the Harris algorithm was employed to detect feature points in the reference and prealigned images and describe them by LSS for matching [36]. To register large, high-resolution remote sensing images, a coarse-to-fine strategy combining the Harris–Laplace detector with the SIFT descriptor has been proposed. After rough registration, a large image is divided into small, processable blocks for fine alignment [181]. Additionally, in a new twostep registration, the approximate spatial relationship is calculated with the deep features using a convolutional NN in the first step. Then, the previous result is adjusted based on the DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE extracted local features [182]. Another technique combines feature point and feature line methods for the registration of images covering low-texture scenes in the computer vision field [183]. Since low- and repeated-texture regions are common in remote sensing images, feature lines can be employed to supplement the number of feature points. Therefore, beside the combination of two geometric feature-based methods, the integration of different geometrical features has great potential for the high-precision alignment of remote sensing images [22]. Since combination schemes integrate the advantages of two or more registration approaches, they offer remarkable precision. Moreover, in general, preregistration provides a rough result that approximates the final alignment. With finetuning in the optimized registration stage, a high-precision registration result is finally acquired. This algorithm is suitable for remote sensing image registration with large spatial position differences. It is as time-consuming as two or more alignment strategies. SOFTWARE-BASED REGISTRATION Most reviews emphasize the ever-increasing number of image registration approaches that are improved on the basis of existing methods for registering larger and more complicated images [16], [184]. Few studies have evaluated the performance of software-embedded image registration modules and the packages/tools for image geometric registration [185]. Thus, in this section, we present some examples. The Earth Resources Data Analysis System (ERDAS), ENVI, PCI Geomatica, ER Mapper, and Arc Geographic Information System (GIS) are well-known software packages for remote sensing image processing that include registration modules. ER Mapper was acquired by ERDAS a few years ago. They integrate conventional manual and automatic registration programs. Concretely, ENVI could register two remote sensing images or align one image with a map covering the same scene. A user can extract tie points by observing similar objects lying on two images, such as corners of buildings, road intersections, inflection points of rivers, and so on. With a uniform point distribution, the parameters of a specified transformation model can be estimated. There are some general geometric mapping functions, including affine, polynomial, and triangulation transformation models. Geometric mapping is generally conducted by an expert and is time-consuming and tedious. It is difficult to avoid subjective factors while extracting tie points, especially when registering WFV images that require more time than general image registration. To liberate the productive forces and improve the registration efficiency, the automatic alignment technique is also put into ENVI. We should point out the reference and sensed images, respectively. After setting the area-based matching parameters, the tie point for transformation model construction is automatically extracted; soon, the aligned image is obtained. Neither the manually extracted tie point nor the automatically acquired point in ENVI is sufficiently accurate. For example, the coordinates of the extracted feature point are (157.05, 171), 131
which may suggest the neighborhood of the real corner. Under this circumstance, the calculated geometric spatial relationship is not as precise as it could be. The obtained registration result is usually worse than expected, especially for high-resolution remote sensing images with inconsistent local deformation. ERDAS was developed by the ERDAS Corporation, in the United States. Compared with ENVI, it can produce tie points with higher location accuracies [for instance, the coordinate of the extracted feature point is (385.776, 75.161), which has more decimal places] to generate precise mapping functions between reference and sensed images that approximate real geometric relations. Additionally, there are abundant transformation models, such as linear rubber sheeting, nonlinear rubber sheeting, and the direct linear transform. Elevation data are introduced into the registration to generate the highprecision alignment of mountainous remote sensing images, even using the digital terrain model (DTM). Furthermore, the region and interval of the selected tie point can be set manually in the “AutoSync” module. To acquire a high-precision registration result, the elevation data (DEM or DTM) should be input at the same time as the image to be registered. If higher-spatial-resolution elevation data were included in ERDAS, the corresponding information would be automatically extracted when an image’s geographic information was identified to register the input image. Image registration can also be conducted in ArcGIS, although most researchers would probably utilize the software to solve problems with the GIS, such as spatial analysis. PCI Geomatica prefers to produce orthophoto and fusion images, rather than registering remote sensing images. However, both ArcGIS and PCI Geomatica contain an image registration module. The steps for alignment processing are similar to those for the aforementioned software, including manual registration and automatic operation. Some different transformation models, such as spline, similarity polynomial, and projective transformations, are used to achieve the high-precision registration of complicated remote sensing images. However, sometimes the result is unsatisfactory for further applications, as the tie points are not uniformly distributed and their number is small. Pixel Information Expert is a new generation of remote sensing image processing software that was developed by Beijing Aerospace Hongtu Information Technology. It can handle the dislocation of multisource, heterogeneous remote sensing images since it integrates a novel algorithm with a focus on multimodal remote sensing image registration. It can be tested free for 30 days. In addition, copyrighted geometric registration software, such as the Hyperspectral Image Processing and Analysis System, GeoImager, Titan Image, and so forth, were generated by the Institute of Remote Sensing, Chinese Academy of Sciences. Because high-resolution image registration is an important task in remote sensing image processing, much emphasis has been placed on it. To extract dense tie points representing local geometric relationships, SURF and an adaptive binning SIFT descriptor have been combined [186]. With the guidance 132 of the local transformation model, an accurate registration result is obtained. The MATLAB code for the algorithm is provided, with experimental data, at https://www.researchgate .net/publication/320354469_HRImReg. The code is encrypted, and the parameters cannot be adjusted. It can be used only for comparative experiments to evaluate a proposed approach. When doing simulation experiments to assess a feature point detector or to evaluate a mismatched elimination approach with real data, the progressive sparse spatial consensus algorithm can be employed [187]. The code, with experimental data, is publicly available at https://github.com/ jiayi-ma?tab=repositories. It has been tested on photographs from the computer vision field. To apply it to remote sensing images, some improvements are needed. Beyond these, there are many commercial and open-source software packages/ tools for geometric registration. There are also different points of view, which should be discussed in depth in the future as more resources become available. However, an evaluation of registration approaches should be conducted, as well, whenever an aligned image is generated from software or a proposed method. EVALUATION OF IMAGE REGISTRATION ACCURACY For the spatial alignment of remote sensing images, it is highly desirable to provide users with an estimate of how accurate the registration actually is. Accuracy evaluation is a nontrivial problem that is present in all literature on remote sensing image registration. We have identified three aspects to measuring the registration accuracy on the basis of different considerations, including tie point identification, the transformation model performance, and the alignment error. In this section, we review basic approaches for alignment assessment. ACCURACY OF TIE POINTS The quality and quantity of tie points are important to guarantee high-precision image registration. The number of redundant tie points, in addition to the elementary computation of the specified transformation model, is essential information since we generally use as many tie points as possible to calculate the parameters of the mapping function for alignment. Furthermore, we must allow for a residual (Tx i, Ty i) for the ith extracted feature point compared with the origin of the image [188]. If there are N tie points, the root-mean-square error (RMSE) can be estimated as follows: RMS tp = N 1 ((Tx i) 2 + (Ty i) 2) .(4) N i/ =1 To enable general comparison, the RMSE should be computed across the normalized (to the pixel size) residuals. Additionally, the bad point proportion should be calculated to evaluate the extracted feature point. This is the number of residuals that lie above a certain threshold multiplied by the ellipse formed by the pixel size. Besides the mentioned criteria, the distribution of tie points is attracting increased attention. To design a uniform distribution of tie points, some papers have proposed to extract feature points within a specified IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
subregion [30]. A detection approach is employed to extract the specified number of feature points. Tie points affect the registration accuracy but are not the sole influencer. TRANSFORMATION MODEL PERFORMANCE The transformation model abstractly represents the geometric mapping function from a sensed image to a reference image. The actual between-image geometric distortion is difficult to obtain without prior information, and the estimated transformation approximates the real geometric relationship between images. One part of the N pairs of tie points is taken for mapping function estimation through the least-squares method, assuming N matched feature points. The left part in the sensed image is employed as the test point to be transformed into the reference image system [188]. The distance between the transformed coordinate and the corresponding point in the reference image is calculated as the residual, the mean of which is a representation of the estimated transformation model: RMS N - te = 1 N-T N-T / ((x - Hx') 2 + (y - Hy') 2),(5) j=1 where H denotes the estimated transformation model by T pairs of tie points, (x, y) and (x', y'), which represent the corresponding points in the reference and sensed images, respectively. Furthermore, a | 2 goodness-of-fit test may be applied [188] to analyze whether the residuals are equally distributed across all quadrants. However, “overfitting” may yield zero error for a mapping model with sufficient degrees of freedom; this is a well-known phenomenon in numerical analysis. Under this circumstance, the registration results may not be optimal. ALIGNMENT ERROR The oldest method for estimating registration accuracy is visual assessment by a domain expert, which is still in use and remains the most effective technique, although it cannot be quantified [16], [188]. At present, this is performed using professional software, such as ENVI and ArcGIS, with shutter tools. Similarity metrics in area-based registration, such as MI, NMI, CC, and so on, are frequently employed to evaluate alignment accuracy [59]. The indicators are easily influenced changes in the information with development and differences in radiation. To quantitatively present the alignment error, the RMSE is calculated using feature points manually extracted by a specialist employing (4) [85]. Since image registration aims to achieve the relative spatial alignment of two different images, there is no gold standard reference image with which to evaluate the registration accuracy. When evaluating outcomes according to at least three criteria, the most indicative results point to the best registration, as different assessments have their own advantages and disadvantages. FUTURE TRENDS There has been a large number of independent studies on remote sensing image registration, and much effort has been DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE put into constructing robust feature descriptors and eliminating mismatched features. With the development of sensor technology and application requirements, some novel opportunities and challenges must be addressed for remote sensing image registration. To us, it seems likely that the future of this field will include accelerated, combined, heterogenous, cross-scale, and smart remote sensing image registration techniques, which are introduced in detail in the following. ACCELERATED REMOTE SENSING IMAGE REGISTRATION With the ongoing development of sensor technology, the spatial resolution of remote sensing images increases, resulting in a growing number of features with distinctive details. The huge number of features lengthens the distance to the real-time registration of remote sensing images, causing inefficiency when aligning large-scale images. Thus, constructing descriptors and matching the detected features is time-consuming for general images, especially WFV ones. As proposed in [52], to achieve real-time registration to the greatest extent possible, remote sensing image registration can be operated on a cloud platform based on finite-state chaotic compressed sensing theory. Similarly, cloud computing [91] and some hardware systems may also be effective for accelerating image registration. At present, parallel computing [139] is the easiest path to implementation. Here, an image is divided into several subregions, and the image features in each one are simultaneously extracted, based on the same principles, on different parallel processors, as is the transformation model construction. The parallel commands are easy to implement on MATLAB and other platforms. COMBINED APPROACHES FOR IMAGE REGISTRATION With the development of imaging sensors, the resolution of remote sensing images has increased, and local deformation has become obvious. For example, the geometric distortion caused by terrain relief and high-rise buildings leads to inaccurate registration [36], introducing difficulties for remote sensing image applications. The reference and sensed images cover the plain and mountainous regions simultaneously in Figure 13(c). Calculating the displacements of corresponding pixels for spatial registration, the enlarged displacements in the specified rectangular regions are shown in Figure 13(d) and (e). The magnitude and direction of the displacements in the plain region are similar, but they differ in the mountainous region. Here, multistage registration with a global mapping function cannot exactly describe the spatial relationship between the reference and sensed images, and neither can the local transformation model. Given that displacements vary in different terrain regions, dividing images into a series of regions and registering with a specified approach may yield a high-precision alignment, indicating a combination of different techniques. Concretely, 133
(a) (b) 108° 45′ 0″ E 108° 50′ 0″ E Elevation (m) 1,722 N 34° 0′ 0″ N 34° 0′ 0″ N 400 108° 45′ 0″ E (d) (c) 108° 50′ 0″ E (e) FIGURE 13. The spatial position of corresponding pixels in a remote sensing image of complex terrain. (a) The reference image. (b) The sensed image. (c) The topographic image. (d) The displacements in the mountainous region marked with a yellow rectangle in (a) and (b). (e) The displacements in the plain region marked with a red rectangle in (a) and (b). 134 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
this transformation model is calculated with distinct tie features in the plain region. With the transformation model, rather than directly obtaining the aligned plain region, the displacement guiding pixels to alignment is estimated. In mountainous regions, the dense optical flow estimation borrowed from computer vision is utilized to acquire the displacement of each corresponding pixel. Then, the displacement fields from different terrain regions are mosaicked (e.g., using the inverse distance weighted function for uniform transitions in image stitching) to obtain a seamless displacement field of the entire image [177]. This is a creative combination of different registration approaches in a coordinated way, differing from the combined approaches mentioned in the “Registration Based on the Combination Method” section with the serial mode. Therefore, regional registration accommodating complex geometric relationships that vary with terrain differences may become a significant trend in remote sensing image registration, giving full play to the registration advantages of different approaches in various terrain regions. HETEROGENOUS AND CROSS-SCALE IMAGE REGISTRATION Heterogenous and cross-scale images collected all at once and at different times provide complementary information to improve our understanding of an entire scene during Earth observation or even during disaster rescues. However, such data usually have dramatically different spatial resolutions, intensities, noise, geometries, and so on, owing to different imaging principles. Some studies have focused on spatial registration, including optical image and SAR registration, optical image and infrared image registration, and satellite image and map registration [36], [57]. These works emphasized the robust construction of descriptors to resist intensity and noise differences and other influential factors. Large-scale differences between cross-scale images (which are much greater than four times the resolution difference between the panchromatic and multispectral images) introduce difficulties for extracting geometrical features from low-resolution images that are similar to those from high-resolution images. Thus, generating the tie features of cross-scale images for transformation model construction, even during high-precision registration, is difficult. Additionally, highefficiency heterogenous and cross-scale image registration remains an open problem that is worth researching in the near future. For a concrete example, the approximately realtime registration of optical and SAR images may offer an approach for analyzing disaster regions as quickly as possible for rescue purposes by means of registering and comparing images before and after an event. These applications are vital for rescue operations. Precise and efficient heterogeneous and cross-scale image registration is a mandatory prerequisite for high-precision, real-time applications. SMART REMOTE SENSING IMAGE REGISTRATION To register multiple remote sensing images, one simple and conventional idea is to align them frame by frame, namely, DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE by converting multiple image registration into pair-to-pair alignment. This process, learning from the simultaneous mosaicking of multiframe images, specifies a reference image connected to others and stitches other images to the reference one. Therefore, when images to be registered are read into the program, the coordinates of the four corners in each image are extracted. The reference image is determined by comparing these coordinates. As presented in Figure 14, images A, B, C, and D are simultaneously aligned with the reference image (marked in green) according to a general registration strategy, as there is overlap between two images. Unlike frameto-frame approaches, this technique needs to specify only the reference image, and the intermediate results do not output and input many times, which saves memory and improves computational efficiency. From our point of view, this is smart registration, which is particularly useful for WFV-image generation. However, when images overlap, a more intelligent approach needs to be developed. Moreover, images to be registered may have small overlapping areas. This overlap presents a challenge for high-accuracy alignment because a small number of geometric and intensity features is available for constructing the transformation model. This problem should be intelligently solved to register images with a low ratio of overlapping regions. Typically, these images are used to produce WFV images by means of stitching. Further solutions should be provided in the future. Therefore, the large-scale, complex distortion of high-resolution, heterogenous, and cross-scale remote sensing images must be a focus of future research. In this situation, the traditional single-registration approach may not meet requirements. For real time, high-precision registration, a combination of alignment approaches and high-performance computing is considered very promising. CONCLUSIONS In this article, we presented a comprehensive and quantitative summary of intensity-based, feature-based, and combined approaches to remote sensing image registration. Conventional methods and new applications of deep learning and optical A B Reference Image C D FIGURE 14. The spatial position of multiple images to be registered. 135
flow techniques were included. The performance of registration software packages and tools was analyzed. Additionally, novel registration evaluations were presented to support an effective assessment. The development of any approach aims to improve registration accuracy as much as possible because registration is an important step for preprocessing remote sensing images. Several such techniques have been developed, as recounted in this article. However, as resolutions increase, the problem of inconsistent local distortion caused by high-rise buildings and topographic relief has become apparent; this cannot be exactly described by the transformation model. Moreover, WFV images are an emerging trend in satellite image production, enabling a whole ROI to be contained within one image. This poses a challenge for real-time registration and memory for registration processing. Therefore, we believe that future research on remote sensing image registration will use accelerated registration, combined approaches for remote sensing image registration, heterogeneous and cross-scale image registration, and smart registration. Challenges remain, and considerable additional research is required. We perform this research with the advantage of lower entrance barriers than the TDOM generation. ACKNOWLEDGMENTS The work was supported by the National Natural Science Foundation of China (grants 41971303 and 41701394), Key Research and Development Program of Shaanxi Province (grant 2020NY-166), and Fundamental Research Funds for the Central Universities (grant GK202103143). The authors thank the editor-in-chief and associate editor of IEEE Geoscience and Remote Sensing Magazine as well as four anonymous reviewers for their advice for strengthening their manuscript. AUTHOR INFORMATION Ruitao Feng (feng-rt@snnu.edu.cn) is with the School of Geography and Tourism, Shaanxi Normal University, Xi’an, 710062, China. Huanfeng Shen (shenhf@whu.edu.cn) is with the School of Resource and Environment Science and the Collaborative Innovation Center for Geospatial Technology, Wuhan University, Wuhan, 430072, China. He is a Senior Member of IEEE. Jianjun Bai (bjj@snnu.edu.cn) is with the School of Geography and Tourism, Shaanxi Normal University, Xi’an, 710062, China. Xinghua Li (lixinghua5540@whu.edu.cn) is with the School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, 430072, China. He is a Senior Member of IEEE. [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] REFERENCES [1] [2] 136 A. Wong and D. A. Clausi, “ARRSI: Automatic registration of remote-sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 5, pp. 1483–1493, 2007. doi: 10.1109/TGRS.2007.892601. X. Li, N. Hui, H. Shen, Y. Fu, and L. Zhang, “A robust mosaicking procedure for high spatial resolution remote sensing images,” [16] [17] ISPRS J. Photogram. Remote Sens., vol. 109, pp. 108–125, Nov. 2015. doi: 10.1016/j.isprsjprs.2015.09.009. H. Shen, X. Meng, and L. Zhang, “An Integrated Framework for the Spatio-Temporal-Spectral Fusion of Remote Sensing Images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12, pp. 7135–7148, 2016. doi: 10.1109/TGRS.2016.2596290. Y. Lu, P. Wu, X. Ma, and X. Li, “Detection and prediction of land use/land cover change using spatiotemporal data fusion and the Cellular Automata–Markov model,” Environ. Monitoring Assessment, vol. 191, no. 2, p. 68, 2019. doi: 10.1007/s10661-019-7200-2. Z. Lv, T. Liu, C. Shi, J. A. Benediktsson, and H. Du, “Novel land cover change detection method based on k-means clustering and adaptive majority voting using bitemporal remote sensing images,” IEEE Access, vol. 7, pp. 34,425–34,437, Jan. 2019. doi: 10.1109/ACCESS.2019.2892648. C. Yuan, F. Wang, S. Wang, and Y. Zhou, “Accuracy evaluation of flood monitoring based on multiscale remote sensing for different landscapes,” Geomatics, Natural Hazards Risk, vol. 10, no. 1, pp. 1389–1411, 2019. doi: 10.1080/19475705.2019. 1580224. L. Yang and G. Cervone, “Analysis of remote sensing imagery for disaster assessment using deep learning: A case study of flooding event,” Soft Comput., vol. 23, no. 24, pp. 13,393– 13,408, 2019. doi: 10.1007/s00500-019-03878-8. K. Barbieux, “Pushbroom hyperspectral data orientation by combining feature-based and area-based co-registration techniques,” Remote Sens., vol. 10, no. 4, p. 645, 2018. doi: 10.3390/ rs10040645. Y. Jiang, J. Wang, L. Zhang, G. Zhang, X. Li, and J. Wu, “Geometric processing and accuracy verification of zhuhai-1 hyperspectral satellites,” Remote Sens., vol. 11, no. 9, p. 996, 2019. doi: 10.3390/rs11090996. I. Aicardi, F. Nex, M. Gerke, and A. M. Lingua, “An image-based approach for the co-registration of multi-temporal UAV image datasets,” Remote Sens., vol. 8, no. 9, p. 779, 2016. doi: 10.3390/rs8090779. F. P. M. Oliveira and J. M. R. S. Tavares, “Medical image registration: A review,” Comput. Methods Biomech. Biomed. Eng., vol. 17, no. 2, pp. 73–93, 2014. doi: 10.1080/10255842.2012.670855. A. Sotiras, C. Davatzikos, and N. Paragios, “Deformable medical image registration: A survey,” IEEE Trans. Med. Imag., vol. 32, no. 7, pp. 1153–1190, 2013. doi: 10.1109/TMI.2013.2265603. M. A. Viergever, J. B. A. Maintz, S. Klein, K. Murphy, M. Staring, and J. P. W. Pluim, “A survey of medical image registration,” Med. Image Anal., vol. 33, pp. 140–144, Oct. 2016. doi: 10.1016/j.media.2016.06.030. G. Haskins, U. Kruger, and P. Yan, “Deep learning in medical image registration: A survey,” Mach. Vis. Appl., vol. 31, nos. 1–2, p. 8, 2020. doi: 10.1007/s00138-020-01060-x. L. G. Brown, “A survey of image registration techniques,” ACM Comput. Surv., vol. 24, no. 4, pp. 325–376, 1992. doi: 10.1145/146370.146374. B. Zitová and J. Flusser, “Image registration methods: A survey,” Image Vis. Comput., vol. 21, no. 11, pp. 977–1000, 2003. doi: 10.1016/S0262-8856(03)00137-9. M. Deshmukh and U. Bhosle, “A survey of image registration,” Int. J. Image Process., vol. 5, no. 3, p. 245, 2011. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[18] Z. Xiong and Y. Zhang, “A critical review of image registration methods,” Int. J. Image Data Fusion, vol. 1, no. 2, pp. 137–158, 2010. doi: 10.1080/19479831003802790. [19] M. V. Wyawahare, P. M. Patil, and H. K. Abhyankar, “Image registration techniques: An overview,” Int. J. Signal Process., Image Process. Pattern Recognit., vol. 2, no. 3, pp. 11–28, 2009. [20] C. Dalmiya and V. Dharun, “A survey of registration techniques in remote sensing images,” Indian J. Sci. Technol., vol. 8, no. 26, pp. 1–7, 2015. doi: 10.17485/ijst/2015/v8i26/81048. [21] R. M. Ezzeldeen, H. H. Ramadan, T. M. Nazmy, M. A. Yehia, and M. S. Abdel-Wahab, “Comparative study for image registration techniques of remote sensing images,” Egyptian J. Remote Sens. Space Sci., vol. 13, no. 1, pp. 31–36, 2010. doi: 10.1016/j. ejrs.2010.07.004. [22] M. P. S. Tondewad and M. M. P. Dale, “Remote sensing image registration methodology: Review and discussion,” Proc. Comput. Sci., vol. 171, pp. 2390–2399, June 2020. doi: 10.1016/j. procs.2020.04.259. [23] P. E. Anuta, “Spatial registration of multispectral and multitemporal digital imagery using fast Fourier transform techniques,” IEEE Trans. Geosci. Electron., vol. 8, no. 4, pp. 353–368, 1970. doi: 10.1109/TGE.1970.271435. [24] X. Xu, X. Li, X. Liu, H. Shen, and Q. Shi, “Multimodal registration of remotely sensed images based on Jeffrey’s divergence,” ISPRS J. Photogram. Remote Sens., vol. 122, pp. 97–115, Dec. 2016. doi: 10.1016/j.isprsjprs.2016.10.005. [25] J. Ma, H. Zhou, J. Zhao, Y. Gao, J. Jiang, and J. Tian, “Robust feature matching for remote sensing image registration via locally linear transforming,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 12, pp. 6469–6481, 2015. doi: 10.1109/ TGRS.2015.2441954. [26] N. Hanaizumi and S. Fujimur, “An automated method for registration of satellite remote sensing images,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 1993, pp. 1348–1350. doi: 10.1109/IGARSS.1993.322087. [27] W. F. Webber, “Techniques for image registration,” in Proc. LARS Symp., West Lafayette, IN, 1973, pp. 1–7. [28] D. I. Barnea and H. F. Silverman, “A class of algorithms for fast digital image registration,” IEEE Trans. Comput., vol. C-21, no. 2, pp. 179–186, 1972. doi: 10.1109/TC.1972.5008923. [29] S. i. Kaneko, Y. Satoh, and S. Igarashi, “Using selective correlation coefficient for robust image registration,” Pattern Recognit., vol. 36, no. 5, pp. 1165–1173, 2003. doi: 10.1016/S00313203(02)00081-X. [30] H. Gonçalves, J. A. Gonçalves, L. Corte-Real, and A. C. Teodoro, “CHAIR: Automatic image registration based on correlation and Hough transform,” Int. J. Remote Sens., vol. 33, no. 24, pp. 7936– 7968, Dec. 20, 2012. doi: 10.1080/01431161.2012.701345. [31] J. Inglada and A. Giros, “On the possibility of automatic multisensor image registration,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 10, pp. 2104–2120, 2004. doi: 10.1109/TGRS.2004. 835294. [32] J. Ma, J. C. Chan, and F. Canters, “Fully automatic subpixel image registration of multiangle CHRIS/Proba data,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 7, pp. 2829–2839, 2010. doi: 10.1109/TGRS.2010.2042813. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [33] Y. Wu, W. Ma, Q. Su, S. Liu, and Y. Ge, “Remote sensing image registration based on local structural information and global constraint,” J. Appl. Remote Sens., vol. 13, no. 1, p. 1, 2019. doi: 10.1117/1.JRS.13.016518. [34] G. Wolberg and S. Zokai, “Image registration for perspective deformation recovery,” in Proc. SPIE, Automatic Target Recognit. X, Orlando, FL, 2000, vol. 4050, pp. 259–270. [35] D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge, “Comparing images using the Hausdorff distance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 9, pp. 850–863, 1993. doi: 10.1109/34.232073. [36] Y. Ye and J. Shan, “A local descriptor based registration method for multispectral remote sensing images with non-linear intensity differences,” ISPRS J. Photogram. Remote Sens., vol. 90, pp. 83–95, 2014. doi: 10.1016/j.isprsjprs.2014.01.009. [37] Y. Hel-Or, H. Hel-Or, and E. David, “Fast template matching in non-linear tone-mapped images,” in Proc. Int. Conf. Comput. Vision (ICCV), Barcelona, Spain, 2011, pp. 1355–1362. doi: 10.1109/ICCV.2011.6126389. [38] Y. Bentoutou, N. Taleb, K. Kpalma, and J. Ronsin, “An automatic image registration for applications in remote sensing,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 9, pp. 2127–2137, 2005. doi: 10.1109/TGRS.2005.853187. [39] K. Taejung and I. Yong-Jo, “Automatic satellite image registration by combination of matching and random sample consensus,” IEEE Trans. Geosci. Remote Sens., vol. 41, no. 5, pp. 1111– 1117, 2003. doi: 10.1109/TGRS.2003.811994. [40] J. P. Kern and M. S. Pattichis, “Robust multispectral image registration using mutual-information models,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 5, pp. 1494–1505, 2007. doi: 10.1109/ TGRS.2007.892599. [41] H. m. Chen, M. K. Arora, and P. K. Varshney, “Mutual information-based image registration for remote sensing data,” Int. J. Remote Sens., vol. 24, no. 18, pp. 3701–3706, 2003. doi: 10.1080/0143116031000117047. [42] A. A. Cole-Rhodes, K. L. Johnson, J. LeMoigne, and I. Zavorin, “Multiresolution registration of remote sensing imagery by optimization of mutual information using a stochastic gradient,” IEEE Trans. Image Process., vol. 12, no. 12, pp. 1495–1511, 2003. doi: 10.1109/TIP.2003.819237. [43] D. Brunner, G. Lemoine, and L. Bruzzone, “Earthquake Damage assessment of buildings using VHR optical and SAR imagery,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 5, pp. 2403– 2420, 2010. doi: 10.1109/TGRS.2009.2038274. [44] X. Wang, W. Yang, A. Wheaton, N. Cooley, and B. Moran, “Efficient registration of optical and IR images for automatic plant water stress assessment,” Comput. Electron. Agriculture, vol. 74, no. 2, pp. 230–237, 2010. doi: 10.1016/j.compag.2010. 08.004. [45] S. Chen, X. Li, L. Zhao, and H. Yang, “Medium-low resolution multisource remote sensing image registration based on SIFT and robust regional mutual information,” Int. J. Remote Sens., vol. 39, no. 10, pp. 3215–3242, 2018. doi: 10.1080/01431161. 2018.1437295. [46] L. Y. Zhao, B. Y. Lü, X. R. Li, and S. H. Chen, “Multi-source remote sensing image registration based on scale-invariant 137
[47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] 138 feature transform and optimization of regional mutual information,” Acta Phys. Sin., vol. 64, no. 12, pp. 124204, 1-11), 2015. G. Hermosillo, C. Chefd’Hotel, and O. Faugeras, “Variational methods for multimodal image matching,” Int. J. Comput. Vis., vol. 50, no. 3, pp. 329–343, Dec. 1, 2002. doi: 10.1023/ A:1020830525823. R. N. Bracewell and R. N. Bracewell, The Fourier Transform and Its Applications. New York: McGraw-Hill, 1986. H. Foroosh, J. B. Zerubia, and M. Berthod, “Extension of phase correlation to subpixel registration,” IEEE Trans. Image Process., vol. 11, no. 3, pp. 188–200, Mar. 2002. doi: 10.1109/83.988953. X. Wan, J. G. Liu, and H. Yan, “The illumination robustness of phase correlation for image alignment,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 10, pp. 5746–5759, 2015. doi: 10.1109/ TGRS.2015.2429740. X. Wan, J. Liu, H. Yan, and G. L. K. Morgan, “Illumination-invariant image matching for autonomous UAV localisation based on optical sensing,” ISPRS J. Photogram. Remote Sens., vol. 119, pp. 198–213, Sept. 2016. doi: 10.1016/j.isprsjprs.2016.05.016. Z. Liu, L. Wang, X. Wang, X. Shen, and L. Li, “Secure remote sensing image registration based on compressed sensing in cloud setting,” IEEE Access, vol. 7, pp. 36,516–36,526, Mar. 2019. doi: 10.1109/ACCESS.2019.2903826. M. Xu and P. K. Varshney, “A subspace method for Fourierbased image registration,” IEEE Geosci. Remote Sens. Lett., vol. 6, no. 3, pp. 491–494, 2009. doi: 10.1109/LGRS.2009.2018705. L. Lucchese, S. Leorin, and G. M. Cortelazzo, “Estimation of two-dimensional affine transformations through polar curve matching and its application to image mosaicking and remotesensing data registration,” IEEE Trans. Image Process., vol. 15, no. 10, pp. 3008–3019, 2006. doi: 10.1109/TIP.2006.877519. P. Bao and D. Xu, “Complex wavelet-based image mosaics using edge-preserving visual perception modeling,” Comput. Graph., vol. 23, no. 3, pp. 309–321, 1999. doi: 10.1016/S00978493(99)00040-0. H. Gang and Z. Yun, “Combination of feature-based and area-based image registration technique for high resolution remote sensing image,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2007, pp. 377–380. doi: 10.1109/ IGARSS.2007.4422809. Y. Ye, J. Shan, L. Bruzzone, and L. Shen, “Robust registration of multimodal remote sensing images based on structural similarity,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2941–2958, 2017. doi: 10.1109/TGRS.2017.2656380. H. Yang, X. Li, L. Zhao, and S. Chen, “A novel coarse-to-fine scheme for remote sensing image registration based on SIFT and phase correlation,” Remote Sens., vol. 11, no. 15, p. 1833, 2019. doi: 10.3390/rs11151833. Y. Han, F. Bovolo, and L. Bruzzone, “An approach to fine coregistration between very high resolution multispectral images based on registration noise distribution,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 12, pp. 6650–6662, 2015. doi: 10.1109/ TGRS.2015.2445632. A. Plyer, E. Colin-Koeniguer, and F. Weissgerber, “A new coregistration algorithm for recent applications on urban [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] SAR images,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 11, pp. 2198–2202, 2015. doi: 10.1109/LGRS.2015.2455071. G. Brigot, E. Colin-Koeniguer, A. Plyer, and F. Janez, “Adaptation and evaluation of an optical flow method applied to coregistration of forest remote sensing images,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 7, pp. 2923–2939, 2016. doi: 10.1109/JSTARS.2016. 2578362. R. Feng, X. Li, and H. Shen, “Mountainous remote sensing images registration based on improved optical flow estimation,” ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci., vol. IV-2/ W5, pp. 479–484, June 2019. doi: 10.5194/isprs-annals-IV2-W5-479-2019. B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artif. Intell., vol. 17, nos. 1–3, pp. 185–203, 1981. doi: 10.1016/0004-3702(81)90024-2. J. J. Gibson, The Perception of the Visual World. Oxford: Houghton Mifflin, 1950. B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in Proc. Imag. Understanding Workshop, 1981, pp. 121–130. Z. Tu et al., “A survey of variational and CNN-based optical flow techniques,” Signal Processing: Image Commun., vol. 72, pp. 9–24, Mar. 2019. doi: 10.1016/j.image.2018.12.002. M. J. Black and P. Anandan, “The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields,” Comput. Vision Image Understand., vol. 63, no. 1, pp. 75–104, 1996. doi: 10.1006/cviu.1996.0006. C. Liu, J. Yuen, and A. Torralba, “SIFT Flow: Dense correspondence across scenes and its applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 978–994, 2011. doi: 10.1109/TPAMI.2010.147. T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” in Proc. Eur. Conf. Comput. Vision (ECCV), 2004, pp. 25–36. doi: 10.1007/978-3-540-24673-2_3. J.-Y. Xiong, Y.-P. Luo, and G.-R. Tang, “An improved optical flow method for image registration with large-scale movements,” Acta Autom. Sin., vol. 34, no. 7, pp. 760–764, 2008. doi: 10.3724/SP.J.1004.2008.00760. A. Plyer, G. Le Besnerais, and F. Champagnat, “Massively parallel Lucas Kanade optical flow for real-time video processing applications,” J. Real-Time Image Process., vol. 11, no. 4, pp. 713– 730, 2016. doi: 10.1007/s11554-014-0423-0. Y. Xiang, F. Wang, L. Wan, N. Jiao, and H. You, “OS-Flow: A robust algorithm for dense optical and SAR image registration,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 9, pp. 1–20, 2019. doi: 10.1109/TGRS.2019.2905585. C. Huo, C. Pan, L. Huo, and Z. Zhou, “Multilevel SIFT matching for large-size VHR image registration,” IEEE Geosci. Remote Sens. Lett., vol. 9, no. 2, pp. 171–175, 2012. doi: 10.1109/ LGRS.2011.2163491. L. Yu, D. Zhang, and E.-J. Holden, “A fast and fully automatic registration approach based on point features for multisource remote-sensing images,” Comput. Geosci., vol. 34, no. 7, pp. 838–848, 2008. doi: 10.1016/j.cageo.2007.10.005. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[75] L. Hui, B. S. Manjunath, and S. K. Mitra, “A contour-based approach to multisensor image registration,” IEEE Trans. Image Process., vol. 4, no. 3, pp. 320–334, 1995. doi: 10.1109/83.366480. [76] H. Goncalves, L. Corte-Real, and J. A. Goncalves, “Automatic image registration through image segmentation and SIFT,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 7, pp. 2589–2600, 2011. doi: 10.1109/TGRS.2011.2109389. [77] H. P. Moravec, “Techniques towards automatic visual obstacle avoidance,” no. 2, p. 584, 1977. [Online]. Available: https://frc .ri.cmu.edu/~hpm/project.archive/robot.papers/1977/aip.txt [78] C. Harris and M. Stephens, “A combined corner and edge detector,” in Proc. Alvey Vision Conf., Manchester, U.K., 1988, vol. 15, pp. 147–151. [79] Y. Xiang, F. Wang, and H. You, “OS-SIFT: A robust SIFT-like algorithm for high-resolution optical-to-SAR image registration in suburban areas,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 6, pp. 3078–3090, 2018. doi: 10.1109/TGRS.2018.2790483. [80] I. Misra, S. M. Moorthi, D. Dhar, and R. Ramakrishnan, “An automatic satellite image registration technique based on Harris corner detection and Random Sample Consensus (RANSAC) outlier rejection model,” in 1st Int. Conf. on Recent Advances in Information Technology (RAIT), 2012, pp. 68–73. [81] S. M. Smith and J. M. Brady, “SUSAN–A new approach to low level image processing,” Int. J. Comput. Vis., vol. 23, no. 1, pp. 45–78, 1997. doi: 10.1023/A:1007963824710. [82] C. Leng, H. Zhang, B. Li, G. Cai, Z. Pei, and L. He, “Local feature descriptor for image matching: A survey,” IEEE Access, vol. 7, pp. 6424–6434, 2019. doi: 10.1109/ACCESS.2018.2888856. [83] W. He and X. Deng, “A modified SUSAN corner detection algorithm based on adaptive gradient threshold for remote sensing image,” in Proc. Int. Conf. Optoelectron. Image Process., 2010, vol. 1, pp. 40–43. [84] R. Feng, X. Li, W. Zou, and H. Shen, “Registration of multitemporal GF-1 remote sensing images with weighting perspective transformation model,” in Proc. IEEE Int. Conf. Image Process. (ICIP), 2017, pp. 2264–2268. doi: 10.1109/ ICIP.2017.8296685. [85] R. Feng, Q. Du, X. Li, and H. Shen, “Robust registration for remote sensing images by combining and localizing feature- and area-based methods,” ISPRS J. Photogram. Remote Sens., vol. 151, pp. 15–26, May 2019. doi: 10.1016/j.isprsjprs.2019.03.002. [86] Y. Duan, X. Huang, J. Xiong, Y. Zhang, and B. Wang, “A combined image matching method for Chinese optical satellite imagery,” Int. J. Digital Earth, vol. 9, no. 9, pp. 851–872, 2016. doi: 10.1080/17538947.2016.1151955. [87] P. K. Konugurthi, R. Kune, R. Nooka, and V. Sarma, “Autonomous ortho-rectification of very high resolution imagery using SIFT and genetic algorithm,” Photogram. Eng. Remote Sens., vol. 82, no. 5, pp. 377–388, 2016. doi: 10.14358/PERS.82.5.377. [88] Q. Li, G. Wang, J. Liu, and S. Chen, “Robust scale-invariant feature matching for remote sensing image registration,” IEEE Geosci. Remote Sens. Lett., vol. 6, no. 2, pp. 287–291, 2009. doi: 10.1109/LGRS.2008.2011751. [89] W. Ma et al., “Remote sensing image registration with modified SIFT and enhanced feature matching,” IEEE Geosci. ReDECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE mote Sens. Lett., vol. 14, no. 1, pp. 3–7, 2017. doi: 10.1109/ LGRS.2016.2600858. [90] M. Gong, S. Zhao, L. Jiao, D. Tian, and S. Wang, “A novel coarse-to-fine scheme for automatic image registration based on SIFT and mutual information,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 7, pp. 4328–4338, 2014. doi: 10.1109/ TGRS.2013.2281391. [91] C. A. Lee, S. D. Gasster, A. Plaza, C. Chang, and B. Huang, “Recent developments in high performance computing for remote sensing: A review,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 4, no. 3, pp. 508–527, 2011. doi: 10.1109/ JSTARS.2011.2162643. [92] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. doi: 10.1023/B:VISI.0000029664.99615.94. [93] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” (in English), IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp. 1615–1630, Oct. 2005. doi: 10.1109/ TPAMI.2005.188. [94] K. Yan and R. Sukthankar, “PCA-SIFT: A more distinctive representation for local image descriptors,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit. (CVPR), Washington, D. C., 2004, vol. 2, pp. 506–513. doi: 10.1109/CVPR.2004.1315206. [95] Y. Zheng, Z. Cao, and Y. Xiao, “Multi-spectral remote image registration based on SIFT,” Electron. Lett., vol. 44, no. 2, pp. 107–108, 2008. [96] J. Morel and G. Yu, “ASIFT: A new framework for fully affine invariant image comparison,” SIAM J. Imag. Sci., vol. 2, no. 2, pp. 438–469, 2009. doi: 10.1137/080732730. [97] A. Sedaghat, M. Mokhtarzade, and H. Ebadi, “Uniform robust scale-invariant feature matching for optical remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 11, pp. 4516–4527, 2011. doi: 10.1109/TGRS.2011.2144607. [98] A. Sedaghat and H. Ebadi, “Distinctive order based self-similarity descriptor for multi-sensor remote sensing image matching,” ISPRS J. Photogram. Remote Sens., vol. 108, pp. 62–71, Oct. 2015. doi: 10.1016/j.isprsjprs.2015.06.003. [99] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robust features,” in Proc. Eur. Conf. Comput. Vision (ECCV), Graz, Austria, 2006, pp. 404–417. [100] W. Yan, H. She, and Z. Yuan, “Robust registration of remote sensing image based on SURF and KCCA,” J. Indian Soc. Remote Sens., vol. 42, no. 2, pp. 291–299, 2014. doi: 10.1007/s12524-013-0324-x. [101] X. Yuan, S. Chen, W. Yuan, and Y. Cai, “Poor textural image tie point matching via graph theory,” ISPRS J. Photogram. Remote Sens., vol. 129, pp. 21–31, July 2017. doi: 10.1016/j.isprsjprs.2017.04.015. [102] R. Bouchiha and K. Besbes, “Automatic remote-sensing image registration using SURF,” Int. J. Comput. Theory Eng., vol. 5, no. 1, pp. 88–92, 2013. doi: 10.7763/IJCTE.2013.V5.653. [103] J. Chen et al., “WLD: A robust local image descriptor,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1705–1720, 2010. doi: 10.1109/TPAMI.2009.155. [104] E. Rosten and T. Drummond, “Machine learning for highspeed corner detection,” in Proc. Eur. Conf. Comput. Vision (ECCV), Graz, Austria, 2006, pp. 430–443. 139
[105] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “BRIEF: Binary robust independent elementary features,” in Proc. Eur. Conf. Comput. Vision (ECCV), Crete, Greece, 2010, pp. 778–792. [106] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in Proc. Int. Conf. Comput. Vision (ICCV), Barcelona, Spain, 2011, pp. 2564–2571. [107] D. Ma and H. Lai, “Remote sensing image matching based improved ORB in NSCT domain,” J. Indian soc. Remote Sens., vol. 47, no. 5, pp. 801–807, 2019. doi: 10.1007/s12524-019-00958-y. [108] P. F. Alcantarilla, A. Bartoli, and A. J. Davison, “KAZE Features,” in Proc. Eur. Conf. Comput. Vision (ECCV), Florence, Italy, 2012, pp. 214–227. [109] P. Alcantarilla, J. Nuevo, and A. Bartoli, “Fast explicit diffusion for accelerated features in nonlinear scale spaces,” in Proc. Brit. Mach. Vision Conf. (BMVC), Bristol, U.K., 2013, pp. 1–11. [110] Y. Ye, M. Wang, S. Hao, and Q. Zhu, “A novel keypoint detector combining corners and blobs for remote sensing image registration,” IEEE Geosci. Remote Sens. Lett., vol. 18, no. 3, pp. 451–455, Mar. 31, 2020. doi: 10.1109/LGRS.2020. 2980620. [111] X. Liu, Y. Ai, J. Zhang, and Z. Wang, “A novel affine and contrast invariant descriptor for infrared and visible image registration,” Remote Sens., vol. 10, no. 4, p. 658, 2018. doi: 10.3390/ rs10040658. [112] Z. Ye et al., “Robust fine registration of multisensor remote sensing images based on enhanced subpixel phase correlation,” Sensors, vol. 20, no. 15, p. 4338, Aug. 4, 2020. doi: 10.3390/ s20154338. [113] Y. C. Hsieh, D. M. McKeown, and F. P. Perlant, “Performance evaluation of scene registration and stereo matching for cartographic feature extraction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 2, pp. 214–238, 1992. doi: 10.1109/34.121790. [114] S. Dongseok, J. K. Pollard, and J. Muller, “Accurate geometric correction of ATSR images,” IEEE Trans. Geosci. Remote Sens., vol. 35, no. 4, pp. 997–1006, 1997. doi: 10.1109/36.602542. [115] J. Inglada and F. Adragna, “Automatic multi-sensor image registration by edge matching using genetic algorithms,” in Proc. Int. Geosci. Remote Sens. Symp. (IGARSS), Sydney, NSW, Australia, 2001, vol. 5, pp. 2313–2315. [116] W. Shi and A. Shaker, “The Line‐Based Transformation Model (LBTM) for image‐to‐image registration of high‐resolution satellite image data,” Int. J. Remote Sens., vol. 27, no. 14, pp. 3001– 3012, 2006. doi: 10.1080/01431160500486716. [117] T.-Z. Xiang, G.-S. Xia, X. Bai, and L. Zhang, “Image stitching by line-guided local warping with global similarity constraint,” Pattern Recognit., vol. 83, pp. 481–497, Nov. 2018. doi: 10.1016/j. patcog.2018.06.013. [118] C. Zhao and A. A. Goshtasby, “Registration of multitemporal aerial optical images using line features,” ISPRS J. Photogram. Remote Sens., vol. 117, pp. 149–160, July 2016. doi: 10.1016/j. isprsjprs.2016.04.002. [119] C. Li and W. Shi, “The generalized-line-based iterative transformation model for imagery registration and rectification,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 8, pp. 1394–1398, 2014. doi: 10.1109/LGRS.2013.2293844. 140 [120] A. O. Ok, J. D. Wegner, C. Heipke, F. Rottensteiner, U. Soergel, and V. Toprak, “Matching of straight line segments from aerial stereo images of urban areas,” ISPRS J. Photogram. Remote Sens., vol. 74, pp. 133–152, Nov. 2012. doi: 10.1016/j.isprsjprs.2012.09.003. [121] J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679– 698, 1986. doi: 10.1109/TPAMI.1986.4767851. [122] D. Marr and E. Hildreth, “Theory of edge detection,” Proc. Roy. Soc. Ser. B-Biol. Sci., vol. 207, no. 1167, pp. 187–217, 1980. doi: 10.1098/rspb.1980.0020. [123] R. G. v. Gioi, J. Jakubowicz, J. Morel, and G. Randall, “LSD: A fast line segment detector with a false detection control,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 4, pp. 722–732, 2010. doi: 10.1109/TPAMI.2008.300. [124] C. Akinlar and C. Topal, “EDLines: A real-time line segment detector with a false detection control,” Pattern Recog. Lett., vol. 32, no. 13, pp. 1633–1642, 2011. doi: 10.1016/j.patrec.2011.06.001. [125] A. Goshtasby and G. C. Stockman, “Point pattern matching using convex hull edges,” IEEE Trans. Syst., Man, Cybern., vol. SMC15, no. 5, pp. 631–637, 1985. doi: 10.1109/TSMC.1985.6313439. [126] W. Dorigo, M. Hollaus, W. Wagner, and K. Schadauer, “An application-oriented automated approach for co-registration of forest inventory and airborne laser scanning data,” Int. J. Remote Sens., vol. 31, no. 5, pp. 1133–1153, 2010. doi: 10.1080/01431160903380581. [127] B. Sirmacek and C. Unsalan, “Urban-area and building detection using SIFT keypoints and graph theory,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 4, pp. 1156–1167, 2009. doi: 10.1109/ TGRS.2008.2008440. [128] J. Flusser and T. Suk, “A moment-based approach to registration of images with affine geometric distortion,” IEEE Trans. Geosci. Remote Sens., vol. 32, no. 2, pp. 382–387, 1994. doi: 10.1109/36.295052. [129] N. R. Pal and S. K. Pal, “A review on image segmentation techniques,” Pattern Recognit., vol. 26, no. 9, pp. 1277–1294, 1993. doi: 10.1016/0031-3203(93)90135-J. [130] D. Xiaolong and S. Khorram, “Development of a feature-based approach to automated image registration for multitemporal and multisensor remotely sensed imagery,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. Proc. Remote Sens.-A Sci. Vision Sustainable Develop. (IGARSS), 1997, vol. 1, pp. 243–245. [131] L. M. Fonseca and B. Manjunath, “Registration techniques for multisensor remotely sensed imagery,” Photogram. Eng. Remote Sensing (PERS), vol. 62, no. 9, pp. 1049–1056, 1996. [132] A. Goshtasby, G. C. Stockman, and C. V. Page, “A region-based approach to digital image registration with subpixel accuracy,” IEEE Trans. Geosci. Remote Sens., vol. GE-24, no. 3, pp. 390–399, 1986. doi: 10.1109/TGRS.1986.289597. [133] J. Ton and A. K. Jain, “Registering Landsat images by point matching,” IEEE Trans. Geosci. Remote Sens., vol. 27, no. 5, pp. 642–651, 1989. doi: 10.1109/TGRS.1989.35948. [134] Y. Chen, X. Zhang, Y. Zhang, S. J. Maybank, and Z. Fu, “Visible and infrared image registration based on region features and edginess,” Mach. Vis. Appl., vol. 29, no. 1, pp. 113–123, 2018. doi: 10.1007/s00138-017-0879-6. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[135] A. Irani Rahaghi, U. Lemmin, D. Sage, and D. A. Barry, “Achieving high-resolution thermal imagery in low-contrast lake surface waters by aerial remote sensing and image registration,” Remote Sens. Environ., vol. 221, pp. 773–783, Feb. 2019. doi: 10.1016/j.rse.2018.12.018. [136] A. Li, X. Cheng, H. Guan, T. Feng, and Z. Guan, “Novel image registration method based on local structure constraints,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 9, pp. 1584–1588, 2014. doi: 10.1109/LGRS.2014.2305982. [137] S. Jiang and W. Jiang, “Hierarchical motion consistency constraint for efficient geometrical verification in UAV stereo image matching,” ISPRS J. Photogram. Remote Sens., vol. 142, pp. 222–242, Aug. 2018. doi: 10.1016/j.isprsjprs.2018. 06.009. [138] J. S. Beis and D. G. Lowe, “Shape indexing using approximate nearest-neighbour search in high-dimensional spaces,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), 1997, vol. 97, pp. 1000–1006. doi: 10.1109/ CVPR.1997.609451. [139] Y. Ma et al., “Remote sensing big data computing: Challenges and opportunities,” Future Gener. Comput. Syst., vol. 51, pp. 47– 60, Oct. 2015. doi: 10.1016/j.future.2014.10.029. [140] G. Stockman, S. Kopstein, and S. Benett, “Matching images to models for registration and object detection via clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-4, no. 3, pp. 229–241, 1982. doi: 10.1109/TPAMI.1982.4767240. [141] G. Borgefors, “Hierarchical chamfer matching: A parametric edge matching algorithm,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 10, no. 6, pp. 849–865, 1988. doi: 10.1109/ 34.9107. [142] L. Livi and A. Rizzi, “The graph matching problem,” Pattern Anal. Appl., vol. 16, no. 3, pp. 253–283, 2013. doi: 10.1007/ s10044-012-0284-8. [143] L. Torresani, V. Kolmogorov, and C. Rother, “Feature correspondence via graph matching: models and global optimization,” in Proc. Eur. Conf. Comput. Vision (ECCV), Berlin, Heidelberg, 2008, pp. 596–609. [144] Z. Liu, J. An, and Y. Jing, “A simple and robust feature point matching algorithm based on restricted spatial order constraints for aerial image registration,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 2, pp. 514–527, 2012. doi: 10.1109/ TGRS.2011.2160645. [145] Z. Xiong and Y. Zhang, “A novel interest-point-matching algorithm for high-resolution satellite images,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 12, pp. 4189–4200, 2009. doi: 10.1109/ TGRS.2009.2023794. [146] H. Chang, G. Wu, and M. Chiang, “Remote sensing image registration based on modified SIFT and feature slope grouping,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 9, pp. 1363–1367, 2019. doi: 10.1109/LGRS.2019.2899123. [147] S. Zhili and Z. Jiaqi, “Image registration approach with scaleinvariant feature transform algorithm and tangent-crossingpoint feature,” J. Electron. Imag., vol. 29, no. 2, pp. 1–14, Mar. 2020. doi: 10.1117/1.JEI.29.2.023010. [148] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analyDECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE sis and automated cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981. doi: 10.1145/358669.358692. [149] K. Zhang, X. Li, and J. Zhang, “A robust point-matching algorithm for remote sensing image registration,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 2, pp. 469–473, 2014. doi: 10.1109/ LGRS.2013.2267771. [150] Y. Wu, W. Ma, M. Gong, L. Su, and L. Jiao, “A novel pointmatching algorithm based on fast sample consensus for image registration,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 1, pp. 43–47, 2015. doi: 10.1109/LGRS.2014.2325970. [151] B. Li and H. Ye, “RSCJ: Robust sample consensus judging algorithm for remote sensing image registration,” IEEE Geosci. Remote Sens. Lett., vol. 9, no. 4, pp. 574–578, 2012. doi: 10.1109/ LGRS.2011.2175434. [152] H. Zhang et al., “Remote sensing image registration based on local affine constraint with circle descriptor,” IEEE Geosci. Remote Sens. Lett., early access, 2020. doi: 10.1109/ LGRS.2020.3027096. [153] F. Ye, Y. Su, H. Xiao, X. Zhao, and W. Min, “Remote sensing image registration using convolutional neural network features,” IEEE Geosci. Remote Sens. Lett., vol. 15, no. 2, pp. 232–236, 2018. doi: 10.1109/LGRS.2017.2781741. [154] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1915–1929, 2013. doi: 10.1109/ TPAMI.2012.231. [155] W. Huang, L. Xiao, Z. Wei, H. Liu, and S. Tang, “A new pansharpening method with deep neural networks,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 5, pp. 1037–1041, 2015. doi: 10.1109/LGRS.2014.2376034. [156] Y. Xing, M. Wang, S. Yang, and L. Jiao, “Pan-sharpening via deep metric learning,” ISPRS J. Photogram. Remote Sens., vol. 145, pp. 165–183, Nov. 2018. doi: 10.1016/j.isprsjprs.2018. 01.016. [157] G. J. Scott, M. R. England, W. A. Starms, R. A. Marcum, and C. H. Davis, “Training deep convolutional neural networks for land–cover classification of high-resolution imagery,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 4, pp. 549–553, 2017. doi: 10.1109/LGRS.2017.2657778. [158] L. Ma, Y. Liu, X. Zhang, Y. Ye, G. Yin, and B. A. Johnson, “Deep learning in remote sensing applications: A meta-analysis and review,” ISPRS J. Photogram. Remote Sens., vol. 152, pp. 166–177, June 2019. doi: 10.1016/j.isprsjprs.2019.04.015. [159] Y. Liu, D. Minh Nguyen, N. Deligiannis, W. Ding, and A. Munteanu, “Hourglass-shapenetwork based semantic segmentation for high resolution aerial imagery,” Remote Sens., vol. 9, no. 6, p. 522, 2017. doi: 10.3390/rs9060522. [160] H. Zhang et al., “Registration of multimodal remote sensing image based on deep fully convolutional neural network,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 12, no. 8, pp. 3028–3042, 2019. doi: 10.1109/JSTARS.2019. 2916560. [161] N. Merkle, W. Luo, S. Auer, R. Müller, and R. Urtasun, “Exploiting deep matching and SAR data for the geo-localization accuracy improvement of optical satellite images,” Remote Sensing, vol. 9, no. 6, p. 586, 2017. doi: 10.3390/rs9060586. 141
[162] H. He, M. Chen, T. Chen, and D. Li, “Matching of remote sensing images with complex background variations via Siamese convolutional neural network,” Remote Sens., vol. 10, no. 3, p. 355, 2018. doi: 10.3390/rs10020355. [163] L. H. Hughes, M. Schmitt, L. Mou, Y. Wang, and X. X. Zhu, “Identifying corresponding patches in SAR and optical images with a pseudo-Siamese CNN,” IEEE Geosci. Remote Sens. Lett., vol. 15, no. 5, pp. 784–788, 2018. doi: 10.1109/LGRS.2018. 2799232. [164] S. Wang, D. Quan, X. Liang, M. Ning, Y. Guo, and L. Jiao, “A deep learning framework for remote sensing image registration,” ISPRS J. Photogram. Remote Sens., vol. 145, pp. 148–164, Nov. 2018. doi: 10.1016/j.isprsjprs.2017.12.012. [165] R. Fan, B. Hou, J. Liu, J. Yang, and Z. Hong, “Registration of multi-resolution remote sensing images based on L2-Siamese model,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14, pp. 1–1, Nov. 19, 2020. doi: 10.1109/JSTARS.2020. 3038922. [166] N. Merkle, S. Auer, R. Müller, and P. Reinartz, “Exploring the potential of conditional adversarial networks for optical and SAR image matching,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 11, no. 6, pp. 1811–1820, 2018. doi: 10.1109/ JSTARS.2018.2803212. [167] H. L. Hughes, M. Schmitt, and X. X. Zhu, “Mining hard negative samples for SAR-optical image matching using generative adversarial networks,” Remote Sens., vol. 10, no. 10, p. 1552, 2018. doi: 10.3390/rs10101552. [168] J. Zhang, W. Ma, Y. Wu, and L. Jiao, “Multimodal remote sensing image registration based on image transfer and local features,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 8, pp. 1210– 1214, 2019. doi: 10.1109/LGRS.2019.2896341. [169] N. Girard, G. Charpiat, and Y. Tarabalka, “Aligning and updating cadaster maps with aerial images by multi-task, multi-resolution deep learning,” in Proc. Asian Conf. Comput. Vision (ACCV 2018), Cham, 2019, pp. 675–690. [170] L. Li, L. Han, M. Ding, Z. Liu, and H. Cao, “Remote sensing image registration based on deep learning regression model,” IEEE Geosci. Remote Sens. Lett., early access, 2020. doi: 10.1109/ LGRS.2020.3032439. [171] F. Liu, F. Bi, L. Chen, H. Shi, and W. Liu, “Feature-area optimization: A novel SAR image registration method,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 2, pp. 242–246, 2016. doi: 10.1109/LGRS.2015.2507982. [172] X. Huang, Y. Sun, D. Metaxas, F. Sauer, and C. Xu, “Hybrid image registration based on configural matching of scale-invariant salient region features,” in Proc. IEEE Comput. Society Conf. Comput. Vis. Pattern Recognit. (CVPR), Washington, D. C., 2004, pp. 167–167. doi: 10.1109/CVPR.2004.362. [173] G. Hong and Y. Zhang, “Combination of feature-based and area-based image registration technique for high resolution remote sensing image,” in Proc. Int. Geosci. Remote Sens. Symp. (IGARSS), Barcelona, Spain, 2007, pp. 377–380. [174] N. E. Mekky, F. E.-Z. Abou-Chadi, and S. Kishk, “Waveletbased image registration techniques: A study of performance,” Int. J. Comput. Sci. Netw. Security, vol. 11, no. 2, pp. 188–196, 2011. 142 [175] S. Suri, P. Schwind, P. Reinartz, and J. Uhl, “Combining mutual information and scale invariant feature transform for fast and robust multisensor SAR image registration,” in Proc. 75th Annu. ASPRS Conf., 2009. [176] Y. S. Heo, K. M. Lee, and S. U. Lee, “Joint depth map and color consistency estimation for stereo images with different illuminations and cameras,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 5, pp. 1094–1106, 2013. doi: 10.1109/ TPAMI.2012.167. [177] R. Feng, Q. Du, H. Shen, and X. Li, “Region-by-region registration combining feature-based and optical flow methods for remote sensing images,” Remote Sens., vol. 13, no. 8, p. 1475, 2021. doi: 10.3390/rs13081475. [178] C. Xing, J. Wang, and Y. Xu, “A method for building a mosaic with UAV images,” Int. J. Inform. Eng. Electron. Bus., vol. 2, no. 1, pp. 9–15, 2010. doi: 10.5815/ijieeb.2010.01.02. [179] Z. Kang, L. Zhang, S. Zlatanova, and J. Li, “An automatic mosaicking method for building facade texture mapping using a monocular close-range image sequence,” ISPRS J. Photogram. Remote Sens., vol. 65, no. 3, pp. 282–293, 2010. doi: 10.1016/j. isprsjprs.2009.11.003. [180] S. R. Lee, “A coarse-to-fine approach for remote-sensing image registration based on a local method,” Int. J. Smart Sens. Intell. Syst., vol. 3, no. 4, 2010. [181] K. Sharma and A. Goyal, “Very high resolution image registration based on two step Harris-Laplace detector and SIFT descriptor,” in 2013 4th Int. Conf. Comput., Commun. Netw. Technol. (ICCCNT), pp. 1–5. doi: 10.1109/ICCCNT.2013.6726632. [182] W. Ma, J. Zhang, Y. Wu, L. Jiao, H. Zhu, and W. Zhao, “A novel two-step registration method for remote sensing images based on deep and local features,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 7, pp. 4834–4843, 2019. doi: 10.1109/ TGRS.2019.2893310. [183] S. Li, L. Yuan, J. Sun, and L. Quan, “Dual-feature warpingbased motion model estimation,” in Proc. IEEE Int. Conf. Comput. Vision (ICCV), 2015, pp. 4283–4291. [184] S. Nag, “Image registration techniques: A survey,” Nov. 28, 2017, arXiv:1712.07540. [185] J. S. Bhatt and N. Padmanabhan, “Image Registration for meteorological applications: Development of a generalized software for sensor data registration at ISRO,” IEEE Geosci. Remote Sens. Mag. (replaces Newslett.), vol. 8, no. 4, pp. 23–37, 2020. doi: 10.1109/MGRS.2019.2949382. [186] A. Sedaghat and N. Mohammadi, “High-resolution image registration based on improved SURF detector and localized GTM,” Int. J. Remote Sens., vol. 40, no. 7, pp. 2576–2601, Apr. 2019. doi: 10.1080/01431161.2018.1528402. [187] Y. Ma, J. Wang, H. Xu, S. Zhang, X. Mei, and J. Ma, “Robust image feature matching via progressive sparse spatial consensus,” IEEE Access, vol. 5, pp. 24,568–24,579, Oct. 2017. doi: 10.1109/ ACCESS.2017.2768078. [188] H. Goncalves, J. A. Goncalves, and L. Corte-Real, “Measures for an objective evaluation of the geometric correction process quality,” IEEE Geosci. Remote Sens. Lett., vol. 6, no. 2, pp. 292– 296, 2009. doi: 10.1109/LGRS.2008.2012441. GRS IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Deep Learning Meets SAR Concepts, models, pitfalls, and perspectives XIAO XIANG ZHU, SINA MONTAZERI, MOHSIN ALI, YUANSHENG HUA, YUANYUAN WANG, LICHAO MOU, YILEI SHI, FENG XU, AND RICHARD BAMLER D eep learning in remote sensing has received considerable international hype, but it is mostly limited to the evaluation of optical data. Although deep learning has been introduced in synthetic aperture radar (SAR) data processing, despite successful first attempts, its huge potential remains locked. In this article, we provide an introduction to the most relevant deep learning models and concepts, point out possible pitfalls by analyzing special characteristics of SAR data, review the state of the art of deep learning applied to SAR, summarize available benchmarks, and recommend some important future research directions. With this effort, we hope to stimulate more research in this inter- MOTIVATION In recent years, deep learning [1] has been developed at a dramatic pace, achieving great success in many fields. Unlike conventional algorithms, deep learning-based methods commonly employ hierarchical architectures, such as deep neural networks, to extract feature representations of raw data for numerous tasks. For instance, convolutional neural networks (CNNs) are capable of learning low- and high-level features from raw images with stacks of convolutional and pooling layers and then applying the extracted features to various computer vision tasks, such as large-scale image recognition [2], object detection [3], and semantic segmentation ©SHUTTERSTOCK.COM/WILLEM Digital Object Identifier 10.1109/MGRS.2020.3046356 Date of current version: 9 February 2021 esting yet underexploited field and to pave the way for the use of deep learning in big SAR data processing workflows. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 0274-6638/21©2021IEEE 143
sharp features when denoising. Furthermore, the development of SAR and optical image joint analysis has been motivated by the capacities of extracting features from both types of images. For applications in InSAR, only a few studies have been carried out, such as the work described in [10]. However, these algorithms neglect the special characteristics of phase and simply use an out-of-the-box deep learning-based model. Despite initial successes, and unlike the evaluation of optical data, the huge potential of deep learning in SAR and InSAR remains locked. For example, to the best knowledge of the authors, there is no example of deep learning in SAR that has been developed for the operational processing of big data or integrated into the production chain of any satellite mission. This article aims at stimulating more research in this interesting yet underexploited research field. [4]. Inspired by numerous successful applications in the computer vision community, the use of deep learning in remote sensing is now receiving significant attention [5]. As first attempts at SAR applications, deep learning-based methods have been adopted for a variety of tasks, including terrain surface classification [6], object detection [7], parameter inversion [8], despeckling [9], specific functions in interferometric SAR (InSAR) [10], and SAR–optical data fusion [11]. For terrain surface classification from SAR and polarimetric SAR (PolSAR) images, effective feature extraction is essential. These features are extracted based on expert domain knowledge and are usually applicable to a small number of cases and data sets. Deep learning feature extraction has, however, proved to overcome, to some degree, both of the aforementioned issues [6]. For SAR target detection, conventional approaches mainly rely on template matching, where specific templates are manually created [12] to classify different categories, and the use of traditional machine learning (ML) methods, such as support vector machines (SVMs) [13], [14]; in contrast, modern deep learning algorithms aim at applying deep CNNs to automatically extract discriminative features for target recognition [7]. For parameter inversion, deep learning models are employed to learn the latent mapping function from SAR images to estimated parameters, e.g., sea ice concentration [8]. Regarding despeckling, conventional methods often rely on artificial filters and may suffer from the improper elimination of INTRODUCTION TO RELEVANT DEEP LEARNING MODELS AND CONCEPTS In this section, we briefly review relevant deep learning algorithms that were originally proposed for visual data processing and that are widely used for state-of-the-art research into deep learning in SAR. In addition, we mention the latest deep learning developments, which are not yet widely applied to SAR but may help create the next generation of its algorithms. Figure 1 gives an overview of the deep learning models we discuss in this section. CNN (a) (d) (c) RNN (b) Deep Learning Generative Models Deep RL (f) GNN (e) (g) (h) (i) (j) FIGURE 1. A selection of relevant deep learning models. (a) The Visual Geometry Group Network. (Source: [15].) (b) The residual neural network (ResNet) block. (Source: [16].) (c) The U-Net. (Source: [17].) (d) The long short-term memory unit. (Source: [18].) (e) The variational autoencoder. (Source: [19].) (f) The recurrent neural network (RNN). (Source: [20].) (g) The generative adversarial network. (Source: [21].) (h) The convolutional graph neural network (GNN). (Source: [22].) (i) The recurrent GNN. (Source: [23].) (j) Neural architecture search using deep reinforcement learning (RL). (Source: [24].) ReLU: rectified linear unit; GRU: gated recurrent unit. 144 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Before discussing deep learning algorithms, we would like to stress that the importance of high-quality benchmark data sets in deep learning research cannot be overstated. Especially in supervised learning, the knowledge that can be gained by a model is bounded by the information present in the training data set. For example, the Modified National Institute of Standards and Technology [25] data set played a key role in LeCun’s seminal paper about CNNs and gradient-based learning [26]. Similarly, there would be no AlexNet [27], the network that kickstarted the current deep learning renaissance, without the ImageNet [28] data set, which contains more than 14 million images and 22,000 classes. ImageNet has been such an important part of deep learning research that, more than 10 years after it was published, it is still used as a standard benchmark to evaluate the performance of CNNs for image classification. DEEP LEARNING MODELS The main principle of deep learning models is to encode input data into effective feature representations for target tasks. To exemplify how a deep learning framework works, we take the autoencoder as an example: it first maps input data to a latent representation via a trainable nonlinear mapping and then reconstructs inputs through reverse mapping. The reconstruction error is usually defined as the Euclidian distance between inputs and reconstructed inputs. Parameters of autoencoders are optimized by gradient descent-based optimizers, such as stochastic gradient descent, root mean square propagation [29], and Adam [30], during the backpropagation step. CNNs With the success of AlexNet in the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where the network scored a top-five test error of 15.3%, compared to the second-best test error of 26.2%, CNNs have attracted worldwide attention and are now used for many image understanding tasks, such as image classification, object detection, and semantic segmentation. AlexNet consists of five convolutional layers, three maximum pooling layers, and three fully connected layers. One of the key AlexNet innovations was the use of graphics processing units (GPUs), which made it possible to train such large networks with huge data sets without using supercomputers. In just two years, the Visual Geometry Group Network [2] overtook AlexNet in performance by achieving a 6.8%, top-five test error at ILSVRC-2014; the main difference was that it used only 3 × 3-sized convolutional kernels, which enabled it to have more channels and, in turn, capture more diverse features. The residual neural network (ResNet) [31], U-Net [32], and DenseNet [33] were the next major CNN architectures. Their main feature concerned the idea of connecting not only neighboring layers but any two layers in the network by using skip connections. This helped reduce information loss across networks, mitigated the problem of vanishing DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE gradients, and facilitated the design of deeper networks. U-Net is one of the most commonly used image segmentation networks. It has an autoencoder-based architecture that uses skip connections to concatenate features from the first layer to the last, the second to the second last, and so on; this way, it can get fine-grained information from the initial layers to the end layers. U-Net was initially proposed for medical image segmentation, where data labeling is a big problem. The authors employed heavy data augmentation techniques on input data, making it possible to learn from only a few hundred annotated samples. In ResNet, skip connections were incorporated within individual blocks, not across the whole network. Since its initial proposal, ResNet has undergone many architectural tweaks, and, even after five years, its variants are always among the top scorers on ImageNet. In DenseNet, all the layers were attached to all preceding layers, reducing the size of the network, albeit at the cost of memory usage. For a more detailed explanation of different CNN models, interested readers are referred to [34]. These CNN models have also proved their worth in SAR processing tasks; e.g., see [35]–[37]. For more examples and details of CNNs in SAR, see the “Recent Advances in Deep Learning Applied to SAR” section. RECURRENT NEURAL NETWORKS Besides CNNs, recurrent neural networks (RNNs) [38] are a major class of deep networks. Their main building blocks are recurrent units, which take the current input and output of the previous state as input. They provide state-of-the-art results for processing data of variable lengths, including text and time-series information. Their weights can be replaced with convolutional kernels for visual processing tasks, such as image captioning and predicting future frames/points in visual time series data. Long short-term memory (LSTM) [39] is one of the most popular RNN architectures: its cells can store values from any past instances and are not severely affected by the problem of gradient diminishing. As with any other time-series data from deep learning tool kits, RNNs are natural choices to process SAR time-series information; e.g., see [40]. GENERATIVE ADVERSARIAL NETWORKS Proposed by Ian Goodfellow et al. [41], generative adversarial networks (GANs) are among the most popular and exciting inventions in the field of deep learning. Based on gametheoretic principles, they consist of two networks called a generator and a discriminator. The generator’s objective is to learn a latent space through which it can create samples from the same distribution as the training data, while the discriminator tries to learn to distinguish whether a sample is from the generator or the training data. This very simple mechanism is responsible for most cutting-edge algorithms for various applications, e.g., generating artificial photorealistic images and videos, superresolution, and text-toimage synthesis. For example, in the SAR domain, GANs 145
have already been successfully used in cloud removal applications [42], [43]. See the “Recent Advances in Deep Learning Applied to SAR” section for more examples. SUPERVISED, UNSUPERVISED, AND REINFORCEMENT LEARNING [47], [48]. Recently, deep RL received particular attention and achieved popularity due to the success of Google Deep Mind’s AlphaGo [49], which defeated the Go board game world champion. This task was considered impossible for computers until a few years ago. RELEVANT DEEP LEARNING CONCEPTS SUPERVISED LEARNING Most popular deep learning models fall under the category of supervised deep learning; i.e., they need labeled data sets to learn objective functions. One big challenge of supervised learning is generalization, i.e., how well a trained model performs on test data. Therefore, it is vital that training data truly represent the actual data distribution so that they can handle all the unseen information. If a model fits well on training data and fails on test data, overfitting occurs. In the deep learning literature, there are several techniques that can be used to avoid overfitting, e.g., dropout [44]. UNSUPERVISED LEARNING Unsupervised learning refers to the class of algorithms where the training data do not contain labels. For instance, in classical data analysis, principal component analysis [45] can be used to reduce the data dimension, followed by a clustering algorithm to group similar data points. In deep learning generative models, autoencoders, variational autoencoders (VAEs), [46] and GANs [41] are some of the popular techniques that can be used for unsupervised learning. Their primary goal is to generate output data from the same distribution as the input data. Autoencoders consist of an encoder that finds a compressed, latent representation of the input and a decoder that translates a representation back to the original input. VAEs take autoencoders to the next level by learning a whole distribution instead of just a single representation at the end of the encoder; this, in turn, can be used by the decoder to generate the whole distribution of outputs. The trick to learning this distribution is to also acquire the variance along with the mean of the latent representation at the encoder–decoder meeting point and to add a Kullback–Leibler divergence-based loss term to the standard reconstruction loss function of the autoencoders. DEEP REINFORCEMENT LEARNING Reinforcement learning (RL) tries to mimic human learning behavior, i.e., taking actions and then adjusting them for the future, according to feedback from the environment. For example, young children learn to repeat or not repeat their actions based on the reaction of their parents. The RL model consists of an environment with states, actions to transition between those states, and a reward system for ending up in different states. The objective of the algorithm is to learn the best actions for given states using a feedback–reward system. In a classical RL algorithms function, approximators are used to calculate the probability of different actions in different states. Deep RL employs different types of neural networks to create these functions 146 AUTOMATIC ML Deep networks have many hyperparameters to choose from, for example, the number of layers, kernel sizes, types of optimizers, skip connections, and the like. There are billions of possible combinations of these parameters, and, given high computational time and energy costs, it is hard to find the best-performing network, even from among a few hundred candidates. In the case of deep learning, the objective of automatic ML (AutoML) is to find the most efficient and highperforming deep network for a given data set and task. The first major attempt in this field was by Zoph et al. [24], who used deep RL to find the optimum CNN for image classification. In the system, an RNN creates CNN architectures and, based on their classification results, proposes changes to them. This process continues to loop until the optimum architecture is found. This algorithm was better able to find competing networks than the state of the art, but it took more than 800 GPUs, which was unrealistic for practical application. Recently, there have been many developments in the AutoML field, and they have made it possible to perform such tasks in more intelligent and efficient ways. More details about the field of network architectural search can be found in [51]. Furthermore, AutoML has already been successfully applied to SAR for PolSAR classification [52]. The method shows great potential for segmentation and classification tasks, in particular. GEOMETRIC DEEP LEARNING: GRAPH NEURAL NETWORKS Except for well-structured image data, there is a large amount of unstructured data in real life, e.g., knowledge graphs and social networks, that cannot be directly processed by a deep CNN. Usually, these data are represented in the form of graphs, where each node indicates an entity and edges delineate mutual relations. As an approach to learning from unstructured data, geometric deep learning has been attracting increasing attention; the most common architecture is the graph neural network (GNN), which has also proved to be successful in dealing with structured data. Specifically, using the terminology of graphs, nodes of a graph can be regarded as feature descriptions of entities, and their edges are established by measuring their relations and distances and encoded in an adjacency matrix. Once a graph is constructed, messages can be propagated among nodes by simply performing matrix multiplication. Accordingly, [53] proposed graph convolutional networks (GCNs), which are characterized by utilizing graph convolutions; the authors of [45] accelerated the process. Moreover, the IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
units in recurrent GNNs (RGNNs) [23], [55] have been shown to obtain achievements in learning from graphs. The usefulness of GNNs in SAR is still to be properly explored, and [56] is one of the only attempts to do so. POSSIBLE PITFALLS To develop tailored deep learning architectures and prepare suitable training data sets for SAR and InSAR tasks, it is important to understand that SAR data are different from optical remote sensing data, not to mention images downloaded from the Internet. In this section, we discuss the special characteristics (and possible pitfalls) encountered while applying deep learning to SAR. What makes SAR data and SAR data processing by neural networks unique? SAR data are substantially different from optical imagery in many respects. The following points should be considered when transferring CNN experience and expertise from optical to SAR data: ◗◗ Dynamic range: Depending on the spatial resolution, the dynamic range of SAR images can be up to 90 dB (TerraSAR-X high-resolution spotlight data with a resolution of roughly 1 m). Moreover, the distribution is extremely asymmetric, with the majority of the pixels in the lowamplitude range (distributed scatterers) and a long tail representing bright discrete scatterers, in particular, in urban areas. Standard CNNs are not able to handle such dynamic ranges, and hence most approaches feature dynamic compression as a preprocessing step. In [57], the authors first take only amplitude values from zero to 255 and then subtract the mean values of each image. In [11] and [58], normalization is performed as a preprocessing step, which significantly compresses the dynamic range. ◗◗ Signal statistics: To retrieve features from SAR (amplitude and intensity) images, speckle statistics must be considered. Speckle is a multiplicative, rather than an additive, phenomenon. This has consequences: while the optimum estimator of the radar brightness of a homogeneous image patch under speckle is a simple moving-average operation (i.e., a convolution, such as in the additive noise case), other optimum detectors of edges and low-level features under additive Gaussian noise may no longer be optimum in the case of SAR. A popular example is Touzi’s constant false alarm rate edge detector [59] for SAR images, which uses the ratio of two spatial averages across adjacent windows. This operation cannot be emulated by the first layer of a standard CNN. Some studies use a logarithmic mapping of the SAR images prior to feeding them into a CNN [9], [60]. This turns speckle into an additive random variable and, as a side effect, reduces the dynamic range. But, still, a single convolutional layer can emulate only approximations to optimum SAR feature estimators. It could be valuable to supplement the original logarithmic SAR image by a few low-pass-filtered and logarithmized versions as input to a CNN. Another approach is DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE to apply a sophisticated speckle reduction filter before entering a CNN, e.g., nonlocal averaging [61]–[63]. ◗◗ Imaging geometry: SAR image coordinates’ range and azimuth are not arbitrary, such as east and north or x and y, but, rather, reflect the peculiarities of the image generation process. Layover always occurs at the near range of an object, and shadow always results at the far range. That means data augmentation by SAR image rotation would lead to nonsense imagery that would never be generated by a SAR. ◗◗ The complex nature of SAR data: The most valuable SAR data information lies in its phase. This applies to SAR image formation, which takes place in the complex signal domain, as well as for PolSAR, InSAR, and tomographic SAR data processing, meaning that an entire CNN must be able to handle complex numbers. For the convolution operation, this is trivial. The nonlinear activation function and the loss function, however, require thorough consideration. Depending on whether the activation function independently acts on the real and imaginary parts of the signal, or only on its magnitude, and where bias is added, the phase will be distorted to different degrees. If we use PolSAR data for land cover and target classification, a nonlinear processing of the phase is even desirable because the phase between different polarimetric channels has physical meaning and hence contributes to the classification process. In SAR interferometry and tomography, however, the absolute phase has no meaning; i.e., the CNN must be invariant to an arbitrary phase offset. Assume some interferometric input signal x to a CNN and the output signal CNN(x) with phase zt = +CNN (x). (1) Any constant phase offset z 0 does not change the meaning of the interferogram. Thus, we require an invariance that we refer to as phase linearity (which is valid at least in the expectation): CNN (xe jz 0) = CNN (x) e jz 0. (2) This linearity is violated, for example, if the activation function is separately applied to real and imaginary parts and if a bias is added to the complex numbers. Another point to consider in regression-type InSAR CNN processing (e.g., for noise reduction) is the loss function. If the quantity of interest is not the complex number itself but its phase, the loss function must be able to handle the cyclic nature of phases. It may also be advantageous that the loss function is independent, at least to a certain degree, of the signal magnitude to relieve a CNN from modeling the magnitude. A loss function that meets these requirements is, for example, L = E 6e j (+CNN (x) - +y)@ , (3) 147
where y is the reference signal. Some authors use the magnitude and phase, rather than real and imaginary parts, as input to a CNN. This approach is not invariant to phase offset, either. The interpretation of a phase function as a real-valued function forces a CNN to disregard the sharp discontinuities at the ! r transitions, whose positions are inconsequential. A standard CNN would pounce on these, interpreting them as edges. ◗◗ Simulation-based training and validation data: The prevailing lack of ground truth for regression-type tasks, such as speckle reduction and InSAR denoising, might tempt us to use simulated SAR data for the training and validation of neural networks. However, this bears the risk that our networks will learn models that are far too simplified. Unlike optical imaging, where highly realistic scenes can be simulated, e.g., by PC games, the simulation of SAR data is more of a scientific topic that lacks the power of commercial companies and a huge market. SAR simulators focus on specific scenarios, e.g., vegetation ­ (only distributed scatterers are considered) and persistent (point) scatterers. The most advanced simulators are probably the ones for computing the radar backscatter signatures of single military objects, such as vessels. To our knowledge, though, there is no simulator available that can, e.g., generate realistic interferometric data of rugged terrain with layover, spatially varying coherence, and diverse scattering mechanisms. Often, simplified scattering assumptions are made, e.g., that speckle is multiplicative. Even this is not true; pure Gaussian scattering can be found only for quite homogeneous surfaces and lowresolution SARs. As soon as the resolution increases, the chances of having a few dominating scatterers in a resolution cell increase, and the statistics become substantially different from those of fully developed speckle RECENT ADVANCES IN DEEP LEARNING APPLIED TO SAR In this section, we provide an in-depth review of deep learning methods applied to SAR data from six perspectives: terrain surface classification, object detection, parameter inversion, despeckling, InSAR, and SAR–optical data fusion. For each, we state notable developments in chronological order and report their advantages and disadvantages. Finally, each section concludes with a brief summary. It is worth mentioning that the application of deep learning to SAR image formation is not explicitly treated here. For SAR focusing, we have to distinguish between general-purpose focusing and the imaging of objects with a priori known properties, such as sparsity. General-purpose algorithms produce data for applications including land use and land cover classification, glacier monitoring, biomass estimation, and interferometry. These are complex-valued, focused data that retain all the information contained in the raw data. General-purpose focusing has a well-defined system model and requires a sequence of fast Fourier transforms 148 (FFTs) and phasor multiplications, i.e., linear operations, such as matrix–vector multiplications. For decades, optimal algorithms have been developed to perform these operations at the highest possible speeds and with diffraction-limited accuracy. There is no reason that deep neural networks should perform better or faster than this gold standard. If we want to introduce prior knowledge about imaged objects, however, specialized focusing algorithms may be beneficially learned by neural networks. But, even then, it might make sense to focus raw data first through a standard algorithm and apply deep learning for postprocessing. In [64], a CNN is trained to focus sparse military targets. Nevertheless, in this approach, the raw data are partially focused by an FFT before entering the CNN. TERRAIN SURFACE CLASSIFICATION As an important direction for SAR applications, terrain surface classification using PolSAR images is rapidly advancing with the help of deep learning. Regarding feature extraction, most conventional methods rely on exploring physical scattering properties [65] and texture information [66] in SAR images. However, these features are mainly human designed based on specific problems and characteristics of data sources. Compared to conventional methods, deep learning is superior in terrain surface classification due to its capability of automatically learning discriminative features. Moreover, deep learning approaches, such as CNNs, can effectively extract not only polarimetric characteristics but also spatial patterns of PolSAR images [6]. Some of the most notable deep learning techniques for PolSAR image classification are reviewed in the following. Xie et al. [67] first applied deep learning to terrain surface classification using PolSAR images. They employed a stacked autoencoder (SAE) to automatically learn deep features from PolSAR data and then fed the data to a Softmax classifier. Remarkable improvements in both the classification accuracy and the visual effect proved that this method could effectively learn a comprehensive feature representation for classification purposes. Instead of simply applying an SAE, Geng et al. [70] proposed a deep convolutional autoencoder (DCAE) for automatically extracting features and performing classification. The first layer of the DCAE is a handcrafted convolutional layer, where filters are predefined, such as gray-level co-occurrence matrices and Gabor filters. The second layer performs a scale transformation, which integrates correlated neighbor pixels to reduce speckle. Following these two layers, a trained SAE, which is similar to [67], is attached for learning more abstract features. Tested on high-resolution, single-polarization TerraSAR-X images, the method achieved remarkable classification accuracy. Based on a DCAE, for SAR image classification, Geng et al. [68] proposed a framework, called the deep supervised and contractive neural network (DSCNN), which introduced a histogram of oriented gradient descriptors. In addition, a supervised penalty was designed to capture relevant IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
classification. This method is built on two feature extraction channels: one to extract polarization features from the six-channel real matrix and the other to extract the spatial features of a Pauli decomposition. Next, the extracted features are combined using two parallel, fully connected layers, and they are finally fed to a Softmax layer for classification. The detailed architecture of this network is illustrated in Figure 3. Different variations of CNNs have been used for terrain surface classification, as well. In [77], Zhou et al. first extracted a six-channel covariance matrix and then fed it to a trainable CNN for PolSAR image classification. Wang et al. [78] proposed a fully convolutional network (FCN) integrated with sparse and low-rank subspace representations for classifying PolSAR images. Chen et al. [79] improved CNN performance by incorporating expert knowledge of target scattering mechanism interpretation and polarimetric feature mining. In a more recent work [80], He et al. proposed the combination of features learned from nonlinear manifold embedding and applying an FCN to input PolSAR images; the final classification was carried out in an ensemble approach by an SVM. In [81], the authors focused on the computational efficiency of deep learning methods, proposing the use of lightweight 3D CNNs. They showed that a classification accuracy comparable to other CNN methods was achievable while significantly reducing the number of learned parameters and therefore gaining computational efficiency. information between features and labels, and a contractive restriction, which can enhance the local invariance, was employed in the following trainable autoencoder layers. An example of applying the DSCNN to TerraSAR-X data from a small area in Norway appears in Figure 2. Compared to other algorithms, the ability of the DSCNN to achieve a highly accurate and noise-free classification map is observed. In addition to the aforementioned methods, many studies integrate SAE models with conventional classification algorithms for terrain surface classification. Hou et al. [73] proposed an SAE combined with superpixels for PolSAR image classification. Multiple layers of the SAE are trained on a pixel-by-pixel basis. Superpixels are formed based on Pauli-decomposed pseudocolor images. Outputs of the SAE are used as features in the final step of k-nearestneighbor superpixel clustering. Zhang et al. [74] applied a sparse SAE to PolSAR image classification by taking into account local spatial information. Qin et al. [75] applied adaptive restricted Boltzmann machine boosting to PolSAR image classification. Zhao et al. [76] proposed a discriminant deep belief network for SAR image classification, in which discriminant features are gleaned by combining ensemble learning with a deep belief network in an unsupervised manner. Moreover, taking into account that most current deep learning methods aim at exploiting features from PolSAR image polarization information and spatial information, Gao et al. [72] proposed a dual-branch CNN to learn features from both perspectives for terrain surface (a) (b) (g) (c) (h) (d) (i) (e) (j) (f) (k) FIGURE 2. Classification maps obtained from a TerraSAR-X image of a small area in Norway [68]. (a)–(f) depict the results of classifica- tion using (a) an SVM (accuracy = 78.42%), (b) a sparse representation classifier (SRC) (accuracy = 85.61%), (c) a random forest (accuracy = 82.20%) [69], (d) an SAE (accuracy = 87.26%) [67], (e) a DCAE (accuracy = 94.57%) [70], and (f) a contractive autoencoder (accuracy = 88.74). (g)–(i) show the combination of a DSCNN with (g) an SVM (accuracy = 96.98%), (h) an SRC (accuracy = 92.51%) [71], and (i) a random forest (accuracy = 96.87%). (j) and (k) represent the classification results of (j) a DSCNN (accuracy = 97.09%) and (k) a DSCNN followed by spatial regularization (accuracy = 97.53%), which achieves higher accuracy than the other methods. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 149
Apart from these single-image classification schemes using CNNs, the use of SAR image time series for crop classification has been shown in [40] and [82]. The authors of both papers experimented with using RNN-based architectures to exploit the temporal dependency of multitemporal SAR images to improve classification accuracy. A unique approach for tackling PolSAR classification was recently proposed in [52], where, for the first time, the authors utilized an AutoML technique to find the optimum CNN architecture for each data set. The approach takes into account the complex nature of PolSAR images, is cost effective, and achieves high classification accuracy [52]. Most of the aforementioned methods rely primarily on preprocessing and transforming raw, complex-valued data into features in the real domain and then inputting the data in a common CNN, which constrains the possibility of directly learning features from raw information. To tackle this problem, Zhang et al. [83] proposed a novel complexvalued CNN (CV-CNN) specifically designed to process complex values in PolSAR data, i.e., the off-diagonal elements of a coherency or covariance matrix. The CV-CNN not only takes complex numbers as inputs but also employs complex weights and complex operations throughout different layers. A complex-valued backpropagation algorithm was also developed for CV-CNN training. Other notable complex-valued deep learning approaches for classification using PolSAR images can be found in [84]–[86]. Differing from the previously mentioned works, which exploit the complex-valued nature of SAR images in PolSAR image classification, Huang et al. [87] recently proposed a novel deep learning framework called the Deep SAR-Net for land use classification focusing on feature extraction from single-polarimetric complex SAR images. The authors perform a feature fusion based on spatial features learned PolSAR Data Preprocessing Dual-CNN Feature Extraction and Classification Convolution 61, Convolution 62, 100 at 3 × 3 FC6_200 500 at 3 × 3 Pooling Pooling FC6_84 Six Channels Matrix T Six-Channel CNN from intensity images and time–frequency features extracted from the spectral analysis of complex SAR images. Since the time–frequency features are highly relevant for distinguishing different backscattering mechanisms within SAR images, they gain accuracy in classifying man-made objects compared to the use of typical CNNs, which focus only on spatial information. Although not completely related to terrain surface classification, it is also worth mentioning that the combination of SAR and PolSAR images with feed-forward neural networks has been extensively used for sea ice classification. This topic is not treated any further in this section, and the interested reader is referred to [88]–[92] for more information. Similar to the polarimetric signature, InSAR coherence provides information about physical scattering properties. In [35], interferometric volume decorrelation is used as a feature for forest/nonforest mapping together with radar backscatter and the incidence angle. The authors used bistatic TerraSAR-X Add-On for Digital Elevation Measurement data, where temporal decorrelation can be neglected. They compared different architectures and concluded that CNNs outperformed the random forest and that the U-Net [32] proved best for this segmentation task. To summarize, it is apparent that deep learning-based SAR and PolSAR classification algorithms have advanced considerably in the past few years. Although, at first, the emphasis was on low-rank representation learning using SAEs [67] and its modifications [70], later research focused on a multitude of issues relevant to SAR imagery, such as taking into account speckle-preserving [68], [70] spatial structures [72] and their complex nature [83]–[85], [87]. It can also be seen that the labeled data scarcity challenge has driven researchers to use semisupervised learning algorithms [86], although weakly supervised methods A B C D E F FC6_14 Softmax Dual-CNN Class 1 Class 1 Pauli RGB CNN Class N Pauli Decomposition Pooling Pooling FC3_84 Pauli RGB Convolution 31, Convolution 32, FC3_200 100 at 3 × 3 500 at 3 × 3 FIGURE 3. The architecture of the dual-branch deep CNN (the Dual-CNN) for PolSAR image classification proposed in [72]. FC: fully connected; RGB: red–green–blue. 150 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
for semantic annotation, which have been proposed for high-resolution optical data [93], have not been explicitly explored for classification tasks using SAR data. Furthermore, specific metric-learning approaches to enhance class separability [94] can be adopted for SAR imagery to improve overall classification accuracy. Finally, one of ML’s important fields, AutoML, which had not been extensively exploited by the remote sensing community, has found an application in PolSAR image classification [52]. OBJECT DETECTION Although various characteristics distinguish SAR images from optical red–green–blue (RGB) images, the SAR object detection problem is still analogous to optical image classification and segmentation in the sense that feature extraction from raw data is always a prior and crucial step. Hence, given the success in the optical domain, there is no doubt that deep learning is one of the most promising ways to develop state-of-the-art SAR object detection algorithms. The majority of the earlier work related to SAR object detection using deep learning consists of taking successful deep learning methods for optical object detection and applying them with minor tweaks to military vehicle detection [the Moving and Stationary Target Acquisition Recognition (MSTAR) data set] and ship detection with custom data sets. Even small networks are easily able to achieve more than 90% test accuracy for most of these tasks. The first attempt at military vehicle detection can be found in [7], where Chen et al. used an unsupervised sparse autoencoder to generate convolution kernels from random patches of a given input for a single-layer CNN, which generated features to train a Softmax classifier for categorizing military targets in the MSTAR data set [96]. The experiments in [7] showed great potential for applying CNNs to SAR target recognition. With this discovery, Chen et al. [97] proposed A-ConvNets, a simple five-layer CNN that was able to achieve state-of-the-art accuracy of approximately 99% on MSTAR. Following this trend, more and more authors applied CNNs to MSTAR [37], [98], [99]. Morgan [37] successfully applied a modestly sized, three-layer CNN to MSTAR, and, building on that work, Wilmanski et al. [100] investigated the effects that initialization and optimizer selection had on the final results. Ding et al. [98] investigated the capabilities of a CNN model combined with domainspecific data augmentation techniques (e.g., pose synthesis and speckle adding) in SAR object detection. Furthermore, Du et al. [99] proposed a displacement- and rotation-insensitive CNN and claimed that data augmentation using training samples is necessary and critical during the preprocessing stage. On the same data set, instead of treating a CNN as an end-to-end model, Wagner [101] and, similarly, Gao [102] integrated a CNN and an SVM by first using a CNN to extract features and then feeding the features to an SVM for final prediction. Specifically, Gao et al. [103] added a DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE class of separation information to the cross-entropy cost function as a regularization term, which they showed explicitly facilitated intraclass compactness and separability and improved the quality of the extracted features. More recently, Furukawa [104] proposed VersNet, an encoder– decoder-style segmentation network, to not only identify but localize multiple objects in an input SAR image. Moreover, Zhang et al. [95] proposed an approach based on multiaspect image sequences as a preprocessing step. They accounted for backscattering signals from different viewing geometries, followed by feature extraction through Gabor filters and dimensionality reduction; they eventually fed the results to a bidirectional LSTM model for the joint recognition of targets. This SAR awareness-trial-repeat framework is presented in Figure 4. Ship detection is another SAR task. Early studies of applying deep learning models to ship detection [105]–[109] mainly consisted of two stages: first, cropping patches from the whole SAR image and then identifying whether cropped patches belonged to target objects by using a CNN. Because of fixed patch sizes, these methods were not robust enough to accommodate variations in ship geometry, such as size and shape. This problem was overcome by using region-based CNNs [110], [111], with the creative use of skip connections and feature fusion techniques in later literature. For example, Li et al. [112] fused features of the last three convolution layers before feeding them to a region proposal network (RPN). Kang et al. [113] introduced a contextual region-based network that fused features from different levels. Meanwhile, to make the most use of features of different resolution, Jiao et al. [114] densely connected each layer to subsequent ones and fed features from all the layers to a separate RPN to generate proposals; in the end, the best proposal was chosen based on an intersection–overunion score. In more recent works on SAR object detection, scientists have tried to explore many other interesting ideas to complement current efforts. Dechesne et al. [115] proposed a multitask network that simultaneously learned to detect, classify, and estimate the length of ships. Mullissa et al. [84] showed that CNNs can be trained directly with complexvalued SAR data; Kazemi et al. [117] performed object classification using an RNN-based architecture directly on received SAR signals instead of processed SAR images; and Rostami et al. [118] and Huang et al. [119] explored knowledge transfers and transfer learning from other domains to the SAR arena for object detection. Perhaps one of the more interesting recent works in this application area relates to building detection, by Shahzad et al. [120]. The authors tackle the problem of very-high-resolution (VHR) SAR building detection using an FCN [121] architecture for feature extraction, followed by a conditional random fields RNN [122], which helps give similar weights to neighboring pixels. This architecture produced building segmentation masks with up to 93% accuracy. An example of the detected buildings can be seen in Figure 5, where Figure 5(a) is the amplitude of 151
FIGURE 4. A flowchart of the multiaspect-aware bidirectional approach for SAR automatic target recognition proposed in [95]. TPLBP: three-patch local binary pattern. Original Images Multiaspect Multiaspect Image Sequence Sample Construction Feature Detection Dimensionality Multiaspect Feature Learning Reduction Classification T72 : 0.9 BMP2 : 0.03 BRDM2: 0.07 Softmax LSTM LSTM TPLBP Gabor Filter LSTM T72 : 0.92 BMP2 : 0.02 BRDM2: 0.06 Softmax LSTM LSTM TPLBP Gabor Filter LSTM T72 : 0.98 BMP2 : 0.01 BRDM2: 0.01 Softmax LSTM LSTM TPLBP Gabor Filter LSTM 152 the input TerraSAR-X image of Berlin and Figure 5(b) is the predicted building mask. Another major contribution made in that paper addresses the lack of training data by introducing an automatic annotation technique, which annotates the SAR tomography data using Open Street Map (OSM) data. As an extension of the preceding work, Sun et al. [123] tackled the problem of individual building segmentation in large-scale urban areas. They proposed a conditional geographic information system (GIS)-aware network (CG-Net) that learns multilevel visual features and employs building footprint data to normalize these features for predicting building masks. Thanks to the novel network architecture and the large number of building labels automatically generated from accurate digital elevation model (DEM) and GIS building footprints, this network achieves an F1 score of 75.08% for individual building segmentation. With the predicted building masks, large-scale level-of-detail 1 building models are reconstructed, with a mean height error of 2.39 m. Overall, deep learning has shown very good performance in existing SAR object detection tasks. There are two main challenges that the algorithm designer needs to keep in mind when tackling any SAR object detection tasks. The first relates to identifying characteristics of SAR imagery, such as imaging geometry, the size of objects, and speckle noise. The second and bigger difficulty concerns the lack of good quality standardized data sets. As we observed, the most popular data set, MSTAR, is too easy for deep nets, and, for ship detection, the majority of authors create their data sets, which makes it very hard to judge the quality of the proposed algorithms and even harder to compare different algorithms. An example of a difficult-to-create data set can be found in global building detection. The shape, size, and style of buildings change quite drastically from region to region, and so a good data set for this purpose re­­ quires training examples taken from buildings from around the world, a task that requires significant effort to produce high-quality annotations of enough structures that deep nets can learn from them. PARAMETER INVERSION Parameter inversion from SAR images is a challenging field in SAR applications. As one important branch, ice concentration estimation is now attracting great attention due to its importance to ice monitoring and climate research [124]. Since there are complex interactions between SAR signals and sea ice [125], empirical algorithms face difficulties with interpreting SAR images for accurate ice concentration estimation. Wang et al. [8] resorted to a CNN for generating ice concentration maps from dual-polarized SAR images. Their method takes image patches of intensity-scaled dual-band SAR images as inputs and directly outputs ice concentrations. In [126] and [127], Wang et al. employed various CNN models to estimate ice concentrations from SAR images during the melt season. Labels IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
were produced by ice experts via visual interpretation. The algorithm was tested on dual-polarization RadarSat-2 data. Since the problem under consideration concerns the regression of a continuous value, the mean square error is selected as the loss function. Experimental results demonstrate that CNNs can offer a more accurate result than comparative operational products. In a different application, Song et al. [130] used a deep CNN, including five pairs of convolutional and maximum pooling layers followed by two fully connected layers, for inverting rough surface parameters from SAR images. The network training was based on simulated data, due solely to the scarcity of real training material. The method was able to invert the desired parameters with a reasonable accuracy, and the authors showed that training a CNN for parameter inversion purposes could be done quite efficiently. Furthermore, Zhao et al. [131] designed a CV-CNN to directly learn physical scattering signatures from PolSAR images. The authors notably proposed a framework to automatically generate labeled data, which led to a supervised learning algorithm for the aforementioned parameter inversion. The approach is similar to the study presented in [132], where the authors used deep learning for SAR image colorization and for learning a full PolSAR image from single-polarization data. Another interesting application of deep learning in parameter inversion was recently published in [133]. The authors propose a deep neural network architecture containing a CNN and a GAN to automatically learn SAR image simulation parameters from a small number of real SAR images. They later feed the learned parameters to a SAR simulator, such as RaySAR [134], to generate a wide variety of simulated SAR images, which can increase training data production and improve the interpretation of SAR images that have complex backscattering scenarios. On the whole, deep learning-based parameter estimation for SAR applications has not yet been fully exploited. Unfortunately, most of the remote sensing community’s focus has been devoted to classical problems, which overlap with computer vision tasks, such as classification, object detection, segmentation, and denoising. One reason for this might be that, since parameter estimation usually requires the incorporation of appropriate physical models and tackles the problem at hand as regression rather than classification, domain knowledge is quite essential for applying deep learning for such tasks, especially for SAR images, with their peculiar physical characteristics. One interesting study [87], described in detail in the “Terrain Surface Classification” section, designs discriminative features through the spectral analysis of complex-valued SAR data and is an important work toward including deep learning in parameter inversion studies using SAR data. We hope that, in the future, more studies will be carried out in this direction. DESPECKLING Speckle, which is caused by the coherent interaction among scattered signals from subresolution objects, often makes processing and interpreting SAR images difficult. Therefore, despeckling is a crucial procedure before applying SAR images to various tasks. Conventional methods aim at removing speckle either spatially, where local spatial filters, such as Lee [135], Kuan [136], and Frost filters [137], are employed, or by using wavelet-based methods [138]–[140]. For a full overview of these techniques, the reader is referred to [141]. During the past decade, patch-based methods for speckle reduction have gained popularity due to their ability to preserve spatial features while not sacrificing image resolution [142]. Deledalle et al. [143] proposed one of the first nonlocal patch-based methods applied to speckle reduction by taking into account the statistical properties of speckle, combined with the original nonlocal image denoising algorithm introduced in [144]. A vast number of variations of the nonlocal method for SAR despeckling have been proposed, with the most notable ones included in [145] and [146]. (a) (b) FIGURE 5. (a) A VHR TerraSAR-X image of Berlin and (b) the predicted building mask [120] (right). DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 153
However, on the one hand, the manual selection of appropriate parameters for conventional algorithms is not easy and is sensitive to reference images. On the other hand, it is difficult to achieve a balance between preserving distinct image features and removing artifacts through empirical despeckling methods. To solve these limitations, methods based on deep learning have been developed. Inspired by the success of image denoising using a residual learning network architecture in the computer vision community [147], Chierchia et al. [60] first introduced a residual learning CNN for SAR image despeckling by presenting a 17-layer CNN for learning to subtract speckle components from noisy images. Considering that speckle noise is assumed to be multiplicative, a homomorphic approach with coupled logarithmic and exponential transformations is performed before and after feeding images to the network. In this case, multiplicative speckle noise is transformed into an additive form and can be recovered by residual learning, where logarithmic speckle noise is regarded as residual. As shown in Figure 6, an input logarithmic noisy image is identically mapped to a fusion layer via a shortcut connection and then added elementwise with the learned residual image to produce a logarithmic clean image. Afterward, denoised images can be obtained by an exponential transformation. Wang et al. [9] proposed a CNN, called Intelligence Detection Using a CNN, for image despeckling, that can directly learn denoised images via a componentwise division-residual layer with skip connections. In another words, homomorphic processing is not introduced for transforming multiplicative noise into additive noise, and, at a final stage, the noisy image is divided by the learned noise to yield a clean image. As a step forward with respect to the two aforementioned residual-based learning methods, Zhang et al. [148] employed a dilated residual network (DRN), SAR–DRN, instead of simply stacking convolutional layers. Unlike [60] and similar to [9], SAR–DRN is trained in an end-to-end fashion using a combination of dilated convolutions and skip connections with a residual learning structure, which indicates that prior knowledge, such as a noise description model, is not required in the workflow. In [149], Yue et al. proposed a novel deep neural network architecture specifically designed for SAR despeckling. It used a CNN to extract image features and reconstruct a discrete radar cross section (RCS) probability density function (PDF). It was trained by a hybrid loss function that measured the distance between the actual SAR image intensity PDF and the estimated one derived from convolution between the reconstructed RCS PDF and a prior speckle PDF. Experimental results demonstrated that the proposed despeckling neural network could achieve performance comparable to nonlearning state-of-the-art methods. The unique distribution of SAR intensity images was also taken into account in [150]. The authors proposed a different loss function, which contained three terms between the true and reconstructed images: the common L2 loss, the L2 difference between the gradient of the two images, and the Kullback–Leibler divergence between the distribution of the two images. The three terms are designed to emphasize spatial details, the identification of strong scatterers, and speckle statistics, respectively. Experiments in [150] show improved performance compared to the SAR–block-matching 3D algorithm (BM3D) [128] and SAR–DRN [148]. In [57], the problem of despeckling was tackled using a time series of images. Employing a stack of images for despeckling is not unique to deep learning-based methods, as recently demonstrated in [151]. In [57], the authors utilized a multilayer perceptron with several hidden layers to learn nonlinear intensity characteristics of training image patches. This approach showed promising results and comparative performance with the state-of-the-art despeckling algorithms. Again using single images instead of time series, in [36], the authors proposed a deep encoder–decoder CNN architecture with a focus on feature preservation, which is a weakness of CNNs. They modified the U-Net [32] to accommodate speckle statistical features. Another notable CNN approach was introduced in [129], where the authors employed a nonlocal structure, while the weights for pixelwise similarity measures were assigned using a CNN. The results of this approach, called CNN–nonlocal means (NLM), are reported in Figure 7, where the superiority of the method with respect to both feature preservation and speckle reduction is clearly observed. One of the drawbacks of the aforementioned algorithms is the requirement of noise-free and noisy image pairs for training. Often, those training data are simulated using optical images with multiplicative noise. This is, of course, not Noisy Image – + Residual Image Exponent Convolution Convolution + BN + ReLU Convolution + BN + ReLU Convolution + ReLU Logarithm CNN Filtered Image FIGURE 6. The CNN architecture for SAR image despeckling [60]. BN: belief network. 154 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
ideal for real SAR images. Therefore, one elegant solution is the noise-to-noise framework [152], where the network requires only two noisy images of the same area. The authors of [152] prove that the network is able to learn a clean representation of the image, given that the noise distributions of the two noisy images are independent and identical. This idea was employed in SAR despeckling in [153]. The authors made use of multitemporal SAR images of the same area as the input to the noise-to-noise network. To mitigate the effect of the temporal change between the input SAR image pairs, they multiplied a patch similarity term to the original loss function. From the deep learning-based despeckling methods reviewed in this section, it can be observed that most methods (a) (b) (c) employ CNN-based architectures with single images of a scene for training; they either output clean images in an end-to-end fashion or propose residual-based techniques to learn underlying noise models. With the availability of large archives of time series thanks to the Sentinel-1 mission, an interesting direction is to exploit the temporal correlation of speckle characteristics for despeckling applications. One critical issue is oversmoothing in despeckling, and it needs to be addressed. Many of the CNN-based methods perform well in terms of speckle removal but are not able to preserve sharp edges. This is quite problematic in despeckling high-resolution SAR images of urban areas, in particular. Another problem in supervised deep learning-based despeckling techniques concerns the lack of ground truth (d) (e) FIGURE 7. A comparison of speckle reduction among SAR–BM3D [128], SAR–CNN [60], and CNN–NLM applied to a small strip of Constella- tion of Small Satellites for Mediterranean Basin Observation–SkyMed data above Caserta, Italy, where the reference clean image has been obtained by temporal multilooking applied to a stack of SAR images [129]. (a) The clean image. (b) The noisy image. (c) SAR–BM3D is applied. (d) SAR–CNN is applied. (e) CNN–NLM is applied. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 155
data. In many studies, the training data set is built by corrupting optical images through multiplicative noise. This is far from realistic for despeckling applied to real SAR data. Therefore, despeckling in an unsupervised manner would be highly desirable and worth attention. InSAR InSAR is one of the most important SAR techniques, and it is widely used in reconstructing the topography of the Earth’s surface, i.e., DEM generation [65], [154], [155], and detecting topographical displacements, e.g., monitoring volcanic eruptions [156]–[158], earthquakes [159], [160], land subsidence [161], and urban areas by using time-series methods [162]–[164]. The principle of InSAR is to first measure the interferometric phase between signals received by two antennas located at different positions and then extract topographic information from the obtained interferogram by unwrapping and converting the absolute phase to height. However, an actual interferogram often suffers from a large number of singular points, which originate from the interference distortion and noise in radar measurements. These points result in unwrapping errors and, consequently, lowquality DEMs. To tackle this problem, Ichikawa and Hirose [165] applied a complex-valued neural network (CV-NN) in the spectral domain to restore singular points. With the help of the complex Markov random field filter [166], they aimed at learning ideal relationships between the spectrum of neighboring pixels and that of the center pixels via a onehidden-layer CV-NN. Notably, the center pixels of each training sample are supposed to be ideal points, which indicates that singular points are not fed to the network during the training procedure. Similarly, Oyama and Hirose [167] restored singular points with a CV-NN in the spectrum domain. Related to topography extraction, Costante et al. [169] proposed a full CNN encoder–decoder architecture for estimating DEMs from single-pass image acquisitions. They demonstrated that this model was capable of extracting high-level features from input radar images using an encoder section and then reconstructing full-resolution DEMs via a decoder section. Moreover, the network can potentially solve the layover phenomenon in one single-look SAR image that has contextual features. In addition to reconstructing DEMs, Schwegmann et al. [170] presented a CNN-based technique to detect subsidence deformations from interferograms. They employed a nine-layer network to extract salient information from interferograms and displacement maps for discriminating deformation targets from deformation-like targets. Furthermore, Anantrasirichai et al. [10], [171], [172] used a pretrained CNN to automatically detect volcanic ground deformation through InSAR images. They divided each image into patches and relabeled it with binary labels, i.e., “background” and “volcano,” and finally fed it to the network to predict volcano deformation. In [173], they further improved their method to be able to detect slow-moving 156 volcanoes using a time series of interferograms. In another study related to automatic volcanic deformation detection, Valade et al. [168] designed and trained a CNN from scratch to learn a decorrelation mask from input wrapped interferograms; the CNN then was used to detect volcanic ground deformation. A flowchart of this approach appears in Figure 8. The training in both [168] and [173] was based on simulated data. Another geophysics-motivated example of using deep learning on InSAR data, which was actually proposed earlier than the previously mentioned CNN-based studies, can be found in [174]–[176], where the authors used simple feed-forward shallow neural networks for seismic event characterization and automatic seismic source parameter inversion by exploiting the power of neural networks in solving nonlinear problems. Recently, deep learning has been utilized for tomographic processing, as well. An unfolded deep network that involves vector-approximate message-passing algorithms was proposed in [177]. Experiments with simulated and real data were performed, showing the spectral estimation gains and achieving competitive performance. In [178], a real-valued deep neural network was applied for multiple-input, multiple-output SAR 3D imaging. It displayed a better superresolution power compared with other compressive sensing-based methods. In summary, it can be concluded that the use of deep learning methods in InSAR is still at a very early stage. Although deep learning has been incorporated in different applications combined with InSAR, the full potential of interferograms has not been fully exploited, except in the pioneering work of Hirose [179]. Many applications treat interferograms and deformation maps obtained from interferograms as images similar to RGB and gray-scale ones, and therefore the complex nature of interferograms has remained unnoticed. Apart from this issue, such as the SAR despeckling problem related to deep learning, the lack of ground truth data for detection and image restoration problems provides motivation to focus on developing semisupervised and unsupervised algorithms that combine deep learning and InSAR. Otherwise, a training database consisting of interferograms for different scenarios and different phase contributions could be beneficial for supervised learning applications. Simulation-based interferogram generation for the latter was recently proposed in [180]. SAR–OPTICAL DATA FUSION The fusion of SAR and optical images can provide complementary information about targets. However, considering the two different sensing modalities, the prior identification and the coregistration of corresponding images are challenging [181] but compulsory. For the purpose of identifying and matching SAR and optical images, many current methods resort to deep learning, given its powerful capabilities of extracting effective features from complex images. In [58], the authors proposed a CNN for identifying corresponding image patches of VHR optical and SAR IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
via a concatenation layer for further binary prediction of their correspondence. A selection of true positives, false positives, false negatives, and true negatives of SAR–optical image patches from [58] is presented in Figure 9. Similarly, imagery of complex urban scenes. Their network consists of two streams: one designed for extracting features from optical images and one responsible for learning features from SAR images. Next, the extracted features are fused Synthetic Decorrelation Mask Synthetic Training Data Synthetic Wrapped Interferogram Synthetic Phase Gradients Gradients, y Gradients, x Input CNN Training Desired Outputs (a) Trained CNN Wrapped Interferogram Estimated Decorrelation Mask Estimated Phase Gradients Real Data Gradients, y Gradients, x (b) Estimated Unwrapped Phase (W) Deformation Map (Wm) Wm = W . λ /4π Deformation Score (DEF) DEF = std – dev (Wm) Phase Unwraping (c) – Time Series and Deformation Maps (Public Website) – email Alert If DEF > 0.001 (Private List) (d) FIGURE 8. The workflow of the volcano deformation (DEF) detection proposed in [168]. The CNN is trained on simulated data and later used to perceive phase gradients and a decorrelation mask from the input wrapped interferograms to locate ground deformation caused by volcanoes. (a) The CNN training. (b) The phase gradient detection. (c) The phase unwrapping and score computation. (d) The dissemination. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 157
Hughes et al. [11] proposed a pseudo-Siamese CNN for learning a multisensor correspondence predictor for SAR and optical image patches. Notably, the networks in [11] and [58] are trained and validated on the SARptical data set [182], [183], which is specifically built for the joint analysis of VHR SAR and optical images in dense urban areas. In [184], the authors proposed a deep learning framework that can obtain an end-to-end mapping between image patch pairs and their matching labels. An image pair is first transformed into two 1D vectors and then concatenated to build a large 1D vector as the network input. Then, hidden layers are stacked for learning the mapping between input vectors and output binary labels, which indicate their correspondence. For the purpose of matching SAR and optical images, Merkle et al. [185] presented a CNN that consists of a feature extraction stage (a Siamese network) and a similarity measure stage (a dot product layer). Specifically, features of input optical and SAR images are extracted via two separate nine-layer branches and then fed to a dot product layer for predicting the shift of the optical image within the large SAR reference patch. Experimental results indicate that this deep learning-based method outperforms state-of-the-art matching approaches [186], [187]. Furthermore, Abulkhanov et al. [188] successfully trained a neural network to build feature point descriptors to identify corresponding patches among SAR and optical images and match the detected descriptors using the random sample consensus algorithm [189]. In contrast to training a model to identify corresponding image patches, Merkle et al. [190] first employed a conditional GAN (cGAN) to generate artificial SAR-like images from optical images and then matched them with real SAR images. The authors demonstrated that the matching accuracy and precision improved through the proposed strategy. Inspired by that study, more researchers resorted to using GANs for the purpose of SAR–optical image matching (see [191] and [192] for a review). With respect to applications of SAR and optical image matching, Yao et al. [193] aimed at applying SAR and optical images to semantic segmentation with deep neural networks. They collected corresponding optical patches from Google Earth that accorded to TerraSAR-X patches and built ground (a) (b) truths using data from OSM. Then, SAR and optical images were separately fed to different CNNs to predict semantic labels (buildings, natural areas, land use, and water). Despite the fact that their experimental results did not outperform the state of the art [194], likely because of the network design or the training strategy, they deduced that introducing advanced models and simultaneously using both data sources can greatly improve the performance of semantic segmentation. Another application, mentioned in [195], demonstrated that standard fusion techniques for SAR and optical images require data from both sources, which indicates that it is still not easy to interpret SAR images without the support of optical ones. To address this issue, Schmitt et al. [195] proposed an automatic colorization network composed of a VAE and a mixture density network [196] to predict artificially colored SAR images (i.e., Sentinel-1 images). These images proved to disclose more information to human interpreters than the original SAR data did. In [42], the authors tackled the problem of cloud removal from optical imagery. They introduced a cGAN architecture to fuse SAR and cloud-corrupted multispectral data for generating cloud- and haze-free multispectral optical information. Experiments proved the effectiveness of the proposed network for removing clouds from multispectral data with auxiliary SAR data. Extending previous multimodal networks for cloud removal, the authors of [43] proposed a cycle-consistent GAN architecture [197] that utilizes an image forward–backward translation consistency loss. Cloudcovered optical information is reconstructed via SAR data fusion, while changes to cloud-free areas are minimized through the cycle consistency loss. The cycle-consistent architecture facilitates training without pixelwise correspondences between cloudy input and cloud-free target optical imagery, relaxing requirements for the training data set. In summary, it can be seen that the utilization of deep learning methods for SAR–optical data fusion has been a hot topic in the remote sensing community. Although a handful of data sets consisting of optical and SAR corresponding image patches is available for different terrain types and applications, one of the biggest problems remains the scarcity of high-quality training data. Semisupervised methods, as proposed in [198], seem to be a viable option (c) (d) FIGURE 9. Randomly selected patches obtained from the testing phase of the network for SAR–optical image patch correspondence detec- tion proposed in [11]. (a) True positives. (b) False positives. (c) False negatives. (d) True negatives. 158 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
to tackle the issue. A great challenge in SAR–optical image matching concerns the extreme difference between the two sensors’ viewing geometries. For this, it is important to exploit auxiliary 3D data to assist the training data generation. EXISTING BENCHMARK DATA SETS AND THEIR LIMITATIONS To train and evaluate deep learning models, large data sets are indispensable. Unlike RGB images in the computer vision community, which can be easily collected and interpreted, SAR images are much more difficult to annotate due to their complex properties. Our research shows that big SAR data sets created for the primary purpose of deep learning investigations are nearly nonexistent in the community. In recent years, only a few SAR data sets have been made public for training and assessing deep learning models. In the following, we categorize those data sets according to their bestsuited deep learning problem and focus on openly accessible and well-curated large data sets (see Table 1 for summaries the open SAR data sets). In particular, we consider the following categories of deep learning problems in SAR: ◗◗ Image classification: Each pixel or patch in one image is classified into a single label. This is often the case in typical land use/land cover classification problems. TABLE 1. AVAILABLE OPEN SAR DATA SETS. NAME DESCRIPTION SUITABLE TASKS RELATED WORK So2Sat LCZ421 [200], TensorFlow 2 This data set contains 400,673 pairs of corresponding Sentinel-1 dual-polarity image patches, Sentinel-2 multispectral image patches, and manually labeled LCZ classes across 42 urban agglomerations (plus 10 additional smaller areas) around the globe. It is the first Earth observation data set that provides a quantitative measure of the label uncertainty, achieved by having a group of domain experts cast 10 independent votes for 19 cities in the data set. Image classification Data fusion Quantification of uncertainties [201] OpenSARUrban3 [199] This data set includes 33,358 Sentinel-1 dual-polarity image patches covering 21 major cities in China, labeled with 10 classes of urban scenes. Image classification SEN12MS 4 [202] In this data set, there are 180,748 corresponding image triplets containing Sentinel-1 dual-polarity SAR data, Sentinel-2 multispectral imagery, and MODIS-derived land cover maps, covering all inhabited continents during all meteorological seasons. Image classification Semantic segmentation Data fusion MSAW5 [204] This data set contains quad-polarity X-band SAR imagery from Capella Space, with a 0.5-m spatial resolution, which covers 120 km2 in the area of Rotterdam. A total of 48,000 unique building footprints are labeled with associated height information curated from the 3D Basis Registratie Adressen en Gebouwen data set. Semantic segmentation PolSAR Image Data Set on San Francisco6, Label7 [205] The data set includes PolSAR images of San Francisco from five different sensors. Each image was densely labeled to five or six classes, such as mountain, water, high-density urban, low-density urban, vegetation, developed, and bare soil. Image classification Semantic segmentation Data fusion [206] MSTAR8 [207] This data set contains 17,658 X-band VHR SAR images chips (patches) of 10 classes of different vehicles plus one class of a simple geometricshaped target. SAR images of pure clutter are also included. Object detection Scene classification [97], [98], [208] OpenSARShip 2.0 9 [209] This data set includes 34,528 Sentinel-1 SAR image chips of ships, with ship geometric information, types, and corresponding AIS information. Object detection Scene classification [210] SAR-Ship data set10 [211] Here, there are 43,819 Gaofen-3 and Sentinel-1 image chips of different ships. Each image chip has a dimension of 256 × 256 pixels in range and azimuth. Object detection Scene classification SARptical11 [212] The SARptical data set includes 10,108 coregistered pairs of TerraSAR-X VHR spotlight image patch and UltraCam aerial RGB image patches for Berlin. The coregistration is defined by the matching of the 3D position of the center of the image pair. Image matching [11], [183] SEN1-212 [203] This data set contains 282,384 pairs of corresponding Sentinel-1 singlepolarization-intensity and Sentinel-2 RGB image patches collected across the globe. The patches are 256 × 256 pixels. Image matching Data fusion [202] [203] 1https://doi.org/10.14459/2018mp1483140. 2https://www.tensorflow.org/datasets/catalog/so2sat. 3https://doi.org/10.21227/3sz0-dp26. 4https://mediatum.ub.tum.de/1474000. 5https://spacenet.ai/sn6-challenge/. 6https://www.ietr.fr/polsarpro-bio/san-francisco/. 7https://github.com/liuxuvip/PolSF. 8https://www.sdms.afrl.af.mil/index.php?collection=mstar. 9http://opensar.sjtu.edu.cn/Data/Search?key=OpenSARShip. 10https://github.com/CAESAR-Radi/SAR-Ship-Dataset. 11https://syncandshare.lrz.de/getlink/figixjRV9idETzPgG689dGB/SARptical_data.zip. 12 https://mediatum.ub.tum.de/1436631. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 159
◗◗ Scene classification: Similar to image classification, one image or patch is classified into a single label. However, one scene is usually much larger than an image patch. Hence, a different network architecture is required. ◗◗ Semantic segmentation: One image or patch is segmented to a classification map of the same dimension. Training such neural networks also requires densely annotated data. ◗◗ Object detection: This is much like scene classification. However, detection often requires the estimation of the object location. ◗◗ Registration/matching: This provides binary classification (matched and unmatched) and estimates the translation between two image patches. Such tasks require that pairs of two different image patches be matched as training data. IMAGE/SCENE CLASSIFICATION So2Sat LCZ42 So2Sat LCZ42 [200] follows the local climate zones (LCZs) classification scheme. The data set consists of 400,673 pairs of dual-polarity Sentinel-1 and multispectral Sentinel-2 image patches from 42 urban agglomerations, plus 10 additional smaller areas, across five continents. The image patches are hand labeled into one of the 17 LCZ classes [213]. The Sentinel-1 image patches contain a geocoded, single-look complex image as well as a despeckled Lee-filtered variant. In particular, it is the first Earth observation data set that provides a quantitative measure of the label uncertainty, achieved by letting a group of domain experts cast 10 independent votes covering 19 cities. It therefore can be considered a large-scale data fusion and classification benchmark data set for cutting-edge ML methodological developments, such as automatic topology learning, data fusion, and the quantification of uncertainties. OpenSARUrban OpenSARUrban [199] consists of 33,358 patches of Sentinel-1 dual-polarity images covering 21 major cities in China. The data set was manually annotated according to a hierarchical classification scheme, with 10 classes of urban scenes at its finest level. Each image patch has a dimension of 100 × 100 pixels, with a pixel spacing of 10 m [the Sentinel-1 ground-range-detected (GRD) product]. This data set can support deep learning studies of urban target characterization and content-based SAR image queries. Figure 10 shows samples. expect this data set to support the community in developing sophisticated deep learning-based approaches for common tasks, such as scene classification and semantic segmentation for land cover mapping. MULTISENSOR ALL-WEATHER MAPPING The Multisensor All-Weather Mapping (MSAW) [204] data set includes high-resolution SAR data, which covers 120 km2 in the area of Rotterdam, The Netherlands. The quad-polarized X-band SAR imagery from Capella Space, with a 0.5-m spatial resolution, was used for the SpaceNet 6 Challenge. In total, 48,000 unique building footprints have been labeled with additional building heights. PolSAR IMAGE DATA SET ON SAN FRANCISCO This data set [205] consists of PolSAR images of San Francisco from eight different sensors, including Airborne SAR, Advanced Land Observing Satellite (ALOS)-1, ALOS-2, RadarSat-2, Sentinel-1A, Sentinel-1B, Gaofen-3, and Radar Imaging Satellite (data compiled by E. Pottier of the Institute of Electronics and Telecommunications of Rennes). Five of the eight images were densely labeled to five or six land use land cover classes in [205]. These densely annotated images correspond to roughly 3,000 training patches of 128 × 128 pixels. Although the data volume is relatively low for deep learning research, this is the only annotated multisensory PolSAR data set, to the best of our knowledge. Therefore, we suggest that its creator increase the number of annotated images to enable its greater potential use. OBJECT DETECTION MSTAR MSTAR [207] is one of the earliest data sets for SAR target recognition. It consists of 17,658 X-band SAR image chips (patches) of 10 classes of vehicles plus one class of simple geometric-shaped targets. The collected SAR image patches are 128 × 128 pixels, with a resolution of 1 ft in the range and azimuth. In addition, 100 SAR images of clutter are provided. In our opinion, the number of image patches is relatively low for deep learning models, especially considering the number of classes. In addition, this data set represents a rather ideal and unrealistic scenario: vehicles are centered in the patch, and the clutter is quite homogeneous, without disturbing signals. However, considering the scarcity of such data sets, MSTAR is a valuable source for target recognition. SEMANTIC SEGMENTATION/CLASSIFICATION SEN12MS SEN12MS [202] was created based on its previous version SEN1-2 [203]. It consists of 180,662 triplets of dualpolarity Sentinel-1 image patches, multispectral Sentinel-2 image patches, and Moderate Resolution Imaging Spectroradiometer (MODIS) land cover maps. The patches are georeferenced, with a ground sampling distance of 10 m. Each image patch has a dimension of 256 × 256 pixels. We 160 OpenSARShip 2.0 This data set [209] is based on its previous version, OpenSARShip [210]. It contains 34,528 Sentinel-1 SAR image patches of different ships, with automatic identification system (AIS) information. For each SAR image patch, the creators manually extracted the ship length, width, and direction as well as the vessel type by verifying the data on the Marine Traffic website [209]. Roughly one-third of the patches are extracted from Sentinel-1 GRD products, IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
and the other two-thirds are from Sentinel-1 single-look complex products. OpenSARShip 2.0 is one of the handful of SAR data sets suitable for object detection. ships, tankers, fishing boats, and others. The scene types include ports, islands, reefs, and sea surfaces of different levels. REGISTRATION/MATCHING SAR-SHIP-DATA SET This data set [211] was created using 102 Gaofen-3 and 108 Sentinel-1 images. It consists of 43,819 ship chips of 256 pixels in both the range and azimuth. The ships mainly have distinct scales and backgrounds. Therefore, this data set can be employed for developing multiscale object detection models. FUSAR–SHIP The FUSAR–Ship data set [214] was created using space–time matched-up data sets of Gaofen-3 SAR images and ship AIS messages. It consists of more than 5,000 ship chips with corresponding vessel information extracted from AIS messages, which can be used to trace back to each unique ship of any particular chip. AIR–SARShip 1.0/2.0 The AIR–SARShip data set [215] has 31 (300) SAR images from the Geofen-3 satellite, including 1- and 3-m-resolution imagery with different imaging modes, such as spotlight and stripmap. There are more than 10 object categories, including SARptical The SARptical data set [183], [212] was designed for interpreting VHR spaceborne SAR images of dense urban areas. It consists of 10,108 pairs of corresponding VHR SAR and optical image patches whose locations are precisely coregistered in 3D. The patches are extracted from TerraSAR-X VHR spotlight images with a resolution better than 1 m and from UltraCam aerial optical images with a 20-cm pixel spacing, respectively. Unlike low- and medium-resolution images, high-resolution SAR and optical images in dense urban areas have very distinct geometries. Therefore, in the SARptical data set, the center points of each image pair are matched in 3D space via sophisticated 3D reconstruction and matching algorithms. The universal transverse Mercator coordinates of the center pixel of each pair are also made publicly available. This data set contributes to applications of multimodal data classification and SAR optical images coregistering. However, we believe more training samples are required for learning complicated SAR optical image-to-image mapping. FIGURE 10. Samples of the OpenSARUrban data set [199]. Six classes are shown from top to bottom: dense and low-rise residential buildings, a general residential area, high-rise buildings, villas, an industrial storage area, and vegetation. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 161
SEN1-2 The SEN1-2 data set [203] includes 282,384 pairs of corresponding Sentinel-1 single-polarization-intensity and Sentinel-2 RGB image patches collected from across the globe and in all meteorological seasons. The patches are 256 × 256 pixels. Their distribution through the four seasons is roughly even. SEN1-2 is the first large open data set of this kind. We believe it will support further developments in the field of deep learning for remote sensing as well as multisensor data fusion, such as SAR image colorization and SAR–optical image matching. ◗◗ OTHER DATA SETS SAMPLE PolSAR IMAGES FROM THE EUROPEAN SPACE AGENCY These data sets (https://earth.esa.int/web/polsarpro/data -sources/sample-datasets) include, for example, the Flevoland PolSAR data set, which several works use for agricultural land use/land cover classification. The authors of [216]–[218] manually labeled it according to different classification schemes. SAR IMAGE LAND COVER This data set [219] is not publicly available. Readers should contact the creator. ◗◗ AIRBUS SHIP DETECTION CHALLENGE This data set can be accessed at https://www.kaggle.com/c/ airbus-ship-detection. CONCLUSION AND FUTURE TRENDS This article reviewed the state of the art of an important and underexploited research field: deep learning in SAR. Relevant deep learning models were introduced, and their applications in six application fields—terrain surface classification, object detection, parameter inversion, despeckling, InSAR, and SAR–optical data fusion—were analyzed in depth. Existing benchmark data sets and their limitations were discussed. In summary, despite early successes, the full exploitation of deep learning in SAR is mostly limited by 1) the lack of large and representative benchmark data sets and 2) the defect of tailored deep learning models that makes full consideration of SAR signal characteristics difficult. Looking forward, the years ahead will be exciting. Nextgeneration spaceborne SAR missions will simultaneously provide high-resolution and global coverage, which will enable novel applications, such as monitoring the dynamic Earth. To retrieve geoparameters from these data, the development of new analytics methods is warranted. Deep learning is among the most promising methods. To fully unlock its potential in SAR/InSAR applications in this big SAR data era, there are several promising future directions, including the following: ◗◗ Large and representative benchmark data sets: As summarized in this article, there is only a handful of SAR 162 ◗◗ ◗◗ benchmarks, in particular, when multimodal ones are excluded. For instance, in SAR target detection, methods are mainly tested on a single benchmark data set, MSTAR, where only several thousands of target samples (several hundred for each class) are provided for training. With respect to InSAR, due to the lack of ground truth, data sets are extremely deficient or nearly nonexistent. Large and representative expert-annotated benchmark data sets are in high demand in the SAR community and deserve more attention. Unsupervised deep learning: To bypass the deficiencies in annotated data in SAR, unsupervised deep learning is a promising direction. These algorithms derive insights directly from the data themselves and work as feature learning, representation learning, and clustering, which could be further used for data-driven analytics. Autoencoders and their extensions, such as VAEs and deep embedded clustering algorithms, are popular choices. With respect to denoising, in despeckling, the high complexity of SAR images and the lack of ground truth make it infeasible to produce appropriate benchmarks from real data. Noise to noise [152] is an elegant example of unsupervised denoising, where the authors of [152] learn denoised data without clean data. Despite the nice visual appearance of the results, preserving details is a must for SAR applications. Interferometric data processing: Since deep learning methods were initially applied to perception tasks in computer vision, many methods resort to transforming SAR images, e.g., PolSAR images, into RGB-like images in advance, or they focus only on intensities. In other words, the most essential component of an SAR measurement— the phase information—is not appropriately considered. Although CV-CNNs are capable of learning phase information and show great potential in processing CV-SAR images, only a few such attempts have been made [83]. Extending CNNs to the complex domain, while preserving precious phase information, would enable networks to directly learn features from raw data and would open up a wide range of SAR/InSAR applications. Quantification of uncertainties: Generally speaking, geoparameter estimates without uncertainty measures are considered invalid in remote sensing. Appropriately trained deep learning models can achieve highly accurate predictions. Yet they fail in quantifying the uncertainty of these predictions. Here, giving a statement about the predictive uncertainty, while considering both aleatoric uncertainty and epistemic uncertainty, is of crucial importance. The Bayesian deep learning community has developed a model-agnostic and easy-to-implement methodology to estimate both the data uncertainty and model uncertainty within deep learning models [54], which is awaiting exploration by the SAR community. Large-scale nonlinear optimization problems: The development of inversion algorithms should keep up the pace of data growth. Fast solvers are demanded for many IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
advanced parameter inversion models, which often involve nonconvex, nonlinear, and complex-valued optimization problems, such as compressive sensing-based tomographic inversion and low-rank complex tensor decomposition for InSAR time series data analysis. In some cases, the iterations of the optimization algorithms perform computations similar to those in layers in neural networks, that is, a linear step followed by a nonlinear activation (see for example, the iteratively reweighted least-squares approach). And it is thus meaningful to replace computationally expensive optimization algorithms with unrolled deep architectures that could be trained from simulated data [50]. ◗◗ Cognitive sensors: Radars—and SARs, in particular—are very complex and versatile imaging machines. A variety of modes (stripmap, spotlight, ScanSAR, terrain observation with progressive scans, and so on), swath widths, incidence angles, and polarizations can be programmed in near real time. Cognitive radars go a giant step further: they autonomously adapt their operational modes to the environment to be imaged through an intelligent interplay of transmit waveforms, adaptive signal processing on the receiver side, and learning. Cognitive SARs are still in their conceptual and experimental phase and are often justified by the stunning capabilities of the echolocation system of bats. In his pioneering article [116], Haykin defines three ingredients of a cognitive radar: “1) intelligent signal processing, which builds on learning through interactions of the radar with the surrounding environment; 2) feedback from the receiver to the transmitter, which is a facilitator of intelligence; and 3) preservation of the information content of radar returns, which is realized by the Bayesian approach to target detection through tracking.” Such a SAR could, e.g., perform low-resolution yet wide-swath surveillance of a coastal area and, in a first step, detect objects of interest, such as ships, in real time. Based on such detection, the transmit waveform could be modified, for instance, by zooming into the region of interest and enabling a close-up look at an object and possibly classifying or even identifying it. Reinforcement (online) learning is part of the concept, as are fast and reliable detectors and classifiers (trained offline), e.g., based on deep learning. All this is edge computing; the learning algorithms have to perform in real time and with the limited compute resources onboard the satellite or airplane. Last but not least, technology advances in deep learning in remote sensing will be possible only if experts in remote sensing and ML work closely together. This is particularly true when it comes to SAR. Thus, we encourage more joint initiatives to work collaboratively toward deep learning powered, explainable, and reproducible big SAR data analytics. ACKNOWLEDGMENTS The work of Xiao Xiang Zhu is jointly supported by the European Research Council, under the European Union’s DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE Horizon 2020 research and innovation program (grant ERC-2016-StG-714087); the Helmholtz Association, through Helmholtz AI, Munich Unit at Aeronautics, Space, and Transport, and through the Helmholtz Excellent Professorship Data Science in Earth Observation: Big Data Fusion for Urban Research; and the German Federal Ministry of Education and Research, through the international Future AI Lab AI4EO (grant 01DD20001). AUTHOR INFORMATION Xiao Xiang Zhu (xiaoxiang.zhu@dlr.de) received her M.Sc., Dr.-Ing., and habilitation degrees in signal processing from the Technical University of Munich (TUM), Munich, Germany, in 2008, 2011, and 2013, respectively. She is a professor of data science in Earth observation at TUM and the head of the Department of Earth Observation Data Science, Remote Sensing Technology Institute, German Aerospace Center, Wessling, 82234, Germany. Since 2019, she has been a co-coordinator of the Munich Data Science Research School and the head of the aeronautics, space, and transport research field at the Helmholtz Association, Bonn, Germany. She has directed the Future Lab AI4EO: Artificial Intelligence for Earth Observation: Reasoning, Uncertainties, Ethics and Beyond, Munich, since 2020 and she serves on the board of directors of the Munich Data Science Institute, TUM. She was a guest scientist or visiting professor at the Italian National Research Council, Naples, Italy; Fudan University, Shanghai, China; the University of Tokyo, Tokyo, Japan; and the University of California, Los Angeles, Los Angeles, California, USA, in 2009, 2014, 2015, and 2016, respectively. Her research interests include remote sensing and Earth observation, signal processing, machine learning, and data science, with a special focus on global urban mapping. She is a member of the Junge Akademie/ Junges Kolleg, Berlin–Brandenburg Academy of Sciences and Humanities; the German National Academy of Sciences Leopoldina; and the Bavarian Academy of Sciences and Humanities. She is an associate editor of IEEE Transactions on Geoscience and Remote Sensing and a Fellow of IEEE. Sina Montazeri (sina.montazeri@dlr.de) received his B.Sc. degree in geodetic engineering from the University of Isfahan, Isfahan, Iran, in 2011; his M.Sc. degree in geomatics from Delft University of Technology, Delft, The Netherlands, in 2014; and his Ph.D. degree in radar remote sensing from the Technical University of Munich (TUM), Munich, Germany, in 2019, with a dissertation on geodetic synthetic aperture radar (SAR) interferometry. In 2012, he spent two weeks with the Laboratoire des Sciences de l’Image, de l’Informatique et de la Télédétection, University of Strasbourg, Strasbourg, France, as a junior researcher working on thermal remote sensing. From 2013 to 2015, he was a research assistant at the Remote Sensing Technology Institute (IMF), German Aerospace Center (DLR), Wessling, 82234, Germany, where he was involved in the absolute localization of point clouds obtained from SAR tomography. From 2015 to 2019, he was a research associate with the Signal 163
Processing in Earth Observation research group, TUM, and IMF–DLR, working on the automatic positioning of ground control points from multiview radar images. He is currently a senior researcher in the Department of Earth Observation Data Science, IMF–DLR, focused on developing machine learning algorithms applied to radar imagery. His research interests include advanced interferometric SAR techniques for the deformation monitoring of urban infrastructure, image and signal processing relevant to radar imagery, and applied machine learning. He received the DLR Science Award and the IEEE Geoscience and Remote Sensing Society Transactions Prize Paper Award, in 2016 and 2017, respectively, for his work on geodetic SAR tomography. Mohsin Ali (syed.ali@dlr.de) received his B.S. degree in computer engineering from the National University of Science and Technology, Islamabad, Pakistan, in 2013 and his M.S. degree in computer science from the University of Freiburg, Freiburg, Germany, in 2018. He is a Ph.D. degree candidate at the Earth Observation Center, German Aerospace Center, Wessling, 82234. Germany, supervised by Prof. Xiao Xiang Zhu. His research interests include uncertainty estimation in deep learning models for remote sensing applications. Yuansheng Hua (yuansheng.hua@dlr.de) received his B.S. degree in remote sensing science and technology from Wuhan University, Wuhan, China, in 2014 and his M.S. degree in Earth-oriented space science and technology from the Technical University of Munich (TUM), Munich, Germany, in 2018. He is pursuing his Ph.D. degree at the German Aerospace Center, Wessling, 82234, Germany, and at TUM. In 2019, he was a visiting researcher at Wageningen University and Research, Wageningen, The Netherlands. His research interests include remote sensing, computer vision, and deep learning, especially their applications in remote sensing. He is a Student Member of IEEE. Yuanyuan Wang (y.wang@tum.de) received his B.Eng. degree, with honors, in electrical engineering from Hong Kong Polytechnic University, Hong Kong, China, in 2008, and his M.Sc. and Dr. Ing. degrees from the Technical University of Munich (TUM), Munich, Germany, in 2010 and 2015, respectively. In June and July 2014, he was a guest scientist at the Institute of Visual Computing, ETH Zürich, Zürich, Switzerland. He is currently with the Department of Earth Observation Data Science, Remote Sensing Technology Institute, German Aerospace Center, Wessling, 82234, Germany, where he leads the Big SAR Data working group. He is also a guest member of the Professorship of Data Science in Earth Observation, TUM, where he supports the scientific management of European Research Council projects So2Sat and AI4SmartCities. His research interests include optimal and robust parameter estimation in multibaseline interferometric synthetic aperture radar (SAR), multisensor fusion algorithms of SAR and optical data, nonlinear optimization with complex numbers, machine learning in SAR, and high-performance computing for big data. He serves as a reviewer for multiple IEEE Geoscience and Remote 164 Sensing Society and other remote sensing journals, and he was named one of the best reviewers of IEEE Transactions on Geoscience and Remote Sensing, in 2016. He is an associate editor of the Royal Meteorological Society’s Geoscience Data Journal. He is a Member of IEEE. Lichao Mou (lichao.mou@dlr.de) received his B.S. degree in automation from the Xi’an University of Posts and Telecommunications, Xi’an, China, in 2012; his M.S. degree in signal and information processing from the University of the Chinese Academy of Sciences, Beijing, China, in 2015; and his Dr.-Ing. degree from the Technical University of Munich (TUM), Munich, Germany, in 2020. He is a guest professor at the Munich AI Future Lab AI4EO, TUM, and the head of the Visual Learning and Reasoning team, Department of Earth Observation Data Science, Remote Sensing Technology Institute (IMF), German Aerospace Center (DLR), Wessling, 82234, Germany. Since 2019, he has been an artificial intelligence consultant for the Helmholtz Artificial Intelligence Cooperation Unit of the Helmholtz Association of Germany. In 2015, he spent six months at the Computer Vision Group, University of Freiburg, Freiburg, Germany. In 2019 he was a visiting researcher at the Cambridge Image Analysis Group, University of Cambridge, Cambridge, U.K. From 2019 to 2020, he was a research scientist at IMF–DLR. He was the first-place winner of the 2016 IEEE GRSS Data Fusion Contest and a finalist for the Best Student Paper Award at the Joint Urban Remote Sensing Event, in 2017 and 2019. He is a Member of IEEE. Yilei Shi (yilei.shi@tum.de) received his Dipl.-Ing. degree in mechanical engineering and his Dr.-Ing. degree in engineering from the Technical University of Munich (TUM), Germany. In April and May 2019, he was a guest scientist in the Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, U.K. He is currently a senior scientist with the Chair of Remote Sensing Technology, TUM, Munich, 82024, Germany. His research interests include computational intelligence; fast-solver and parallel computing for large-scale problems; advanced methods for synthetic aperture radar (SAR) and interferometric SAR processing; machine learning and deep learning for a variety data sources, such as SAR, optical images, medical images, and so on; and partial differential equation-related numerical modeling and computing. He is a Member of IEEE. Feng Xu (fengxu@fudan.edu.cn) received his B.E. degree, with honors, in information engineering from Southeast University, Nanjing, China, in 2003 and his Ph.D. degree, with honors, in electronic engineering from Fudan University, China, in 2008. From 2008 to 2010, he was a postdoctoral fellow at the National Oceanic and Atmospheric Administration Center for Satellite Application and Research, Camp Springs, Maryland, USA. From 2010 to 2013, he was with Intelligent Automation, Rockville, Maryland, USA, and with the NASA Goddard Space Flight Center, Greenbelt, Maryland, USA, as a research scientist. In 2012, he was selected for China’s Global Experts Recruitment Program and subsequently returned to Fudan IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
University, Shanghai, 200433, China, in 2013, where he is currently a professor in the School of Information Science and Technology and the vice director of the Ministry of Education Key Laboratory for Information Science of Electromagnetic Waves. He has authored more than 30 papers in peer-reviewed journals, coauthored two books, and written many conference papers, and he holds two patents. His research interests include electromagnetic scattering modeling, synthetic aperture radar information retrieval, and radar system development. He was a recipient of the second-class National Nature Science Award and the 2014 SUMMA graduate fellowship in advanced electromagnetics. He serves as an associate editor of IEEE Geoscience and Remote Sensing Letters. He is the founding chair of the IEEE Geoscience and Remote Sensing Society Shanghai Chapter and a Senior Member of IEEE. Richard Bamler (richard.bamler@dlr.de) received his Dipl.-Ing. degree in electrical engineering, Dr.-Ing. degree in engineering, and habilitation degree in signal and systems theory, in 1980, 1986, and 1988, respectively, from the Technical University of Munich, Germany. He worked at the university, from 1981 to 1989, on optical signal processing, holography, wave propagation, and tomography. He joined the German Aerospace Center (DLR), Wessling, 82234, Germany, in 1989, where he is currently the director of the Remote Sensing Technology Institute. In early 1994, he was a visiting scientist at the NASA Jet Propulsion Laboratory in preparation of the Spaceborne Imaging Radar-C/X-band Synthetic Aperture Radar (SIR-C/X-SAR) missions, and, in 1996, he was a guest professor at the University of Innsbruck. Since 2003, he has held a full professorship in remote sensing technology at the Technical University of Munich, Munich, 80333, Germany, as a double appointment with his DLR position. His teaching activities include university lectures and courses covering signal processing, estimation theory, and synthetic aperture radar (SAR). Since he joined the DLR, his team has worked on SAR and optical remote sensing, image analysis and understanding, stereo reconstruction, computer vision, ocean color, passive and active atmospheric sounding, and laboratory spectrometry. His team is responsible for the development of the operational processors for SIR-C/X-SAR, the Shuttle Radar Topography Mission, TerraSAR-X, TerraSAR-X Add-On for Digital Elevation Measurement, the Tandem-L mission, the Second European Remote Sensing Satellite Global Ozone Monitoring Experiment (GOME), Environmental Satellite Scanning Imaging Absorption Spectrometer for Atmospheric Cartography, Meteorological Operational Satellite/ GOME-2, Sentinel-5 Precursor, Sentinel-4, DLR Earth Sensing Imaging Spectrometer, the Environmental Mapping and Analysis Program mission, and others. His research interests include algorithms for optimum information extraction from remote sensing data, with an emphasis on SAR. This involves new estimation algorithms, such as sparse reconstruction, compressive sensing, and deep learning. He is a Fellow of IEEE. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. doi: 10.1038/nature14539. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, arXiv:1409.1556. Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with deep learning: A review,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 11, pp. 3212–3232, 2019. doi: 10.1109/ TNNLS.2018.2876865. Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew, “A review of semantic segmentation using deep neural networks,” Int. J. Multimedia Inf. Retrieval, vol. 7, no. 2, pp. 87–93, 2018. doi: 10.1007/s13735017-0141-z. X. X. Zhu et al., “Deep learning in remote sensing: A comprehensive review and list of resources,” IEEE Geosci. Remote Sens. Mag., vol. 5, no. 4, pp. 8–36, 2017. doi: 10.1109/MGRS.2017.2762307. H. Parikh, S. Patel, and V. Patel, “Classification of SAR and PolSAR images using deep learning: A review,” Int. J. Image Data Fusion, vol. 11, no. 1, pp. 1–32, 2020. doi: 10.1080/19479832.2019.1655489. S. Chen and H. Wang, “SAR target recognition based on deep learning,” in Proc. Int. Conf. Data Sci. Adv. Anal. (DSAA), 2014, pp. 541–547. doi: 10.1109/DSAA.2014.7058124. L. Wang, A. Scott, L. Xu, and D. Clausi, “Ice concentration estimation from dual-polarized SAR images using deep convolutional neural networks,” in IEEE Trans. Geosci. Remote Sens., vol. 11, no. 1, pp. 1–32, 2014. doi: 10.1109/TGRS.2016.2543660. P. Wang, H. Zhang, and V. Patel, “SAR image despeckling using a convolutional neural network,” IEEE Signal Process. Lett., vol. 24, no. 12, pp. 1763–1767, 2017. doi: 10.1109/LSP.2017.2758203. N. Anantrasirichai, J. Biggs, F. Albino, P. Hill, and D. Bull, “Application of machine learning to classification of volcanic deformation in routinely generated InSAR data,” JGR, Solid Earth, vol. 123, no. 8, pp. 6592–6606, 2018. doi: 10.1029/2018JB015911. L. Hughes, M. Schmitt, L. Mou, Y. Wang, and X. X. Zhu, “Identifying corresponding patches in SAR and optical images with a pseudo-Siamese CNN,” IEEE Geosci. Remote Sens. Lett., vol. 15, no. 5, pp. 784–788, 2018. doi: 10.1109/LGRS.2018.2799232. K. Ikeuchi, T. Shakunaga, M. Wheeler, and T. Yamazaki, “Invariant histograms and deformable template matching for SAR target recognition,” in Proc. CVPR IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 1996, pp. 100–105. doi: 10.1109/ CVPR.1996.517060. Q. Zhao and J. Principe, “Support vector machines for SAR automatic target recognition,” IEEE Trans. Aerosp. Electron. Syst., vol. 37, no. 2, pp. 643–654, 2001. doi: 10.1109/7.937475. M. Bryant and F. Garber, “SVM classifier applied to the MSTAR public data set,” in Proc. Algorithms Synth. Aperture Radar Imag., 1999, pp. 355–360. doi: 10.1117/12.357652. M. Ferguson, R. Ak, Y.-T. T. Lee, and K. H. Law, “Automatic localization of casting defects with convolutional neural networks,” in Proc. IEEE Int. Conf. Big Data (Big Data), Dec. 2017, pp. 1726–1735. doi: 10.1109/BigData.2017.8258115. K. Chen, K. Chen, Q. Wang, Z. He, J. Hu, and J. He, “Short-term load forecasting with deep residual networks,” IEEE Trans. Smart Grid, vol. 10, no. 4, pp. 3943–3952, July 2019. doi: 10.1109/ TSG.2018.2844307. 165
[17] Y. Han and J. C. Ye, “Framing U-Net via Deep Convolutional Framelets: Application to Sparse-View CT,” IEEE Transactions on Medical Imaging, vol. 37, no. 6, pp. 1418–1429, Jun. 2018, doi: 10.1109/TMI.2018.2823768. [18] “Long short-term memory.” Wikimedia. https://upload.wiki​ media.org/wikipedia/commons/thumb/3/3b/The_LSTM_ cell.png/1280px-The_LSTM_cell.png (accessed May 27, 2020). [19] Y. Yang, K. Zheng, C. Wu, and Y. Yang, “Improving the Classification Effectiveness of Intrusion Detection by Using Improved Conditional Variational AutoEncoder and Deep Neural Network,” Sensors, vol. 19, no. 11, p. 2528, Jun. 2019. doi: 10.3390/ s19112528. [20] W. Feng, N. Guan, Y. Li, X. Zhang, and Z. Luo, “Audio visual speech recognition with multimodal recurrent neural networks,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), May 2017, pp. 681–688. doi: 10.1109/IJCNN.2017.7965918. [21] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative Adversarial Networks: An Overview,” IEEE Signal Process. Mag., vol. 35, no. 1, pp. 53– 65, Jan. 2018. doi: 10.1109/MSP.2017.2765202. [22] M. Zitnik, M. Agrawal, and J. Leskovec, “Modeling polypharmacy side effects with graph convolutional networks,” Bioinformatics, vol. 34, no. 13, pp. 457–466, 2018. doi: 10.1093/bioinformatics/bty294. [23] B. Huang and K. M. Carley, “Residual or gate? Towards deeper graph neural networks for inductive graph representation learning,” Aug. 2019, arXiv: 1904.08035. [24] M. Alioscha-Perez, A. D. Berenguer, E. Pei, M. C. Oveneke, and H. Sahli, “Neural architecture search under black-box objectives with deep reinforcement learning and increasingly-sparse rewards,” in 2020 Int. Conf. Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, Feb. 2020. pp. 276–281. doi: 10.1109/ICAIIC48513.2020.9065031. [25] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” 2010. [Online]. Available: http://yann.lecun. com/exdb/mnist/ [26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. doi: 10.1109/5.726791. [27] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1097–1105. [28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255. doi: 10.1109/CVPR.2009.5206848. [29] T. Tieleman and G. Hinton, “Lecture 6.5-Rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Netw. Machine Learn., vol. 4, no. 2, pp. 26–31, 2012. [30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, arXiv:1412.6980. [31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90. [32] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. Int. 166 [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] Conf. Med. Image Comput. Comput.-Assisted Intervention, 2015, pp. 234–241. doi: 10.1007/978-3-319-24574-4_28. G. Huang, Z. Liu, K. Weinberger, and L. Maaten, “Densely connected convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 2261–2269. doi: 10.1109/ CVPR.2017.243. T. Hoeser and C. Kuenzer, “Object detection and image segmentation with deep learning on earth observation data: A review-Part I: evolution and recent trends,” Remote Sens., vol. 12, no. 10, p. 1667, 2020. doi: 10.3390/rs12101667. A. Mazza, F. Sica, P. Rizzoli, and G. Scarpa, “TanDEM-X forest mapping using convolutional neural networks,” Remote Sens., vol. 11, no. 24, p. 2980, Jan. 2019. doi: 10.3390/rs11242980. F. Lattari, B. Gonzalez Leon, F. Asaro, A. Rucci, C. Prati, and M. Matteucci, “Deep learning for SAR image despeckling,” Remote Sens., vol. 11, no. 13, p. 1532, 2019. doi: 10.3390/rs11131532. D. Morgan, “Deep convolutional neural networks for ATR from SAR imagery,” in Proc. SPIE, vol. 9475, May 13, 2015. doi: 10.1117/12.2176558. B. A. Pearlmutter, “Learning state space trajectories in recurrent neural networks,” Neural Computat., vol. 1, no. 2, pp. 263–269, 1989. doi: 10.1162/neco.1989.1.2.263. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computat., vol. 9, no. 8, pp. 1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735. E. Ndikumana, D. Ho Tong Minh, N. Baghdadi, D. Courault, and L. Hossard, “Deep recurrent neural network for agricultural classification using multitemporal SAR sentinel-1 for Camargue, France,” Remote Sens., vol. 10, no. 8, p. 1217, 2018. doi: 10.3390/rs10081217. I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680. C. Grohnfeld, M. Schmitt, and X. X. Zhu, “A conditional generative adversarial network to fuse SAR and multispectral optical data for cloud removal from Sentinel-2 images,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2018, pp. 1726–1729, doi: 10.1109/IGARSS.2018.8519215. P. Ebel, M. Schmitt, and X. Zhu, “Cloud removal in unpaired sentinel-2 imagery using cycle-consistent GAN and SAR-optical data fusion,” in Proc. IGARSS 2020 IEEE Int. Geosci. Remote Sens. Symp. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014. doi: 10.5555/2627435.2670313. K. Pearson, “LIII. On lines and planes of closest fit to systems of points in space,” London, Edinburgh, Dublin Philosoph. Mag. J. Sci., vol. 2, no. 11, pp. 559–572, 1901. doi: 10.1080/14786440109462720. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2013, arXiv:1312.6114. V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. doi: 10.1038/nature14236. H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource management with deep reinforcement learning,” in Proc. 15th ACM Workshop Hot Topics Netw., 2016, pp. 50–56. doi: 10.1145/3005745.3005750. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[49] D. Silver et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016. doi: 10.1038/nature16961. [50] X. Chen, J. Liu, Z. Wang, and W. Yin, “Theoretical linear convergence of unfolded ista and its practical weights and thresholds,” 2018. [51] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” 2018, arXiv:1808.05377. [52] H. Dong, B. Zou, L. Zhang, and S. Zhang, “Automatic design of CNNs via differentiable neural architecture search for PolSAR image classification,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 9, pp. 1–14, 2020. doi: 10.1109/TGRS.2020.2976694. [53] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” 2016, arXiv:1609.02907. [54] A. Kendall and Y. Gal, “What uncertainties do we need in Bayesian deep learning for computer vision?” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017, pp. 5580–5590. doi: 10.5555/3295222.3295309. [55] Y. Shi, Q. Li, and X. X. Zhu, “Building segmentation through a gated graph convolutional neural network with deep structured feature embedding,” ISPRS J. Photogram. Remote Sens., vol. 159, pp. 184–197, Jan. 2020. doi: 10.1016/j.isprsjprs. 2019.11.004. [56] F. Ma, F. Gao, J. Sun, H. Zhou, and A. Hussain, “Attention graph convolution network for image segmentation in big SAR imagery data,” Remote Sens., vol. 11, no. 21, p. 2586, 2019. doi: 10.3390/rs11212586. [57] X. Tang, L. Zhang, and X. Ding, “SAR image despeckling with a multilayer perceptron neural network,” Int. J. Digit. Earth, vol. 12, no. 3, pp. 1–21, 2018. doi: 10.1080/17538947.2018. 1447032. [58] L. Mou, M. Schmitt, Y. Wang, and X. X. Zhu, “A CNN for the identification of corresponding patches in SAR and optical imagery of urban scenes,” in Proc. Urban Remote Sens. Event (JURSE), 2017, pp. 1–4. doi: 10.1109/JURSE.2017.7924548. [59] R. Touzi, A. Lopes, and P. Bousquet, “A statistical and geometrical edge detector for SAR images,” IEEE Trans. Geosci. Remote Sens., vol. 26, no. 6, pp. 764–773, 1988. doi: 10.1109/36.7708. [60] G. Chierchia, D. Cozzolino, G. Poggi, and L. Verdoliva, “SAR image despeckling through convolutional neural networks,” 2017, arXiv:1704.00275. [61] Y. Shi, X. X. Zhu, and R. Bamler, “Optimized parallelization of non-local means filter for image noise reduction of InSAR image,” in Proc. IEEE Int. Conf. Inf. Automat., 2015, pp. 1515–1518. doi: 10.1109/ICInfA.2015.7279525. [62] X. X. Zhu, R. Bamler, M. Lachaise, F. Adam, Y. Shi, and M. Eineder, “Improving TanDEM-X DEMs by non-local InSAR filtering,” in Proc. Euro. Conf. Synth. Aperture Radar (EUSAR), 2014, pp. 1–4. [63] L. Denis, C.-A. Deledalle, and F. Tupin, “From patches to deep learning: Combining self-similarity and neural networks for SAR image despeckling,” in Proc. IGARSS 2019 - 2019 IEEE Int. Geosci. Remote Sens. Symp., pp. 5113–5116. doi: 10.1109/ IGARSS.2019.8898473. [64] J. Gao, B. Deng, Y. Qin, H. Wang, and X. Li, “Enhanced radar imaging using a complex-valued convolutional neural netDECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] work,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 1, pp. 35–39, 2019. doi: 10.1109/LGRS.2018.2866567. A. Moreira, P. Prats-Iraola, M. Younis, G. Krieger, I. Hajnsek, and K. P. Papathanassiou, “A tutorial on synthetic aperture radar,” IEEE Geosci. Remote Sens. Mag., vol. 1, no. 1, pp. 6–43, 2013. doi: 10.1109/MGRS.2013.2248301. C. He, S. Li, Z. Liao, and M. Liao, “Texture classification of PolSAR data based on sparse coding of wavelet polarization textons,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 8, pp. 4576– 4590, 2013. doi: 10.1109/TGRS.2012.2236338. H. Xie, S. Wang, K. Liu, S. Lin, and B. Hou, “Multilayer feature learning for polarimetric synthetic radar data classification,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2014, pp. 2818–2821. doi: 10.1109/IGARSS.2014.6947062. J. Geng, H. Wang, J. Fan, and X. Ma, “Deep supervised and contractive neural network for SAR image classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 4, pp. 2442–2459, 2017. doi: 10.1109/TGRS.2016.2645226. S. Uhlmann and S. Kiranyaz, “Integrating color features in polarimetric SAR image classification,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 4, pp. 2197–2216, 2014. doi: 10.1109/TGRS. 2013.2258675. J. Geng, J. Fan, H. Wang, X. Ma, B. Li, and F. Chen, “High-resolution SAR image classification via deep convolutional autoencoders,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 11, pp. 2351–2355, 2015. doi: 10.1109/LGRS.2015.2478256. B. Hou, B. Ren, G. Ju, H. Li, L. Jiao, and J. Zhao, “SAR image classification via hierarchical sparse representation and multisize patch features,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 1, pp. 33–37, 2016. doi: 10.1109/LGRS.2015.2493242. F. Gao, T. Huang, J. Wang, J. Sun, A. Hussain, and E. Yang, “Dual-branch deep convolution neural network for polarimetric SAR image classification,” Appl. Sci., vol. 7, no. 5, p. 447, 2017. doi: 10.3390/app7050447. B. Hou, H. Kou, and L. Jiao, “Classification of polarimetric SAR images using multilayer autoencoders and superpixels,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 7, pp. 3072–3081, 2016. doi: 10.1109/JSTARS.2016.2553104. L. Zhang, W. Ma, and D. Zhang, “Stacked sparse autoencoder in PolSAR data classification using local spatial information,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 9, pp. 1359–1363, 2016. doi: 10.1109/LGRS.2016.2586109. F. Qin, J. Guo, and W. Sun, “Object-oriented ensemble classification for polarimetric SAR imagery using restricted Boltzmann machines,” Remote Sens. Lett., vol. 8, no. 3, pp. 204–213, 2017. doi: 10.1080/2150704X.2016.1258128. Z. Zhao, L. Jiao, J. Zhao, J. Gu, and J. Zhao, “Discriminant deep belief network for high-resolution SAR image classification,” Pattern Recognit., vol. 61, pp. 686–701, 2017. doi: 10.1016/j.patcog.2016.05.028. Y. Zhou, H. Wang, F. Xu, and Y. Jin, “Polarimetric SAR image classification using deep convolutional neural networks,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 12, pp. 1935–1939, 2016. doi: 10.1109/LGRS.2016.2618840. Y. Wang, C. He, X. Liu, and M. Liao, “A hierarchical fully convolutional network integrated with sparse and low-rank subspace 167
[79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] 168 representations for PolSAR imagery classification,” Remote Sens., vol. 10, no. 2, p. 342, 2018. doi: 10.3390/rs10020342. S. Chen and C. Tao, “PolSAR image classification using polarimetric-feature-driven deep convolutional neural network,” IEEE Geosci. Remote Sens. Lett., vol. 15, no. 4, pp. 627–631, 2018. doi: 10.1109/LGRS.2018.2799877. C. He, M. Tu, D. Xiong, and M. Liao, “Nonlinear manifold learning integrated with fully convolutional networks for PolSAR image classification,” Remote Sens., vol. 12, no. 4, p. 655, 2020. doi: 10.3390/rs12040655. H. Dong, L. Zhang, and B. Zou, “PolSAR image classification with lightweight 3D convolutional networks,” Remote Sens., vol. 12, no. 3, p. 396, 2020. doi: 10.3390/rs12030396. N. Teimouri, M. Dyrmann, and R. N. Jørgensen, “A novel spatio-temporal FCN-LSTM network for recognizing various crop types using multi-temporal radar images,” Remote Sens., vol. 11, no. 8, p. 990, 2019. doi: 10.3390/rs11080990. Z. Zhang, H. Wang, F. Xu, and Y. Jin, “Complex-valued convolutional neural network and its application in polarimetric SAR image classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 12, pp. 7177–7188, 2017. doi: 10.1109/TGRS.2017.2743222. A. G. Mullissa, C. Persello, and A. Stein, “PolSARNet: A deep fully convolutional network for polarimetric SAR image classification,” IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 12, no. 12, pp. 5300–5309, 2019. doi: 10.1109/ JSTARS.2019.2956650. L. Li, L. Ma, L. Jiao, F. Liu, Q. Sun, and J. Zhao, “Complex contourlet-CNN for polarimetric SAR image classification,” Pattern Recognit., vol. 100, p. 107,110, Apr. 2020. doi: 10.1016/j. patcog.2019.107110. W. Xie, G. Ma, F. Zhao, H. Liu, and L. Zhang, “PolSAR image classification via a novel semi-supervised recurrent complex-valued convolution neural network,” Neurocomputing, vol. 388, pp. 255–268, May 2020. doi: 10.1016/j.neucom. 2020.01.020. Z. Huang, M. Datcu, Z. Pan, and B. Lei, “Deep SAR-Net: Learning objects from signals,” ISPRS J. Photogram. Remote Sens., vol. 161, pp. 179–193, Mar. 2020. doi: 10.1016/j.isprsjprs.2020.01.016. R. Ressel, A. Frost, and S. Lehner, “A neural network-based classification for sea ice types on x-band SAR images,” IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 7, pp. 3672–3680, 2015. doi: 10.1109/JSTARS.2015.2436993. R. Ressel, S. Singha, and S. Lehner, “Neural network based automatic sea ice classification for CL-pol RISAT-1 imagery,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2016, pp. 4835–4838. doi: 10.1109/IGARSS.2016.7730261. R. Ressel, S. Singha, S. Lehner, A. Rosel, and G. Spreen, “Investigation into different polarimetric features for sea ice classification using x-band synthetic aperture radar,” IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 7, pp. 3131–3143, 2016. doi: 10.1109/JSTARS.2016.2539501. S. Singha, M. Johansson, N. Hughes, S. M. Hvidegaard, and H. Skourup, “Arctic sea ice characterization using spaceborne fully polarimetric L-, C-, and X-band SAR with validation by airborne measurements,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 7, pp. 3715–3734, 2018. doi: 10.1109/TGRS.2018.2809504. [92] N. Zakhvatkina, V. Smirnov, and I. Bychkova, “Satellite SAR data-based sea ice classification: An overview,” Geosciences, vol. 9, no. 4, p. 152, 2019. doi: 10.3390/geosciences9040152. [93] X. Yao, J. Han, G. Cheng, X. Qian, and L. Guo, “Semantic annotation of high-resolution satellite images via weakly supervised learning,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 6, pp. 3660–3671, 2016. doi: 10.1109/TGRS.2016.2523563. [94] G. Cheng, C. Yang, X. Yao, L. Guo, and J. Han, “When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative CNNs,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 5, pp. 2811–2821, 2018. doi: 10.1109/TGRS.2017.2783902. [95] F. Zhang, C. Hu, Q. Yin, W. Li, H. Li, and W. Hong, “SAR target recognition using the multi-aspect-aware bidirectional LSTM recurrent neural networks,” 2017, arXiv:1707.09875. [96] E. Keydel, S. Lee, and J. Moore, “MSTAR extended operating conditions: A tutorial,” in Proc. SPIE, vol. 2757, pp. 228–242, 1996. doi: 10.1117/12.242059. [97] S. Chen, H. Wang, F. Xu, and Y. Jin, “Target classification using the deep convolutional networks for SAR images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 8, pp. 4806–4817, 2016. doi: 10.1109/TGRS.2016.2551720. [98] J. Ding, B. Chen, H. Liu, and M. Huang, “Convolutional neural network with data augmentation for SAR target recognition,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 3, pp. 364–368, 2016. doi: 10.1109/LGRS.2015.2513754. [99] K. Du, Y. Deng, R. Wang, T. Zhao, and N. Li, “SAR ATR based on displacement-and rotation-insensitive CNN,” Remote Sens. Lett., vol. 7, no. 9, pp. 895–904, 2016. doi: 10.1080/2150704X.2016.1196837. [100] M. Wilmanski, C. Kreucher, and J. Lauer, “Modern approaches in deep learning for SAR ATR,” in Proc. SPIE 9843, Algorithms for Synthetic Aperture Radar Imagery XXIII, vol. 9843, May 14, 2016, p. 98430N. doi: 10.1117/12.2220290. [101] S. Wagner, “SAR ATR by a combination of convolutional neural network and support vector machines,” IEEE Trans. Aerosp. Electron. Syst., vol. 52, no. 6, pp. 2861–2872, 2016. doi: 10.1109/ TAES.2016.160061. [102] F. Gao, T. Huang, J. Sun, J. Wang, A. Hussain, and E. Yang, “A new algorithm for SAR image target recognition based on an improved deep convolutional neural network,” Cogn. Computat., vol. 11, no. 6, pp. 809–824, 2019. doi: 10.1007/s12559-018-9563-z. [103] F. Gao, T. Huang, J. Wang, J. Sun, E. Yang, and A. Hussain, “Combining deep convolutional neural network and SVM to SAR image target recognition,” in Proc. IEEE Int. Conf. Internet of Things (iThings) IEEE Green Comput. Commun. (GreenCom) IEEE Cyber, Phys. Soc. Comput. (CPSCom) IEEE Smart Data (SmartData), 2017, pp. 1082–1085. doi: 10.1109/iThings-GreenComCPSCom-SmartData.2017.165. [104] H. Furukawa, “Deep learning for end-to-end automatic target recognition from synthetic aperture radar imagery,” 2018, arXiv:1801.08558. [105] D. Cozzolino, G. D Martino, G. Poggi, and L. Verdoliva, “A fully convolutional neural network for low-complexity singlestage ship detection in Sentinel-1 SAR images,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2017, pp. 886–889. doi: 10.1109/IGARSS.2017.8127094. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[106] C. Schwegmann, W. Kleynhans, B. Salmon, L. Mdakane, and R. Meyer, “Very deep learning for ship discrimination in synthetic aperture radar imagery,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2016, pp. 104–107. doi: 10.1109/ IGARSS.2016.7729017. [107] C. Bentes, A. Frost, D. Velotto, and B. Tings, “Ship-iceberg discrimination with convolutional neural networks in high resolution SAR images,” in Proc. Euro. Conf. Synth. Aperture Radar (EUSAR), 2016, pp. 1–4. [108] N. Ødegaard, A. Knapskog, C. Cochin, and J. Louvigne, “Classification of ships using real and simulated data in a convolutional neural network,” in Proc. IEEE Radar Conf. (RadarConf), 2016, pp. 1–6. doi: 10.1109/RADAR.2016.7485270. [109] Y. Liu, M. Zhang, P. Xu, and Z. Guo, “SAR ship detection using sea-land segmentation-based convolutional neural network,” in Proc. Int. Workshop Remote Sens. Intell. Process. (RSIP), 2017, pp. 1–4. doi: 10.1109/RSIP.2017.7958806. [110] R. Girshick, “Fast R-CNN,” 2015, arXiv:1504.08083. [111] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017. doi: 10.1109/TPAMI.2016.2577031. [112] J. Li, C. Qu, and J. Shao, “Ship detection in SAR images based on an improved faster R-CNN,” in Proc. SAR Big Data Era: Models, Methods Appl. (BIGSARDATA), 2017, pp. 1–6, doi: 10.1109/ BIGSARDATA.2017.8124934. [113] M. Kang, K. Ji, X. Leng, and Z. Lin, “Contextual region-based convolutional neural network with multilayer fusion for SAR ship detection,” Remote Sens., vol. 9, no. 8, p. 860, 2017. doi: 10.3390/rs9080860. [114] J. Jiao et al., “A densely connected end-to-end neural network for multiscale and multiscene SAR ship detection,” IEEE Access, vol. 6, pp. 20,881–20,892, Apr. 2018. doi: 10.1109/ACCESS.2018.2825376. [115] C. Dechesne, S. Lefèvre, R. Vadaine, G. Hajduch, and R. Fablet, “Multi-task deep learning from sentinel-1 SAR: Ship detection, classification and length estimation,” presented at the Conf. Big Data from Space, 2019. [116] S. Haykin, “Cognitive radar: A way of the future,” IEEE Signal Process. Mag., vol. 23, no. 1, pp. 30–40, 2006. doi: 10.1109/ MSP.2006.1593335. [117] S. Kazemi, B. Yonel, and B. Yazici, “Deep learning for direct automatic target recognition from SAR data,” in Proc. IEEE Radar Conf. (RadarConf), 2019, pp. 1–6. doi: 10.1109/RADAR.2019. 8835492. [118] M. Rostami, S. Kolouri, E. Eaton, and K. Kim, “Deep transfer learning for few-shot SAR image classification,” Remote Sens., vol. 11, no. 11, p. 1374, 2019. doi: 10.3390/rs11111374. [119] Z. Huang, Z. Pan, and B. Lei, “What, where, and how to transfer in SAR target recognition based on deep CNNs,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 4, 2019. doi: 10.1109/ TGRS.2019.2947634. [120] M. Shahzad, M. Maurer, F. Fraundorfer, Y. Wang, and X. X. Zhu, “Buildings detection in VHR SAR images using fully convolution neural networks,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 2, pp. 1100–1116, 2019. doi: 10.1109/TGRS.2018.2864716. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [121] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 3431–3440. doi: 10.1109/CVPR.2015.7298965. [122] S. Zheng et al., “Conditional random fields as recurrent neural networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1529–1537. doi: 10.1109/ICCV.2015.179. [123] Y. Sun, Y. Hua, L. Mou, and X. X. Zhu, “CG-net: Conditional GIS-aware network for individual building segmentation in VHR SAR images,” 2020, arXiv:2011.08362. [124] F. Radar and J. Falkingham. “Global satellite observation requirements for floating ice.” World Meteorological Organization. https://globalcryospherewatch.org/satellites/docs/PSTG-4_ Doc_08-04_GlobSatObsReq-FloatingIce.pdf (accessed Jan. 25, 2021). [125] W. Dierking, “Sea ice monitoring by synthetic aperture radar,” Oceanography, vol. 26, no. 2, pp. 100–111, 2013. doi: 10.5670/ oceanog.2013.33. [126] L. Wang, K. Scott, L. Xu, and D. Clausi, “Sea ice concentration estimation during melt from dual-pol SAR scenes using deep convolutional neural networks: A case study,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 8, pp. 4524–4533, 2016. doi: 10.1109/TGRS.2016.2543660. [127] L. Wang, “Learning to estimate sea ice concentration from SAR imagery,” Ph.D. dissertation, Univ. Waterloo, 2016. [Online]. Available: http://hdl.handle.net/10012/10954 [128] S. Parrilli, M. Poderico, C. V. Angelino, and L. Verdoliva, “A nonlocal SAR image denoising algorithm based on LLMMSE wavelet shrinkage,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 2, pp. 606–616, 2012. doi: 10.1109/TGRS.2011.2161586. [129] D. Cozzolino, L. Verdoliva, G. Scarpa, and G. Poggi, “Nonlocal CNN SAR image despeckling,” Remote Sens., vol. 12, no. 6, p. 1006, 2020. doi: 10.3390/rs12061006. [130] T. Song, L. Kuang, L. Han, Y. Wang, and Q. H. Liu, “Inversion of rough surface parameters from SAR images using simulationtrained convolutional neural networks,” IEEE Geosci. Remote Sens. Lett., vol. 15, no. 7, pp. 1130–1134, 2018. doi: 10.1109/ LGRS.2018.2822821. [131] J. Zhao, M. Datcu, Z. Zhang, H. Xiong, and W. Yu, “Contrastive-regulated CNN in the complex domain: A method to learn physical scattering signatures from flexible polsar images,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 12, pp. 10,116–10,135, 2019. doi: 10.1109/TGRS.2019.2931620. [132] Q. Song, F. Xu, and Y.-Q. Jin, “Radar image colorization: converting single-polarization to fully polarimetric using deep neural networks,” IEEE Access, vol. 6, pp. 1647–1661, 2018. [Online]. Available: https://ieeexplore.ieee.org/document/8141881 doi: 10.1109/ACCESS.2017.2779875. [133] S. Niu, X. Qiu, B. Lei, C. Ding, and K. Fu, “Parameter extraction based on deep neural network for SAR target simulation,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 7, pp. 4901–4914, 2020. doi: 10.1109/TGRS.2020.2968493. [134] S. Auer, R. Bamler, and P. Reinartz, “RaySAR - 3D SAR simulator: Now open source,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), Beijing, 2016, pp. 6730–6733. doi: 10.1109/ IGARSS.2016.7730757. 169
[135] J. Lee, “Digital image enhancement and noise filtering by use of local statistics,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-2, no. 2, pp. 165–168, 1980. doi: 10.1109/TPAMI.1980.4766994. [136] D. Kuan, A. Sawchuk, T. Strand, and P. Chavel, “Adaptive noise smoothing filter for images with signal-dependent noise,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-7, no. 2, pp. 165–177, 1985. doi: 10.1109/TPAMI.1985.4767641. [137] V. Frost, J. Stiles, K. Shanmugan, and J. Holtzman, “A model for radar images and its application to adaptive digital filtering of multiplicative noise,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-4, no. 2, pp. 157–166, 1982. doi: 10.1109/ TPAMI.1982.4767223. [138] H. Xie, L. Pierce, and F. Ulaby, “SAR speckle reduction using wavelet denoising and Markov random field modeling,” IEEE Trans. Geosci. Remote Sens., vol. 40, no. 10, pp. 2196–2212, 2002. doi: 10.1109/TGRS.2002.802473. [139] F. Argenti and L. Alparone, “Speckle removal from SAR images in the undecimated wavelet domain,” IEEE Trans. Geosci. Remote Sens., vol. 40, no. 11, pp. 2363–2374, 2002. doi: 10.1109/ TGRS.2002.805083. [140] A. Achim, P. Tsakalides, and A. Bezerianos, “SAR image denoising via Bayesian wavelet shrinkage based on heavy-tailed modeling,” IEEE Trans. Geosci. Remote Sens., vol. 41, no. 8, pp. 1773–1784, 2003. doi: 10.1109/TGRS.2003.813488. [141] F. Argenti, A. Lapini, T. Bianchi, and L. Alparone, “A tutorial on speckle reduction in synthetic aperture radar images,” IEEE Geosci. Remote Sens. Mag., vol. 1, no. 3, pp. 6–35, 2013. doi: 10.1109/MGRS.2013.2277512. [142] F. Tupin, L. Denis, C.-A. Deledalle, and G. Ferraioli, “Ten years of patch-based approaches for SAR imaging: A review,” in Proc. IGARSS 2019–2019 IEEE Int. Geosci. Remote Sens. Symp., pp. 5105–5108. doi: 10.1109/IGARSS.2019.8900596. [143] C.-A. Deledalle, L. Denis, and F. Tupin, “Iterative weighted maximum likelihood denoising with probabilistic patch-based weights,” IEEE Trans. Image Process., vol. 18, no. 12, pp. 2661– 2672, 2009. doi: 10.1109/TIP.2009.2029593. [144] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR’05), 2005, vol. 2, pp. 60–65. doi: 10.1109/CVPR.2005.38. [145] X. Su, C.-A. Deledalle, F. Tupin, and H. Sun, “Two-step multitemporal nonlocal means for synthetic aperture radar images,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 10, pp. 6181–6196, 2014. doi: 10.1109/TGRS.2013.2295431. [146] C.-A. Deledalle, L. Denis, F. Tupin, A. Reigber, and M. Jager, “NL-SAR: A unified nonlocal framework for resolutionpreserving (pol)(in)SAR denoising,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 4, pp. 2021–2038, 2015. doi: 10.1109/ TGRS.2014.2352555. [147] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Trans. Image Process., vol. 26, no. 7, pp. 3142– 3155, 2017. doi: 10.1109/TIP.2017.2662206. [148] Q. Zhang, Q. Yuan, J. Li, Z. Yang, and X. Ma, “Learning a dilated residual network for SAR image despeckling,” Remote Sens., vol. 10, no. 2, p. 196, 2018. doi: 10.3390/rs10020196. 170 [149] D.-X. Yue, F. Xu, and Y.-Q. Jin, “SAR despeckling neural network with logarithmic convolutional product model,” Int. J. Remote Sens., vol. 39, no. 21, pp. 7483–7505, 2018. doi: 10.1080/01431161.2018.1471539. [150] S. Vitale, G. Ferraioli, and V. Pascazio, “Multi-objective CNN based algorithm for SAR despeckling,” Aug. 2020, arXiv: 2006.09050v4. [151] G. Baier, W. He, and N. Yokoya, “Robust nonlocal low-rank SAR time series despeckling considering speckle correlation by total variation regularization,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 11, pp. 1–13, 2020. doi: 10.1109/TGRS. 2020.2985400. [152] J. Lehtinen et al., “Noise2noise: Learning image restoration without clean data,” 2018, arXiv:1803.04189. [153] X. Ma, C. Wang, Z. Yin, and P. Wu, “SAR image despeckling by noisy reference-based deep learning method,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 12, pp. 1–12, 2020. doi: 10.1109/ TGRS.2020.2990978. [154] H. Zebker, C. Werner, P. Rosen, and S. Hensley, “Accuracy of topographic maps derived from ERS-1 interferometric radar,” IEEE Trans. Geosci. Remote Sens., vol. 32, no. 4, pp. 823–836, 1994. doi: 10.1109/36.298010. [155] R. Abdelfattah and J. Nicolas, “Topographic SAR interferometry formulation for high-precision DEM generation,” IEEE Trans. Geosci. Remote Sens., vol. 40, no. 11, pp. 2415–2426, 2002. doi: 10.1109/TGRS.2002.805071. [156] D. Massonnet, P. Briole, and A. Arnaud, “Deflation of mount Etna monitored by spaceborne radar interferometry,” Nature, vol. 375, no. 6532, p. 567, 1995. doi: 10.1038/375567a0. [157] J. Ruch, J. Anderssohn, T. Walter, and M. Motagh, “Calderascale inflation of the Lazufre volcanic area, South America: Evidence from InSAR,” J. Volcanol. Geotherm. Res., vol. 174, no. 4, pp. 337–344, 2008. doi: 10.1016/j.jvolgeores.2008. 03.009. [158] E. Trasatti et al., “The 2004–2006 uplift episode at Campi Flegrei caldera (Italy): Constraints from SBAS-DInSAR ENVISAT data and Bayesian source inference,” Geophys. Res. Lett., vol. 35, no. 7, pp. 1–6, 2008. doi: 10.1029/2007GL033091. [159] D. Massonnet et al., “The displacement field of the landers earthquake mapped by radar interferometry,” Nature, vol. 364, no. 6433, p. 138, 1993. doi: 10.1038/364138a0. [160] G. Peltzer and P. Rosen, “Surface displacement of the 17 May 1993 Eureka valley, California, earthquake observed by SAR interferometry,” Science, vol. 268, no. 5215, pp. 1333–1336, 1995. doi: 10.1126/science.268.5215.1333. [161] V. B. H. (Gini) Ketelaar, Satellite Radar Interferometry (Remote Sensing and Digital Image Processing), vol. 14. The Netherlands: Springer-Verlag, 2009. [162] X. X. Zhu and R. Bamler, “Let’s do the time warp: Multicomponent nonlinear motion estimation in differential SAR tomography,” IEEE Geosci. Remote Sens. Lett., vol. 8, no. 4, pp. 735–739, 2011. doi: 10.1109/LGRS.2010.2103298. [163] S. Gernhardt and R. Bamler, “Deformation monitoring of single buildings using meter-resolution SAR data in PSI,” ISPRS J. Photogram. Remote Sens., vol. 73, pp. 68–79, Sept. 2012. doi: 10.1016/j.isprsjprs.2012.06.009. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[164] S. Montazeri, X. X. Zhu, M. Eineder, and R. Bamler, “Threedimensional deformation monitoring of urban infrastructure by tomographic SAR using multitrack TerraSAR-x data stacks,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12, pp. 6868–6878, 2016. doi: 10.1109/TGRS.2016.2585741. [165] K. Ichikawa and A. Hirose, “Singular unit restoration in InSAR using complex-valued neural networks in the spectral domain,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 3, pp. 1717–1723, 2017. doi: 10.1109/TGRS.2016.2630719. [166] R. Yamaki and A. Hirose, “Singular unit restoration in interferograms based on complex-valued markov random field model for phase unwrapping,” IEEE Geosci. Remote Sens. Lett., vol. 6, no. 1, pp. 18–22, 2009. doi: 10.1109/LGRS.2008. 2005588. [167] K. Oyama and A. Hirose, “Adaptive phase-singular-unit restoration with entire-spectrum-processing complex-valued neural networks in interferometric SAR,” Electron. Lett., vol. 54, no. 1, pp. 43–44, 2018. doi: 10.1049/el.2017.2680. [168] S. Valade et al., “Towards global volcano monitoring using multisensor sentinel missions and artificial intelligence: The MOUNTS monitoring system,” Remote Sens., vol. 11, no. 13, pp. 1528, 2019. doi: 10.3390/rs11131528. [169] G. Costante, T. Ciarfuglia, and F. Biondi, “Towards monocular digital elevation model (DEM) estimation by convolutional neural networks-application on synthetic aperture radar images,” 2018, arXiv:1803.05387. [170] C. Schwegmann, W. Kleynhans, J. Engelbrecht, L. M ­ dakane, and R. Meyer, “Subsidence feature discrimination using deep convolutional neural networks in synthetic aperture radar imagery,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2017, pp. 4626–4629. doi: 10.1109/IGARSS.2017.8128031. [171] N. Anantrasirichai, F. Albino, P. Hill, D. Bull, and J. Biggs, “Detecting volcano deformation in InSAR using deep learning,” 2018, arXiv:1803.00380. [172] N. Anantrasirichai, J. Biggs, F. Albino, and D. Bull, “A deep learning approach to detecting volcano deformation from satellite imager y using synthetic datasets,” Remote Sens. Environ., vol. 230, p. 111,179, Sept. 2019. doi: 10.1016/j.rse. 2019.04.032. [173] N. Anantrasirichai, J. Biggs, F. Albino, and D. Bull, “The application of convolutional neural networks to detect slow, sustained deformation in InSAR time series,” Geophys. Res. Lett., vol. 46, no. 21, pp. 11,850–11,858, 2019. [174] F. Del Frate, M. Picchiani, G. Schiavon, and S. Stramondo, “Neural networks and SAR interferometry for the characterization of seismic events,” in Proc. SPIE, 2010, p. 78290J. doi: 10.1117/12.867915. [175] M. Picchiani, F. Del Frate, G. Schiavon, S. Stramondo, M. Chini, and C. Bignami, “Neural networks for automatic seismic source analysis from DInSAR data,” in Proc. SPIE, 2011, p. 81790K. doi: 10.1117/12.898575. [176] S. Stramondo, F. Del Frate, M. Picchiani, and G. Schiavon, “Seismic source quantitative parameters retrieval from InSAR data and neural networks,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 1, pp. 96–104, 2011. doi: 10.1109/TGRS. 2010.2050776. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE [177] J. Gao, Y. Ye, S. Li, Y. Qin, X. Gao, and X. Li, “Fast super-resolution 3D SAR imaging using an unfolded deep network,” in Proc. IEEE Int. Conf. Signal, Inf. Data Process. (ICSIDP), 2019, pp. 1–5. doi: 10.1109/ICSIDP47821.2019.9173392. [178] C. Wu, Z. Zhang, L. Chen, and W. Yu, “Super-resolution for MIMO array SAR 3-D imaging based on compressive sensing and deep neural network,” IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 13, pp. 3109–3124, 2020. doi: 10.1109/JSTARS.2020.3000760. [179] A. Hirose, Complex-Valued Neural Networks (Studies in Computational Intelligence). Berlin: Springer-Verlag, 2012, vol. 400. [180] G. Rongier, C. Rude, T. Herring, and V. Pankratius, “Generative Modeling of InSAR Interferograms,” Earth Space Sci., vol. 6, no. 12, pp. 2671–2683, 2019. doi: 10.1029/2018EA000533. [181] M. Schmitt and X. X. Zhu, “On the challenges in stereogrammetric fusion of SAR and optical imagery for urban areas,” Int. Arch. Photogram. Remote Sens. Spatial Inf. Sci., vol. XLI-B7, pp. 719–722, June 2016. doi: 10.5194/isprs-archives-XLI-B7-719-2016. [182] Y. Wang, X. X. Zhu, S. Montazeri, J. Kang, L. Mou, and M. Schmitt, “Potential of the ‘SARptical’ system,” presented at the FRINGE, 2017. [183] Y. Wang and X. X. Zhu, “The SARptical dataset for joint analysis of SAR and optical image in dense urban area,” in Proc. IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2018, pp. 6840–6843. doi: 10.1109/IGARSS.2018.8518298. [184] S. Wang, D. Quan, X. Liang, M. Ning, Y. Guo, and L. Jiao, “A deep learning framework for remote sensing image registration,” ISPRS J. Photogram. Remote Sens., vol. 145, pp. 148–164, Nov. 2018. doi: 10.1016/j.isprsjprs.2017.12.012. [185] N. Merkle, W. Luo, S. Auer, R. Müller, and R. Urtasun, “Exploiting deep matching and SAR data for the geo-localization accuracy improvement of optical satellite images,” Remote Sens., vol. 9, no. 6, p. 586, 2017. doi: 10.3390/rs9060586. [186] S. Suri and P. Reinartz, “Mutual-information-based registration of TerraSAR-X and Ikonos imagery in urban areas,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 2, pp. 939–949, 2010. doi: 10.1109/TGRS.2009.2034842. [187] F. Dellinger, J. Delon, Y. Gousseau, J. Michel, and F. Tupin, “SARSIFT: A SIFT-like algorithm for SAR images,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 1, pp. 453–466, 2015. doi: 10.1109/ TGRS.2014.2323552. [188] D. Abulkhanov, I. Konovalenko, D. Nikolaev, A. Savchik, E. Shvets, and D. Sidorchuk, “Neural network-based feature point descriptors for registration of optical and SAR images,” in Proc. SPIE 10696, Tenth Int. Conf. Machine Vision (ICMV 2017), vol. 10696 2017, pp. 106960L. doi: 10.1117/12.2310085. [189] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981. doi: 10.1145/358669.358692. [190] N. Merkle, S. Auer, R. Müller, and P. Reinartz, “Exploring the potential of conditional adversarial networks for optical and SAR image matching,” IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 11, no. 6, pp. 1–10, 2018. doi: 10.1109/ JSTARS.2018.2803212. [191] L. H. Hughes, N. Merkle, T. Burgmann, S. Auer, and M. Schmitt, “Deep learning for SAR-optical image matching,” in Proc. 171
IGARSS 2019 – 2019 IEEE Int. Geosci. Remote Sens. Symp., pp. 4877–4880. doi: 10.1109/IGARSS.2019.8898635. [192] M. Fuentes Reyes, S. Auer, N. Merkle, C. Henry, and M. Schmitt, “SAR-to-optical image translation based on conditional generative adversarial networks-optimization, opportunities and limits,” Remote Sens., vol. 11, no. 17, p. 2067, 2019. doi: 10.3390/ rs11172067. [193] W. Yao, D. Marmanis, and M. Datcu, “Semantic segmentation using deep neural networks for SAR and optical image pairs,” presented at the Big Data from Space, 2017. [194] N. Audebert, B. Le Saux, and S. Lefevre, “Semantic segmentation of earth observation data using multimodal and multi-scale deep networks,” in Computer Vision–ACCV 2016 (Lecture Notes in Computer Science), vol. 10111, S.-H. Lai, V. Lepetit, K. Nishino, and Y. Sato, Eds. Cham: Springer-Verlag, 2017, pp. 180–196. [195] M. Schmitt, L. Hughes, M. Körner, and X. X. Zhu, “Colorizing Sentinel-1 SAR images using a variational autoencoder conditioned on Sentinel-2 imagery,” Int. Arch. Photogram. Remote Sens. Spatial Inform. Sci., vol. 42, no. 2, pp. 1045–1051, 2018. doi: 10.5194/isprsarchives-XLII-2-1045-2018. [196] C. Bishop, “Mixture density networks,” Citeseer, Tech. Rep., 1994. Accessed: Jan. 25, 2020. [Online]. Available: https://publications. aston.ac.uk/id/eprint/373/1/NCRG_94_004.pdf [197] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-toimage translation using cycle-consistent adversarial networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2242– 2251. doi: 10.1109/ICCV.2017.244. [198] L. H. Hughes and M. Schmitt, “A semi-supervised approach to SAR-optical image matching,” ISPRS Ann. Photogram. Remote Sens. Spatial Inform. Sci., vol. IV-2/W7, pp. 71–78, Sept. 2019. doi: 10.5194/isprs-annals-IV-2-W7-71-2019. [199] J. Zhao, Z. Zhang, W. Yao, M. Datcu, H. Xiong, and W. Yu, “OpenSARUrban: A Sentinel-1 SAR image dataset for urban interpretation,” IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 13, pp. 187–203, 2020. doi: 10.1109/JSTARS.2019.2954850. [200] X. Zhu et al., “So2Sat LCZ42: A benchmark dataset for global local climate zones classification,” IEEE Geosci. Remote Sens. Mag., vol. 8, no. 3, pp. 187–203, 2020. doi: 10.1109/MGRS.2020. 2964708. [201] M. Neumann, A. S. Pinto, X. Zhai, and N. Houlsby, “In-domain representation learning for remote sensing,” Nov. 2019, arXiv: 1911.06721. [202] M. Schmitt, L. H. Hughes, C. Qiu, and X. X. Zhu, “SEN12MS A curated dataset of georeferenced multi-spectral Sentinel-1/2 imagery for deep learning and data fusion,” ISPRS Ann. Photogram. Remote Sens. Spatial Inform. Sci., vol. IV-2/W7, pp. 153– 160, Sept. 2019. doi: 10.5194/isprs-annals-IV-2-W7-153-2019. [203] M. Schmitt, L. H. Hughes, and X. X. Zhu, “The SEN1-2 dataset for deep learning in SAR-Optical data fusion,” in Proc. ISPRS Ann. Photogram. Remote Sens. Spatial Inf. Sci., pp. 141–146, 2018. [204] J. Shermeyer et al., “Spacenet 6: Multi-sensor all weather mapping dataset,” 2020, arXiv:2004.06500. [205] X. Liu, L. Jiao, and F. Liu, “PolSF: Polsar image dataset on San Francisco,” 2019, arXiv:1912.07259. [206] Y. Cao, Y. Wu, P. Zhang, W. Liang, and M. Li, “Pixel-wise Polsar image classification via a novel complex-valued deep fully con- 172 volutional network,” Remote Sens., vol. 11, no. 22, p. 2653, 2019. doi: 10.3390/rs11222653. [207] T. Ross, S. Worrell, V. Velten, J. Mossing, and M. Bryant, “Standard SAR ATR evaluation experiments using the MSTAR public release data set,” in Proc. Algorithms Synth. Aperture Radar Imag., 1998. doi: 10.1117/12.321859. [208] F. Gao, Y. Yang, J. Wang, J. Sun, E. Yang, and H. Zhou, “A deep convolutional generative adversarial networks (DCGANS)based semi-supervised method for object recognition in synthetic aperture radar (SAR) images,” Remote Sens., vol. 10, no. 6, p. 846, 2018. doi: 10.3390/rs10060846. [209] B. Li, B. Liu, L. Huang, W. Guo, Z. Zhang, and W. Yu, “OpenSARShip 2.0: A large-volume dataset for deeper interpretation of ship targets in Sentinel-1 imagery,” in Proc. SAR Big Data Era: Models, Methods Appl. (BIGSARDATA), Nov. 2017, pp. 1–5. doi: 10.1109/BIGSARDATA.2017.8124929. [210] L. Huang et al., “OpenSARShip: A dataset dedicated to Sentinel-1 ship interpretation,” IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 11, no. 1, pp. 195–208, Jan. 2018. doi: 10.1109/JSTARS.2017.2755672. [211] Y. Wang, C. Wang, H. Zhang, Y. Dong, and S. Wei, “A SAR dataset of ship detection for deep learning under complex backgrounds,” Remote Sens., vol. 11, no. 7, p. 765, Mar. 2019. doi: 10.3390/rs11070765. [212] Y. Wang, X. X. Zhu, B. Zeisl, and M. Pollefeys, “Fusing meterresolution 4-D InSAR point clouds and optical images for semantic urban infrastructure monitoring,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 1, pp. 14–26, Jan. 2017. doi: 10.1109/ TGRS.2016.2554563. [213] I. D. Stewart and T. R. Oke, “Local climate zones for urban temperature studies,” Bull. Amer. Meterol. Soc., vol. 93, no. 12, pp. 1879–1900, 2012. doi: 10.1175/BAMS-D-11-00019.1. [214] H. Xiyue, W. Ao, Q. Song, J. Lai, H. Wang, and F. Xu, “FUSARship: A high-resolution SAR-AIS matchup dataset of Gaofen-3 for ship detection and recognition,” Sci. China Inf. Sci., vol. 68, 2020, Art. no. 140303. doi: 10.1007/s11432-019-2772-5. [215] S. Xian, W. Zhirui, S. Yuanrui, D. Wenhui, Z. Yue, and F. Kun, “Air-sarship–1.0: High resolution SAR ship detection dataset,” J. Radars, vol. 8, no. 6, pp. 852–862, 2019. [216] P. Yu, A. Qin, and D. Clausi, “Unsupervised polarimetric SAR image segmentation and classification using region growing with edge penalty,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 4, pp. 1302–1317, 2012. doi: 10.1109/TGRS.2011.2164085. [217] D. Hoekman and M. Vissers, “A new polarimetric classification approach evaluated for agricultural crops,” IEEE Trans. Geosci. Remote Sens., vol. 41, no. 12, pp. 2881–2889, 2003. doi: 10.1109/ TGRS.2003.817795. [218] W. Yang, D. Dai, J. Wu, and C. He, “Weakly supervised polarimetric SAR image classification with multi-modal Markov aspect model,” in Proc. ISPRS, 2010. [219] C. O. Dumitru, G. Schwarz, and M. Datcu, “SAR image land cover datasets for classification benchmarking of temporal changes,” IEEE J. Select. Topics Appl. Earth Observ. Remote Sens., vol. 11, no. 5, pp. 1571–1592, May 2018. doi: 10.1109/ JSTARS.2018.2803260. GRS IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
©SHUTTERSTOCK.COM/1968 Forward-Looking Ground-Penetrating Radar Subsurface target imaging and detection: A review DAVIDE COMITE, FAUZIA AHMAD, MOENESS G. AMIN, AND TRAIAN DOGARU D etection of shallow-buried, in-road threats using a forward-looking (FL) ground-penetrating radar (GPR) system has attracted significant research interest in the last decade. An FL-GPR mounted on a moving platform can provide standoff target detection and imaging. This enables real-time sensing and situation awareness over large ground areas. The main challenge facing this sensing technology is high false-alarm rates due to scattering arising from air– ground interface roughness and subsurface clutter. In this article, we present a comprehensive review of the state-of-the-art techniques that address the unique challenges associated with FL-GPR technology. Specifically, we focus on array-based FL-GPR systems and consider both electromagnetic modeling and signal processing for problem formulation and solutions. Image formation methods and target detection approaches are discussed, highlighting their offerings and shortcomings in providing reliable system performance. Digital Object Identifier 10.1109/MGRS.2020.3048368 Date of current version: 9 February 2021 DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE THE CHALLENGES OF FORWARD-LOOKING GROUND-PENETRATING RADAR In recent years, radar imaging and detection of shallow-buried targets have garnered much interest due to the need for reliable subsurface investigations in a variety of applications, including real-time security, military situational awareness, and humanitarian demining of unexploded ordnance over large areas [1]–[8]. Although a broad class of sensing modalities, including seismic and radiometric, have been proposed in the literature for the detection of buried targets [9]–[11], electromagnetic waves remain a viable option (see, e.g., [12]) owing to their various attributes, such as superior ground penetration, sensitivity to arbitrarily shaped plastic targets, and robustness to different soil conditions. In particular, the FL-GPR technology is gaining impetus as it enables sensing from a standoff distance. A major motivation for the development of early FL-GPR systems has been their terrain-mapping capabilities, used to clear roads from explosive hazards. Vehicle-borne, downlooking radar systems previously employed in this application lacked the standoff detection range that would enable 0274-6638/21©2021IEEE 173
spotting of the hazard before the vehicle drove over it. By pointing the antenna array to look ahead of the vehicle, FL-GPR systems are able to achieve a reasonable lead detection time before reaching the actual explosive hazard location. However, performance of an FL-GPR system is highly impacted by rough surface clutter (see, e.g., [13]). Depending on the soil conditions and degree of surface roughness, the returns from the ground interface can dominate the radar measurements and obscure the target response. This leads to significant uncertainty in the assessment and interpretation of the attained radar images. Compared to its downward-looking counterpart, wherein the antennas are either coupled or very close to the ground surface (see, e.g., [3], [4], and [14]–[16]), an FL-GPR system employs oblique and near-grazing incidence sensing to enable target detection from a safe standoff distance. In this case and depending on the roughness profile of the illuAPPROACHES BASED minated surface, most of the ON CONTROLLED energy would be forward-scatEXPERIMENTS, THOUGH tered along the specular direcCOMPLICATED AND COSTLY, tion, yielding reduced returns ARE VALUABLE IN from the air–ground interface. UNDERSTANDING In practice, however, even PHENOMENOLOGY AND if the backscattered echo from CAN PROVIDE REAL the rough surface is relatively weak, the intensity of the sigSCATTERING DATA. nal returns from concealed targets can also be quite low. This renders target detection and localization challenging, especially in the case of nonmetallic objects. Therefore, proper design of both imaging and detection approaches becomes fundamental to improving the performance of the FL configuration. To compensate for the loss of energy due to the signal bounce at the ground interface, synthetic aperture radar (SAR)-based focusing is typically employed [4], [17]–[21], wherein coherently combining the returns at multiple antenna positions focuses the energy to an image pixel, thereby improving weak target representations. Several approaches based on electromagnetic modeling and statistical detection analysis have been proposed for FL-GPR (see, e.g., [19], [22], and [23] and the references therein). In this article, we focus on array-based FL-GPR systems and provide a comprehensive review of the state-of-theart radar imaging and detection methods, highlighting their advantages and limitations. We attempt to group advances in FL-GPR based on the nature of the data employed, system prototyping, properties of the imaging scene assumed, and principal signal processing algorithms undertaken. SOLUTION TO THE FORWARD PROBLEM A controlled solution for the forward-scattering problem can be a powerful tool to assess and characterize the ground interface contributions, predict the target signature, and design 174 and validate image formation methods, including clutter mitigation approaches. This would require determining the scattered field from the illuminated scene that essentially comprises the targets buried in a dielectric half-space with a rough surface profile. For simplicity and without loss of generality, the involved media are assumed homogeneous. Under known materials and imaging geometry, in the presence of a flat ground interface and considering targets represented by canonical simple shapes, the underlying scattering problem can be analytically characterized by solving Maxwell’s equations (see, e.g., [12], [24], and [25]). However, in most cases of practical interest, those assumptions and prior information do not hold, and more realistic and flexible approaches are needed. These approaches call for implementing a full-wave solution of the scattering problem numerically (see, e.g., [26] and the references therein) or collecting experimental data. In the next sections, we summarize key FL-GPR systems used in data collection and also discuss the numerical approaches used for data modeling. FORWARD-LOOKING GROUND-PENETRATING RADAR PROTOTYPES AND EXPERIMENTAL APPROACHES Approaches based on controlled experiments, though complicated and costly, are valuable in understanding phenomenology and can provide real scattering data. This, however, requires the availability of specific facilities and preparation of experimental campaigns. Toward this end, different prototypes and radar systems have been developed for data measurements, which are summarized as follows. In [27]–[29], a prototype of a high-resolution GPR system was designed and deployed by SRI International. This system is a stepped-frequency, fully polarimetric radar, operating over the 0.3–3-GHz frequency band. The prototype was conceived to operate as an FL SAR system, providing a ground–surface resolution of about 5 cm. The experimental activity was originally designed to define optimal FL-GPR parameters and support image processing for the standoff target detection of concealed antitank mines. Reference [27] was the first publication reporting experimental data collection and processing with an FL SAR system for GPR applications. It provided insights into the signal-to-clutter ratio (SCR) of shallow-buried targets as well as key features of clutter statistics. Time-frequency analysis was applied in [30] using an FL-GPR system to detect plastic targets buried under a rough ground surface. Different quadratic time-frequency distributions were considered to characterize and interpret the scattering from both the targets and the rough surface. This work employed experimental data described in [27] and proposed a target detector based on the signal ambiguity function, which showed superior detection performance over a conventional detector. In [31], an FL-GPR operating from 0.76 to 3.8 GHz was developed by Planning System Incorporated (PSI). This FLGPR system is a broadband, stepped-frequency, continuous wave (CW) system performing digital phase detection of the CW echoes on a fixed number of receiving channels. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Experimental campaigns were carried out to collect data in the field, accounting for both metallic and plastic objects. A near-field delay-and-sum beamforming algorithm (more details are provided in the “Image Formation” section) was implemented to provide focused images of the considered area. To meet the system bandwidth requirements, the antennas were constituted by Archimedean spirals, and each antenna was housed in a cavity-backed structure. The U.S. Army Combat Capabilities Development Command Army Research Laboratory (ARL) FL-GPR prototype, called the synchronous impulse reconstruction (SIRE) radar [see Figure 1(a)], is an ultrawideband (UWB) radar based on the transmission of short pulses [32]. For the imaging and detection of buried targets, the system employs a physical array of 16 receiving antennas, which provide a long aperture for high cross-range resolution. The transmitted pulse has a 0.3–3-GHz bandwidth, which represents a tradeoff between fine down-range resolution and the ability to penetrate soil depths of a few centimeters. To increase the signal-to-noise ratio, the baseband receiver integrates radar returns from multiple pulses prior to processing for target detection. The system hardware was based on commercially available integrated circuits, which provided a low-cost and lightweight digitizing scheme. In [32], both simulations and measurements in the field were conducted considering on-surface metallic targets; the possibility of penetrating foliage and weather was experimentally assessed. Following the design and testing of SIRE, ARL researchers proposed a new UWB radar system, called the spectrally agile frequency-incrementing reconfigurable (SAFIRE) radar system [34]. SAFIRE was designed to provide an unprecedented capability of adapting the operating frequency to the surrounding electromagnetic environment, thereby lowering the susceptibility of the system to radio-frequency (RF) interference. To this end, SAFIRE employed steppedfrequency waveforms and sought to eliminate system transmissions that are likely to cause interference to nearby sources of disturbance [35]. Currently, such a feature is considered essential for FL-GPRs operating in congested RF environments. The SAFIRE operating band ranges from 300 to 2,000 MHz, with a minimum frequency step-size of 1 MHz. The SAFIRE system can be configured in either an FL or side-looking orientation and is equipped with a uniform linear array made of 16 Vivaldi receiving antennas and two quad-ridge horn transmit antennas. The latter are placed above the ends of the receiver array. The sequential firing of the two transmitters provided orthogonal waveforms, which established a multiple-input, multiple-output (MIMO) configuration with an extended virtual aperture for improved cross-range resolution. Experimental FL-GPR data, collected by the Army Look Ahead Radar Impulse (for) Countermine (ALARIC) vehicle-borne UWB impulse radar system, were used in [36] to provide the first assessment of coherent integration DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE through exploitation of the platform movement. The system employs an impulse generator at approximately 950 MHz and has a 300–3,000-MHz bandwidth (down-range resolution of .5 cm). A pair of transverse electromagnetic horn transmit antennas, placed at two ends of a 2-m-wide receiver array, was considered to provide good pulse fidelity while minimizing the reflected power of the transmitter. The receiver array comprised 16 identical Vivaldi notch antennas, which were selected because of their compact size and low cross coupling between elements. Using physical array measurements from multiple platform positions, it was shown that conventional synthetic aperture processing can be used to form FL-GPR images of good quality, though at the expense of the lateral position estimates of the targets within the illuminated scene. The preliminary results also demonstrated the possibility of successfully detecting metallic targets buried near the surface. More recently, the authors in [33] proposed an experimental test for the assessment of imaging and detection (a) (b) FIGURE 1. (a) The system and field test of the U.S. ARL. (Source: [32]). (b) The test facilities at Ingegneria dei Sistemi S.p.A., Italy. (Source: [33]). 175
with two transmitters and 16 receivers, were presented in [18]. The modeled system has a frequency bandwidth from 0.3 to 1.5 GHz. In particular, a near-field army FDTD software package was developed at the ARL for synthesizing FL-GPR numerical data accounting for realistic sensing conditions. More information on the modeling approach and targets used for the analysis can be found in [37] and [38]. An example of FL-GPR-focused numerical data from [38] is reproduced in Figure 2; the images are formed over a horizontal plane in front of the transmitting and receiving antennas considering both metallic and plastic targets whose locations are specified in Figure 2(a). Both the flat ground interface [Figure 2(a)] and the rough surface [Figure 2(b)] are simulated. The latter was generated by assuming a random process model described by Gaussian statistics. A 3D full-wave approach, based on a finite- difference frequency- domain method and optimized to provide realtime solutions, was proposed in [20] to model an FL-GPR on a moving platform and calculate the scattering from rough terrains located at large electrical distances from the antennas. For a synthetic aperture, the computational domain was reduced to a small subset of the observed region, and the surface clutter was determined by performing the simple multiplication of a precomputed impulse response NUMERICAL APPROACHES matrix of the rough profile with a matrix characterizing the Numerical data obtained by means of a finite- difference FL-GPR transmitted signal. time- domain (FDTD) method, modeling an FL-GPR system This approach significantly reduced the complexity through an efficient use of computational resources, –20 4 thereby permitting the representa5 8 3 –25 2 tion of lossy/frequency-dispersive 2 Tx: (θinc, φ1) –30 11 soils and target-detection processing 1 3 –35 Rx Array 0 6 in real time. This is especially useful –40 10 9 –1 in scenarios where an experimental Tx: (θinc, φ2) –2 1 –45 performance validation may incur a 7 –3 –50 high cost and/or require significantly 4 –4 more resources. –8 –6 –4 –2 0 2 4 6 8 The authors in [39] extended the rex (m) al-time 3D simulation to the multiview (a) case, considering a realistic velocity of 4 the moving platform. The matrix-mul–25 3 tiplication-based surface clutter com2 –30 1 putation in this case required an addi–35 0 tional precomputed correction matrix –40 –1 of the moving platform measurement –45 –2 steps along the direction of motion. –3 –50 The method was tested via Monte Carlo –4 –55 simulations. In practice, the proposed –8 –6 –4 –2 0 2 4 6 8 simulation-based approach can be x (m) (b) used to estimate the scattering from the rough surface profile, which can then be subtracted from the actual FL-GPR FIGURE 2. The focused numerical data (in decibels) for a scene with size equal to 9 × 19 m: measurements. The resulting difference ground with (a) a flat surface and (b) a randomly rough surface characterized by a root mean square (rms) surface height equal to 0.8 cm and correlation length of 14.93 cm. Further details signals can then be processed for image formation and target detection. can be found in [38]. Rx: receive; Tx: transmit. (Source: [38].) y (m) y (m) performance by means of an FL-GPR under realistic conditions. Test facilities at Ingegneria dei Sistemi S.p.A., headquartered in Italy [see Figure 1(b)], are equipped with a moving platform that can support two or more antennas, pointing toward a test site that comprises several resolution cells of the FL-GPR system. The test field allows the inclusion of heterogeneous soils. Data were gathered in the frequency range from 0.4 to 2 GHz using a transmit and a receive horn antenna, both connected to a network analyzer. The antennas, spaced 93.5 cm apart and tilted at a 45° angle, were mounted on the moving platform at a distance of 1.42 m from the air–soil interface. The platform was moved along straight tracks with a constant spatial step of size 0.02 m. Scanning lines of about 8 m were used to collect data over a sandy portion of the test site. The experiments were performed after intense rainfall, which reproduced challenging operational conditions. This resulted in a nonhomogeneous background medium that consisted of two layers: the upper layer, with thickness of a few centimeters, was dry sand while the deeper layer comprised wet sand. This work also discussed the performance achievable with a conventional microwave tomographic approach to focus FL-GPR data. 176 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
IMAGE FORMATION Once the scattering data have been collected or generated by solving the forward problem, a postprocessing procedure is needed to produce an image of the illuminated scene [12], [40]. In most cases, when the detection of concealed targets is of interest, the image is 2D and formed over a horizontal plane within an area ahead of the moving FL system [see Figures 2(a) and (b)]. Although the height of the 2D image can be arbitrarily chosen, the capability of penetrating lossy soils at microwave frequencies is on the order of 2–10 cm, which is comparable to the achievable resolution. Therefore, varying the height by a few centimeters will not significantly affect the image quality and target-detection capability. Several image formation approaches have been proposed in the literature, with a majority being simple adaptations of conventional algorithms used for focusing SAR data. More involved strategies based on electromagnetic formulations of the problem have been presented to account for the presence of the dielectric interface and near-field conditions arising due to shorter distances between the antennas and imaging region of interest. In the following sections, we give an overview of these methods. MIGRATION Among the most well-known image formation algorithms, migration has been broadly used to focus GPR data. Migration is a family of imaging techniques that originate from the seismic literature [41], [42]. Over the years, this class of algorithms has been adapted within radar imaging frameworks, including SAR and GPR (see, e.g., [43] and [44]). From a practical viewpoint, the algorithm essentially operates on the scattered field at the receiver to compensate for the different delays encountered by the signal generated by point-like scatterers, which are illuminated within a certain time interval during the movement of the system (the FLGPR platform in this case). In radar imaging, the migration algorithm is sometimes assimilated to beamforming approaches since both essentially compensate for the hyperbolic patterns representing raw data in a time-range scattering diagram. Image reconstruction in terms of migrated data can be achieved by numerically implementing a double integral function of time and range, which includes the scattered field and migration operator (see, e.g., [45] and [46]). A number of contributions on the application of migration and beamforming algorithms to GPR data have been proposed. A comprehensive review of these approaches can be found in [17]. MICROWAVE TOMOGRAPHY GPR imaging methods based on an electromagnetic formulation constitute a so-called inverse problem [12], [47]. Mathematically, a solution to the direct problem exists that is unique and has a continuous dependence on the data (see, e.g., [40] and [48]). The problem becomes ill posed when the uniqueness of the solution and/or continuity of DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE its data dependence do not hold. The latter implies that even a small error in the scattered field (e.g., the presence of additive thermal noise) can cause a considerable error in the reconstruction of the background dielectric characteristics. In practice, a regularization is applied to solve the inverse problem [48]. The main objective of the numerous methods that exist to regONCE THE SCATTERING ularize the inverse problem is renouncing an ideal solution DATA HAVE BEEN and looking for suitable roCOLLECTED OR GENERATED bustness in the results. BY SOLVING THE FORWARD Imaging procedures based PROBLEM, A on a linear solution of the POSTPROCESSING scattering equation have been PROCEDURE IS NEEDED TO shown to be simple and particPRODUCE AN IMAGE OF THE ularly suitable for the processILLUMINATED SCENE. ing of GPR data [4], [49]–[51], including FL configurations [23], [33]. These procedures are mainly based on the Born approximation (BA) (see e.g., [40] and [49] and the references therein), which essentially approximates the internal field of a dielectric object with the incident field; the latter being a known term. By suitably defining the Green’s function of the problem [12], the electromagnetic formulation of the scattering field based on a linear solution allows near-field consideration consistent with the nature of the illumination. We can also describe in the formulation a flat air–soil interface by defining a Green’s function for multilayered media. Methods based on the inversion of the linear scattering equation are often referred to as microwave tomography approaches [49], which essentially consist of retrieving the unknown profile of a dielectric object, i.e., the contrast function, from the knowledge of the scattered field collected at the receiving antenna. The contrast function is defined as the relative difference between the (complex) permittivity of the target and that of the reference propagation scenario (free space, in the case at hand). By modeling the transmitting antennas as vertically oriented Hertzian dipoles and measuring only the VV-polarization scattered field from the investigation domain D, the linear relationship under BA for shallow-buried targets can be expressed as [12], [23] E s ^rr, ~h = - jk b2 ~n 0 z 0 $ ##D G^r, rr, ~h $ 6G ^r, rt, ~h $ z 0@ | (r) dr, (1) where E s is the VV-polarized scattered field corresponding to angular frequency ~ collected at point rr, | is the unknown scene reflectivity, G is the free-space dyadic Green’s function, k b = f r k 0 is the wavenumber in the medium, and k 0 = ~ f 0 n 0 is the free-space wavenumber. The vectors rr and rt represent the positions of the receive and transmit antennas, respectively; r denotes a generic point 177
in the image area; and z 0 is the unit vector along the vertical direction. The operator “$” in (1) represents the dyadic product and is implemented as the usual product between a 3 # 3 matrix and a 3 # 1 vector. To generate the image, (1) is discretized by means of a conventional methods-of-moments approach (i.e., implementing a point-matching procedure) [12]. To limit the computational burden, the linear problem can be simply solved by applying the adjoint operator, which is also known as the backpropagation algorithm (BPA) [52], and solving for the unknown scene reflectivity. That is, | = L )zz E s z, (2) where L )zz is the adjoint of the discretized linear operator in (1), and E s z and | are stacked vectors representing the collected scattered field data and discretized version of the unknown scene reflectivity, respectively. The spatial map defined by the magnitude of | is the tomographic image of D. An alternative, computationally more demanding approach to regularize and solve (1) can be implemented using truncated singular value decomposition (TSVD) [48], [54]. To achieve robustness of the solution against noise and the uncertainties of the parameters of the reference 0 –5 y (m) –10 –20 0 –30 –40 5 0 2 4 6 8 10 12 14 16 18 x (m) (a) –50 0 –5 scenario, the inversion is performed by implementing the following equation [23], [48]: N |= / n=1 1 v n G E s z, u n H v n , (3) where G·,·H denotes the inner product, v n denotes the singular values (sorted in a decreasing order) of the linear operator L zz, u n and v n are the singular vectors of L zz, and N is the truncation index, whose choice should ensure a compromise between resolution and smoothness of the reconstruction and the stability of the solution against noise. TSVD belongs to the class of inverse filtering methods [54] and has frequently been applied to process GPR data [33], [49]. A performance comparison between TSVD and BPA for an FL-GPR was conducted in [23]. In Figure 3, we reproduce the images generated by the two methods using the near-field numerical data in [23]. The reconstruction capabilities of both schemes were investigated by analyzing the achievable resolution limits and considering the impact of rough surface clutter on image quality. It was shown that the two methods provide comparable imaging capabilities with few differences. More specifically, a microwave inverse imaging approach can provide improved target reconstructions over BPA, specifically enhancing the response of weak targets. On the other hand, BPA provides smoother and cleaner images that are less affected by environmental clutter. In general, BPA is preferred in the case of a large investigation domain and when implementing multiview and multiaperture strategies [19], [36], [53] or multilook incoherent processing [44]. An example of an FL-GPR image achieved with BPA based on a multiaperture strategy, i.e., integration of a certain number of FL-GPR scans selected along the track of the sensor platform [53], is shown in Figure 4. An FL-GPR image based on real data, described in [28] and [29], is depicted in Figure 5. The crosses with label P10 point to the nominal locations of plastic mines buried at a depth of 10 cm. y (m) –10 –20 0 –30 5 0 2 4 6 8 10 12 14 16 18 x (m) (b) –50 FIGURE 3. Reconstructed images for plastic and metallic targets, both on top of and buried below a rough surface with an rms height equal to 0.8 cm and correlation length of 14.93 cm [23]. The amplitude is normalized to the maximum and expressed in decibels over the interval [–50, 0]. (a) TSVD inversion with a truncation index of N = 180. (b) Adjoint inversion. The range between the strongest and weakest target is around 45 dB for the TSVD and nearly 48 dB for the adjoint method. (Source: [23].) 178 y (m) –40 5 4 3 2 1 0 –1 –2 –3 –4 –5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 x (m) 0 –5 –10 –15 –20 –25 –30 –35 –40 FIGURE 4. The normalized BPA tomographic reconstruction, on the decibel scale, achieved the integration of the sets of eight FL-GPR apertures. The processed numerical data are described in [37]. The true target positions are indicated with red crosses. (Source: [53].) IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
0 –5 40 45 50 55 Down Range (m) 60 –10 –15 (dB) –1 0 1 2 3 (a) –1 0 1 2 3 0 –5 40 45 50 55 Down Range (m) 60 –10 –15 (dB) Cross Range (m) DATA-ADAPTIVE AND COMPRESSIVE SENSING METHODS A data-adaptive approach for FL-GPR image formation was proposed in [44] (Figure 6). It is based on amplitude and phase estimation and rank-deficient robust Capon beamforming. There were 12 evenly spaced scans (each scan covering 2 m in the down range) used to form the entire image, covering 24 m in total. The amplitude- and phase-estimation algorithm in conjunction with the robust Capon beamformer provided a significantly enhanced image quality compared to BPA. Compressive sensing (CS) methods can also be applied to exploit the intrinsic sparsity of the illuminated scene in terms of the number of buried targets. A CS approach was employed in [56] for scene reconstruction using measurements from a MIMO FL-GPR system. Assuming a linear model relating the measured data and the unknown scene reflectivity, the image formation can be posed as a solution to an inverse problem regularized by a sparsity-inducing norm. This framework permits scene reconstruction with spatial and temporal sampling at sub-Nyquist rates. In real environments, even with few targets, there exists strong clutter that populates and subsequently degrades the quality of the reconstructed image. This is because the rough surface clutter in the FL-GPR can be distributed over the entire region. An FL-GPR image from [56], generated by processing real data of a shallow-buried metallic antitank landmine using the CS technique, is depicted in Figure 7. Clearly, without clutter suppression, it Cross Range (m) The image is strongly cluttered with contributions from the rough surface. The aforementioned tomographic imaging methods are based on free-space approximation, neglecting the presence of the air-to-ground interface and assuming the propagation as occurring in a homogeneous dielectric medium. The performance of the approximate free-space tomographic imaging was contrasted in [55] with that of a tomographic algorithm that accounts for the presence of the actual half-space geometry. The latter implements the spectral representation of the dyadic Green’s function. Using numerical electromagnetic FL-GPR data, the authors in [55] demonstrated that a free-space approximation can lead to a loss of imaging resolution and degradation in the SCR, as compared to its halfspace counterpart. The impact of the lower resolution was also observed in the estimated target statistics [53]. (a) FIGURE 6. Real-data-based, single-look imaging results: (a) a BPA imaging result and (b) the results of a hybrid of amplitudeand phase-estimation algorithm and robust Capon beamformer. (Source: [44].) 0 X Distance (m) 1 P10 –1 –0.8 2 P10 3 –0.6 –0.4 4 P10 15.4 –2 15.6 6 0 1 2 3 4 5 Y Distance (m) 6 7 FIGURE 5. An FL-GPR image of plastic mines. The crosses point to the nominal buried locations of the targets. The label P10 denotes a plastic mine buried at a depth of 10 cm. As expected, the image is strongly cluttered by the contribution from the rough surface. The blank region indicates where a strong stake (fiducial) return has been masked from the image to make the mine returns more visible. (Source: [30].) DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE Azimuth (m) 5 15.8 16 16.2 –1 0 1 2 3 14 14.5 15 15.5 16 16.5 17 17.5 18 Range (m) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 FIGURE 7. A CS image of sparse data without clutter suppression. (Source: [56].) 179
Additionally, reflections generated by rocks and other objects lying on the surface above the targets can be the source of strong clutter or false alarms (see, e.g., [53], [65], and [66]). Since the illuminated area in the FL-GPR usually CLUTTER-MITIGATION STRATEGIES extends beyond the image region where the targets reside, The standoff-sensing capability of FL-GPR comes at the strong clutter can also derive from nearby shrubbery, rocks, expense of the energy backscattered by the illuminated and other objects lying on the surface. Because of these factargets. The weak target responses are vulnerable to intertors, clutter-suppression approaches devised for the DL conference scattering arising from the air–ground interface figuration may not directly apply to FL-GPR. roughness and subsurface clutter. Therefore, it is imperative Figure 8 shows an image from [56] obtained by applying to eliminate or significantly reduce the clutter for effective BPA to real data corresponding to a shallow-buried landand reliable target detection. mine in a road 6 m wide, with rocks and shrubs populating Over many years, considerable attention has been dethe roadside. The various types of clutter are clearly visible voted to the suppression of clutter generated by the ground in the image. More specifically, in addition to the clutter in bounce in the down-looking the image region, strong azimuth clutter and short-range (DL) configuration, wherein clutter are also visible. The former is due to large shrubs the detection of objects burTHE STANDOFF-SENSING and on-surface rocks on the side of the road, while the latter ied at large depths (on the orCAPABILITY OF FL-GPR is associated with ranges adjacent to the radar system that der of tens of centimeters) is cause returns with small propagation delays. possible [54], [57]–[63]. Since COMES AT THE EXPENSE Some research efforts have been devoted to rough surthe ground bounce in DL is OF THE ENERGY face clutter characterization and reduction in FL-GPR (see, typically from a fixed range BACKSCATTERED BY THE e.g., [20], [56], and [67] and the references therein). One of and has the highest strength, ILLUMINATED TARGETS. the first attempts to characterize rough surface clutter in it is conventionally removed FL-GPR is documented in [68], where plane-wave timeby estimating and subtracting domain scattering from a fixed target in the presence of the ground return from the a rough surface was numerically solved by means of an measured signals or via time gating [64]. In FL-GPR sensFDTD algorithm. The authors examined the statistics of the ing, however, the rough air–ground interface creates clutter pulse scattered from the surface and applied conventional that is essentially distributed over the entire area illumimatched filtering for target detection. nated by the sensor. A method based on the scattering solution through physical optics was proposed in [69]. The authors demonstrated that, by analyzing both Clutter in Buried Reconstruction scattering amplitude and phase as Short-Range Clutter Reconstruction Region Landmine Region well as employing time-frequency signal representations, it is possible 1 to suppress clutter and improve 0.9 –6 target-detection performance over 0.8 conventional approaches based on –4 background subtraction or param0.7 eter analysis [70]. –2 0.6 An analytical approach was devel0 0.5 oped in [71] to examine the impact 0.4 of the rough surface on the detection 2 of buried targets in FL-GPR. This ap0.3 4 proach quantified the coherent and 0.2 incoherent components of the cross 6 0.1 section of buried targets using phys8 ical-optics approximation. The total 6 8 10 12 14 16 18 20 received signal from the targets and Range (m) surrounding clutter was determined Azimuth Clutter to consist of three components: a coherent signal (whose phase is well defined and can be tracked) correFIGURE 8. An FL-GPR image showing different types of clutter. The data are acquired by a sponding to the target, an incohervehicle-mounted stepped-frequency FL-GPR virtual aperture radar, and BPA is used to generent signal (whose phase is random) ate the image. (Source: [56].) Azimuth (m) is difficult to distinguish the target from the clutter in the reconstructed image. 180 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE y (m) In [67] and [75], an alternative approach, based on the coherence factor (CF), was proposed for clutter reduction. The performance of the CF-based approach was quantified in terms of the SCR in the image domain. The approach leveraged the matched filtering formulation of microwave near-field tomographic imaging to define the CF for a multiantenna FL-GPR system. The CF was used to generate a coherence map of the region of interest, which was then applied as a mask to the original tomographic image. Since the CF map assumes small values for low-coherence image regions, which correspond to strong rough surface clutter contributions, the final image has significantly reduced clutter and is more amenable to the implementation of a subsequent target-detection procedure. A comparative example is shown in Figure 9. In [56], a clutter-suppression method, in conjunction with CS imaging, described in the “Image Formation” section, was designed for a MIMO, array-based FL-GPR. A preprocessing method was proposed for reducing the azimuth and short-range clutter localized in specific regions outside of the image area, as depicted in Figure 8. This was achieved by implementing azimuth filtering on sparse-array data and range-profile domain suppression via an inverse Fourier transform. The clutter-suppressed version of the CS-based image in Figure 7 is depicted in Figure 10, where the impact of the clutter reduction method is clearly visible. The clutter y (m) generated by the target, and an incoherent clutter contribution. As such, the problem of subsurface target detection can rely on the identification of a partially coherent broadband signal in the presence of noise. This approach, however, would require the design of a coherent system, which is complicated and expensive. Further, it could fail not only in the presence of strong surface roughness profiles or inhomogeneities but also under weak target response (i.e., dielectric) when the useful signal can lose its partially coherent nature. Nonetheless, the main analytical approaches are based on physical-optics scattering and a Gaussian representation of the correlation function of the rough soil; these assumptions, however, may not represent all possible realistic conditions. To overcome some intrinsic limitations of the analytical approaches and provide a more realistic prediction of back-scattering in FL-GPR systems, in both the presence and absence of buried targets, a full-wave solution based on an FDTD modeling of dispersive soil (i.e., described by a frequency-dependent permittivity) was proposed in [72]. This work also developed a statistical analysis of the roughsurface scattering, constituting one of the first attempts at the application of optimum hypothesis testing to solve the problem of the detection of radar returns from buried mines in the presence of rough surface clutter. The effects of surface clutter on time-reversal-based FLGPR imaging were investigated numerically in [73], where a large realistic scene consisting of landmines buried under a rough surface was considered. This work emphasized the role of the polarization of the incidence wave and impact of the surface parameters on the dynamic range of the radar images comprising both clutter and metallic/dielectric targets. The impact of target orientation was also considered therein. Following a similar full-wave approach, the authors in [74] characterized clutter in the image domain and proposed a statistical polarimetric approach for the reduction of the rough surface clutter to improve the signal-to-background ratio. Specifically, the method was based on the analysis of the polarimetric coherence of the backscattered signal, which is assumed to be zero for the rough surface clutter and nonzero for human-made discrete targets. A synthetic aperture near-field beamforming approach was used to reduce clutter for antitank mine detection in [31], which also proposed a statistical analysis of the signal and clutter contributions based on real data from metallic and plastic landmines. This work provided useful insights into the relative intensity of clutter and targeted echoes, discussing the challenging nature of the detection of plastic materials. In [36], [37], [44], and [53], authors demonstrated the advantage of coherently integrating measurements corresponding to multiple consecutive platform positions for rough surface clutter reduction. The approach simply takes advantage of the coherent and static nature (when observed from different spatial positions) of the scattering generated by human-made targets with respect to the rough surface contribution. 5 4 3 2 1 0 –1 –2 –3 –4 –5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 x (m) (a) 0 –5 5 4 8 4 3 2 2 1 5 0 9 6 –1 1 –2 7 –3 3 –4 –5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 x (m) (b) 0 –5 –10 –15 –20 –25 –30 –35 –40 –10 –15 –20 –25 –30 –35 –40 FIGURE 9. The CF-based imaging results using FL-GPR numerical data of plastic and metallic landmines: images (a) before clutter suppression and (b) after CF-based enhancement. (Source: [67].) 181
in the region containing the targets, on the other hand, can be reduced by fine-tuning the regularization parameter associated with the sparsity-based inverse problem. Toward this end, an iterative procedure was implemented in [56] to estimate an optimum regularization parameter in the presence of rough surface clutter, based on the ratio of clean areas within the image with respect to cluttered regions. TARGET DETECTION The presence of rough surface clutter in FL-GPR imagery renders the detection of on-surface and buried targets challenging. Owing to the oblique illumination in the FL configuration, only a small fraction of the transmitted energy is backscattered from the target THE PRESENCE OF ROUGH and collected by the radar reSURFACE CLUTTER IN ceiver. The deeper the burial depth of the target, the weaker FL-GPR IMAGERY RENDERS the signal return. More imTHE DETECTION OF portantly, due to the similar ON-SURFACE AND BURIED dielectric features of plastic TARGETS CHALLENGING. targets and the surrounding soil (permittivity on the order of 3–4 for a dry background medium), the scatterers cannot be easily differentiated from clutter in both the spectral and image domains [11]. To this end, innovative statistical and spectral approaches have been devised in the literature to offer reliable target detection in FLGPR applications. In the following sections, we group these approaches based on statistical and spectral methods. STATISTICAL DETECTORS In [28] and [29], the effectiveness of two statistical signalprocessing techniques, namely, the polarimetric whitening –1 –0.8 –0.6 –0.4 Azimuth (m) 15.4 –2 15.6 15.8 16 16.2 –1 0 1 2 3 14 14.5 15 15.5 16 16.5 17 17.5 18 Range (m) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 FIGURE 10. A CS image of Figure 7 after clutter suppression. (Source: [56].) 182 filter and generalized likelihood ratio test (LRT), was investigated for different types of targets buried at various depths below the interface. The capability of these methods to detect metallic targets with high confidence was illustrated. However, an unsatisfactory detection performance for plastic mines was observed due to 1) a mismatch between the ground truth and assumed target and clutter statistics as well as 2) an incomplete exploitation of the target signatures. A locally adaptive detection method that adjusted the detection criteria automatically and dynamically across different spatial regions of the FL-GPR image was proposed in [76]. In this work, an FL-GPR image was processed with a locally adaptive standard deviation filter to compute the standard deviation of a small neighborhood around each pixel of interest in the image. More specifically, prior to performing target detection, each image pixel value was replaced by the maximum pixel value within a rectangular neighborhood of dimensions equal to 3 m in the cross range and 1.5 m in the down range. Potential targets within the image were identified by performing the following operation: A = arg u, v {G f (u, v) $ min {O f (u, v), - 60}}, (4) where O f (u, v) denotes the filtered image, A is the set of local-maxima locations, and G denotes the FL-GPR image. An empirical value of –60 dB was selected as the threshold. An example from [76] is depicted in Figure 11, where the associated false-alarm locations are indicated with white crosses. Expectedly, because of the nonoptimal choice of the threshold, the processed image still exhibits a considerable number of false alarms. An image-domain LRT-based detection strategy was proposed in [53], which exploits the multiview intrinsic nature of the FL-GPR configuration. The multiple views of the scene correspond to measurements from different positions along the platform trajectory. For an LRT detector, the exact statistics of the targets and clutter in the FL-GPR images need to be known a priori. To this end, clutter and target pixel sets, obtained from the training data, were used to determine the target and clutter statistics. The targets were represented by a three-component Gaussian mixture model, whereas the clutter was found to be Rayleigh distributed. Two different LRT detection strategies were employed for fusion of the multiview images. The first performed simultaneous detection and fusion within the LRT framework under the assumption of independent target and clutter statistics from one viewpoint to another. Mathematically, the pixelwise LRT applied on N im images, {X n (i, j); n = 1, 2, f, N im}, is given by N im LR (i, j) = % n=1 H0 p (X n (i, j)| H 1) 1 c, (5) p (X n (i, j)| H 0) 2 H1 where p (X n (i, j) ; H 0) and p (X n (i, j) ; H 1) are the conditional probability density functions of the nth image under the IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
null (target absent) and alternative (target present) hypotheses, respectively, and data independence across the multiple views is assumed. By comparing the likelihood ratio with a threshold c determined using the Neyman–Pearson theorem [77], a fused binary image Ff can be defined as Ff (i, j) = ( 1 if LR (i, j) 2 c (6) 0 if LR (i, j) # c. y (m) The second method applied the LRT detector to individual images, followed by fusion of the detected binary images through a pixel-by-pixel multiplication. Since the clutter generates different image-domain signatures when observed from different viewpoints, both strategies take advantage of the clutter diversity provided by the multiple views, though the latter scheme does not require the data independence assumption across the multiple views. An adaptive version of the LRT detector of [53] was proposed in [78] to allow enhanced multiview detection of lowsignature targets in a rough surface clutter environment. To achieve a more accurate estimation of the image-domain statistics, the target and clutter distributions were iteratively adjusted by means of a two-step procedure. The first step aimed at separating the image into target and clutter regions, whereas the second step used the extracted target and clutter regions to update the target and clutter statistics. This process was repeated until convergence was achieved. –275 Along Track (m) A binary image from [78], corresponding to the image presented in Figure 4, is reported in Figure 12. In [78], it was shown that an adaptive detector can outperform its nonadaptive counterpart in terms of the false-alarm rates while providing comparable detection performance. A robust LRT detector, under the independence viewpoints assumption, was proposed in [79] for multiview FLGPR imaging. Instead of modeling the distributions of the target and clutter pixels with parametric families, a band of feasible probability densities under each hypothesis was constructed using training data. The detector was then designed such that it minimized the maximum error probability for all feasible density pairs within the two bands. This relaxed the strong assumptions about the clutter and noise distributions, rendering the detector robust against statistical model deviations. The minimax approach is critical in cases where accurate estimation of the distribution of the background clutter may be challenging. A binary image from [79], corresponding to the image of Figure 4, is depicted in Figure 13. It was demonstrated that, –280 –285 –10 –5 0 Along Array (m) 5 10 (a) 5 4 3 2 1 0 –1 –2 –3 –4 –5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 x (m) FIGURE 12. The detection result for the image presented in Figure 4 obtained through the adaptive procedure [78]. The red areas indicate the detected target regions, while the black areas represent false alarms. (Source: [78].) –208 –210 5 –212 3 –214 –216 –10 –5 0 Along Array (m) 5 10 (b) FIGURE 11. An FL-GPR-processed image in [76]. The × symbol indi- cates false alarms, + indicates fiducial alarm, and a circle indicates a target. The panels correspond to two different regions along a track that have slightly different lengths: (a) –285 to –275 m and (b) –216 to –206 m. (Source: [76].) DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE y (m) Along Track (m) –206 4 1 2 –1 5 7 3 0 9 6 1 –3 –5 8 2 4 6 8 x (m) 10 12 14 16 FIGURE 13. The detection result for the image presented in Figure 4 obtained using the robust LRT detector. (Source: [79].) 183
extracted at each narrow-frequency subband and employed to assess their role in target detection. For this purpose, the authors considered a Fisher’s linear discriminant (FLD)based classifier, which can be mathematically described as follows. With C 1 and C 2 representing the “false-alarm” and “target” classes, respectively, and given N training feature vectors {x n, n = 1, f, N} that have been labeled as C 1 and C 2, FLD finds the projection direction in the feature space that maximizes the class separation as compared to detectors based on parametric models, robust detectors can lead to significantly reduced false-alarm rates, particularly in cases where there is a mismatch between the assumed model and true distributions. Both the robust and parametric detectors were extended to incorporate statistical dependence between multiview images via a copulabased function in [80]. y FLD = (m 2 - m 1) S -w1, (7) where m i, i = 1, 2,is the mean feature vector of the ith class and S w is the within-class scatter matrix, defined as 1 S w = N : / (x n - m 1) (x n - m 1) T xn ! C1 + / xn ! C2 (x n - m 2) (x n - m 2) TD.  (8) When an unlabeled testing data point is collected, its confidence value is determined by the projection of its feature vector on y FLD, and it is classified by means of simple thresholding. In [84], the authors performed target detection using space–wavenumber processing and a feature-based method, employing data measured by means of a vehiclemounted FL-GPR equipped with a MIMO array. The approach was applied in the image domain and relied on the definition of a bistatic scattering function associated with selected pixels. A set of images achieved with different incident and scattering angles was used to estimate the bistatic scattering function. Experimental results demonstrated that the proposed method can offer an efficient feature vector for landmine discrimination. An original approach to process measured data collected at the U.S. Army test site using the radar system 24 Vehicle Direction Along-Track Range (m) SPECTRAL APPROACHES To improve the detection performance of plastic objects, a time-frequency approach was proposed in [30]. The detection problem was conventionally formulated by considering a signal corrupted by interference. To deal with the nonstationary nature of both the signal and clutter, the authors employed time-frequency distribution to provide temporal localization of the signal spectral components. The detector considered the signal time-frequency representation based on the Choi–Williams distribution or, equivalently, the ambiguity function and applied discriminant features extracted using principal component analysis plus the linear discriminant method [81], [82]. The effectiveness of this approach and the employed detector was demonstrated using experimental results. Frequency subband processing was used in [83], together with co- and cross-polarized signals, for enhanced target-detection performance in FL-GPR sensing. Images were formed using one wide subband and four narrow-frequency subbands within a 2.5-GHz signal bandwidth to analyze the frequency dependency of landmines and clutter. An FLGPR image, corresponding to the copolarized (VV) signal over multiple subbands, is shown in Figure 14. It is evident that the clutter is particularly strong, but its distribution changes over the frequency bands considered. On this basis, a number of features, including the magnitude, local contrast, ratio between copolarized and cross-polarized signals, and features of polarimetric decompositions, were 20 16 12 8 –4 0 4 Cross-Track Range (m) (a) –4 0 Cross-Track Range (m) (b) 4 –4 0 4 Cross-Track Range (m) (c) –4 0 Cross-Track Range (m) (d) 4 –4 0 4 Cross-Track Range (m) (e) FIGURE 14. An example of images achieved with the FL-GPR system in [83] keeping unchanged the illuminated scene and polarization (VV) but exploiting different frequency subbands: (a) 0.8–2.8, (b) 0.75–1.35, (c) 1.25–1.85, (d) 1.75–2.35, and (e) 2.25–2.85 GHz. The black circle indicates the true target location. (Source: [83].) 184 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
developed by PSI was proposed in [85]. The method relied on the definition of a set of spatial lanes in the radar image. The identification of potential targets was first independently performed in each subregion, and they were then tracked across the subregions. Weighted averages of the corresponding geometrical features were evaluated, and the target persistence across the spatial regions was used to reduce the false-alarm rates. Targets appearing in a limited number of lanes were removed as part of the detection scheme. An analysis of the spectral features extracted from the scattered signal, with the goal to improve the performance of buried explosive hazard detection, was provided in [86]. Natural resonant frequency and polarization features of improvised explosive devices were examined in [87] for FLGPR. In [76], a spectrum-based classifier was proposed that rejected false alarms by classifying each potential target based on its spatial frequency spectrum. A method based on the use of narrow-band and fullband radar processing, coupled with a classifier exploiting complex-valued Gabor filter responses, was proposed in [66]. Full-band radar images yielded high spatial resolution, while narrow-band images provided the means to detect targets with unique signatures. A composite confidence map was implemented to detect local maxima and isolate potential target pixels. FUTURE TRENDS A completely different radar-based approach from FL-GPR to road mapping and clearing is to employ a traditional airborne, side-looking SAR system flying on a track parallel to the road and imaging the ground area of interest. This approach has the advantages of a high coverage rate as well as the fact that the platform does not come in contact with the in-road hazard. Nevertheless, these radar systems operate at relatively long ranges (at least 1 km) and, consequently, require larger transmitted power and longer coherent integration intervals to achieve high image resolution. Both modeling and experimental studies have demonstrated the difficulty of detecting weak buried targets (such as plastic landmines) by side-looking GPR systems, even in mild clutter conditions [21], [88]. Recent advances in radar sensors based on unmanned aerial vehicle (UAV) platforms promise to bring together the advantages of both types of aforementioned imaging systems [89]–[96]. Thus, a UAV-based SAR system can operate at small elevations and ranges, requiring a small amount of power and a short synthetic aperture length. At the same time, as the flying platform does not come in contact with the ground, the standoff range requirement relevant to ground-vehicle-borne systems does not apply in this case. A UAV-mounted radar system is likely to be significantly less costly to build and operate than any of the current airborne or FL-GPR systems, while the excellent control of UAV flight trajectories makes motion compensation easier DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE to accomplish. The radar antenna can be readily configured as down-looking, side-looking, or FL, depending on the imaging application, while antenna arrays can be combined with the synthetic aperture created by platform motion. Another possible scenario is using a network of distributed UAV-based SAR imaging senONE POTENTIAL sors working cooperatively to CAPABILITY OF FL-GPR map a ground area. VehicleSYSTEMS THAT HAS RARELY borne radar systems will still BEEN EXPLOITED IN have a role in explosive hazard PRACTICE THUS FAR IS THE detection; however, we envision that future trends will CREATION OF 3D IMAGES OF favor employing unmanned THE SCENE UNDER ground vehicle platforms for INVESTIGATION. this application. One potential capability of FL-GPR systems that has rarely been exploited in practice thus far is the creation of 3D images of the scene under investigation. By adding the height dimension to a radar map, one can infer the depth of a buried target and partially mitigate the surface clutter, which now affects only a limited part of the image volume. Additional target features inferred from the 3D map can also be useful in target-classification applications. One example of a GPR system operating on the FL SAR principle and capable of creating 3D images is the iRadar, developed by the Lawrence Livermore National Laboratory [97]. While the iRadar antenna array is mounted close to the ground and provides only a modest standoff range, one can envision a system equipped with a similar array mounted on a UAV flying at a height of 1–2 m over the road and mapping the area of interest, including the underground volume [96]. Although the surface clutter becomes less an issue in the detection of buried targets in 3D images, underground inhomogeneities created by different soil layers, rocks, roots, and so on represent a new source of clutter that may degrade the detection and classification performance. In addition to buried object detection, FL radar technology is finding use in other emerging applications. One example is attempting to exploit the 3D imaging capability of an FL radar to assist helicopter landing in degraded visual environments (DVEs), such as those created by brownout conditions. A prototype radar system based on this principle is currently being developed at ARL [98]. To achieve an angular resolution of 0.1–0.2°, comparable to optical sensors such as lidar, this radar system must operate in the millimeter-wave regime (Ka band). The wave attenuation through dust, sand storms, or other DVE conditions at these frequencies (less than 1 dB/km one way) is still low enough to provide a see-through capability, which is not available in infrared, optical, or lidar sensors. The 3D map of the landing zone obtained by the FL SAR would be interpreted in terms of natural or human-made terrain features, and this information would be passed to the pilot via 185
a helmet-mounted display to assist in deciding whether the landing zone is safe. The principle of the helicopter-mounted FL SAR system for 3D landing zone mapping is explained in Figure 15. The system is equipped with a 2-m-wide front-bumpertype linear antenna array, which provides resolution in the azimuth direction, while THIS EMERGING the forward motion of the TECHNOLOGY HAS GAINED platform at constant height AN INCREASING INTEREST creates a synthetic aperture DUE TO ITS HUMANITARIAN w it h suf f ic ient ele vat ion look-angle diversity to offer AND MILITARY resolution in the vertical diAPPLICATIONS WHILE rection. The radar waveform MAINTAINING OPERATOR bandwidth (between 0.5 and SAFETY. 1 GHz) provides resolution in the down-range direction. To date, several studies based on computer simulations have demonstrated the feasibility of this concept and emphasized some of the major challenges associated with it. CONCLUSIONS In this article, we presented an overview of image formation and subsurface target-detection techniques using FLGPR. This emerging technology has gained an increasing interest due to its humanitarian and military applications while maintaining operator safety. We provided a balanced account of existing methods and discussed their respective advantages and limitations. The presented image formation approaches included conventional back-projection, 2D Synthetic Array 1D Linear Array (a) ar Line Arra y Forward Motion Equivalent 2D Array ∆φ ∆θ Obstacle (b) FIGURE 15. (a) A schematic representation of the helicopter-borne FL SAR system for 3D landing-zone imaging, showing the proposed configuration involving a linear antenna array. (b) The equivalence of this imaging system with a 2D antenna array. 186 microwave tomographic techniques, and CS-based methods, with the last of these assuming the underlying scene to be sparse. We also outlined different approaches to deal with clutter arising from the rough ground interface. Finally, we detailed statistical and spectral techniques for landmine detection in FL-GPR applications. While a broad range of imaging, target-detection, and clutter-suppression techniques have been proposed in the literature, there are still open issues, particularly associated with the detection of plastic targets and real-time operation under various challenging realistic conditions, that require further investigations. New machine learning algorithms could also be devised for target classification, especially in the presence of strong clutter. The future trend of radar deployment on unmanned platforms (both aerial and ground based) brings forth new challenges. From an implementation perspective, the antenna design is a critical issue, especially when using antenna arrays with the limited space available on an unmanned aerial platform. At the preferred operational frequency range of 0.3–3 GHz, depending on the radiation performance, the antennas can be quite bulky and heavy. Compact designs using metamaterial-based UWB conformal antenna technology are a promising potential solution to the implementation challenges. From an algorithmic perspective, multiplatform data fusion strategies under both communication and computation constraints could be devised to achieve enhanced performance using a distributed network of unmanned platforms. In short, research into devising effective solutions for addressing the aforementioned challenges is critical to providing performance guarantees. AUTHOR INFORMATION Davide Comite (davide.comite@uniromal.it) received his master’s degree (cum laude) in communications engineering and Ph.D. degree in electromagnetics and mathematical models for engineering from the Sapienza University of Rome, Rome, Italy, in 2011 and 2015, respectively. He was a visiting Ph.D. student with the Institute of Electronics and Telecommunications of Rennes, University of Rennes 1, Rennes, France, from March to June 2014 and a postdoctoral researcher with the Center of Advanced Communications, Villanova University, Villanova, Pennsylvania, USA, in 2015. He is currently a post-doctoral researcher with the Sapienza University of Rome, Rome, 00184, Italy. His research interests include the study of scattering from natural surfaces as well as GNSS reflectometry over the land, microwave imaging and object detection performed through ground-penetrating radar, modeling of the radar signature in forward-scatter radar systems, study and design of leakywave antennas, and generation of nondiffracting waves and pulses. He has been a recipient of a number of awards at national and international conferences. Most recently, he received a Young Scientist Award for the General Assembly and Scientific Symposium of the International Union IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
of Radio Science (URSI) 2020. In 2019 and 2020, the IEEE Antennas and Propagation Society recognized him as an Outstanding Reviewer for IEEE Transactions on Antennas and Propagation, and he was honored as the best reviewer for IEEE Journal of Selected Topics in Applied Earth Observation and Remote Sensing in 2020. He is an associate editor of Journal of Engineering and Microwaves, Antennas, and Propagation, both by the Institution of Engineering and Technology, and IEEE Access. He is a Senior Member of IEEE and of URSI. Fauzia Ahmad (fauzia.ahmad@temple.edu) received her Ph.D. degree in electrical engineering from the University of Pennsylvania in 1997. She is an associate professor in the Department of Electrical and Computer Engineering, Temple University, Philadelphia, Pennsylvania, 19122, USA. Prior to joining Temple University in 2016, she was a research professor and the director of the Radar Imaging Lab at the Center for Advanced Communications, Villanova University. She has more than 250 publications in the areas of array and statistical signal processing, computational imaging with applications in radar and ultrasonics, compressive sensing , machine learning, radar signal processing, and structural heath monitoring. She is a Fellow of IEEE and of the Society of Photo-Optical Instrumentation Engineers (SPIE). She is the past chair of the IEEE Dennis J. Picard Medal for Radar Technologies and Applications Committee and SPIE Compressive-Sensing Conference series. She currently chairs the SPIE Big Data Conference series. She is a member of the Sensor Array and Multichannel Technical Committee of the IEEE Signal Processing Society, member of the Computational Imaging Technical Committee of the IEEE Signal Processing Society, and member of the Electrical Cluster of the Franklin Institute Committee on Science and the Arts. She also serves as an associate editor of IEEE Transactions on Computational Imaging and IET Radar, Sonar, & Navigation. Moeness G. Amin (moeness.amin@villanova.edu) received his Ph.D. degree in 1984 from the University of Colorado, Boulder. Since 1985, he has been on the faculty of the Department of Electrical and Computer Engineering at Villanova University, Villanova, Pennsylvania, 19085, USA, where he is now a professor and director of the Center for Advanced Communications. He is a Fellow of IEEE, the Society of Photo-Optical Instrumentation Engineers, the Institute of Engineering and Technology (IET), and the European Association for Signal Processing (EURASIP). He is a recipient of the U.S. Fulbright Distinguished Chair in Advanced Science and Technology, Alexander von Humboldt Research Award, IET Achievement Medal, IEEE Warren D. White Award for Excellence in Radar Engineering, IEEE Signal Processing Society Technical Achievement Award, NATO Scientific Achievement Award, EURASIP Technical Achievement Award, and IEEE Third Millennium Medal. He was a Distinguished Lecturer of the IEEE Signal Processing Society. He has more than 850 journal and conference publications in the areas of wireless communications, time–frequency analysis, sensor array processing, satellite DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE navigations, ultrasound imaging, and radar signal processing. He is a recipient of 12 best paper awards. He is the editor of three books from CRC Press: Through-the-Wall Radar Imaging (2011), Compressive Sensing for Urban Radar (2014), and Radar for Indoor Monitoring (2017). He serves on the editorial board of Proceedings of the IEEE. Traian Dogaru (traian.v.dogaru.civ@mail.mil) received his degree in electrical engineering from the Polytechnic University of Bucharest, Bucharest, Romania, in 1990 and his M.S. and Ph.D. degrees in electrical engineering from Duke University, Durham, North Carolina, USA, in 1997 and 1999, respectively. He was a research associate with Duke University, developing algorithms for electromagnetic field modeling, between 1999 and 2001. He has been with the U.S. Army Research Laboratory, Adelphi, Maryland, 20783, USA, since 2001. His research interests include radar signature modeling, computational electromagnetics, signal processing, radar imaging and detection of concealed targets, sensing through the wall, foliage penetration, and ground-penetrating radar, as well as applying advanced computational modeling techniques to the analysis of complex sensing scenarios. He is a Member of IEEE. REFERENCES [1] A. P. Annan, “GPR—History, trends, and future developments,” Subsurface Sens. Technol. Appl., vol. 3, no. 4, pp. 253–270, 2002. doi: 10.1023/A:1020657129590. [2] D. J. Daniels, “A review of GPR for landmine detection,” Sens. Imag., vol. 7, no. 3, p. 90, 2006. doi: 10.1007/s11220-006-0024-5. [3] M. Sato, “Principles of mine detection by ground-penetrating radar,” in Anti-personnel Landmine Detection for Humanitarian Demining. Berlin: Springer-Verlag, 2009, pp. 19–26. [4] I. Catapano, G. Gennarelli, G. Ludeno, F. Soldovieri, and R. Persico, “Ground-penetrating radar: Operation principle and data processing,” in Wiley Encyclopedia Elect. Electron. Eng. Hoboken, NJ: Wiley, 2019, pp. 1–23. [5] L. Robledo, M. Carrasco, and D. Mery, “A survey of land mine detection technology,” Int. J. Remote Sens., vol. 30, no. 9, pp. 2399–2410, 2009. doi: 10.1080/01431160802549435. [6] W. R. Scott, K. Kim, G. D. Larson, A. C. Gurbuz, and J. H. McClellan, “Combined seismic, radar, and induction sensor for landmine detection,” J. Acoust. Soc. Amer., vol. 123, no. 5, pp. 3042–3042, 2008. doi: 10.1121/1.2932726. [7] C. R. Ratto, P. A. Torrione, and L. M. Collins, “Exploiting groundpenetrating radar phenomenology in a context-dependent framework for landmine detection and discrimination,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 5, pp. 1689–1700, 2010. doi: 10.1109/TGRS.2010.2084093. [8] M. G. Fernández et al., “Synthetic aperture radar imaging system for landmine detection using a ground penetrating radar on board a unmanned aerial vehicle,” IEEE Access, vol. 6, pp. 45,100–45,112, 2018. [9] S. Vitebskiy and L. Carin, “Resonances of perfectly conducting wires and bodies of revolution buried in a lossy dispersive halfspace,” IEEE Trans. Antennas Propag., vol. 44, no. 12, pp. 1575– 1583, 1996. doi: 10.1109/8.546243. 187
[10] I. J. Gupta, A. van der Merwe, and C.-C. Chen, “Extraction of complex resonances associated with buried targets,” in Proc. SPIE Detection Remediation Technol. Mines Minelike Targets III, 1998, vol. 3392, pp. 1022–1032. doi: 10.1117/12.324149. [11] L. Carin, N. Geng, M. McClure, J. Sichina, and L. Nguyen, “Ultrawide-band synthetic-aperture radar for mine-field detection,” IEEE Antennas Propag. Mag., vol. 41, no. 1, pp. 18–33, 1999. doi: 10.1109/74.755021. [12] R. Persico, Introduction to Ground Penetrating Radar: Inverse Scattering and Data Processing. Hoboken, NJ: Wiley, 2014. [13] M. El-Shenawee and C. M. Rappaport, “Monte Carlo simulations for clutter statistics in minefields: AP-mine-like-target buried near a dielectric object beneath 2-D random rough ground surfaces,” IEEE Trans. Geosci. Remote Sens., vol. 40, no. 6, pp. 1416– 1426, 2002. doi: 10.1109/TGRS.2002.800275. [14] A. C. Gurbuz, J. H. McClellan, and W. R. Scott, “A compressive sensing data acquisition and imaging method for stepped frequency GPRs,” IEEE Trans. Signal Process., vol. 57, no. 7, pp. 2640–2650, 2009. doi: 10.1109/TSP.2009.2016270. [15] D. Comite, A. Galli, I. Catapano, and F. Soldovieri, “Advanced imaging for down-looking contactless GPR systems,” Appl. Comput. Electromagn. Soc. J., vol. 33, no. 7, pp. 1–4, 2017. [16] G. Ludeno, G. Gennarelli, S. Lambot, F. Soldovieri, and I. Catapano, “A comparison of linear inverse scattering models for contactless GPR imaging,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 10, pp. 7305–7316, 2020. doi: 10.1109/TGRS.2020.2981884. [17] R. Solimene, I. Catapano, G. Gennarelli, A. Cuccaro, A. Dell’Aversano, and F. Soldovieri, “SAR imaging algorithms and some unconventional applications: A unified mathematical overview,” IEEE Signal Process. Mag., vol. 31, no. 4, pp. 90–98, 2014. doi: 10.1109/MSP.2014.2311271. [18] T. Dogaru, “NAFDTD—A near-field finite difference time domain solver,” Army Research Lab., Sensors and Electronic Devices Directorate, Adelphi, MD, Tech. Rep. ARL-TR-6110, 2012. [19] D. Comite, F. Ahmad, M. Amin, and T. Dogaru, “Multi-aperture processing for improved target detection in forward-looking GPR applications,” in Proc. Eur. Conf. Antennas Propag., 2016, pp. 1–3. [20] M. M. Tajdini, B. Gonzalez-Valdes, J. A. Martinez-Lorenzo, A. W. Morgenthaler, and C. M. Rappaport, “Real-time modeling of forward-looking synthetic aperture ground penetrating radar scattering from rough terrain,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 5, pp. 2754–2765, May 2019. doi: 10.1109/ TGRS.2018.2876808. [21] L. Nguyen, K. Ranney, K. Sherbondy, and A. Sullivan, “Detection of buried in-road IED targets using airborne ultra-wideband (UWB) low-frequency SAR,” in Proc. 60th MSS Tri-Service Radar Symp., 2014. [22] J. A. Camilo, J. M. Malof, P. A. Torrione, L. M. Collins, and K. D. Morton Jr., “Clutter and target discrimination in forwardlooking ground penetrating radar using sparse structured basis pursuits,” in Proc. SPIE Detection Sens. Mines, Explosive Objects, and Obscured Targets XX, 2015, , vol. 9454, p. 94540V. doi: 10.1117/12.2176491. [23] F. Soldovieri, G. Gennarelli, I. Catapano, D. Liao, and T. Dogaru, “Forward-looking radar imaging: A comparison of two data processing strategies,” IEEE J. Sel. Topics Appl. Earth Observ. Re- 188 mote Sens., vol. 10, no. 2, pp. 562–571, 2016. doi: 10.1109/ JSTARS.2016.2543840. [24] A. W. Morgenthaler and C. M. Rappaport, “Scattering from lossy dielectric objects buried beneath randomly rough ground: Validating the semi-analytic mode matching algorithm with 2-D FDFD,” IEEE Trans. Geosci. Remote Sens., vol. 39, no. 11, pp. 2421– 2428, 2001. doi: 10.1109/36.964978. [25] J. T. Johnson and R. J. Burkholder, “Coupled canonical grid/ discrete dipole approach for computing scattering from objects above or below a rough interface,” IEEE Trans. Geosci. Remote Sens., vol. 39, no. 6, pp. 1214–1220, 2001. doi: 10.1109/36.927443. [26] D. Comite, A. Galli, I. Catapano, and F. Soldovieri, “The role of the antenna radiation pattern in the performance of a microwave tomographic approach for GPR imaging,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 10, no. 10, pp. 4337–4347, 2017. doi: 10.1109/JSTARS.2016.2636833. [27] J. Kositsky and P. Milanfar, “Forward-looking high-resolution GPR system,” in Proc. SPIE Detection and Remediation Technol. Mines Minelike Targets IV, 1999, vol. 3710, pp. 1052–1062. [28] J. Kositsky and C. A. Amazeen, “Results from a forward-looking GPR mine detection system,” in Proc. SPIE Detection and Remediation Technol. Mines and Minelike Targets VI, 2001, vol. 4394, pp. 700–711. [29] J. Kositsky, R. Cosgrove, C. A. Amazeen, and P. Milanfar, “Results from a forward-looking GPR mine detection system,” in Proc. SPIE Detection and Remediation Technol. Mines Minelike Targets VII, 2002, vol. 4742, pp. 206–217. [30] Y. Sun and J. Li, “Time–frequency analysis for plastic landmine detection via forward-looking ground penetrating radar,” Inst. Elect. Eng. Proc. Radar, Sonar, Navigation, vol. 150, no. 4, pp. 253–261, 2003. [31] M. R. Bradley, T. R. Witten, M. Duncan, and R. McCummins, “Anti-tank and side-attack mine detection with a forward-looking GPR,” in Proc. SPIE Detection and Remediation Technol. Mines Minelike Targets IX, 2004, vol. 5415, pp. 421–432. [32] M. Ressler, L. Nguyen, F. Koenig, D. Wong, and G. Smith, “The Army Research Laboratory (ARL) synchronous impulse reconstruction (SIRE) forward-looking radar,” in Proc. SPIE Unmanned Systems Technology IX, 2007, vol. 6561, pp. 35–46. [33] I. Catapano, A. Affinito, A. Del Moro, G. Alli, and F. Soldovieri, “Forward-looking ground-penetrating radar via a linear inverse scattering approach,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 10, pp. 5624–5633, 2015. doi: 10.1109/TGRS.2015.2426502. [34] B. R. Phelan, K. D. Sherbondy, K. I. Ranney, and R. M. Narayanan, “Proc. SPIE Design and performance of an ultra-wideband stepped-frequency radar with precise frequency control for landmine and IED detection,” in Proc. Radar Sensor Technology XVIII, 2014, vol. 9077, pp. 53–64. [35] B. R. Phelan, K. I. Ranney, K. A. Gallagher, J. T. Clark, K. D. Sherbondy, and R. M. Narayanan, “Design of ultrawideband stepped-frequency radar for imaging of obscured targets,” IEEE Sensors J., vol. 17, no. 14, pp. 4435–4446, 2017. doi: 10.1109/ JSEN.2017.2707340. [36] T. Ton, D. Wong, and M. Soumekh, “ALARIC forward-looking ground penetrating radar system with standoff capability,” in IEEE Int. Conf. Wireless Information Technol. Syst., 2010, pp. 1–4. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
[37] D. Liao, T. Dogaru, and A. Sullivan, “Large-scale, full-wave-based emulation of step-frequency forward-looking radar imaging in rough terrain environments,” Sens. Imag., vol. 15, no. 1, p. 88, 2014. [38] D. Liao and T. Dogaru, “Full-wave characterization of rough terrain surface scattering for forward-looking radar applications,” IEEE Trans. Antennas Propag., vol. 60, no. 8, pp. 3853–3866, 2012. doi: 10.1109/TAP.2012.2201076. [39] M. M. Tajdini, A. W. Morgenthaler, and C. M. Rappaport, “Multiview synthetic aperture ground-penetrating radar detection in rough terrain environment: A real-time 3-d forward model,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 5, pp. 3400–3410, 2019. doi: 10.1109/TGRS.2019.2954776. [40] M. Pastorino, Microwave Imaging, vol. 208. Hoboken, NJ: Wiley, 2010. [41] G. A. McMechan, “A review of seismic acoustic imaging by reverse-time migration,” Int. J. Imag. Syst. Technol., vol. 1, no. 1, pp. 18–21, 1989. doi: 10.1002/ima.1850010104. [42] C. Özdemir, Ş. Demirci, E. Yiğit, and B. Yilmaz, “A review on migration methods in b-scan ground penetrating radar imaging,” Math. Problems Eng., vol. 2014, pp. 1–17, June 2014. doi: 10.1155/2014/280738. [43] J. M. Lopez-Sanchez and J. Fortuny-Guasch, “3-D radar imaging using range migration techniques,” IEEE Trans. Antennas Propag., vol. 48, no. 5, pp. 728–737, 2000. doi: 10.1109/8.855491. [44] Y. Wang, Y. Sun, J. Li, and P. Stoica, “Adaptive imaging for forward-looking ground penetrating radar,” IEEE Trans. Aerosp. Electron. Syst., vol. 41, no. 3, pp. 922–936, 2005. doi: 10.1109/ TAES.2005.1541439. [45] J. Gazdag, “Wave equation migration with the phase-shift method,” Geophysics, vol. 43, no. 7, pp. 1342–1351, 1978. doi: 10.1190/1.1440899. [46] I. Catapano, F. Soldovieri, G. Alli, G. Mollo, and L. A. Forte, “On the reconstruction capabilities of beamforming and a microwave tomographic approach,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 12, pp. 2369–2373, 2015. doi: 10.1109/LGRS.2015.2476514. [47] W. C. Chew, Waves and Fields in Inhomogeneous Media. Piscataway, NJ: IEEE Press, 1995. [48] M. Bertero and P. Boccacci, Introduction to Inverse Problems in Imaging. Boca Raton, FL: CRC Press, 1998. [49] G. Leone and F. Soldovieri, “Analysis of the distorted born approximation for subsurface reconstruction: Truncation and uncertainties effects,” IEEE Trans. Geosci. Remote Sens., vol. 41, no. 1, pp. 66–74, 2003. doi: 10.1109/TGRS.2002.806999. [50] P. Meincke, “Linear GPR inversion for lossy soil and a planar airsoil interface,” IEEE Trans. Geosci. Remote Sens., vol. 39, no. 12, pp. 2713–2721, 2001. doi: 10.1109/36.975005. [51] T. B. Hansen and P. M. Johansen, “Inversion scheme for ground penetrating radar that takes into account the planar air-soil interface,” IEEE Trans. Geosci. Remote Sens., vol. 38, no. 1, pp. 496–506, 2000. doi: 10.1109/36.823944. [52] A. Ben-Israel and T. N. Greville, Generalized Inverses: Theory and Applications, vol. 15. New York: Springer Science & Business Media, 2003. [53] D. Comite, F. Ahmad, D. Liao, T. Dogaru, and M. G. Amin, “Multiview imaging for low-signature target detection in rough-surface clutter environment,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 9, pp. 5220–5229, 2017. doi: 10.1109/TGRS.2017.2703820. [54] R. Solimene, A. Cuccaro, A. Dell’Aversano, I. Catapano, and F. Soldovieri, “Ground clutter removal in GPR surveys,” IEEE J. Sel. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 3, pp. 792–798, 2013. doi: 10.1109/JSTARS.2013.2287016. [55] D. Comite, F. Ahmad, and T. Dogaru, “Performance of free-space tomographic imaging approximation for shallow-buried target detection,” in Proc. IEEE 7th Int. Workshop on Comput. Adv. MultiSensor Adaptive Process., 2017, pp. 1–4. [56] J. Yang, T. Jin, X. Huang, J. Thompson, and Z. Zhou, “Sparse MIMO array forward-looking GPR imaging based on compressed sensing in clutter environment,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 7, pp. 4480–4494, 2013. doi: 10.1109/ TGRS.2013.2282308. [57] L. M. van Kempen, H. Sahli, J. Brooks, and J. P. Cornelis, “New results on clutter reduction and parameter estimation for land mine detection using GPR,” in Proc. 8th Int. Conf. Ground Penetrating Radar, 2000, vol. 4084, pp. 872–879. [58] F. Abujarad, A. Jostingmeier, and A. Omar, “Clutter removal for landmine using different signal processing techniques,” in Proc. Int. Conf. Ground Penetrating Radar, 2004, pp. 697–700. [59] R. Wu et al., “Adaptive ground bounce removal,” Electron. Lett., vol. 37, no. 20, pp. 1250–1252, 2001. doi: 10.1049/el:20010855. [60] R. Wu, J. Liu, Q. Gao, H. Li, and B. Zhang, “Progress in the research of ground bounce removal for landmine detection with ground penetrating radar,” PIERS Online, vol. 1, no. 3, pp. 336– 340, 2005. doi: 10.2529/PIERS041130195615. [61] G. Nadim, “Clutter reduction and detection of landmine objects in ground penetrating radar data using likelihood method,” in Proc. IEEE Int. Symp. Commun., Control Signal Process., 2008, pp. 98–106. [62] F. Abujarad, G. Nadim, and A. Omar, “Clutter reduction and detection of landmine objects in ground penetrating radar data using singular value decomposition (SVD),” in Proc. Int. Workshop on Adv. Ground Penetrating Radar, 2005, pp. 37–42. [63] O. Lopera, N. Milisavljević, and S. Lambot, “Clutter reduction in GPR measurements for detecting shallow buried landmines: A Colombian case study,” Near Surface Geophys., vol. 5, no. 1, pp. 57–64, 2007. doi: 10.3997/1873-0604.2006018. [64] D. J. Daniels, “Ground penetrating radar,” in Encyclopedia of RF and Microwave Engineering. Hoboken, NJ: Wiley, 2005. [65] T. C. Havens et al., “Improved detection and false alarm rejection using FLGPR and color imagery in a forward-looking system,” in Proc. SPIE Detection and Sensing Mines, Explosive Objects, and Obscured Targets XV, 2010, vol. 7664, p. 76641U. doi: 10.1117/12.852274. [66] T. C. Havens, J. M. Keller, K. Ho, T. T. Ton, D. C. Wong, and M. Soumekh, “Narrow-band processing and fusion approach for explosive hazard detection in FLGPR,” in Proc. SPIE Detection and Sensing Mines, Explosive Objects, and Obscured Targets XVI, 2011, vol. 8017, p. 80171F. doi: 10.1117/12.884610. [67] D. Comite, F. Ahmad, T. Dogaru, and M. Amin, “Coherence-factor-based rough surface clutter suppression for forward-looking GPR imaging,” Remote Sens., vol. 12, no. 5, p. 857, 2020. doi: 10.3390/rs12050857. [68] T. Dogaru and L. Carin, “Time-domain sensing of targets buried under a rough air-ground interface,” IEEE Trans. Antennas Propag., vol. 46, no. 3, pp. 360–372, 1998. doi: 10.1109/8.662655. [69] H. Jin-feng and Z. Zheng-ou, “A novel method for clutter reduction in the FLGPR measurements,” in Proc. IEEE Int. Conf. Commun., Circuits Syst., 2004, vol. 2, pp. 896–900. 189
[70] L. Van Kempen and H. Sahli, “Signal processing techniques for clutter parameters estimation and clutter removal in GPR data for landmine detection,” in Proc. IEEE Signal Process. Workshop on Stat. Signal Process. (Cat. No. 01TH8563), 2001, pp. 158–161. [71] K. F. Casey, “Rough-surface effects on subsurface target detection,” in Proc. SPIE Detection and Remediation Technol. Mines Minelike Targets VI, 2001, vol. 4394, pp. 754–763. [72] G. A. Tsihrintzis, C. M. Rappaport, S. C. Winton, and P. M. Johansen, “Statistical modeling of rough surface scattering for ground-penetrating radar applications,” in Proc. SPIE Detection and Remediation Technol. Mines Minelike Targets III, 1998, vol. 3392, pp. 735–744. [73] D. Liao and T. Dogaru, “Full-wave-based emulation of forwardlooking radar target imaging in rough terrain environment,” in Proc. IEEE Int. Symp. Antennas Propag., 2011, pp. 2107–2110. [74] D. Liao, “Ground surface scattering and clutter suppression in ground-penetrating radar applications,” in Proc. IEEE Int. Symp. Antennas Propag., 2012, pp. 1–2. [75] D. Comite, F. Ahmad, T. Dogaru, and M. G. Amin, “Coherence factor for rough surface clutter mitigation in forward-looking GPR,” in Proc. IEEE Radar Conf., 2017, pp. 1803–1806. [76] T. C. Havens et al., “Locally adaptive detection algorithm for forward-looking ground-penetrating radar,” in Proc. SPIE Detection and Sensing Mines, Explosive Objects, and Obscured Targets XV, 2010, vol. 7664, p. 76642E. doi: 10.1117/12.851512. [77] S. M. Kay, Fundamentals of Statistical Signal Processing. Englewood Cliffs, NJ: Prentice Hall, 1993. [78] D. Comite, F. Ahmad, T. Dogaru, and M. Amin, “Adaptive detection of low-signature targets in forward-looking GPR imagery,” IEEE Geosci. Remote Sens. Lett., vol. 15, no. 10, pp. 1520–1524, Oct. 2018. [79] A. D. Pambudi, M. Fauß, F. Ahmad, and A. M. Zoubir, “Minimax robust landmine detection using forward-looking ground-penetrating radar,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 7, pp. 1–10, 2020. doi: 10.1109/TGRS.2020.2971956. [80] A. D. Pambudi, F. Ahmad, and A. M. Zoubir, “Copula-based robust landmine detection in multi-view forward-looking GPR imagery,” in Proc. IEEE Radar Conf., 2020, pp. 1–6. [81] R. O. Duda, P. E. hart, and D. G. Stork, Pattern Classification. Hoboken, NJ: Wiley, 2001. [82] C. M. Bishop, Pattern Recognition and Machine Learning. Berlin: Springer-Verlag, 2006. [83] T. Wang, J. M. Keller, P. D. Gader, and O. Sjahputera, “Frequency subband processing and feature analysis of forward-looking ground-penetrating radar signals for land-mine detection,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 3, pp. 718–729, 2007. doi: 10.1109/TGRS.2006.888142. [84] T. Jin, J. Lou, and Z. Zhou, “Extraction of landmine features using a forward-looking ground-penetrating radar with MIMO array,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 10, pp. 4135–4144, 2012. doi: 10.1109/TGRS.2012.2188803. [85] T. Wang, O. Sjahputera, J. M. Keller, and P. D. Gader, “Landmine detection using forward-looking GPR with object tracking,” in Proc. SPIE Detection and Remediation Technol. Mines Minelike Targets X, 2005, vol. 5794, pp. 1080–1088. [86] J. Farrell et al., “Evaluation and improvement of spectral features for the detection of buried explosive hazards using forward- 190 looking ground-penetrating radar,” in Proc. SPIE Detection and Sensing Mines, Explosive Objects, and Obscured Targets XVII, 2012, vol. 8357, p. 83571C. doi: 10.1117/12.918779. [87] H.-S. Youn et al., “Feasibility study for IED detection using forward-looking ground penetrating radar integrated with target features classification,” in Proc. IEEE Antennas Propag. Soc. Int. Symp., 2010, pp. 1–4. [88] T. Dogaru and C. Le, “Polarization differences in airborne ground penetrating radar performance for landmine detection,” in Proc. SPIE Radar Sensor Technology XX, 2016, vol. 9829, pp. 85–97. [89] M. Garcia-Fernandez, Y. Alvarez-Lopez, and F. Las Heras, “Autonomous airborne 3D SAR imaging system for subsurface sensing: UWB-GPR on board a UAV for landmine and IED detection,” Remote Sens., vol. 11, no. 20, p. 2357, 2019. doi: 10.3390/ rs11202357. [90]A. Alzeyadi, J. Hu, and T. Yu, “Electromagnetic sensing of a subsurface metallic object at different depths,” in Proc. SPIE Nondestructive Characterization and Monitoring Adv. Mater., Aerosp., Civil Infrastructure, and Transp. XIII, 2019, vol. 10971, p. 1,097,105. [91] M. González-Díaz, M. García-Fernández, Y. Álvarez-Loópez, and F. Las-Heras, “Improvement of GPR SAR-based techniques for accurate detection and imaging of buried objects,” IEEE Trans. Instrum. Meas., vol. 69, no. 6, pp. 3126–3138, 2019. doi: 10.1109/ TIM.2019.2930159. [92] D. Šipoš and D. Gleich, “A lightweight and low-power UAVborne ground penetrating radar design for landmine detection,” Sensors, vol. 20, no. 8, p. 2234, 2020. doi: 10.3390/s20082234. [93] I. Catapano et al., “Small multicopter-UAV-based radar imaging: Performance assessment for a single flight track,” Remote Sens., vol. 12, no. 5, p. 774, 2020. doi: 10.3390/rs12050774. [94] T. Dogaru, “Imaging study for small unmanned aerial vehicle (UAV)-mounted ground-penetrating radar: Part I – Methodology and analytic formulation,” Army Res. Lab., Sensors and Electronic Devices Directorate, Adelphi, MD, Tech. Rep. ARLTR-8645, 2019. [95] T. Dogaru, “Imaging study for small unmanned aerial vehicle (UAV)-mounted ground-penetrating radar: Part II – Numeric examples and performance analysis,” Army Res. Lab., Sensors and Electronic Devices Directorate, Adelphi, MD, Tech. Rep. ARLTR-8725, 2019. [96] T. Dogaru, “Imaging study for small unmanned aerial vehicle (UAV)-mounted ground-penetrating radar: Part III – A multistatic approach,” Army Res. Lab., Sensors and Electronic Devices Directorate, Adelphi, MD, Tech. Rep. ARL-TR-8773, 2019. [97] D. W. Paglieroni, D. H. Chambers, J. E. Mast, S. W. Bond, and N. Reginald Beer, “Imaging modes for ground penetrating radar and their relation to detection performance,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 3, pp. 1132–1144, 2015. doi: 10.1109/JSTARS.2014.2357718. [98] T. Dogaru, “Synthetic aperture radar for helicopter landing in degraded visual environments,” Army Res. Lab., Sensors and Electronic Devices Directorate, Adelphi, MD, Tech. Rep. ARLTR-8595, 2018. GRS IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Gaussianizing the Earth Multidimensional information measures for Earth data analysis J. EMMANUEL JOHNSON, VALERO LAPARRA, MARÍA PILES, AND GUSTAU CAMPS-VALLS I nformation theory (IT) is an excellent framework for analyzing Earth system data because it enables us to characterize uncertainty and redundancy and is universally interpretable. However, accurately estimating information content is challenging because spatiotemporal data are high-dimensional and heterogeneous and have nonlinear characteristics. In this article, we apply multivariate Gaussianization for probability density estimation, which is robust to dimensionality, comes with statistical guarantees, and is easy to apply. In addition, this methodology enables us to estimate information-theoretic measures to characterDigital Object Identifier 10.1109/MGRS.2021.3066260 Date of current version: 6 May 2021 ize multivariate densities: information, entropy, total correlation, and mutual information (MI). We demonstrate how IT measures can be applied in various Earth system data analysis problems. First, we show how the method can be used to jointly Gaussianize radar backscattering intensities, synthesize hyperspectral data, and quantify information content in aerial optical images. We also quantify the information content of several variables that describe the soil–vegetation status in agroecosystems and investigate the temporal scales that maximize their shared information under extreme events, such as droughts. Finally, we measure the relative information content of space and time dimensions in remote sensing products and model simulations involving long records ©SHUTTERSTOCK.COM/SUMANBHAUMIK DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 0274-6638/21©2021IEEE 191
of key variables, such as precipitation, sensible heat (SH), and evaporation. Results confirm the validity of the method, for which we anticipate wide use and adoption. Code and demonstrations of the implemented algorithms and IT measures are provided. EARTH DATA AND INFORMATION DELUGE Understanding spatial temporal dynamics of Earth system models and ovservation data are fundamental to monitoring our planet and understanding climate change [1]–[4]. We now face an information deluge from remote sensing platforms that continuously increase the spatial, temporal, and spectral resolution of data sources. Earth system data come in high volumes, are heterogeneous, and are riddled with uncertainty [5], which poses important challenges in analysis, modeling, and understanding. The statistical analysis of remote sensing data and model simulations requires dealing with this large amount of heterogeneous, multivariate, and spatiotemporal material. Copious amounts of data do not necessarily mean large quantities of information. For example, it is now widely acknowledged that models are often correlated and share common traits, features, and information content. Which features are the most appropriate and representative? How can we best quantify their information content in meaningful units? Essential Earth variables and data products exhibit high levels of redundancy in space and time. So, what is the most appropriate space, time, or spatiotemporal scale one should look at? The same questions arise when trying to assess and choose the most adequate observational variable and biogeophysical parameter for Earth monitoring. From a purely statistical standpoint, information quantification for Earth and climate data is difficult. IT is the appropriate framework to study information content, uncertainty, and redundancy [6]. The estimation of entropy and MI for discrete and continuous random variables has been addressed through different approaches in the statistics literature [7]–[10]. But the IT measure estimation of multivariate data is problematic. Some methods, such as using histograms [6], [11] and nearest neighbors [8]–[10], can be very limiting, as they do not scale well, do not converge to the true measure, and show a high estimation bias [12]. However, in the remote sensing and geosciences community, there have been many successful application-driven approaches to overcome this challenge. Examples include studying feature redundancy in image classifiers [13], assessing the maximum number of parameters that can be estimated given a set of observations [14], remote sensing feature extraction and weighting [15], [16], data fusion [17], image registration [18]– [20], synthetic aperture radar (SAR) data characterization [21], [22], and quantifying uncertainty in models and observations [23]. However, again, these methods are applicationdriven, and none have been tested in very-high-dimensional scenarios, which is crucial for data characterization. All information quantification metrics require a good multivariate density estimator. This is especially 192 problematic in Earth observation (EO) data with moderate- to high-dimensional problems and nonlinear feature relations. These issues affect the classic parametric density estimators based on the exponential family of solutions and mixture distributions as well as nonparametric methods based on histograms, kernel density estimation (KDE), and k-nearest neighbors (kNNs). As an alternative to these traditional methods, there is a new class of techniques called neural density estimators [24], which are parameterized neural networks that approximate densities. They use the “change-of-variables” formula to estimate the densities of inputs and enable one to draw samples of input data. They have promise, as they have been successfully used in applications related to Earth system sciences, including inverse problems [25] and density estimation [26]. In this article, we look at a particular class of models in the neural density estimation family. In particular, we introduce the Gaussianization method [27] and a generalized algorithm called rotation-based iterative Gaussianization (RBIG) [28]. This uses a repeated sequence of simpler feature-wise Gaussian transformations and orthogonal rotations until convergence. In each iteration, the total correlation and the non-Gaussianity are reduced and converge toward zero, that is, toward full independence. The learned transformation toward the Gaussian domain is invertible, which enables us to easily synthesize data by inverting samples drawn from the Gaussian domain. The approach is also advantageous because it enables us to estimate IT measures, such as entropy, total correlation, non-Gaussianity, and MI in high-dimensional data. It is fast and easy to apply and has links to deep neural networks [28]–[30]. MULTIVARIATE GAUSSIANIZATION PROBABILITY DENSITY FUNCTION ESTIMATION Most problems in signal and image processing, IT, and machine learning involve the challenging task of multidimensional probability density function (PDF) estimation. A PDF, or simply a density p ($), takes an input x ! X and outputs a density following the properties 1) that p (x) $ 0, 6x ! R D and 2) that it has to sum to one, #X p (x) dx = 1. In practice, we usually do not have access to the PDF p ($), but we do have a set of (multivariate) samples drawn from the generating process x = " x 1, x 2, f, x N , to estimate the PDF from. Accurate PDF estimation is important because it enables us to 1) calculate the probability of any arbitrary input data point, which accounts for the relative likelihood that the value of the random variable will equal the sample; 2) generate samples x l + p (x) from this distribution, thus facilitating data synthesis, background and support estimation, and anomaly detection; and 3) calculate expectations for functions (or transformations) of arbitrary form f (x) given p (x), i.e., E x [f (x)], which enables us to, e.g., characterize a system. IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Having access to all these properties gives us the ability to tackle long-standing problems in machine learning and statistics. With accurate PDF estimates, one can model the conditional densities of data generated from a prior distribution, develop accurate and efficient compression schemes, and use principled objective functions, such as the maximum likelihood. In addition, having access to an accurate density estimator can be useful in many hybrid applications to deal with out-of-sample and out-ofdistribution problems [31]. The problem is, therefore, to estimate the density p (x) given a set of samples from X. The simplest approach to PDF estimation assumes that the density has a parametric functional form defined by a fixed number of tunable parameters. The Gaussian assumption is the most widely adopted for unimodal distributions, which come parameterized by a mean n and a covariance function R. If more than one mode is assumed, a mixture of Gaussians (MoG) generally leads to better fits. However, finding a parametric form for the distribution that fits properly to particular data is very difficult in most cases. The alternative technique comes from nonparametric models, which do not assume a specific form for the distribution and are learned from data. The simplest nonparametric method estimates the PDF by partitioning the data space into nonoverlapping bins, where the density is estimated as the fraction of data points in the bin divided by the volume of the bin. This estimator runs the risk of overfitting or underfitting, depending upon how the bins are selected. Thus, there are several rule-of-thumb estimators with a wide range of guidelines for choosing the most appropriate bin size: 1) an overall good estimator using Sturges’s Rule [32], an estimator that is better for a larger number of samples and is more robust to outliers by using the Freedman–Diaconis method [33], and more Bayesian approaches [34]. However, histogram-based PDF estimation methods are affected by the curse of dimensionality, so they cannot be applied to a large number of features. Alternative parametric estimates that follow probability estimation schemes for the optimal bin width determined by the maximum likelihood have been introduced [24]. However, they are very rigid and lead to extremely rough density functions. To achieve smoother PDF estimates, KDE is popular. It places a nonlinear kernel function with a varying bandwidth parameter to control the degree of smoothness on top of each example. Unfortunately, a bias–variance tradeoff will result in over/underfitting the PDF, especially in moderate- to high-dimensional problems. In the previous approaches, the bandwidth is typically fixed a priori following heuristics in the literature [35], and it rarely accounts for the concentration of points, i.e., that smaller bins should be placed in regions with a higher concentration of points, in the form of an adaptive bit allocation scheme. This can be addressed by using kNNs, which have one adaptive bandwidth per location and depend on the number of available training points. However, all the preceding DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE density estimators suffer from the curse of dimensionality: as the dimensionality increases, the space becomes sparser, and density estimates are unreliable. GAUSSIANIZATION FOR PDF ESTIMATION An alternative way to estimate a PDF from observational data is to employ a data transformation to a convenient domain instead of working explicitly in the high-dimensional input domain. The question of what constitutes a convenient domain is a long-standing one. Ideally, the domain should have independent components so that one can work in each dimension independently to get rid of the curse of dimensionality. It should enable one to perform operations and compute quantities therein, and it should be invertible so that one can express these quantities in meaningful units of the input domain. The Gaussian distribution has the desirable properties of showing independent components and being mathematically tractable and is thus a good candidate for density estimation. A class of Gaussianization methods [28], [30] looks for transforms to a multivariate Gaussian domain. These transforms are related to projection pursuit transformations introduced in [42] and seek to transform a multivariate distribution p (x), where x + X ! R d, into a standardized multivariate Gaussian distribution [27], [28]: G i: x ! R d 7 z ! R d  + p (x) + N (0, I d), (1) where i are the parameters learned to Gaussianize the data x, 0 is a vector of zeros (for the means), and I d is the identity matrix (for the covariance). By construction, the Gaussianization transform is a parameterized function G i consisting of a sequence of L iterations (or layers), each performing an orthogonal rotation of the data and a marginal Gaussianization transformation to every feature. The transformation G i in each iteration , is defined as G i : x , + 1 = R , W , (x ,), , = 0, 1, f, L, where x 0 corresponds to the original data x, W , is the marginal Gaussianization of each dimension of x , for the iteration ,, and R , is a rotation matrix for the marginally Gaussianized variable W , (x ,). After convergence in L iterations, the transformation contains all the needed information to convert data coming from the original density into a multivariate Gaussian. Here, i collectively group all parameters: those from the rotation matrix R and the marginal transformation W. For example, one could use a principal component analysis (PCA) transformation for the rotation matrix R and a histogram transformation for the marginal Gaussianization transformation W. Then, the eigenvectors obtained from PCA describing R and the parameterizations of W would define i. See Table 1 for more details on the decomposition of this formula and Figure 1 for a full decomposition of a toy data set. 193
TABLE 1. A SUMMARY OF THE COMPONENTS OF THE GAUSSIANIZATION ALGORITHM. DESCRIPTION NOTATION TRANSFORMATION DOMAIN Marginal uniformization U Histogram [28], mixture CDF [36], KDE [30], Lambert [37], spline [38], Box–Cox [39] R " R [0, 1] Inverse CDF CDF −1 Inverse Gaussian CDF, logistic, inverse Cauchy CDF R [0, 1] " R Marginal Gaussianization W = CDF -1 % U Marginal uniformization + inverse CDF R"R Rotation R PCA [28], independent component analysis [27], random rotations [28], Householder transformations [40], [41] Rd " Rd Gaussianization block G , = R 6W 1 gW d@ Composition of rotation + marginal Gaussianization Rd " Rd Gaussianization transform G = 6G 1 % g % G L@ Composition of Gaussianization blocks Rd " Rd BEFORE AFTER CDF: cumulative distribution function. We can use the change-of-variables formula to calculate the PDF of x as p x (x) = p z ^G i (x)h d x G i (x) ,  we can sample points in the original domain xl ! X by generating samples in the transformed Gaussian domain and propagating these through the inverse transformation G -H1 . Because the transform is a product of linear and marginal operations, the Jacobian and the inverse transform can be easily computed [28], [44]. The original Gaussianization algorithm [27] worked by applying an orthogonal rotation matrix via independent component analysis and an MoG for the marginal Gaussian transformation. After enough repetitions L, it was shown that this converged to a multivariate Gaussian distribution [27]. In [28], we extended Gaussianization by realizing that the method will converge with any orthogonal rotation matrix R, and we named the algorithm RBIG. (2) where d x G i (x) is the determinant of the Jacobian of G i with respect to x. Generally, any unknown PDF of x can be estimated as long as we have the transformation G i along with its Jacobian. Intuitively, this transformation essentially converts the density of X into unstructured noise (often Gaussian or normal) [24], [26], [43]. There is no limit to the number of composite transformations G H = G i1 % G i 2 % g % G i L that can be used to sufficiently converge to the Gaussian distribution. In addition, because G H is invertible, G1 x0 G2 x1 GO G3 x2 x3 GL xO xL FIGURE 1. A complete Gaussianization of a noisy sine wave to a marginally and jointly Gaussian distributed one. We use PCA for the rotation matrix and a histogram cumulative distribution function estimator for the marginal transformation. Plots were generated with seaborn [73]. 194 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
This facilitated simpler and faster algorithms, such as PCA, and even randomly generated orthogonal rotation matrices. In addition, much simpler univariate estimators, such as histograms, were used to significantly speed up the algorithm. Meng et al. [30] coined the term Gaussianization flows and extended the iterative algorithm to be fully parameterized and trainable by incorporating a mixture of logistics as the marginal Gaussianization layer and a sequence of Householder flows [40], [41] as the rotation layer. They also proved that this is a universal approximator and showed convincing results that Gaussianization is comparable to other classes of methods specifically designed for density estimation and sampling [30]. All transformations and example variants can be found in Table 1. For details about the theoretical convergence properties of Gaussianization flows, see [27], [28], and [30]. Regardless of the chosen method, to find the parameters i for the transformation G i, we minimize the following cost function with respect to i: L ^i h = D KL 7p z ^G i ^ x hh N ^0, I D hA, (3) Gaussian). See the RBIG site, https://ipl-uv.github.io/rbig_ jax/, for a working implementation of the RBIG algorithm in Python and MATLAB. IT MEASURES USING THE RBIG TRANSFORM RBIG was designed for density estimation but was inspired by, and had connections to, IT [6]. The series of transformations learned by RBIG converts data from the original domain to a standard multivariate Gaussian one. The features are marginally independent, which is important for determining information-theoretic measures using the Gaussianization scheme. This reduction in redundancy is iteratively achieved and can be explicitly computed by summing up all the layer redundancy reductions. This metric is known as the total correlation, and computing it enables us to derive information-theoretic measures from data. INFORMATION Shannon information I [47] is based on the idea that a sample, x i, is more interesting (it carries more information) if it is less probable. The formal definition of information is which is the Kullback–Leibler (KL) divergence between the estimated Gaussian distribution and the true multivariate Gaussian distribution of mean 0 and covariance I; in other words, this is a measure of how non-Gaussian our distribution is after transformation. This reveals a direct relationship with information-theoretic concepts and measures. Chen [27], [46] showed that (3) can be decomposed as where T (x) is the total correlation (T) (as well as multi-information and multivariate MI) between all the marginal distributions and J m (x) is the KL divergence between the marginal distributions and the standard Gaussian normal distribution. Intuitively, this cost function is trying to minimize the information shared among each of the marginal distributions and ensure that they follow a normal Gaussian distribution. We want to highlight the fact that RBIG vastly transforms and simplifies the PDF estimation problem, from directly estimating the density of the highdimensional multivariate distribution in X to doing it indirectly through a transformation to a Gaussian domain, all by using a series of marginal transformations, which are straightforward and fast. An example of how RBIG works on a simple 2D toy data set is provided in Figure 2. We transform a non-Gaussian 2D data set into a 2D marginal and jointly Gaussian distribution along with the inverse transformation (first row). The second row demonstrates how we can use RBIG to synthesize points in the data domain by using the inverse transformation. Figure 2(f) shows evolution through iterations of the final total correlation (as a measure of redundancy) and the non-Gaussianity (as a measure of the distance to a DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE y x (a) (c) (b) 0.6 0.4 0.2 0 Loss L (i) = T (x) + J m (x), (4) I (x i) = - log (p x (x i)). (5) (d) (e) 0 10 20 Iterations (f) ∆ Total Correlation Non-Gaussianity FIGURE 2. The density estimation of a sinusoid with heteroscedas- tic noise, using RBIG. The original data distribution X is mapped to a Gaussian domain Z, with transform G i parameterized by a set of rotations and marginal Gaussianizations collectively denoted as i, which has an analytic inverse transformation, x = D -i 1 (zt ), to recover the original data. One can sample random data from the Gaussian in domain Z and use the inverse transformation of z to xt for data synthesis. We also demonstrate the losses: the equivalence of the change in the total correlation between layers DT and the KL divergence between transformed data and a multivariate Gaussian (non-Gaussianity). (a) X. (b) zt = G i (x). (c) x = G -i 1 (zt ) . (d) Z. (e) xt = G -i 1 (z) . (f) DT and non-Gaussianity Plots were generated with matplotlib [74]. 195
It can be used, for instance, to highlight regions of more interest in a data set. Information can be computed for each sample in our data set by using RBIG and (2). The expected value of the information provided by a complete data set, x, is called entropy: H (x) = E x [- log (p x (x))]. (6) While entropy could be computed by estimating the information of each sample in a data set using (5) and averaging, computing it using the ability of RBIG to calculate the total correlation is more convenient, as we will see in the following section. TOTAL CORRELATION The total correlation, T, accounts for the information shared among the dimensions of a multidimensional random variable [48], [49]. Details of how to compute T using RBIG can be found in [28]; here, we sketch the main idea. Given data x ! R D, we first learn the Gaussianization transform with L iterations and compute the cumulative reduction in the total correlation in each iteration as T^x h = / d D H^N^0,1hh - / H^x ,dhn.(7) L D ,=1 d=1 x = [x1, x2] y = [y1, y2] H (x1) x1 H (x2) y1 H (x) = H ([x1, x2]) x2 y2 The number of layers L will be determined by the reduction in the total correlation with each transformation. If there is no change in the total correlation after some threshold number of layers, we can assume that x d are completely independent. It is important to note that all entropy calculations involve only marginal operations, which are simple and fast, enabling RBIG to be used on large data sets that have a high number of dimensions. JOINT ENTROPY While the concept of information is attached to a particular sample, entropy is used in different fields to characterize how unpredictable a complete process is. Entropy can be easily computed from the learned RBIG transformation by H (x) = D / H (x i) - T (x),(8) d=1 D d=1 where R H (x i) are marginal entropy estimations and T (x) also involves marginal estimations [see (7)]. MULTIVARIATE MI Multivariate MI accounts for the information shared by two data sets [6]. Estimating MI can be very challenging when working with high-dimensional data. Our approach is based on the invariance property of MI to reparameterize the space of each variable [8]. Therefore, we essentially Gaussianize the two data sets, X and Y, with corresponding H (y1) transforms that remove their total correlations. Then, the total correlation H (y2) remaining between both Gaussianized data sets is equivalent to the MI H (y) between the original data sets: T (x) = MI (x1, x2) T (y)   MI (X, Y) = T ([G i x (X), G i y (Y)]), (9) MI (x, y) T ([x, y]) which again implies only marginal operations [see (7)]. Figure 3 includes a Venn diagram illustrating the different IT measures used in this article, and Table 2 demonstrates how they FIGURE 3. A Venn diagram of the relationships of all IT measures used in this article. The solid-colored circles represent marginal variables, and the intersection regions with bold lines show regions for IT measures, such as MI and total correlation, T. TABLE 2. A COMPARISON OF DIFFERENT IT MEASURES AND THE POPULAR PEARSON CORRELATION COEFFICIENT, t. Correlation t (x, y) Low Medium Low Low Medium High MI MI (x, y) Low Medium High High High High Marginal entropy H (x), H (y) High High High High High High Joint entropy H (x, y) High Medium Medium Low Low Low This table is also a visual demonstration of how to interpret MI and its relationship to marginal entropy and joint entropy; MI (x, y) = H (x) + H (y) - H (x, y). 196 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
compare to the popular Pearson correlation coefficient for different toy data sets. ILLUSTRATIVE EXPERIMENTS In this section, we explore the information content, redundancy, and relation in a selection of Earth data analysis problems involving remote sensing data and models, using RBIG. First, we illustrate the method’s ability to analyze standard remote sensing settings involving total correlation estimation in hyperspectral, radar, and very-high-resolution imagery. Second, we quantify the information content of several variables that describe a soil–vegetation status and investigate the temporal scales leading to the maximum shared information for the detection and precursors of anomalies, such as droughts. Finally, we explore the challenging problems of IT measurement estimates and the quantification of the spatiotemporal information tradeoff in global Earth products. Table 3 summarizes the experiments in terms of measures, applications, and data/simulations. GAUSSIANIZATION IN REMOTE SENSING DATA This first set of experiments considers the use of RBIG for standard remote sensing image processing. We show the performance of RBIG in hyperspectral, very-high-resolution, and radar imagery and for several applications: joint (multivariate) Gaussianization, data synthesis, and information estimation. GAUSSIANIZATION OF RADAR IMAGES The first part of the experiment focuses on the analysis of radar imagery. The data were collected in the Urban Expansion Monitoring (UrbEx) project, a part of the European Space Agency’s European Space Research Institute Data User Program [51]. Results from the UrbEx project were used to perform the analysis of the selected test sites and for validation purposes. We consider a European Remote Sensing Satellite 2 (ERS-2) SAR pair selected with perpendicular baselines between 20 and 150 m to obtain the interferometric coherence. The corresponding pair (I 1, I 2) of SAR backscattering intensities (0–35 days) was stacked for analysis; see Figure 4. The relation between the intensity features is strongly nonlinear and non-Gaussian and shows a large dispersion; see Figure 4(a). The total correlation, T, computed with RBIG for the original domain is T = 0.0929 b. A standard approach in SAR image (pre)processing consists of noise removal and marginal Gaussianization, which can address these problems only partially. This marginal Gaussianization cannot deal with the saturation for high and low signal values [Figure 4(b)]. A multivariate Gaussianization leads to a fully Gaussian density [Figure 4(c)]. This is confirmed by the estimated total correlation of T = 0.0095 b, as it is less than the marginally Gaussianized data. SYNTHESIZING HYPERSPECTRAL IMAGES To show the ability of the method to deal with high-dimensional data, we consider hyperspectral image processing. We took the standard Airborne Visible/Infrared Imaging Spectrometer Indian Pines data set [52], where the data have spectral redundancy and complex joint distributions. The images contain 200 spectral channels, constituting the (very high) input dimensionality. We learned a Gaussianization transform that led to a multivariate Gaussian domain of 200-dimension spectral bands. Then, we selected from a multivariate Gaussian n = 10 6 samples of 200 dimensions and inverted them back to the spectral domain. RBIG can be used this way to easily generate synthetic spectra. Figure 5(a) presents the original and synthesized spectra. It shows how the proposed method enables us to generate/synthesize seemingly spectral distributions, even in such a highdimensional setting. Figure 5(b) and (c) gives corner plots illustrating joint distributions among various spectral bands (10, 20, 50, 100, and 150). We see that the marginal and joint distributions for the RBIG-generated spectra in Figure 5(c) TABLE 3. A SUMMARY OF EXPERIMENTS, WITH DETAILS OF THE DATA SETS, CONFIGURATIONS, APPLICATIONS, AND MEASURES EMPLOYED. EXPERIMENT 1 2 3 DECEMBER 2021 DATA SET CHARACTERISTICS REFERENCE CONFIGURATION APPLICATION MEASURES SAR: European Remote Sensing Satellite 2 26 m, backscatter intensity [51] Pixel-wise Gaussianization T Hyperspectral: Airborne Visible/Infrared Imaging Spectrometer 30 m, 224 channels [52] Pixel-wise Synthesis T Airborne camera: red-green-blue images 10 cm, 21 classes, 100 images/class [53] Spatial I quantification T Optical: Moderate Resolution Imaging Spectroradiometer land surface temperature, normalized-difference vegetation index 0.05º, 5.5 years, 14 days [54] Temporal I quantification, PDF comparison H, MI Passive microwave: Soil Moisture and Ocean Salinity (SMOS) soil moisture, vegetation optical depth 25 km, 5.5 years, daily [55] Temporal I quantification, PDF comparison H, MI Observed and simulated: evaporation, SH, precipitation 0.083º, 10 years, monthly, global [56] Spatiotemporal I quantification I, H IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 197
Machine learning and, in particular, deep learning have led to an important leap in classification accuracy. However, owing to the wealth of data and their diversity, it becomes necessary to design algorithms that exploit most of the images’ information content in terms of relevant features and examples. We validate RBIG to estimate the total correlation (multi-information) in a set of aerial scenes collected in the University of California, Merced, data set [53], which contains manually extracted images from the United States Geological Survey’s National Map Urban Area Imagery collection, from 21 aerial scene categories, are very similar to the real data in Figure 5(b) across all pairwise band combinations. It is important to note that some of the most widely used methods, such as PCA, could replicate Figure 5(a) with a good approximate mean and standard deviation, but they would not be able to duplicate Figure 5(d), where all joint distributions are approximately Gaussian. INFORMATION IN HIGH-RESOLUTION IMAGES Very-high-resolution images are constantly acquired by the new generation of sensors on airborne and spaceborne platforms. A systematic analysis of the images is necessary. I2 I1 l1 50 100 150 200 250 3 200 l2 l2 150 100 –2 l1 0 2 3 2 2 1 1 0 l2 250 0 l1 0 –2 2 0 –1 –1 50 –2 –2 0 –3 –3 (a) (b) (c) 9,000 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 25 50 75 100 125 150 175 200 Wavelength, λ (nm) (a) 20 50 150 100 50 20 Real Generated 150 100 Radiance (Wm–2nm–1) FIGURE 4. Radar image processing. We illustrate the Gaussianization of 2D radar data comprised of a pair (I1, I2) of ERS-2 SAR backscattering intensities. (a) The joint distribution is non-Gaussian, and preprocessing before applying any algorithm is generally convenient. The (b) standard marginal Gaussianization does not achieve a full spherical (joint) Gaussian, unlike (c) the RBIG transformation [75], [76]. 10 20 50 (b) 100 150 10 20 50 100 150 (c) FIGURE 5. The Gaussianization and synthesis of hyperspectral data, using RBIG. In (a) we show the mean and standard deviation spectrum for the 21,000 real pixels (mean = black; standard deviation = darker shade) and the 1 million pixels generated synthetically (mean = red; standard deviation = lighter shade) using RBIG. In (b) and (c), we show the marginal and joint distributions of 10, 20, 50, 100, and 150 spectral bands for the real data and for data generated with RBIG, respectively. Plots were generated with corner [77]. 198 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE DECEMBER 2021
Freeways b12 b13 Runways Buildings b11 Overpasses Intersections Baseball Diamonds Dense Residential Storage Tanks Parking Lots Tennis Harbors Rivers Beaches Airplanes Golf Courses Sparse Residential Mobile Home Parks Agriculture Chaparral Forest Medium Residential (spatiospectral) T using RBIG; see Figure 6(d). We show in Figure 6(e) the average and standard deviation of the T evolution through 50 iterations for the 21 classes (note the log scale) and the total correlation per class. More textured classes, such as runaways, freeways, buildings, and intersections, lead to higher T, while rather homogeneous/flat with a 1-ft/pixel resolution. The data set contains highly overlapping classes and has 100 images per class; examples appear in Figure 6(a). We extracted 3 # 3 # 3 color patches from each image, which yielded 6,499,950 27-dimension feature vectors per class. Then, we developed a Gaussianization transformation for each class and computed the (a) b11 b12 b13 g11 g12 g13 r11 r12 r13 22 g23 r r r 32 g33 21 22 g11 22 b23 r11 32 b33 r21 23 r31 r32 r33 r31 20 0 –5 15 log (T ) –10 –15 10 –20 –25 5 –30 0 –35 20 30 Iterations (d) r33 (c) 5 10 b33 g33 r32 (b) 0 b23 T 40 50 Parking Lots Runways Airplanes Buildings Freeways Beaches Sparse Residential Intersections Agriculture Dense Residential Overpasses Rivers Storage Tanks Baseball Diamonds Harbors Medium Residential Tennis Courts Mobile Home Parks Golf Courses Forest Chaparral 0 2 4 T (e) 6 8 10 FIGURE 6. The estimation of the total correlation, T, in very-high-resolution aerial imagery. (a) Images for each of the 21 classes in the database, ranked according to their estimated T. (b) Each image is decomposed in 3 # 3 patches with three channels (red-green-blue), making samples of 27 dimensions. (c) We measure how much overlap there is between the information content (i.e., the total correlation) of the 27 dimensions for each class. We show a Venn diagram to illustrate the measured information, following the same criteria as in Figure 3. (d) The average total correlation is iteratively computed for the different 21 class-specific RBIG models through 50 iterations, with the mean T (solid) and the T standard deviation (shaded) across all models. Convergence is achieved very rapidly for all classes (note the log scale). (e) The ranked T per class computed from the RBIG models [78]. DECEMBER 2021 IEEE GEOSCIENCE AND REMOTE SENSING MAGAZINE 199
classes, including chaparral, fields, and forests, have little information content. NFORMATION QUANTIFICATION OF TERRESTRIAL BIOSPHERE VARIABLES IN TIME According to climate projections, extreme events are likely to intensify and become more frequent during the coming years [59]. The effects of extreme events (such as droughts) are prevalent not only in the biosphere and atmosphere but in the anthroposphere. Drought is a major cause of limited agricultural productivity, which accounts for a large proportion of the crop losses and annual yield variations throughout the world [60]. Droughts are also direct contributors to social conflicts, migration, and political unrest (e.g., [61]). There are many studies that show the value of incorporating EO data for global agricultural systems and applications [62], [63]. Variables, such as the land surface temperature (LST) and the normalized-difference vegetation index (NDVI), derived from optical satellites, and, more recently, soil moisture (SM) and the vegetation optical depth (VOD) derived from passive microwave sensors, are just a few of the many features that can potentially be key to the early detection of droughts [54], [55], [64]. The Soil Moisture Agricultural Drought Index (SMADI) was proposed in [65] to integrate SM with the LST and NDVI, showing good agreement with other indices and documented events worldwide [54]. In this experiment, we quantify the information in and between LST, NDVI, SM, and VOD variables for a study area in California (only agricultural fields); see Figure 7. The LST and NDVI are descriptors of the surface temperature and vegetation chlorophyll content, whereas SM and the VOD characterize the water content in soils and vegetation [55], [65]. We also use information measures to evaluate whether it would be worthwhile to include the VOD as an additional variable in the SMADI ensemble to characterize droughts. Prior to the analyses, variables are resampled into a common 0.05º grid and biweekly temporal resolution. Details of the data sets are provided in Table 3. Measures are conducted for 2010 and 2011 and 2014–2016, which are representative of conditions with and without droughts (see Figure 7). We focus on computing multivariate IT measures in a temporal feature setting, where previous time steps are included as input features. For example, one input feature includes the current time stamp, two input features include the current time stamp and a time stamp from 14 days earlier, and so on. This enables us to investigate temporal scales that maximize shared information among remotely sensed variables. This is particularly relevant for droughts since there is a time lag between soil/climatic conditions (e.g., represented by SM and the LST) and plant responses (e.g., described by the NDVI and VOD), which varies in the literature from two or three weeks to three months [66]. The amount of expected information H for each of the four variables and how it changes as we include more 200 temporal dimensions is analyzed in Figure 8(a). Entropy will always increase with more features. The entropy shown here has been normalized by the total number of features, which enables us to quantify the amount of entropy per feature. It can be seen that the amount of entropy for the VOD is the highest in all temporal settings, closely followed by the LST. All variables decrease in entropy as we add more temporal features. The NDVI saturates at roughly 1.5 b, whereas the other variables have a steady, smooth decline. We can also see that the LST and VOD show the largest difference between years with and without droughts and that the difference is largest as we increase the temporal dimension. This result suggests that the LST and VOD observed during longer periods could be more useful in detecting droughts. Figure 8(b) demonstrates that the VOD increases the amount of expected information when added to the SMADI variable ensemble in all the considered temporal settings, suggesting that it would be worthwhile to include the VOD in agricultural drought studies. The results indicate that vegetation monitoring operational settings could benefit from synergistic approaches that facilitate including multisensory, multidimensional variables, in particular, under stress and during disturbances, such as agricultural droughts. The MI of every pair of multidimensional variables was analyzed to investigate the pairs’ relationships and redundancies as well as the optimal time scales for combining them. Note that standard measures for pairwise comparison, such as Pearson’s correlation, are restricted to one temporal dimension and hence do not facilitate exploring these scales. The MI scores obtained for LST relations are given in Figure 9. Interestingly, the figure shows that the LST–NDVI and LST–VOD show an MI increase to approximately two to four temporal dimensions and then saturate. This result suggests that a period of about one or two months is needed to capture the soil–plant status through the remotely sensed variables analyzed in our study region. The curves are relatively similar regardless of whether there is a drought or not, and the value spread for drought years is considerably reduced for all variables and especially for the VOD. This could be related to reduced variability (a limited range of values) during droughts, but further studies are needed to confirm this. We also observed that the MI is consistently low among SM and all variables with any number of temporal dimensions, and it is also low between the NDVI and VOD, highlighting the value of combining optical and microwave variables for vegetation/land monitoring. INFORMATION IN SPATIAL–TEMPORAL EARTH DATA DATA For our experiments, we used observational and model simulated variables from the Earth Science Data Lab [56] (https://www.earth​systemdatalab.net/), which is a platform that provides a