8th COMPUTATIONAL ARCHIVAL SCIENCE (CAS) WORKSHOP
Sunday Dec. 17, 2023
Part of: 2023 IEEE Big Data Conference (IEEE BigData 2023) — http://bigdataieee.org/BigData2023/ (Sorrento, Italy) – Dec. 15-18, 2023
Sunday, Dec. 17, 2023
–all times are in Central European Time, UTC+1–
Boardroom #6
- Workshop Chairs:
Mark Hedges 1, Victoria Lemieux 2, Richard Marciano 3
1 King’s College London UK / 2 U. British Columbia CANADA / 3 U. Maryland USA
9:00 – 10:00 SESSION 1: Classification & Annotation
- 9:00-9:20 #1: The Sequel: The Development of a Novel Context Capturing Method for the Functional Auto Classification of Records (S01209)
Nathaniel Payne [School of Library, Archival, and Information Studies (iSchool) University Of British Columbia, CANADA]-
PAPER — VIDEO ABSTRACT: Computational archival science (CAS) provides new pathways for research. Biologists, for example, can perform scientific studies by applying AI/ML to digital biological specimen collections and explore questions that were not possible in the analog world. One such approach is the application of computational methods for specimen outlining to assist with specimen identification, morphometry, and other scientific questions. The challenge is to determine how to computationally generate and represent a specimen’s outline. The research presented in this paper addresses this challenge, through the deployment of elliptical Fourier descriptors (EFDs). The paper describes the image processing pipeline for extracting fish outlines, a key morphological feature, and representing the outlines using EFDs. In addition, our research presents the application of machine learning classification on the EFDs. The resulting dataset is well suited for a variety of machine learning-based downstream analyses, including classification by genus and species. Overall, the classification tests produced a 96.3% accuracy, demonstrating the distinguishing nature of the EFDs, and by proxy, the fish outlines as a whole. Broadly, these results indicate the effectiveness of archival specimen usage in machine learning applications, and demonstrate specimen outlining via Fourier descriptors as a computational archival science approach.
-
- 9:20-9:40 #2: Specimen Outlining: A Computational Archival Science Approach (S01216 )
David Breen, Andrew Senin, Ajani Levere, Joel Pepper, Jane Greenberg [Department of Computer Science Drexel University, Philadelphia, PA, USA / Department of Information Science Drexel University, Philadelphia, PA, USA]-
PAPER — VIDEO ABSTRACT: Computational archival science (CAS) provides new pathways for research. Biologists, for example, can perform scientific studies by applying AI/ML to digital biological specimen collections and explore questions that were not possible in the analog world. One such approach is the application of computational methods for specimen outlining to assist with spec-imen identification, morphometry, and other scientific questions. The challenge is to determine how to computationally generate and represent a specimen’s outline. The research presented in this paper addresses this challenge, through the creation of elliptical Fourier descriptors (EFDs). The paper describes the image processing pipeline for extracting fish outlines, a key morphological feature, and representing the outlines using EFDs. In addition, our research presents the application of machine learning classification on the EFDs. The resulting dataset is well suited for a variety of machine learning-based downstream anal-yses, including classification by genus and species. Overall, the classification tests produced a 96.3% accuracy, demonstrating the distinguishing nature of the EFDs, and by proxy, the fish outlines as a whole. Broadly, these results indicate the effectiveness of archival specimen usage in machine learning application, and demonstrate specimen outlining via Fourier descriptors as a computational archival science approach..
-
- 9:40-10:00 #3: Who’s in My Archive? An End-to-End Framework for Automatic Annotation of TV Personalities (S01206)
Maurizio Montagnuolo, Fulvio Negro, Alberto Messina, Angelo Bruccoleri, Roberto Iacoviello [Centre for Research, Technological Innovation and Experimentation Rai Radiotelevisione Italiana Turin, ITALY]-
PAPER — VIDEO ABSTRACT: Knowledge about the presence of people in a video is a valuable source of information in many applications, such as video annotation, retrieval and summarisation. The contribution of this paper goes in the direction of demonstrating how AI-based face processing technologies can be profitably used to perform video annotation of television content. To validate our vision, we developed the Face Management Framework (FMF), which implements an end-to-end pipeline for face analysis and content annotation based on few-shot or zero-shot face embedding extraction models. The results of the test campaign of the system show that the key performance indicators that we defined were exceeded by a wide margin, demonstrating how media workflows could greatly benefit from the tool and the efficiency improvements it brings.
-
10:00 – 10:30 COFFEE BREAK
10:30 – 11:30 SESSION 2: Authenticity & Trust
- 10:30-10:50 #4: Authenticating Citizen Journalism by Incorporating the View of Archival Diplomatics into the Verification of Open-source Investigators (S01211)
Hoda Hamouda [School of Information (iSchool UBC) University of British Columbia, CANADA]-
PAPER — VIDEO ABSTRACT: Can archival science and diplomatics enhance our ability to authenticate YouTube citizen journalism videos captured in conflict-affected regions? This research explores the possibility of expanding the current process of human rights open-source investigators in verifying online videos by integrating authentication measures of archival diplomatics into the workflow of open-source investigators.
-
- 10:50-11:10 #5: Will Blockchain Technology Change How Well National Archives Preserve the Trustworthiness of Digital Records?: Preliminary Results of a Survey (S01205)
Özhan Saglik, Victoria Lemieux [Bursa Uludag University, Türkiye / University of British Columbia Vancouver, CANADA]-
PAPER — VIDEO ABSTRACT: The purpose of this study is to examine the viewpoint of national archives on blockchain and distributed ledger technologies, discover their activities in relation to the application of these technologies, and analyse their thoughts on how these technologies can play a role in the preservation of records’ trustworthiness. A survey method was adopted in the study. The survey consisted of 18 questions about national archives’ attitude and actions in relation to application of blockchain and distributed ledger technologies. The survey was sent to the 194 national archives listed in the Directory of National Archives. Eighteen responses have been acquired which, while low, provides initial insights into how national archives are responding to these technologies. This study has three hypotheses. The first one is “blockchain technology will change archiving practices”, the second one is “the trustworthiness of digital records can be preserved better with blockchain technology”, and the last one is “national archives are reluctant to implement blockchain networks that use tradable crypto-assets”. According to the results obtained from the survey, the first hypothesis has not been verified. The second hypothesis is likely, as national archives that are keen to adopt blockchain and distributed ledger technologies, but a majority of the archives are hesitant to adopt these technologies for archiving, suggesting that the third and final hypothesis might also true, though the reasons for national archives’ reluctance to adopt these technologies could be more varied than originally hypothesized. This study is one of the first systemic analyses of the viewpoint and activities of national archives on blockchain and distributed technologies.
-
- 11:10-11:30 #6: Analogous Analogues: Digital Twins and Hardware Tracking in GLAM Collections (S01208)
Dian Ross, Edmond Cretu, Victoria Lemieux [Electrical and Computer Engineering University of British Columbia Vancouver, CANADA / School of Information University of British Columbia Vancouver, CANADA]-
PAPER — VIDEO ABSTRACT: Galleries, Libraries, Archives, and Museums (GLAMs) are host to cultural treasures and historic records but face inherent challenges maintaining accessibility and traceability in their legacy collections. Rolling COVID-19 lockdowns over the past three years (2020-2023) have limited access to primary materials while user expectation of digital access to collections has grown. With renewed digital access, however, comes new challenges in authentication and provenance tracking: collection digitization and monitoring of cultural artefacts introduces new lines of work for institutions already constrained by budgets and staffing. Building upon our previous exploration of this topic, “NFTs: Tulip Mania or Digital Renaissance?”, we present a design solution for tracking and monitoring GLAM collection objects via a hardware controller with Trusted Execution Environment (TEE) that interfaces with a trusted and flexible digital twin ledger architecture, selected from our analysis of database and private ledger technologies. We conclude by outlining the physical threat model for this design: future work will expand this model to include digital (cyber) threats to GLAM collection objects and investigate credentialed queries.
-
11:30 – 12:30 SESSION 3: Emerging Challenges & Opportunities
- 11:30-11:50 #7: Critical Community-Centeredness: Ethical Considerations for Computational Archival Studies (S01203)
Madelynn Dickerson, Audra Eagle Yun [Digital Scholarship Services University of California, Irvine Libraries Irvine, USA / Special Collections & Archives University of California, Irvine Libraries Irvine, USA]-
PAPER — VIDEO ABSTRACT: In this paper, we call for computational archival studies to prioritize social justice and community-centeredness. Our initial research findings, as well as the work of community archives, provide evidence of the need to elevate and truly center the voices of those depicted (or underrepresented) in large-scale digital archives, leveraging the power of computational thinking with the transformative experience of seeing oneself represented (or representing oneself) in digital collections.
-
- 11:50-12:10 #8: Accelerating Precision Research and Resolution Through Computational Archival Science Pedagogy (S01204)
Sarah A. Buchanan, Jennifer L. Wachtel, Jennifer A. Stevenson [University of Missouri Columbia, USA / University of Maryland and National Archives and Records Administration, Washington, D.C., USA / Defense Threat Reduction Agency Fort Belvoir, USA]-
PAPER — VIDEO ABSTRACT: Use of archival collections is accelerated by the presence of finding aids, which communicate the arrangement and description of collection contents. To arrive at the optimal arrangement of a collection, archivists rely on some item-level processing or knowledge gained by exploring and manipulating digital reproductions of the contents. In this paper we consider archival student and instructor perspectives from hands-on course experiences directly with two distinct collections: one pertaining to the development, 2017 transfer and launch, and ongoing maintenance of the International Research Portal for Records Related to Nazi-Era Cultural Property (IRP2), and one a selection of unclassified catalog entries about digitized nuclear science reports. Visualizing is a data practice that permits the discovery of key content patterns, identification of computational models to be carried out to aid further analysis, and query-resolution for subject experts with precise – and historically significant – research questions. While archival data visualizations have previously been implemented as an extension of descriptive work including finding aid element counts, here we connect visualization to the work of archival outreach and access. We study how visualizations generated by groups of students working with textual and numerical dataset portions can ultimately accelerate time-sensitive uses of collections.
-
- 12:10-12:30 (virtual) #9: The Utility of Standards and Good Practice Guidelines for Records Professionals: Comparing Apples, Oranges, and Other Fruits (S01215)
Shadrack Katuu [University of South Africa, Pretoria, SOUTH AFRICA]-
PAPER — VIDEO ABSTRACT: The perceived usefulness of standards and good practice guidelines (S&GPG) for records professionals is often seen as ambiguous. Many professionals find the abundance of options overwhelming and confusing. Even after selecting seemingly suitable S&GPG, their direct benefits may not always be evident and can potentially restrict professional autonomy in certain situations. This article explores various approaches employed by records professionals to understand the complexity of S&GPG, such as simple listing or ontological representation. However, each approach has its own set of constraints. The article proposes an initial meta-framework that draws from insights of successful frameworks to provide preliminary categories. The purpose of this conceptual proposal is to assist records professionals understand the connections between S&GPG.
-
12:30 – 2:00 LUNCH BREAK
2:15 – 3:00 KEYNOTE: “Archival-informed AI”
Dr. Emanuele Frontoni [University of Macerata, ITALY]
ABSTRACT:
BIO:
3:00 – 4:00 SESSION 4: Generative AI and LLMs
- 3:00-3:20 #10: Can GPT-4 Think Computationally about Digital Archival Practices? (S01213)
William Underwood, Joan Gage [College of Information Studies, University of Maryland, College Park, MD, USA / Paul D West Middle School, Fulton County Schools, East Point, GA, USA]-
PAPER — VIDEO ABSTRACT: This paper describes an investigation of GPT-4’s knowledge in some areas of archival practice, and its ability to think computationally about archival tasks. It is demonstrated that GPT-4 has shown an understanding of ten among the twentytwo distinct forms of computational thinking. When GPT-4 is combined with plugins, it is able to apply some of these methods and tools to digital archival tasks.
-
- 3:20-3:40 (virtual) #11: Exploring the Application of Large Language Models in Detecting and Protecting Personally Identifiable Information in Archival Data: A Comprehensive Study (S01207)
Jianliang Yang, Xiya Zhang, Kai Liang, Yuenan Liu [School of Information Resource Management Renmin University of China Beijing, CHINA / Digital Archives Management Office Hangzhou Archives Zhejiang, CHINA]-
PAPER — VIDEO ABSTRACT: This comprehensive study investigates the application of Large Language Models (LLMs) for detecting and protecting Personally Identifiable Information (PII) in archival data, a pressing concern for archives under the mandate to increase public access while safeguarding personal privacy. The paper juxtaposes traditional supervised learning methods against LLMs’ unsupervised capabilities in PII detection, unveiling LLMs as viable alternatives capable of achieving satisfactory performance levels without the need for extensive training datasets. Through empirical analysis, the study validates the feasibility of LLMs in identifying sensitive information within large volumes of archival material. The findings highlight LLMs’ significant interpretability, providing understandable rationale behind PII identification—a feature that not only enhances trust in AI applications but also aids archival staff in the review process. This research contributes novel insights into the intersection of AI and archival science, presenting LLMs as powerful tools for addressing the twin challenges of data accessibility and privacy.
-
- 3:40-4:00 (virtual) #12: AI-Generated Images as an Emergent Record Format (S01212)
Jessica Bushey [School of Information San José State University San José, USA]-
PAPER — VIDEO ABSTRACT: AI-generated Images are disrupting existing approaches to verifying the trustworthiness of visual media. The application of generative AI in fields in which images are trusted visual evidence of persons, actions and events is drawing the attention of archival scientists and AI researchers. A literature review of AI-generated images as an emergent record format, identified an absence of archival and recordkeeping knowledge. Analysis of the results revealed six thematic categories: authenticity and verifiability; manipulation and misinformation; bias and representation; attribution and intellectual property; transparency and explainability; and ethical considerations. These themes inform the development of research questions and the next phase of the study that includes the application of theory and methods of archival diplomatics and computational archival science.
-
4:00 – 4:30 COFFEE BREAK
4:30 – 5:30 SESSION 4: Discussion and closing
IMPORTANT DEADLINES:
- Monday, Nov. 6, 2023 (final): Due date for full workshop papers submission
- Wednesday, Nov 15, 2023: Notification of paper acceptance to authors
- Wednesday, Nov 22, 2023 (hard deadline): Camera-ready of accepted papers
- Sunday, Dec 17, 2023: Day-long CAS workshop (in person) in Sorrento, IT
- If you are planning on attending the workshop, please contact organizers for registration details!
PAPER SUBMISSION:
- Please submit a full-length paper (up to 10 page IEEE 2-column format, reference pages don’t count in the 10 pages) through the online submission system at: https://wi-lab.com/cyberchair/2023/bigdata23/scripts/ws_submit.php
- Formatting Instructions: Papers should be formatted to IEEE Computer Society Proceedings Manuscript Formatting Guidelines: https://www.ieee.org/conferences/publishing/templates.html
COMPUTATIONAL ARCHIVAL SCIENCE: digital records in the age of big data
INTRODUCTION TO WORKSHOP [also see our CAS Portal]:
The large-scale digitization of analogue archives, the emerging diverse forms of born-digital archive, and the new ways in which researchers across disciplines (as well as the public)wish to engage with archival material, are resulting in disruptions to transitional archival theories and practices. Increasing quantities of ‘big archival data’ present challenges for the practitioners and researchers who work with archival material, but also offer enhanced possibilities for scholarship, through the application both of computational methods and tools to the archival problem space and of archival methods and tools to computational problems such as trusted computing, as well as, more fundamentally, through the integration of computational thinking with archival thinking.
Our working definition of Archival Computational Science (CAS) is:
-
- A transdisciplinary field that integrates computational and archival theories, methods and resources, both to support the creation and preservation of reliable and authentic records/archives and to address large-scale records/archives processing, analysis, storage, and access, with aim of improving efficiency, productivity and precision, in support of recordkeeping, appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.
OBJECTIVES
This workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice (including record keeping) and new forms of analysis and historical, social, scientific, and cultural research engagement with archives.We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality–meaning, knowledge and value–from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.
This will be the 8th workshop at IEEE Big Data addressing Computational Archival Science (CAS), following on from workshops in 2016, 2017, 2018, 2019, 2020, 2021 and 2022. It also builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a 2016 symposium held in April 2016 at the University of Maryland.
All papers accepted for the workshop will be included in the Conference Proceedings published by the IEEE Computer Society Press. In addition to standard papers, the workshop (and the call for papers) will incorporate a student poster session for PhD and Master’s level students.
RESEARCH TOPICS COVERED:
Topics covered by the workshop include, but are not restricted to, the following:
-
- Application of analytics to archival material, including AI, ML, text-mining, data-mining, sentiment analysis, network analysis.
- Analytics in support of archival processing, including e-discovery, identification of personal information, appraisal, arrangement and description.
- Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
- New forms of archives, including Web, social media, audiovisual archives, and blockchain.
- Cyber-infrastructures for archive-based research and for development and hosting of collections
- Big data and archival theory and practice
- Digital curation and preservation
- Crowd-sourcing and archives
- Big data and the construction of memory and identity
- Specific big data technologies (e.g. NoSQL databases) and their applications
- Corpora and reference collections of big archival data
- Linked data and archives
- Big data and provenance
- Constructing big data research objects from archives
- Legal and ethical issues in big data archives
PROGRAM CHAIRS:
Dr. Mark Hedges
Department of Digital Humanities (DDH)
King’s College London, UK
Prof. Victoria Lemieux
School of Information
University of British Columbia, CANADA
Prof. Richard Marciano
Advanced Information Collaboratory (AIC)
College of Information Studies
University of Maryland, USA
PROGRAM COMMITTEE MEMBERS:
Dr. Bill Underwood
Advanced Information Collaboratory (AIC)
College of Information Studies
University of Maryland, USA
Dr. Jane Greenberg
Alice B. Kroeger Professor and Director, Metadata Research Center
College of Computing & Informatics
Drexel University, USA
Mark Conrad
Advanced Information Collaboratory (AIC)
College of Information Studies
University of Maryland, USA
Gregory Jansen
Advanced Information Collaboratory (AIC)
College of Information Studies
University of Maryland, USA
Rajesh Kumar Gnanasekaran
Advanced Information Collaboratory (AIC)
College of Information Studies
University of Maryland, USA
Lori Perine
Advanced Information Collaboratory (AIC)
College of Information Studies
University of Maryland, USA
Jennifer Proctor
Advanced Information Collaboratory (AIC)
College of Information Studies
University of Maryland, USA