9th COMPUTATIONAL ARCHIVAL SCIENCE (CAS) WORKSHOP
Tuesday, Dec. 17, 2024
Location: Congressional D (Lobby)
Hyatt Regency Washington on Capitol Hill
400 New Jersey Avenue, NW
Washington, D.C. 20001 United States
Part of: 2024 IEEE Big Data Conference (IEEE BigData 2024)
https://www3.cs.stonybrook.edu/~ieeebigdata2024
Dec. 15-18, 2024
![]() |
||
![]() ![]() ![]() |
|
|
SCHEDULE: 8:15 to 5:10 with 6 Sessions
– 8:15 – 8:20 WELCOME
– 8:20 – 8:40 SESSION 1: Trends in Computational Archival Science (CAS) [1 talk]
– 8:40 – 10:00 SESSION 2: Exploring and Using Archives [5 talks]
** 10:00 – 10:30 COFFEE BREAK **
– 10:30 – 10:50 SESSION 2 continued…
– 10:50 – 12:30 SESSION 3: AI for Archival Functions [5 talks]
** 12:30 – 2:00 LUNCH BREAK **
– 2:00 – 3:00 SESSION 4: Computer Vision & Video [3 talks]
– 3:00 – 3:40 SESSION 5: Ethical Considerations [2 talks]
– 3:40 – 4:00 * Discussion & Additional Questions before coffee break *
** 4:00 – 4:30 COFFEE BREAK **
– 4:30 – 5:10 SESSION 6: Archival Education & Training [2 talks]
8:15 – 8:20 WELCOME
- Workshop Chairs:
Victoria Lemieux 1, Richard Marciano 2, Mark Hedges 3
1 U. British Columbia CANADA / 2 U. Maryland USA / 3 King’s College London UK
![]() |
![]() |
![]() |
| 42 attendees & 18 papers from 6 countries: Canada (1), USA (13) (North America) / Germany (3), Ireland (1), Switzerland (2) (Europe) / Japan (1) (Asia), and 21 distinct institutions. | ||
8:20 – 8:40 SESSION 1: Trends in Computational Archival Science (CAS)
- 8:20-8:40 #1-1: A Computational Review of the Literature of Computational Archival Science (CAS): Advancing Archival Theory in the Age of the Digital Tsunami and the Vanishing Box Problem (S01204)
Jennifer Proctor & Richard Marciano [U. Maryland / USA]-

PAPER — VIDEO — SLIDES 
ABSTRACT: This paper examines literature from the field of Computational Archival Science (CAS) to track efforts to address the challenges of the digital age in archives. The Digital Tsunami presented archives with a problem of scale, challenges with Digital Fragility, and changing modes of access which CAS sought to address with computational methods. The born-digital revolution presents the challenge of the Vanishing Box – the loss of topical and temporal structure in records created by the digital workforce leaving collections of Virtual Machines containing chaotic virtual drifts of items with limited metadata suffering from delays in digital preservation. This raised further issues of scale as well as requiring a fundamental rethinking of foundational theories of archival science which CAS sought to address with changes to the Appraisal, Records Management, Description, and Preservation practices previously developed for analog and digitized records.
-
8:40 – 10:00 SESSION 2: Exploring and Using Archives
- 8:40-9:00 #2-1: Exploring Large Language Models for Analyzing Changes in Web Archive Content: A Retrieval-Augmented Generation Approach (S01205)
Jhon G. Botello, Lesley Frew, Jose J. Padilla & Michele C. Weigle [Old Dominion U. / USA]-

PAPER — VIDEO — SLIDES ABSTRACT: Websites typically display only their most recent content. However, the dynamic nature of the web leads to frequent updates and deletions. Web archives preserve snapshots of earlier versions for those interested in tracking changes over time. Analyzing these changes often requires a manual process that relies on traditional methods focused on terms or phrase-level differences. This study explores the capability of Large Lan-guage Models (LLMs), specifically GPT-4o, through a Retrieval-Augmented Generation (RAG) approach for detecting changes in archived web pages. Using WARC-GPT, a RAG pipeline to interact with Web ARChive (WARC) files, we identify and analyze changes across a small set of U.S. federal environmental web pages that changed between 2016 and 2020. Our findings show that GPT-4o can effectively be used to detect inconsistencies in web archive content, including consideration of the change and the semantic context upon which the changes occurred. Our exploration represents an initial step toward using Artificial Intelligence (AI) for deeper and scalable web change analysis.
-
- 9:00-9:20 #2-2: Historic Black Lives Matter: Recovering Hidden Knowledge in Archives Through Interactive Data Visualization (S01217)
Lori Perine [U. Maryland / USA]-

PAPER — VIDEO — SLIDES ABSTRACT: This paper presents the Historical Black Lives Matter (HBLM) case study, an exploratory application of interactive data visualization to a collection of manumissions documents in the Legacy of Slavery (LoS) project at the Maryland State Archives, with the goal of enhancing discovery and recovering hidden knowledge. The case study extends prior interdisciplinary research on applying computational treatments to LoS collections and contributes to research in Computational Archival Science (CAS), computational thinking, and data visualization to enhance access to archival collections. Three design objectives are addressed: representation of people, user experience, and facilitation of knowledge discovery. The paper is organized to demonstrate a customizable workflow for the process of formulating design based on data visualization principles, implementing designs with open-source tools, and incorporating user evaluation in service to successfully fulfilling the design objectives and related functionality in the final implemented design. Examples of hidden knowledge recovered using the visualizations are presented, providing new insights into Maryland’s antebellum Black population. The data visualization design methods and practices permitted investigation at a more granular level, and enabled communication of a richer narrative. Use of open-source software makes these methods accessible to archivists, information professionals, and researchers, and supports creation of artifacts for research, teaching and learning. Future extensions could incorporate advanced computational techniques to enable map features, network and textual analysis, and dynamic query-based composition of visualizations.
-
- 9:20-9:40 #2-3: Can Generative AI Uncover Hidden Patterns in Historical Domestic Traffic Ads Through Data Analysis? A ChatLoS-DTA Exploration (S01207)
Mariia Vetluzhskikh, Rajesh Kumar Gnanasekaran & Richard Marciano [U. Maryland / USA]-

PAPER — VIDEO — SLIDES ABSTRACT: This paper presents ChatLoS-DTA, a custom Generative Pre-trained Transformer (GPT) model specifically developed for data analysis on the Domestic Traffic Ads (DTA) Legacy of Slavery dataset. The DTA dataset consists of numerous historical newspaper advertisements from 1824 to 1864 for buying and selling enslaved individuals across Maryland. This dataset, digitized and curated by the Maryland archives, offers valuable insights into patterns within the domestic slave trade. However, certain accessibility challenges exist for non-technical users, including the descendants of enslaved individuals or cultural researchers. ChatLoS-DTA, built on OpenAI’s ChatGPT-4 and Python libraries, was designed to allow such users to query the dataset using natural language without requiring any technical expertise. This paper discusses ChatLoS-DTA’s architecture, ethical framework, performance, and limitations, highlighting the model’s potential as a template for applying generative AI in cultural and historical research. Future work includes refining the tool’s accuracy to broaden dataset compatibility and further enhance ethical safeguards.
-
- 9:40-10:00 (virtual) #2-4: Ontology-driven knowledge base for digital humanities: Restructuring knowledge organization at the library of the Folkwang University of the Arts (S01210)
Andrea Linxen (1), Vera-Maria Schmidt (2), Harald Klinke (3) & Christian Beecks (1) [(1) FernUniversität Hagen, (2) Folkwang U., (3) LMU München / GERMANY]-

PAPER — VIDEO — SLIDES ABSTRACT: Academic libraries are increasingly challenged by the need to efficiently manage and analyse vast collections of data and knowledge. The divers formats and organisation methods of these collections, ranging from traditional print media to digital archives and multimedia assets, can hinder researchers’ ability to easily access and retrieve relevant information. This paper introduces an ontology-driven knowledge base to address this issue by enabling the efficient access to knowledge in the application domain and enhancing the semantic search capabilities in the field of Digital Humanities. Our approach focuses on the development of an ontology-drive knowledge base for semantic search in academic libraries by the example of the library of the Folkwang University of Arts that captures the knowledge concepts present in the library’s archival collec-tions. The resulting ontology framework provides a structured representation of domain knowledge, facilitating the integration of diverse data sources, including structured, semi-structured, and unstructured data from the application domain into a triple store knowledge base. By leveraging SPARQL queries generated from Large Language Model (LLM) prompts, we aim to facilitate more intuitive and effective knowledge retrieval. This approach allows users to express their information needs in a more natural and flexible way, leading to more accurate and relevant search results. We evaluate the proposed ontology-driven knowledge base in terms of its integrity, consistency, flexibility, relevance, and scalability. Our evaluation methodology includes a combination of verification and validation techniques, including automated reasoners and query results based on competence questions. Our findings demonstrate the potential of ontology engineering to enhance complex information retrieval in academic libraries. However, we also identify limitations related to processing speed for complex queries and the quality of search results. This research contributes to the field of computational archival science by providing a novel approach to semantic search in academic libraries. By enabling more precise and efficient access to knowledge, our ontology-driven knowledge base has the potential to enrich the academic and Digital Humanities landscape, empowering researchers to delve deeper into the vast resources available within these institutions.
-
10:00 – 10:30 COFFEE BREAK
- 10:30-10:50 #2-5: Myanmar Law Cases and Proceedings Retrieval with GraphRAG (S01208)
Shoon Lei Phyu, Jaman Shuhayel, Murataly Uchkempirov & Parag Kulkarni [Tokyo International U. / JAPAN]-

PAPER — VIDEO — SLIDES ABSTRACT: Legal document retrieval poses various challenges due to diverse linguistic and domain-specific complexities. The GraphRAG approach represents a significant advance in retrieving and summarizing archival case documents. It deals with the difficulties of accessing relevant legal information with inherent complexities. Further, it improves the efficiency of information retrieval by using graphical representations of legal texts. It enables lawyers to navigate the complex relationships between cases, statutes, and legal principles. The framework facilitates extracting relevant information and incorporates advanced natural language processing techniques for efficient summarization. It enables users to understand key legal concepts quickly. By fostering interdisciplinary collaboration and focusing on user-centered design, GraphRAG can significantly improve access to legal information, thereby meeting the growing needs of the legal community. This paper proposes a graph-rag-based approach for multilingual legal information retrieval (ML2IR), focusing on the Burmese language. Our graph-rag-based approach addresses the hallucination problem, crucial in legal information retrieval. Additionally, our work identifies important nodes and establishes contextual relationships, leading to higher accuracy and effective information retrieval.
-
10:50 – 12:30 SESSION 3: AI for Archival Functions
- 10:50-11:10 #3-1: Maturity Assessment of Appraisal Processes in the AI Age: Ongoing Framework and Measuring Method (S01212)
Basma Makhlouf Shabou [HES-SO Geneva / SWITZERLAND]-

PAPER — VIDEO — SLIDES ABSTRACT: The increasing volume of generated data and archives raises pertinent questions regarding the effectiveness of traditional archival appraisal methods, which largely depend on human expertise. As automation and artificial intelligence (AI) become prevalent in various sectors, the field of archiving stands on the brink of significant transformation. This paper explores the integration of AI within archival appraisal processes, framed within the context of the Maturity Assessment for Appraisal (MAA) project (2023-2025). The MAA seeks to evaluate the defensibility, stability, and appropriateness of current appraisal practices while assessing the readiness of records for automated appraisal. Employing exploratory qualitative research, the study outlines a systematic approach that includes a literature review, testing of a maturity model, and consultations with archival professionals. The MAA encompasses six key dimensions: principles, vision/strategic framework, compliance, methodology, tools, and criteria, providing a structured framework for assessing the maturity of appraisal practices in the AI age. Preliminary results highlight the potential benefits of AI in enhancing appraisal efficiency and effectiveness, paving the way for more informed and defensible archival decisions. Some applied use cases are already started and primarily derived results will be highlighted.
-
- 11:10-11:30 (virtual) #3-2: Collaborating for Change? Assessing Metadata Inclusivity in Digital Collections with Large Language Models (LLMs) (S01209)
Giulia Osti (1) & Elizabeth Russey-Roke (2) [(1) University College Dublin & (2) Emory U. / IRELAND & USA]-

PAPER — VIDEO — SLIDES ABSTRACT: This research explores how Large Language Models (LLMs) can be contribute to human expertise and ethical judgment to support reparative archival description practices. We assess the interpretive abilities of three LLMs (Command R+, GPT-3.5 Turbo, and GPT-4o Mini) in assisting humans with metadata inclusivity evaluations. Our testbed comprises a small metadata subset (369 records) from the Robert Langmuir African-American Photograph Collection at Emory University. Despite limited task-specific training and no access to the digital objects associated with the metadata, the LLMs demonstrated notable capacity in identifying gaps, harmful language, and latent contextual elements. By integrating computational methods into descriptive workflows, this research advances Computational Archival Science (CAS), demonstrating how LLMs can enable connecting computational techniques with archival practices in tackling complex, human values-driven challenges.
-
- 11:30-11:50 #3-3: AI-Ready Data: Knowledge Extraction from Archival Lab Notebooks (S01213)
Joel Pepper (1), Elizabeth Jones (2) , Xintong Zhao (1), Jacob Furst (3), Kyle Langlois (3), Fernando Uribe-Romo (3), David Breen (1) & Jane Greenberg (1) [(1) Drexel, (2) Northeastern, (3) U. Central Florida / USA]-

PAPER — VIDEO — SLIDES ABSTRACT: Collections of analog lab notebooks are an in-valuable source of data about research conditions, steps, and outcomes, and in aggregate have the potential to provide new insights into the successes, failures and pedagogy of research laboratories. Unfortunately, these artifacts are increasingly at risk of being lost from the historical scientific record, given limited archiving and an absence of computational and AI readiness. This paper reports on research addressing this challenge by testing mechanisms for transforming digital scans of analog lab notebooks into AI-ready data resources. The research being pursued is framed by the field of computational archival science (CAS) and the aim to utilize analog, research lab notebook data for scientific study. The paper presents background context on archival lab notebooks and CAS, discusses MOF (metal organic frameworks) and COF (covalent organic frameworks) synthesis –the scientific domain of the lab notebooks under study, and details our research methods. We demonstrate a promising approach that automatically segments pages into discrete entry types, extracts the contents of those entries, refines the output and assesses the automated results. These efforts represent a first step towards developing a framework for both improving the usability of archival lab notebooks, and enabling their contents to be used in subsequent scientific inquiry.
-
- 11:50-12:10 #3-4: Automating Chapter-Level Classification for Electronic Theses and Dissertations (S01214)
Bipasha Banerjee, William A. Ingram & Edward A. Fox [Virginia Tech / USA]-

PAPER — VIDEO — SLIDES ABSTRACT: Traditional archival practices for describing elec-tronic theses and dissertations (ETDs) rely on broad, high-level metadata schemes that fail to capture the depth, complexity, and interdisciplinary nature of these long scholarly works. The lack of detailed, chapter-level content descriptions impedes researchers’ ability to locate specific sections or themes, thereby reducing discoverability and overall accessibility. By providing chapter-level metadata information, we improve the effectiveness of ETDs as research resources. This makes it easier for scholars to navigate them efficiently and extract valuable insights. The absence of such metadata further obstructs interdisciplinary research by obscur-ing connections across fields, hindering new academic discoveries and collaboration. In this paper, we propose a machine learning and AI-driven solution to automatically categorize ETD chapters. This solution is intended to improve discoverability and promote understanding of chapters. Our approach enriches traditional archival practices by providing context-rich descriptions that facilitate targeted navigation and improved access. We aim to support interdisciplinary research and make ETDs more accessible. By providing chapter-level classification labels and using them to index in our developed prototype system, we make content in ETD chapters more discoverable and usable for a diverse range of scholarly needs. Implementing this AI-enhanced approach allows archives to serve researchers better, enabling efficient access to relevant information and supporting deeper engagement with ETDs. This will increase the impact of ETDs as research tools, foster interdisciplinary exploration, and reinforce the role of archives in scholarly communication within the data-intensive academic landscape.
-
- 12:10-12:30 (virtual) #3-5: Model Selection for HERITAGE-AI: Evaluating LLMs for Contextual Data Analysis of Maryland’s Domestic Traffic Ads (1824–1864) (S01218)
Rajesh Kumar Gnanasekaran, Lori Perine, Mark Conrad & Richard Marciano [U. Maryland / USA]-

PAPER — VIDEO — SLIDES ABSTRACT: The HERITAGE-AI (Harnessing Enhanced Re-search and Instructional Technologies for Archival Generative Exploration using AI), as part of the IMLS grant initiative, GenAI-4-Archive, aims to analyze sensitive historical datasets ethically using advanced AI technologies. One of the key tasks of this project focuses on selecting the most suitable Large Language Model (LLM) for analyzing the Domestic Traffic Ads (DTA) published in Maryland between 1824 and 1864 by slave traders—a dataset rich in historical significance yet fraught with ethical considerations. Analyzing sensitive historical datasets presents unique ethical and technical challenges. This paper presents a comparative evaluation of leading LLMs to identify the optimal model to meet HERITAGE-AI’s objectives. We survey contemporary models, including OpenAI’s GPT-4o, Anthropic’s Claude Sonnet, Meta’s Llama 3.2, and Google’s Gemini, to identify the most suitable model for Generative AI-based analysis of the DTA dataset. The objective is to select an LLM that can handle the sensitive nature of the data responsibly while providing accurate and insightful analysis. Three critical evaluation criteria, among others, are established for this reason: Sensitivity to Historical Context, Privacy and Security, and Customizability. Our analysis follows a three-step approach: evaluating free versions, paid versions, and enterprise-grade cloud-based implementations of these LLMs. Our findings reveal that while free and paid versions offer varying degrees of accessibility, they fall short in providing the necessary privacy, security, multi-user access, and customization required for analyzing sensitive historical data like the DTA dataset. In the third step, by comparing the cloud-based implementations of Azure OpenAI’s GPT-4o, AWS Bedrock’s Claude, and AWS Bedrock’s Llama3.2 LLMs, Azure openAI GPT-4o emerges as the most suitable option for this project. Although GPT-4o and Claude were close contenders, GPT-4o demonstrated robust mechanisms due to its high accuracy, ethical sensitivity, robust privacy controls, and scalability in a cloud-based environment. It also offers extensive customizability, allowing for effective integration of the DTA dataset and alignment with the project’s ethical standards. Future work will involve domain experts and community members in implementing Azure OpenAI GPT-4o for the DTA dataset analysis.
-
12:30 – 2:00 LUNCH BREAK
2:00 – 3:00 SESSION 4: Computer Vision & Video
- 2:00-2:20 #4-1: Sifting U. S. Census Records with Computer Vision and Machine Learning (S01216)
Gregory Jansen [U. Maryland / USA]-

PAPER — VIDEO — SLIDES ABSTRACT: This paper shares the culmination of my work to computationally enhance researcher access to U. S. Census records, by targeting their personal transcription labor on those document pages that are most likely to contain relevant information. Much research on the United States population over time concerns demographic groups that may be identified, for example, through the race column on census population schedules, which are the handwritten forms on which census takers would record household information. This project was created to support the researcher efforts of Dr. Richard Marciano and the study of the community impact of the forced relocation of Japanese American households during the second world war. In particular through a detailed comparison between the Japanese American households and people recorded in 1940 and in 1950 Sacramento California. While the census forms have a different layout in each decade, the general design is tabular with rows and columns that may be used to visually segment the document. This paper, and the code notebooks that are published along with it, demonstrate a computer vision technique for segmenting population schedules to extract the individual cell images from their race column. Then the individual cell images are cleaned up and fed into two different neural network models, for identifying the handwritten race code within them. Finally, we created a user interface that allows a researcher to perform a visual review of uncertain results from the above process and thereby create a reliable dataset containing only those population schedule pages that pertain to their research. The Python code notebooks that were used to perform this analysis and the review process are linked within the paper and are freely available for reuse under a Creative Commons share-alike license.
-
- 2:20-2:40 #4-2: Video Content Summarization with Large Language-Vision Models (S01215)
Kelley Lynch, Bohan Jiang, Ben Lambright, Kyeongmin Rim & James Pustejovsky [Brandeis U. / USA]-

PAPER — VIDEO — SLIDES ABSTRACT: We present a modular pipeline for summarizing broadcast news videos using large language and vision mod-els, specifically integrating Whisper for ASR, TransNetV2 for shot segmentation, LLaVA for image captioning, and LLaMA for generating structured summaries. Implemented within the CLAMS platform using the Multimedia Interchange Format (MMIF) for component interoperability, our approach combines ASR transcriptions and image captions to enhance metadata extraction. We evaluated our pipeline with automated metrics based on user-generated Youtube video descriptions as well as human assessments. Our analysis highlights challenges with automated metrics and emphasizes the value of human evaluation for nuanced assessment. This work demonstrates the effectiveness of multimodal summarization for video metadata extraction and paves the way for enhanced video accessibility.
-
- 2:40-3:00 (recorded) #4-3: Beyond Essentials: Nuanced and Diverse Text-to-video Retrieval (S01203)
Yuchen Yang [École Polytechnique Fédérale de Lausanne / SWITZERLAND]-

PAPER — VIDEO —SLIDES ABSTRACT: The field of text-to-video retrieval has advanced significantly with the evolution of language models and large-scale pre-training on generated caption-video pairs. Current methods predominantly focus on visual and event-based details, making retrieval largely reliant on tangible aspects. However, videos encompass more than just “seen” or “heard” elements, containing diverse, nuanced layers that are often overlooked. This work addresses this gap by introducing a method that incorporates audio, style, and emotion considerations into text-to-video retrieval through three key components. First, an aug-mentation block is implemented to generate additional textual information on a video’s audio, style, and emotional aspects, supplementing the original caption. Second, a cross-modal audio-visual attention block fuses visual and audio data within the video, aligning it with this enriched textual information. Third, hybrid space learning is applied, using multiple latent spaces to align textual and video data, which minimizes potential conflicts between various information sources. In standard evaluations, models are often tested on benchmark datasets that emphasize simple, short, visual and event-based queries. To more accurately assess model performance under diverse query conditions that capture the nuanced dimensions of video content, we developed a new evaluation dataset. Our results demonstrate that, while our method performs comparably with state-of-the-art models on conventional test sets, it surpasses non-pre-trained models when addressing more complex queries, as evidenced by this novel test dataset.
-
3:00 – 3:40 SESSION 5: Ethical Considerations
- 3:00-3:20 #5-1: Computational Archival Processes & Assessable Sustainability: Challenges and Opportunities (S01211)
Aurèle Nicolet & Basma Makhlouf Shabou [HES-SO Geneva / SWITZERLAND]-

PAPER — VIDEO — SLIDES ABSTRACT: This article highlights the environmental impacts associated with information and communication technologies (ICT) used for data storage and processing, emphasizing the significant emissions generated during the lifecycle of electronic devices. It addresses the challenges of assessing these environmental impacts using methodologies like the GHG Protocol and Life Cycle Assessment (LCA). It also explores opportunities for mitigating these impacts through better data governance, techniques for reducing digital waste, and sustainability initiatives such as the Arch’Eco project, who aims to assess environmental impacts for the entire lifecycle of data and identify best practices for data management in an environmentally friendly manner.
-
- 3:20-3:40 #5-2: An Ethical Reflection Aid for Responsible AI in Computational Archival Science (S01202)
Sara Mannheimer (1), Jason A. Clark (1), Scott W. H. Young (1), Bonnie Sheehey (1), Natalie Bond (2), Doralyn Rossmann (1), Hannah Scates Kettler (3) & Yasmeen Shorish (4) [(1) Montana State U., (2) U. of Montana, (3) Iowa State U. & (4) James Madison U. / USA]-

PAPER — VIDEO — SLIDES ABSTRACT: UAI implementations continue to grow in cultural heritage settings and have deep connections to how computational archival science is conducted. This paper reviews how AI is being implemented in computational archival science. It then provides an overview of ethical issues and considerations when implementing AI for computational archival science. Our research team has developed an evidence-based ethical reflection aid to guide library and archives practitioners through responsible implementations of AI. When using the ethical reflection aid practitioners consider a specific AI implementation scenario, consider how the different values held by different stakeholders may align or come into conflict, and outline potential actions for responsible AI. We present an example of using the ethical reflection aid to examine a scenario related to computational archival science. Ultimately, this paper suggests that careful examination of stakeholder values can support a more responsible computational archival science practice.
-
- 3:40-4:00 * Discussion & Additional Questions before coffee break *



4:00 – 4:30 COFFEE BREAK
4:30 – 5:10 SESSION 6: Archival Education & Training
- 4:30-4:50 #6-1: Can GPT-4 Think Computationally about Digital Archival Tasks? – Part 2 (S01206)
William Underwood & Joan Gage [(1) University of Maryland & (2) Paul D West Middle School, Fulton County Schools, East Point, GA / USA]-

PAPER — VIDEO — SLIDES ABSTRACT: This study examines the computational problem-solving capabilities of GPT-4, focusing on its knowledge of machine learning, email categorization, and computational problem solving, alongside its proficiency in Python programming, computational abstraction, and program debugging. The aim of these investigations is to evaluate whether the capabilities of Large Language Models (LLMs), as demonstrated by GPT-4, can support Master of Library and Information Science (MLIS), graduate students in developing computational thinking skills relevant to digital archival tasks.
-
- 4:50-5:10 (virtual) #6-2: Training in Computational Archival Science: Do CAS Educational Frameworks meet Professional Expectations? (S01201)
Victoria Lemieux & Richard Arias-Hernandez [U. British Columbia / CANADA]-

PAPER — VIDEO — SLIDES ABSTRACT: This paper explores the evolving landscape of training for archival professionals in the context of big data and emerging technologies. By comparing two educational frameworks—the CAS framework, developed from computational thinking research and CAS research papers, and the InterPARES framework, based on empirical studies with archivists working with AI/ML, we identify areas of alignment and divergence. While both frameworks share significant concordance, suggesting a growing consensus on integrating computing into archival work, key differences in their approaches (learning outcomes vs. competencies) and focus areas (such as work practices, systems thinking, and cybersecurity) highlight the need for further discourse among archival scholars, educators, and practitioners. These distinctions must be addressed before formalizing CAS educational frameworks. This paper also initiates efforts to integrate emerging technological competencies by bridging the CAS and InterPARES frameworks, emphasizing the value of complementary perspectives from both professional practice and academic research. We argue that such integration is essential for developing robust competency frameworks in archival education, particularly within higher education’s professional programs.
-
5:10 – Discussion and Closing
6:30-9:00 National Museum of the American Indian: Banquet Award Ceremony Social Program
- In Memoriam two years ago… to our friend and CAS collaborator Michael Kurtz:
“One of the pulls to the bright side is our CAS initiative. Not only is it intellectually compelling to me, but I feel I am part of an endeavor that will help others in the archival space and beyond. To be even more blunt, I am so curious to see what happens next as it makes me want to push the boundaries of the time that I have left!”

- Michael launched the CAS initiative in 2016, with Victoria Lemieux, Mark Hedges, Maria Esteva, William Underwood, Mark Conrad, and Richard Marciano [LINK], and co-founded the AI-Collaboratory in January 2020, while in London at the British Library’s Alan Turing Institute, with Victoria Lemieux, Mark Hedges, Bill Underwood, Jane Greenberg, Mark Conrad, Greg Jansen, Lyneise Williams, Eirini Goudourali, and Richard Marciano [LINK].


-
- IMPORTANT DEADLINES:
Monday, Nov. 4NEW: Saturday, Nov. 9, 2024 (final): Due date for full workshop papers submissionFriday, Nov 15NEW: Saturday, Nov 16, 2024: Notification of paper acceptance to authors- Saturday, Nov 23, 2024 (hard deadline): Camera-ready of accepted papers
- Tuesday, Dec 17, 2024: Day-long CAS workshop (in person) in Washington DC, USA
- If you are planning on attending the workshop, please contact mark.hedges at kcl.ac.uk for registration details!
PAPER SUBMISSION:
- Please submit a full-length paper (up to 10 page IEEE 2-column format, reference pages don’t count in the 10 pages) through the online submission system at: https://wi-lab.com/cyberchair/2024/bigdata24/scripts/submit.php?subarea=BigD
- Formatting Instructions: Papers should be formatted to IEEE Computer Society Proceedings Manuscript Formatting Guidelines: https://www.ieee.org/conferences/publishing/templates.html
COMPUTATIONAL ARCHIVAL SCIENCE: digital records in the age of big data
INTRODUCTION TO WORKSHOP [also see our CAS Portal]:
The large-scale digitization of analogue archives, the emerging diverse forms of born-digital archive, and the new ways in which researchers across disciplines (as well as the public)wish to engage with archival material, are resulting in disruptions to transitional archival theories and practices. Increasing quantities of ‘big archival data’ present challenges for the practitioners and researchers who work with archival material, but also offer enhanced possibilities for scholarship, through the application both of computational methods and tools to the archival problem space and of archival methods and tools to computational problems such as trusted computing, as well as, more fundamentally, through the integration of computational thinking with archival thinking.
Our working definition of Archival Computational Science (CAS) is:
-
- A transdisciplinary field that integrates computational and archival theories, methods and resources, both to support the creation and preservation of reliable and authentic records/archives and to address large-scale records/archives processing, analysis, storage, and access, with aim of improving efficiency, productivity and precision, in support of recordkeeping, appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.
OBJECTIVES
This workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice (including record keeping) and new forms of analysis and historical, social, scientific, and cultural research engagement with archives.We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality–meaning, knowledge and value–from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.
This will be the 9th workshop at IEEE Big Data addressing Computational Archival Science (CAS), following on from workshops in 2016, 2017, 2018, 2019, 2020, 2021, 2022 and 2023. It also builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a 2016 symposium held in April 2016 at the University of Maryland.
All papers accepted for the workshop will be included in the Conference Proceedings published by the IEEE Computer Society Press.
RESEARCH TOPICS COVERED:
Topics covered by the workshop include, but are not restricted to, the following:-
- Application of analytics to archival material, including AI, ML, text-mining, data-mining, sentiment analysis, network analysis.
- Analytics in support of archival processing, including e-discovery, identification of personal information, appraisal, arrangement and description.
- Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
- New forms of archives, including Web, social media, audiovisual archives, and blockchain.
- Cyber-infrastructures for archive-based research and for development and hosting of collections
- Big data and archival theory and practice
- Digital curation and preservation
- Crowd-sourcing and archives
- Big data and the construction of memory and identity
- Specific big data technologies (e.g. NoSQL databases) and their applications
- Corpora and reference collections of big archival data
- Linked data and archives
- Big data and provenance
- Constructing big data research objects from archives
- Legal and ethical issues in big data archives
PROGRAM CHAIRS:
Dr. Mark Hedges
Department of Digital Humanities (DDH)
King’s College London, UKProf. Victoria Lemieux
School of Information
University of British Columbia, CANADAProf. Richard Marciano
Advanced Information Collaboratory (AIC)
College of Information Studies
University of Maryland, USA
PROGRAM COMMITTEE MEMBERS:
Dr. Sarah Buchanan
Library and Information Science
iSchool
University of Missouri, USAMark Conrad
Advanced Information Collaboratory (AIC)
College of Information
University of Maryland, USADr. Anne J. Gilliland
Center for Information as Evidence (CIE)
School of Education and Information Science
UCLA, USADr. Jane Greenberg
Alice B. Kroeger Professor and Director, Metadata Research Center
College of Computing & Informatics
Drexel University, USAGregory Jansen
Advanced Information Collaboratory (AIC)
College of Information
University of Maryland, USARajesh Kumar Gnanasekaran
Advanced Information Collaboratory (AIC)
College of Information
University of Maryland, USADr. Nathaniel Payne
Advanced Information Collaboratory (AIC)
Dygital9 and NOQii & Contivos
University of British Columbia, CANADALori Perine
Advanced Information Collaboratory (AIC)
College of Information
University of Maryland, USAJennifer Proctor
Advanced Information Collaboratory (AIC)
College of Information
University of Maryland, USADr. Bill Underwood
Advanced Information Collaboratory (AIC)
College of Information
University of Maryland, USA
- IMPORTANT DEADLINES:






























