10th COMPUTATIONAL ARCHIVAL SCIENCE (CAS) WORKSHOP
Tuesday December 9, 2025 (Online)
Part of: 2025 IEEE Big Data Conference (IEEE BigData 2025)
https://conferences.cis.um.edu.mo/ieeebigdata2025/
Dec. 8-11, 2025
SCHEDULE: 9:00 a.m. to 4:40 p.m. w. 6 Sessions & 18 papers
— all times in EST (New York) —
COMPUTATIONAL ARCHIVAL SCIENCE (CAS): digital records in the age of big data
– Workshop objectives
– Workshop research topics covered
————————–
– 9:00 – 9:10 WELCOME
– 9:10 – 9:30 KEYNOTE: Dr. Phang Lai Tee, National Archives of Singapore
– 9:30 – 10:10 SESSION 1: Blockchain & Archives [2 talks]
** 10:10 – 10:20 COFFEE BREAK **
– 10:20 – 11:40 SESSION 2: Processing Analog Archives [4 talks]
** 11:40 – 12:40 LUNCH BREAK **
– 12:40 – 1:40 SESSION 3: Retrieval-augmented Generation [3 talks]
** 1:40 – 1:50 COFFEE BREAK **
– 1:50 – 3:10 SESSION 4: Archival Theory & Computational Practice [4 talks]
** 3:10 – 3:20 COFFEE BREAK **
– 3:20 – 4:00 SESSION 5: Knowledge Organization & Retrieval [2 talks]
– 4:00 – 4:30 SESSION 6: Web Archiving [3 papers] — 1 regular & 2 lightning talk
– 4:30 – 4:40 WRAP UP
————————–
Close to 70 online participants, keynote from Dr. Phang Lai Tee, National Archives of Singapore and Chair of the UNESCO Memory of the World Preservation Sub-Committee on Artificial Intelligence, and 18 papers from 27 institutions in 8 countries spanning 5 continents:
Canada, USA (North America) / Brazil (South America) / Scotland, Spain, Switzerland (Europe) / South Africa (Africa) / Korea (Asia).
COMPUTATIONAL ARCHIVAL SCIENCE (CAS): digital records in the age of big data
1. INTRODUCTION [also see our CAS Portal at https://ai-collaboratory.net/cas/]:
The large-scale digitization of analog archives, the emerging diverse forms of born-digital archives, and the new ways in which researchers across disciplines (as well as the public) wish to engage with archival materials, are resulting in disruptions to traditional archival theories and practices. Increasing quantities of ‘big archival data’ present challenges for the practitioners and researchers who work with archival material, but also offer enhanced possibilities for scholarship, through the application both of computational methods and tools to the archival problem space and of archival methods and tools to computational problems such as trusted computing, as well as, more fundamentally, through the integration of computational thinking with archival thinking.
2. CAS [refined by Nathaniel Payne (2018)]:
-
- A transdisciplinary field grounded in archival, information, and computational science that is concerned with the application of computational methods and resources, design patterns, sociotechnical constructs, and human-technology interaction, to large-scale (big data) records/archives processing, analysis, storage, long-term preservation, and access problems, with the aim of improving and optimizing efficiency, authenticity, truthfulness, provenance, productivity, computation, information structure and design, precision, and human technology interaction in support of acquisition, appraisal, arrangement and description, preservation, communication, transmission, analysis, and access decision.
– OBJECTIVES:
This workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice (including recordkeeping) and new forms of analysis and historical, social, scientific, and cultural research engagement with archives. We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality–meaning, knowledge and value–from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.
This will be the 10th workshop at IEEE Big Data addressing Computational Archival Science (CAS), following on from workshops in 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, and 2024. It also builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a 2016 symposium held in April 2016 at the University of Maryland.
All papers accepted for the workshop will be included in the Conference Proceedings published by the IEEE Computer Society Press.
– RESEARCH TOPICS COVERED:
Topics covered by the workshop include, but are not restricted to, the following:
-
- Application of analytics to archival material, including AI, ML, text-mining, data-mining, sentiment analysis, network analysis.
- Analytics in support of archival processing, including e-discovery, identification of personal information, appraisal, arrangement and description.
- Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
- New forms of archives, including Web, social media, audiovisual archives, and blockchain.
- Cyber-infrastructures for archive-based research and for development and hosting of collections
- Big data and archival theory and practice
- Digital curation and preservation
- Crowdsourcing and archives
- Big data and the construction of memory and identity
- Specific big data technologies (e.g. NoSQL databases) and their applications
- Corpora and reference collections of big archival data
- Linked data and archives
- Big data and provenance
- Constructing big data research objects from archives
- Legal and ethical issues in big data archives
9:00 – 9:10 WELCOME
- Workshop Chairs:
Mark Hedges 1, Victoria Lemieux 2, Richard Marciano3
1 King’s College London UK / 2 U. British Columbia CANADA / 3 U. Maryland USA
![]() |
![]() |
![]() |
| VIDEO: Link | ||
- Program Committee Members:
9:10 – 9:30 KEYNOTE: Dr. Phang Lai Tee, National Archives of Singapore:
Applications and Challenges for Archives and Documentary Heritage in the Age of AI: Some Reflections
-

Dr. Phang Lai Tee
Senior Deputy Director
Senior Principal Archivist
Chair of the UNESCO Memory of the World Preservation Sub-Committee on AI
National Archives of SingaporeSLIDES: Link BIO: Dr Phang Lai Tee is the Senior Principal Archivist and Senior Deputy Director overseeing the Audio Visual Archives, the Oral History Centre, and the Sound and Moving Image Laboratory at the National Archives of Singapore, an institution of the National Library Board. She chairs the Preservation Sub-Committee of the UNESCO Memory of the World (MoW) International Advisory Committee, which is responsible for providing advice on matters relating to the selection, preservation, and accessibility of documentary heritage in all its forms and its supporting technologies. She led the setup of the UNESCO MoW working group on AI in May 2025. She also played a key role in the optimal preservation and digitization of Singapore’s 20th-century audiovisual heritage to facilitate current and future accessibility.
9:30 – 10:10 SESSION 1: Blockchain & Archives
- 9:30-9:50 #1-1: Blockchain and Responsible AI: Enhancing Transparency, Privacy, and Accountability through Blockchain Hackathon (S13207)
Jiho LEE (1), Jaehyung JEONG (1), Victoria LEMIEUX (2), Tim WEINGÄRTNER (3), and JaeSeung SONG (1) [(1) Sejong U. / REPUBLIC of KOREA, (2) U. British Columbia / CANADA, (3) Hoschschule Luzern / SWITZERLAND]-

PAPER — VIDEO — SLIDES ABSTRACT: This paper discusses a Computational Archival Science curriculum initiative-the 2025 Blockathon for Social Good at the University of British Columbia’s Blockchain Summer Institute (BSI), which challenged participants to implement a data privacy-enforcing LLM-based chatbot using the Clio-X fair-data ecosystem (integrated with Pontus-X). Teams successfully specified, implemented, and verified privacy-preserving mechanisms (filtering and masking) against sensitive datasets. The results demonstrate that integrating blockchain and distributed ledger technologies (DLTs) with AI systems, specifically using Clio-X’s Trusted Execution Environment (TEE), significantly strengthens data privacy and accountability by ensuring all workflow actions are traceable via the blockchain. This work offers practical insights into combining blockchain and Responsible AI for protecting personal information.
Keywords—blockchain, responsible artificial intelligence, privacy, digital archives, hackathons
-
- 9:50-10:10 #1-2: Cryptographic Provenance and AI-generated Images (S13212)
Jessica BUSHEY (1), Nicholas RIVARD (2), and Michel BARBEAU (2)
[(1) San Jose State U. / USA, (2) Carleton U. / CANADA]-


PAPER — VIDEO — SLIDES ABSTRACT: In a world of proliferating synthetic multimedia content, it is increasingly important to develop the ability to trace the origin of digital assets and verify their authenticity. A bold initiative is the Coalition for Content Provenance and Authenticity (C2PA), which proposes a data model for associating authenticated provenance information, known as content credentials, with multimedia assets. This paper situates C2PA within the field of Computational Archival Science (CAS), examining how cryptographic provenance and authenticity frameworks can operationalize archival trustworthiness in big data environments. With a focus on digital images (include AI-generated), this paper explores the creation of a C2PA-based pipeline for digital asset preservation informed by archival science, computer science and cybersecurity. Using an analytical framework derived from archival diplomatics and computational provenance modeling, the study maps authenticity metadata for digital images. The role of C2PA in records management and archival preservation of AI-generated images is introduced. The use of emerging blockchain technology to support provenance of multimedia content is discussed in detail. Alternative solutions are discussed. C2PA security risks are reviewed. The findings demonstrate that C2PA represents a computational model of provenance – transforming authority-based trust into trust by design. This contribution addresses a key CAS research challenge of establishing provenance for digital assets across distributed systems.
Keywords—provenance, C2PA, blockchain, computational archival science, AI-generated Images.
-
10:10 – 10:20 COFFEE BREAK
10:20 – 11:40 SESSION 2: Processing Analog Sources
- 10:20-10:40 #2-1: Using an Ensemble Approach for Layout Detection and Extraction from Historical Newspapers (S13209)
Aditya JADHAV, Bipasha BANERJEE, and Jennifer GOYNE
[Virginia Tech / USA]-

PAPER — VIDEO — SLIDES ABSTRACT: Digitized newspapers are archival resources that contain complex and inconsistent layouts. Pages often include multiple columns, large headlines, images, and advertisements, along with scanning issues such as noise or uneven lighting. Cloud services can provide a quick baseline, yet per-page pricing and proprietary models limit full-collection processing, and transparent evaluation. We present an open-source workflow that approaches layout understanding as an essential step for newspaper digitization. The pipeline runs in iterative stages: it first applies OpenCV rules to extract high-confidence text panels, then uses the Newspaper Navigator model to remove images and advertisements, and finally applies a fine-tuned Detectron2 model to recover remaining text regions. Each detection is saved as a crop with its coordinates and metadata recorded in a manifest file. Text crops are transcribed using a vision-based multi-modal large language model (LLM). This open source service is scalable, supports consistent reprocessing, and produces useful outputs that support archival goals of access and preservation.
Keywords—Document Layout Analysis, Historical Newspapers, Text Region Detection, Optical Character Recognition, Vision Large Language Models.
-
- 10:40-11:00 #2-2: PARDES: Automatic Generation of Descriptive Terms for Logical Units in Historical Handwritten Collections (S13218)
Pepita RAVENTÓS-PAJARES (1), Joan Andreu SÁNCHEZ (2), and Enrique VIDAL (2)
[(1) U. of Lleida / SPAIN, (2) U. Politécnica de Valéncia / SPAIN]-



PAPER — VIDEO — SLIDES ABSTRACT: The immense volume of digitized handwritten documents in archives presents a significant accessibility challenge, as manual archival description at the item level is prohibitively costly. This paper presents the research of the PARDES project, which investigated the automatic generation of semantically relevant descriptive terms for logical units within a historical handwritten collection. The methodology leveraged pre-existing word distributions from a Handwritten Text Recognition (HTR) system applied to the “Escola Normal” diary collection (1931-1932). We implemented a term extraction technique based on expected frequency. The approach was subjectively evaluated with end-users. Preliminary results demonstrate that the proposed technique effectively identifies key concepts and named entities. End-user feedback confirmed the utility of the automatically generated term lists for accelerating cataloging tasks and enhancing content discovery. This work validates a scalable, cost-effective framework for unlocking the “hidden” knowledge within large-scale manuscript collections, with high potential for generalization across archival systems.
Keywords—Digital Archives, Handwritten Text Recognition, Probabilistic Indexing, Archival Description, Natural Language Processing
-
- 11:00-11:20 #2-3: From Analog Records to Computational Research Data: Building the AI-Ready Lab Notebook (S13217)
Joel PEPPER (1), Zach SIAPNO (1), Jacob FURST (2), Fernando URIBE-ROMO (2), David BREEN (1), and Jane GREENBERG (1)
[(1) Drexel U. / USA, (2) U. Central Florida / USA]-


PAPER — VIDEO — SLIDES ABSTRACT: Scientific laboratory notebooks, particularly those in analog, handwritten form, represent a significant yet underutilized data source for computational studies. This paper reports on our research to further develop a pipeline for transforming analog lab notebooks to AI-Ready digital archives. The research is conducted within the framework for Computational Archival Science (CAS), extending CAS principles, drawing from archival practice and computational thinking. We provide background context on laboratory notebook history and current day use, explore CAS as a framework for study, followed by our research goals and methods. Automated extraction results for table records found in the notebooks have an error rate under 5% on a per cell basis. The framework, methods, and our findings seek to advance pipelines for making analog records, both historical and current, accessible and curated for computational research. The findings presented underscore both the accelerating pace of extraction technologies and the importance of more structured, consistent analog documentation practices to support computational transformation and AI-readiness. The conclusion summarizes results and identifies next steps.
Keywords—Computational archival science, AI-ready data, lab notebooks, digital collections
-
- 11:20-11:40 #2-4: Classification of Paper-based Archival Records Using Neural Networks (S13202)
Jussara TEIXEIRA (1), Juliana ALMEIDA (2), Tânia GAVA, Raphael DALL’ORTO (1), and José DORIGUETO (1),
[(1) Institute of Information and Communication Technology / BRAZIL, (2) State Public Archives APEES / BRAZIL, (3) Federal U. / BRAZIL]-

PAPER — VIDEO — SLIDES ABSTRACT: This paper presents the application of Artificial Intelligence (AI) techniques to support the classification of paper-based archival records managed in the Electronic Process System (SEP) of the State of Espırito Santo, Brazil. Originally implemented in 1986 on a mainframe platform and migrated to a web-based environment in 2010, SEP contained more than 4.3 million unclassified records. To address this backlog, different supervised learning algorithms were evaluated, including Neural Networks, Decision Tree, Random Forest, and SGD Classifier, using TF–IDF vectorization. The experiments compared training sets of 20,000 and 200,000 examples and a test set consisting of 50,208 manually classified records. Results demonstrated that the Multilayer Perceptron (Neural Network) achieved an accuracy of 97.12% with low computational cost, outperforming traditional classifiers in scalability and adaptability. The implementation of a modular, container-based machine learning stack enabled large-scale automation, supporting the classification of more than 1.2 million records in 2025, with human supervision in specific stages of the process. During model application, clusters of processes with similar textual and contextual characteristics were identified, allowing a single classification to be assigned to entire groups, thus reducing manual effort and increasing consistency. This work contributes to the field of Computational Archival Science by demonstrating how AI can enhance functional classification while preserving archival principles of provenance and organicity.
Keywords—Archival Science, Artificial Intelligence, Computational Archival Science, Neural Network, Record Classification.
-
11:40 – 12:40 LUNCH BREAK
12:40 – 1:40 SESSION 3: Retrieval-augmented Generation
- 12:40-1:00 #3-1: Developing a Smart Archival Assistant with Conversational Features and Linguistic Abilities: the Ask_ArchiLab Initiative (S13203)
Basma MAKHLOUF SHABOU (1), Lamia FRIHA (2), and Wassila RAMLI (2)
[(1) University of Applied Sciences and Arts of Western Switzerland (HES-SO) / SWITZERLAND, (2) U. Geneva / SWITZERLAND]-


PAPER — VIDEO — SLIDES ABSTRACT: This article describes Ask_ArchiLab project, a recent project conducted at ArchiLab_Geneva School of Business Administration. The project aims to modernize archival practices using conversational AI. It addresses challenges in digital archiving through a multi-agent system that enables fast, contextual queries. The current focus is on delivering professional archival knowledge via semantic technologies like RDF using advanced RAG techniques. The project fosters international collaboration to enhance access and usability in archival science.
Keywords— Generative AI, RDF, Open Linked Data, GraphRAG, Agentics, Knowledge engineering, Advanced archival knowledge, PIAf, ICA
-
- 1:00-1:20 #3-2: Index-aware Knowledge Grounding of Retrieval-Augmented Generation in Conversational Search for Archival Diplomatics (S13210)
Qihong ZHOU (1), Binming LI (2), and Victoria LEMIEUX (1)
[(1) U. British Columbia / CANADA, (2) Simon Fraser U. / CANADA]-


PAPER — VIDEO — SLIDES ABSTRACT: This paper discusses a novel index-aware method of semantically grounding chunking in the preprocessing phase of a conversational search pipeline. The paper outlines the novel index-aware chunking strategy, explains the setup for an experimental evaluation, and concludes with a discussion of the experimental results. The results indicate that using index-aware knowledge grounding in the conversational search pipeline can help reduce computational costs, processing resource demands, hallucinations, and the precision of answers.
Keywords— conversational search, retrieval-augmented generation, computational archival science
-
- 1:20-1:40 #3-3: Retrieval-augmented LLMs for ETD Subject Classification (S13211)
Hajra KLAIR, Fausto GERMAN, Amr ABOELNAGA, Bipasha BANERJEE, Hoda ELDARDIRY, and William INGRAM
[Virginia Tech / USA]-


PAPER — VIDEO — SLIDES ABSTRACT: Electronic Theses and Dissertations (ETDs) constitute a vital part of the global scholarly record, but require accurate subject classification for effective archival management and retrieval. The computational challenges of processing thousands of ETDs annually, and implementing an effective subject classification within it exemplify one of the core issues in modern digital repositories. Current practices rely on author-assigned labels that need to be manually verified, while automated approaches using LLMs struggle with document length constraints and interdisciplinary research categorization. This paper presents a two-stage pipeline that addresses these challenges. First, we develop a novel summarization framework that extracts keywords from abstracts, generates targeted questions, and synthesizes answers from full document text through BM25 retrieval and retrieval-augmented generation (RAG). This approach helps create enriched representations that capture methodological nuances beyond what abstracts alone provide. Second, we implement reasoning-guided prompt engineering that embeds explicit disambiguation logic for frequently misclassified categories, particularly at disciplinary boundaries. Our approach demonstrates consistent improvements over baseline methods, with particular gains for interdisciplinary research and commonly confused category pairs. Analysis reveals that systematic misclassification patterns where application domains overshadow methodological contributions can be effectively addressed through structured prompting that prioritizes disciplinary markers and methodological approaches. This work provides both a practical solution for ETD classification and valuable insights to help improve LLMs’ ability to distinguish between similar academic fields. Beyond classification, these generated summaries could prove valuable for enhancing search and indexing in digital repositories, supporting literature reviews, and enabling quick understanding of lengthy dissertations.
Keywords— archival records, subject classification, natural language processing, large language models, summarization, prompt engineering, natural language generation, digital libraries, computational archival science
-
1:40 – 1:50 COFFEE BREAK
1:50 – 3:10 SESSION 4: Archival Theory & Computational Practice
- 1:50-2:10 #4-1: Archival Research Theory: Putting Smart Technology to Work for Researchers (S13213)
Kenneth THIBODEAU (1), Alex RICHMOND (2), and Mario BEAUCHAMP (3)
[(1) NARA (retired) / USA, (2) Bank of Canada / CANADA, (3) Carleton U. / CANADA]-

PAPER — VIDEO — SLIDES ABSTRACT: This paper describes a research project that aims at extending the domain of archival theory and science so that it can actively support researchers attempting to exploit on the informative potential of archives. It builds on the conceptual foundation of Constructed Past Theory, semiotics, type theory and the polymorphic entity relation attributed data model.
Keywords— archival bond, archival theory, Constructed Past Theory, semiotics, type theory.
-
- 2:10-2:30 #4-2: Systems Thinking, Management Standards, and the Quest for Records and Archives Management Relevance (S13206)
Shadrack KATUU
[U. South Africa / SOUTH AFRICA]-

PAPER — VIDEO — SLIDES ABSTRACT: Computational Archival Science (CAS) integrates archival, information, and computational sciences to address large-scale records and archives challenges. This paper explores how systems thinking practices within CAS can reposition records and archives management (RAM) programs from overt support services to covertly embedded, value-adding components within institutions. By aligning RAM with ISO Management System Standards (MSS), including the Management System for Records (MSR), RAM programs can enhance their strategic relevance, optimize resources, and mitigate risks. This approach ensures measurable benefits, and sustainable operations, reducing the risk of marginalization while increasing organizational value.
Keywords—High-Level Structure, ISO standards, Management System for Records, Management System Standards, Plan-Do-Check-Act, Systems Thinking
-
- 2:30-2:50 #4-3: Can GPT-4 Think Computationally about Digital Archival Practices? – Part 3 (S13214)
William UNDERWOOD (1), and Joan GAGE (2)
[(1) U. Maryland / USA, (2) Fulton County Schools / USA]-

PAPER — VIDEO — SLIDES ABSTRACT: The flood of digital records into 21st century archives has made it essential to develop new digital archival methods to replace those used for traditional paper documents, film, and tapes. Additionally, it has become necessary to integrate computational thinking skills into the graduate training of upcoming archival professionals. This paper presents findings from investigations into the computational thinking abilities of GPT-4o regarding its understanding of digital archival practices. GPT-4o demonstrates knowledge of computational modeling and simulation thinking techniques. It also exhibits familiarity with PC and MS-DOS emulation, the simulation of costs and risks associated with archival media selections, the evaluation of the empirical validity of preservation models, and the creation of a model to evaluate the trustworthiness of an archival system.
Keywords—computational thinking, GPT-4, modeling and simulation, large language models, MLIS education
-
- 2:50-3:10 #4-4: Algorithm Auditing for Reliable AI Authenticity Assessment of Digitized Archival Objects (S13201)
Daniel FONNER
[Southern Methodist U. / USA]-

PAPER — VIDEO —SLIDES ABSTRACT: Digital archives play a critical role in enabling trustworthy digital authentication efforts and, by extension, provenance efforts, especially as diverse fields of study increasingly depend on large-scale digital records and computational analysis. Using case studies of such efforts in the field of art authentication, this study demonstrates how authenticity classifications can vary dramatically with minor changes to input image resolution. Such vulnerabilities reveal how archival truth, provenance, and legitimacy may be manipulated when algorithmic systems are deployed without transparency or stress testing. By embedding algorithm auditing within computational archival science practices, this research proposes a methodological safeguard that improves the reliability, reproducibility, and accountability of computational models applied to cultural heritage archives. Algorithm auditing and its documentation related to the applied analysis of archival material enable improved curation and trustworthy engagement with digital archives in the age of big data.
Keywords—Digital Curation, Algorithm Auditing, Fine Art.
-
3:10 – 3:20 COFFEE BREAK
3:20 – 4:00 SESSION 5: Knowledge Organization & Retrieval
- 3:20-3:40 #5-1: Ontologies Applied to Archival Records: a Preliminary Proposal for Information Retrieval (S13208)
Thiago Henrique BRAGATO BARROS (1)(3), Maurício COELHO da SILVA (1), Rafael Rodrígo do CARMO BATISTA (2), Frances RYAN (3), and David HAYNES (3)
[(1) U. Federal do Rio Grande do Sul / BRAZIL, (2) U. Federal de Santa Catarina / BRAZIL, (3) Edinburgh Napier U. / SCOTLAND]-

PAPER — VIDEO — SLIDES ABSTRACT: Archives preserve records that document actions, rights, and memory. Yet, queries against archival catalogues often underperform when faced with term ambiguity, complex provenance, multi-level description, and evolving institutional contexts. This paper proposes a preliminary, ontology-driven approach to improve information retrieval (IR) over archival descriptions and digital objects. We review relevant literature from ontology engineering and information science, outline design principles aligned with archival theory (provenance, original order, context), and present a modular ontology pattern—ARCO (Archival Records, Contexts & Operations)— covering Records, Agents, Functions, Activities, Mandates, Places, Events and Concept Schemes. We define competency questions for retrieval, describe indexing and reasoning workflows. We close with implementation considerations for public-sector environments and future work on authority control, multilingual access, and alignment to domain vocabularies and linked open data. Background claims draw from handbooks on ontologies and ontology-driven information systems, visual knowledge modelling, and e-government data publishing.
Keywords—Archives; Ontologies; Information Retrieval; Provenance; Authority Control Introduction
-
- 3:40-4:00 #5-2: Operationalizing Context: Contextual Integrity, Archival Diplomatics, and Knowledge Graphs (S13216)
Jim SUDERMAN (1), Frederic SIMARD (2), Nicholas RIVARD (2), Iori KHUHRO (3), Erin GILMORE (4), Michel BARBEAU (2), Darra HOFMAN (2), and Mario BEAUCHAMP (2)
[(1) Consultant / CANADA, (2) Carleton U. / CANADA, (3) U. British Columbia / CANADA, (4) San Jose State U. / USA]-


PAPER — VIDEO — SLIDES ABSTRACT: Protecting privacy has become a pressing problem for archives, which are mandated with providing access to enormous volumes of records, both analogue and digital, with limited resources. Content-driven solutions, including anonymization and pseudonymization as well many automated solutions, have delivered unsatisfactory results, with many records being broadly restricted. In this paper, we lay out the theoretical framework for a context-based AI privacy solution for archival records, combining three approaches to “context” (contextual integrity, archival diplomatics, and knowledge graphs). This framework lays the groundwork for a GraphRAG workflow that identifies critical contextual information about privacy in records collections. By centering context and making it machine-legible, this solution operationalizes contextual integrity, allowing archivists and other records professionals to make informed decisions about privacy in a resource-efficient manner.
Keywords—computational archival science, privacy, knowledge graph, machine learning, retrieval augmented generation, large language model
-
4:00 – 4:30 SESSION 6: Web Archiving
- #6-1: Arabic News Archiving is Catching Up to English: A Quantitative Study (S13205)
Hussam HALLACK & Michael Nelson
[Old Dominion U. / USA]-

PAPER — VIDEO — SLIDES ABSTRACT: Arabic is the sixth most spoken language in the world, but it is largely underrepresented on the Internet with only 0. 5% of web pages. In this paper, we present a quantitative study for the archival rate of news web pages published in Arabic compared to news pages published in English by Arabic news outlets that span 23 years. Our findings reveal that, contrary to the general conjecture that web archives favor English web pages, the archival rate of Arabic web pages increases more rapidly than the archival rate for English web pages in our dataset. This surprising trend indicates that although only 53% of Arabic news pages were archived compared to 58% of English pages, the archival rate for Arabic content increased from 24% to 53% in the last decade whereas English archiving only increased from 47% to 58%. Furthermore, we found that 99.6% of archived Arabic content only exists in the Internet Archive (IA), compared to 87.7% for English web pages. Our memento analysis shows that losing the IA would render 99.6% of archived Arabic news in our dataset permanently inaccessible if they were removed from the live web or their content had changed since they were last archived (a risk 14% greater than it is for English web pages). Our study exposes a dangerous centralization in web archiving at the IA. We also found that the rapid growth of Arabic content archiving is due to IA’s improvements rather than broader preservation efforts by all web archives. The union of all web archives excluding the IA only contributed 0.4% of Arabic archived web pages in our dataset. Our work sounds an urgent call for decentralized preservation strategies to protect the rare and sparse Arabic digital heritage.
Keywords—Arabic News Archiving, English News Archiving, Web Archiving, The Wayback Machine, The Internet Archive, News Archiving, News Preservation
-
— with 2 ADDITIONAL PAPER lightning talk:
#6-2 & #6-3: VIDEO — SLIDES
- #6-2: The Gap Continues to Grow Between the Wayback Machine and All Other Web Archives (S13204)
Hussam HALLACK & Michael Nelson
[Old Dominion U. / USA]-

PAPER ABSTRACT: We studied the archival rate of 4,116 Arabic and English news stories published by Arabic news outlets between 1999 and 2022. We found that 45% of news stories were never archived. Among the archived stories, 99.74% were archived by the Internet Archive (IA), while all other web archives combined only preserved 6.74% of the sample. Our results contradict a 2013 study on a different sample that found redundancy across the IA and the union of other public web archives. This paper highlights the unparalleled growth of the Internet Archive and the shrinkage of all other web archives in the last decade. We demonstrate how, if the IA were to become unavailable, 95.24% of archived content in this dataset would be irretrievable if they were removed from the live web or their content had changed since they were last archived. Our quantitative study underscores the critical importance of the IA and the urgent need to reinforce diversity in web archiving.
Keywords—Web Archiving, The Wayback Machine, The Internet Archive, News Archiving, News Preservation
-
- #6-3: Collecting and Archiving 1.5 Million Multilingual News Stories’ URIs from Sitemaps (S13215)
Hussam HALLACK & Michael Nelson
[Old Dominion U. / USA]-

PAPER ABSTRACT: We introduce JANA1.5, a dataset of 1.5 million news stories’ URIs (Uniform Resource Identifiers). We explain our approach for collecting and archiving 1.5 million Arabic and English news stories’ URIs from Sitemaps for four major news websites (Aljazeera, Aljazeera English, Alarabiya, and Arab News) which are based in Arabic countries and geared towards Arabs, English speaking Arabs, Arabic speakers around the world, and English speakers who are interested in the Arabic narratives of world news. Two of these news websites (Arab News and Aljazeera English) publish news in English while the other two (Aljazeera and Alarabiya) publish news in Arabic. Our method applies to all news websites and can be generalized to all of them regardless of the language in which they publish. The purpose of this study is to provide an annotated and categorized dataset of archived and unarchived news stories’ URIs to be used in cross-language information retrieval (CLIR) and web archiving research. We studied multiple ways to collect news stories’ URIs including Sitemaps, RSS feeds, X posts (tweets), news aggregators, and web scraping. We chose to use Sitemaps because it offered the largest amount of news stories’ URIs in the shortest amount of time using the fewest resources and produced the least noise in the results. Such a collection can be used to conduct research on web archiving, news similarity, machine translation, CLIR, and Information Retrieval.
Keywords—Arabic News Dataset, Arabic News Stories, Cross-Language News Dataset, Cross-Language News Stories, English News Dataset
-
4:30 – 4:40 WRAP UP: All
In Memoriam, Dec. 17, 2022… to our friend and CAS collaborator Michael Kurtz:
“One of the pulls to the bright side is our CAS initiative. Not only is it intellectually compelling to me, but I feel I am part of an endeavor that will help others in the archival space and beyond. To be even more blunt, I am so curious to see what happens next, as it makes me want to push the boundaries of the time that I have left!”
Photo taken on Friday, Dec. 16, 2022 — Annapolis, MD.Michael launched the CAS initiative in 2016, with Victoria Lemieux, Mark Hedges, Maria Esteva, William Underwood, Mark Conrad, and Richard Marciano [LINK].
He also co-founded the AI-Collaboratory in January 2020, while in London at the British Library’s Alan Turing Institute, with Victoria Lemieux, Mark Hedges, Bill Underwood, Jane Greenberg, Mark Conrad, Greg Jansen, Lyneise Williams, Eirini Goudourali, and Richard Marciano [LINK] –– see 2 pictures below .











