Part of: 2022 IEEE Big Data Conference (IEEE BigData 2022) — http://bigdataieee.org/BigData2022/index.html (Osaka, Japan)

Monday Dec 19, 2022: CAS#7 Workshop (all times are in Mountain Standard Time, UTC-7)

9:00 – 9:15 WELCOME

  • Workshop Chairs:
    Mark Hedges 1, Victoria Lemieux 2, Richard Marciano 3

    1King’s College London UK  /  2U. British Columbia CANADA  /  3U. Maryland USA

    Introductions: VIDEO

9:15 – 10:00 KEYNOTE: Archives and AI: An Overview of Current Debates and Future Perspectives”, Dr. Giovanni Colavizza

ABSTRACT: Artificial Intelligence (AI) is increasingly necessary in support of archival processes and records management decisions. The scale, rapidity and complexity required in such operations all contribute to make AI increasingly relevant to practitioners. In this talk, I will discuss recent developments at the intersection of archives and AI, as well as highlight some of the challenges still lying ahead of us. Promising avenues for future work will be proposed for discussion.

BIO: Giovanni is Assistant Professor of Digital Humanities at UvA, visiting researcher at The Alan Turing Institute and at the Centre for Science and Technology Studies (CWTS), Leiden University. He did his PhD at the Digital Humanities Laboratory of the EPFL in Lausanne, working on methods for text mining and citation analysis of scholarly publications, and is co-founder of Odoma, a start-up offering customized machine learning techniques in the cultural heritage domain. Giovanni was also a Co-investigator on the Living with Machines project and convenes the AI for Arts interest group at the Turing.

10:00 – 11:00 SESSION 1: Computational approaches to archival practice

  • 10:00-10:20 #1: An Intelligent Class: The Development of a Novel Context Capturing Method for the Functional Auto Classification of Records
    Nathaniel Payne


      ABSTRACT: The need to accurately classify records is a core problem in many domains. Historically, the classification of records was done manually, with those records “read” as they were received and categorized. Unfortunately, due to a significant growth in the volume of records, the need for robust auto-classification methods that can effectively “read” and classify records, is high. Today, significant challenges remain in the literature and practice relating to the development of effective, auto-classification processes. This is because the functional classification process is a challenge for both humans and machines, with little research on the steps needed to effectively functionally classify a record. In order to move research forward, this paper will address both challenges. Firstly, this paper, will seek to evaluate the efficacy of manual classifiers on a classification task, using knowledge from this process to articulate a process for functional classification that utilizes a record’s archival diplomatic context. Secondly, this paper will compare the efficacy of manual versus auto-classification using a record set with over 500,000 records, using a novel auto-classification approach that leverages a record’s archival diplomatic context, and not just its content, to improve classification accuracy. As this paper will discuss, there is significant variance between records managers during the manual classification process, with statistically significant differences in their ability to accurately classify both administrative and operational records. Moreover, this paper will demonstrate that an auto-classifier, when trained using key elements of archival diplomatic context, can statistically outperform a group of expert manual classifiers on a classification task.

  • 10:20-10:40 #2: A Data-Driven Approach to Reparative Description at the University of Chicago
    Ashley Gosselar


      ABSTRACT: Reparative description of collections is a growing element of diversity, equity, and inclusion efforts at cultural heritage institutions. However, the scale and complexity of the work can be overwhelming in practice. I demonstrate that computational methodologies and data analytics can be used to kickstart the planning stage for reparative description of archival finding aids. I discuss auditing and analyzing finding aids at the University of Chicago Library’s Hanna Holborn Gray Special Collections Research Center for potentially problematic language utilizing Python, Trifacta, Tableau, and Neo4j. I describe insights gained by treating finding aids as data, and I share recommendations for structuring reparative description work in a logical and attainable way.

  • 10:40-11:00 #3: Metadata Verification: A Workflow for Computational Archival Science
    Joel Pepper, Andrew Senin, Dom Jebbia, David Breen, and Jane Greenberg


      ABSTRACT: Researchers seeking to apply computational methods are increasingly turning to scientific digital archives contain-ing images of specimens. Unfortunately, metadata errors can inhibit the discovery and use of scientific archival images. One such case is the NSF-sponsored Biology Guided Neural Network (BGNN) project, where an abundance of metadata errors has significantly delayed development of a proposed, new class of neural networks. This paper reports on research addressing this challenge. We present a prototype workflow for specimen scientific name metadata verification that is grounded in Computational Archival Science (CAS), report on a taxonomy of specimen name metadata error types with preliminary solutions. Our 3-phased workflow includes tag extraction, text processing, and interactive assessment. A baseline test with the prototype workflow identified at least 15 scientific name metadata errors out of 857 manually reviewed, potentially erroneous specimen images, corresponding to a ∼0.2% error rate for the full image dataset. The prototype workflow minimizes the amount of time domain experts need to spend reviewing archive metadata for correctness and AI-readiness before these archival images can be utilized in downstream analysis.

11:00 – 11:20 COFFEE BREAK: virtual!

11:20 – 12:40 SESSION 2: Working with Archival Materials

  • 11:20-11:40 #4: AI and Archive Handwritten Text Recognition Applied to Patrimonial Holdings: An Example of 10 Diaries Written by Spanish Republican Teachers in 1932
    Pepita Raventos, Celio Hernandez, and Meritzell Simon


      ABSTRACT: Archival research based on records, data, information and in interdisciplinary collaboration with Organizational Sociology and Information Science Research applied to information architecture are essential for the effective implementation of Systems that contribute to the governance of organizations and for the promotion of transparency and accountability in the organization (Raventós, 2020, p. 309). Likewise, in line with the protocol ISO 30301, sustainability as a challenge is useful to obtain a systematic and verifiable approach. to the management support system (MSS) for records. This includes AI principles, whose forms have a robust and effective impact on the business architecture of the organization, following the ISO/TR 21965 protocol. In addition, MSS promotes transparency and governance for the direction of the institution. With the aim of advancing and enriching this methodological perspective, this article explores how the contribution of Computational Archival Science (CAS) can encourage a paradigm shift within Archival Science: from working with records to manipulating the data they contain. In order to achieve this, this article illustrates a case study carried out by the Archive and Records Management Service of the University of Lleida (Spain). This consists of applying a form of handwritten text recognition technology (Transkribus) to a part of the Lleida Teacher Training College Holding, an archival heritage with unique historical value hosted by the University of Lleida (UdL). 

  • 11:40-12:00 #5: CensusIRL: Historical Census Data Preparation with MDD Support
    Adam Doherty, Rachel Murphy, Alexander Schieweck, Stuart Clancy, Ciara Breathnach, and Tiziana Margaria


      ABSTRACT: Census returns are a critical source of information for governments globally. They underpin a wide spectrum of public planning including health, housing, work and education. Historically, census forms have captured names, places, dates, age, occupation, family structure, and religion. In more recent times, sexual orientation and ethnicity, queries that can be intrusive to vulnerable communities, have been added to the criteria, and for such reasons data security is of paramount importance. Most governments restrict access to individual census returns, presenting the data in aggregate report format. The Irish government is particularly strict, enforcing a statutory closure period of 100 years. An exception was made for the Irish 1911 census which were digitised and released for free online consultation in 2009 [1]. They are an excellent source for genealogists and historians alike but exist as separate digital siloes. This project uses an eXtreme Model-Driven Development (XMDD) environment to create linkages between both datasets. It will discuss the development process of the CensusIrl application and the process used in developing the matching algorithm used. We will discuss the census records and the data cleansing process used in creating the initial proof of concept application. We detail the different approaches to the development life-cycle of the application and describe the different utilises used in the sanitation of data points in the records and the match-making process.

  • 12:00-12:20 #6: The Arquive of Tatuoca Magnetic Observatory Brazil: from Paper to Intelligent Bytes
    Cristian Berrío-Zapata, Ester Ferreira da Silva, Mayara Costa Pinheiro, Cristiano Mendel Martins, Vinicius Augusto Carvalho de Abreu, Mario Augusto Góngora, and Kelso Dunman


      ABSTRACT: The Magnetic Observatory of Tatuoca (TTB) was installed by Observatório Nacional (ON) in 1957, near Belém city in the state of Pará, Brazilian Amazon. Its history goes back to 1933, when a Danish mission used this location to collect data, due to its privileged position near the terrestrial equator. Between 1957 and 2007, TTB produced 18,000 magnetograms on paper using photographic variometers, and other associated documents like absolute value forms and yearbooks. Data was obtained manually from these graphs with rulers and grids, taking 24 average readings per day, that is, one per hour. In 2017, the Federal University of Pará (UFPA in the Portuguese acronym) and ON collaborated to rescue this physical archive. In 2022 UFPA took a step forward and proposed not only digitizing the documents but also developing an intelligent agent capable of reading and extracting the information of the curves with a resolution better than an hour, being this the central goal of the project. If the project succeeds, it will rescue 50 years of data imprisoned in paper, increasing measurement sensitivity far beyond what these sources used to give. This will also open the possibility of applying the same AI to similar documents in other observatories or disciplines like seismography. This article recaps the project, and the complex challenges faced in articulating Archival Science principles with AI and Geoscience.

  • 12:20-12:40 #7: Applications of Data Analysis on Scholarly Long Documents
    • Bipasha Banerjee, William A. Ingram, Jian Wu, and Edward A. Fox

      ABSTRACT: Theses and dissertations record the work of graduate students and are typically a requirement at the culmination of the graduate degree. Thus, they contain important information that reflects a graduate student’s exploration of their research topic. Although print submission was commonplace early on, most universities now require students to submit an electronic version. The electronic document referred to as an ETD hence-forth has become the primary way of submitting, storing, and distributing graduate work. Millions of such documents have been created in the past two decades. They are maintained and stored by university libraries, digital repositories, and other academic publishing companies. These online repositories have increased access to such documents. Nonetheless, these documents fail to meet the needs of researchers, who find it challenging to find and access knowledge from such long documents. The worldwide ETD collection has increased in volume to become what is known as ‘scholarly big data’. Apart from the text body, these documents contain a myriad of other pieces of knowledge like tables, figures, definitions, literature reviews, and references. There is a growing demand amongst researchers across various domains to make this collection of scholarly documents more computationally driven. We use ideas from natural language processing, information retrieval, and machine learning to excavate knowledge from this rich information source. In this paper, we examine some of the challenges we face, identify some key areas of exploration, and discuss our methods to mitigate the challenges.

12:40 – 1:00  Discussion

1:00 – 2:00 LUNCH BREAK: Virtual!

2:00 – 2:40 SESSION 3: CAS and Education

  • 2:00-2:20 #8: Computational Thinking Integration into Archival Educators’ Networked Instruction
    Sarah Buchanan, Karen Gracy, Joshua Kitchens, and Richard Marciano


      ABSTRACT: This paper discusses the use of Computational Thinking (CT) in Archival Educators’ instruction towards enhancing the training and professional development of the library and archival workforce to meet the needs of their communities, and enhancing digital collection management and access to information and resources through retrospective and born-digital content. Four educators share their teaching strategies aimed at modernizing the way digital LIS and computational education are conducted. Their goal is to create an active and engaged community of future archival practitioners, ready to tackle the digital records and archives future.

  • 2:20-2:40 #9: Teachable Insights from Working with The Mary Eliza Project for Gastronomy Students Working with Data 
    Laura Kitchings


      ABSTRACT: This paper discusses the use of the Boston’s City Archives dataset for a Data Analysis Course Completion Certificate from General Assembly. It documents insights gained while doing computational archival science using visualization tools designed for business use to be shared as part of a textual analysis workshop in October 2022 for graduate students in Boston University’s Gastronomy Program. The workshop focuses less on computational tools, although students will be using Google Sheets, and focuses more on challenges and considerations when working on digital humanities projects.

2:40 – 3:20 SESSION 4: New forms of archives

  • 2:40-3:00 #10: A Technical Assessment of Blockchain in Healthcare with a Focus on Big Data
    Ghassan Al-Sumaidaee, Anastasios Alexandridis, Rami Alkhudary, and Zelijko Zilic


      ABSTRACT: New healthcare record management (HRM) systems have been introduced as technology has evolved to provide more efficient care. Since medical data is usually sensitive and must be protected from unauthorized access, attention must be paid to data integrity, patient privacy, and storage. Blockchain technology has been proposed in the literature to integrate healthcare information systems through a decentralized and unified network. However, the literature on blockchain in healthcare is full of promises that may not be true under certain conditions. In our paper, we evaluate the veracity and sophistication of some of the claims made in the literature. We go beyond performing a literature review and shed light on the weak technical aspects claimed about blockchain. In addition, we benefit from our technical assessment and suggest some future research directions to improve healthcare systems that use blockchain and big data solutions.

  • 3:00-3:20 #11: Evolutionary Archives: The Unlikely Comparison of GenBank and Know Your Meme
    • Sarah Bratt, and Alexander O. Smith

      ABSTRACT: Digital trace data in archives offer novel sources for examining “evolutionary” social and cultural phenomena. Yet, few studies have formally examined the features of archives that can be of use for scholars taking evolutionary perspectives in archival science. To address this gap, we compare the design features, metadata, and affordances of two repositories – GenBank and Know Your Meme – leveraging longstanding evolutionary analogies between genes and memes to identify trace data useful for evolutionary analyses. Our empirical analysis reveals the opportunities and limitations in using networked and longitudinal data contained in repositories. Repositories, here, are analyzed as trace data. We argue that archival system designers and CAS research should be aware of how archives represent data and how the archival features influence or limit CAS research. We conclude with a discussion of the challenges associated with archival (meta)data structures offering “big data” (and “big metadata”). In examining these repositories, we speculate about computational concerns in archiving evolutionary data. Doing so moves towards a principled approach for informing how evolutionary archives could be designed.

3:20 – 3:40 COFFEE BREAK: virtual!

3:40 – 4:20 SESSION 4: New Forms of Archives

  • 3:40-4:00 #12: LABDRIVE, a Petabyte Scalable, OAIS/ISO 16363 Conformant, for Scientific Research Organizations to Preserve Documents, Processed Data, and Software
    David Giaretta, Teófilo Redondo, Antonio G. Martinez, and María Fuertes


      ABSTRACT: Vast amounts of scientific, cultural, social, business and government, and other, information is being created every day. There are billions of objects, in a multitude of formats, semantics and associated software. Much of this information is transitory but there is still an immense amount which should be preserved for the medium and long term, even indefinitely. Preservation requires that the information continues to be usable, not simply to be printed or displayed. Of course, the digital objects (the bits) must be preserved, as must the “metadata” which enables the bits to the understood which includes the software. Before LABDRIVE no system could adequately preserve such information, especially in such gigantic volume and variety. In this paper we describe the development of LABDRIVE and its ability to preserve and to scale up to tens or hundreds of Petabytes in a way which is conformant to the OAIS Reference Model and capable of being ISO 16363 certified.

  • 4:00-4:20 #13: Open Science and Research Data Management: A FAIR European Postgraduate Programme
    Horacio Gonzalez-Velez, Ciprian Dobre, Barbara Sanchez-Solis, Giulia Antinucci, Dave Feenan, and Dana Gheorghe


      ABSTRACT: Vast amounts of scientific, cultural, social, business and government, and other, information is being created every day. There are billions of objects, in a multitude of formats, semantics and associated software. Much of this information is transitory but there is still an immense amount which should be preserved for the medium and long term, even indefinitely. Preservation requires that the information continues to be usable, not simply to be printed or displayed. Of course, the digital objects (the bits) must be preserved, as must the “metadata” which enables the bits to the understood which includes the software. Before LABDRIVE no system could adequately preserve such information, especially in such gigantic volume and variety. In this paper we describe the development of LABDRIVE and its ability to preserve and to scale up to tens or hundreds of Petabytes in a way which is conformant to the OAIS Reference Model and capable of being ISO 16363 certified.

4:20 – 4:40 SESSION 2 (continued): Working with archival materials

  • 4:20-4:40 #14: Event Time Extraction from Japanese News Archives
    Siqi Peng, Akihiro Yamamoto, Shinsuke Mori, and Tatsuki Sekino   


      ABSTRACT: This paper proposes an integrated method for extracting the time information of events from Japanese news archives. We first utilize a new pattern-based method named TRE/ERT combined with a neural-based model to extract all temporal expressions possibly related with an event. Then, we apply a simple but efficient clustering and narrowing process to summarize these temporal expressions into a small time frame for events lasting shorter than a day, or time frames for the beginning and the end days of the events for events spanning multiple days. We conducted two experiments where the results show that when working with one-day events, our system has a precision high up to 57% and the rate that the actual date of the event falls in our extracted time frame reaches 100%as long as the event name is found in the archive. The results also show that our system works with multiple-day events, but needs further improvements to get better results.

4:40 – 5:00 Discussion and closing






  • Friday Oct 28, 2022 (final): Due date for full workshop papers submission
  • Friday Nov 4, 2022: Notification of paper acceptance to authors
  • Tuesday Nov 27, 2022 (hard deadline): Camera-ready of accepted papers


COMPUTATIONAL ARCHIVAL SCIENCE: digital records in the age of big data


The large-scale digitization of analogue archives, the emerging diverse forms of born-digital archive, and the new ways in which researchers across disciplines (as well as the public)wish to engage with archival material, are resulting in disruptions to transitional archival theories and practices. Increasing quantities of ‘big archival data’ present challenges for the practitioners and researchers who work with archival material, but also offer enhanced possibilities for scholarship, through the application both of computational methods and tools to the archival problem space and of archival methods and tools to computational problems such as trusted computing, as well as, more fundamentally, through the integration of computational thinking with archival thinking.

Our working definition of Archival Computational Science (CAS) is:

    • A transdisciplinary field that integrates computational and archival theories, methods and resources, both to support the creation and preservation of reliable and authentic records/archives and to address large-scale records/archives processing, analysis, storage, and access, with aim of improving efficiency, productivity and precision, in support of recordkeeping, appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.


This workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice (including record keeping) and new forms of analysis and historical, social, scientific, and cultural research engagement with archives.We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality–meaning, knowledge and value–from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.

This will be the 7th workshop at IEEE Big Data addressing Computational Archival Science (CAS), following on from workshops in 2016, 2017, 2018, 2019, 2020 and 2021. It also builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a 2016 symposium held in April 2016 at the University of Maryland.

All papers accepted for the workshop will be included in the Conference Proceedings published by the IEEE Computer Society Press. In addition to standard papers, the workshop (and the call for papers) will incorporate a student poster session for PhD and Master’s level students.

Topics covered by the workshop include, but are not restricted to, the following:

    • Application of analytics to archival material, including AI, ML, text-mining, data-mining, sentiment analysis, network analysis.
    • Analytics in support of archival processing, including e-discovery, identification of personal information, appraisal, arrangement and description.
    • Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
    • New forms of archives, including Web, social media, audiovisual archives, and blockchain.
    • Cyber-infrastructures for archive-based research and for development and hosting of collections
    • Big data and archival theory and practice
    • Digital curation and preservation
    • Crowd-sourcing and archives
    • Big data and the construction of memory and identity
    • Specific big data technologies (e.g. NoSQL databases) and their applications
    • Corpora and reference collections of big archival data
    • Linked data and archives
    • Big data and provenance
    • Constructing big data research objects from archives
    • Legal and ethical issues in big data archives

Dr. Mark Hedges
Department of Digital Humanities (DDH)
King’s College London, UK

Prof. Victoria Lemieux
School of Information
University of British Columbia, CANADA

Prof. Richard Marciano
Advanced Information Collaboratory (AIC)
College of Information Studies
University of Maryland, USA

Dr. Bill Underwood
Advanced Information Collaboratory (AIC)
College of Information Studies
University of Maryland, USA

Dr. Tobias Blanke
Distinguished Professor in AI and Humanities
University of Amsterdam, THE NETHERLANDS