Friday Dec. 11, 2020, 9:00a.m. – 5:20p.m — Atlanta, GA (now virtual)
PART OF: IEEE Big Data 2020 — http://bigdataieee.org/BigData2020/index.html

COMPUTATIONAL ARCHIVAL SCIENCE: digital records in the age of big data


9:00 – 9:15 WELCOME

  • Workshop Chairs:
    Mark Hedges 1, Victoria Lemieux 2, Richard Marciano 3

    1 KCL, 2 UBC, 3 U. Maryland

9:15 – 10:00 KEYNOTE: 
Music in the Archives: Digital Musicology as a case study in Computational Archival Science: David de Roure

  • Professor of e-Research, Oxford e-Research Centre, Department of Engineering Science
  • Director, Digital Humanities at Oxford, The Oxford Research Centre in the Humanities (TORCH)
  • Turing Fellow in the Humanities and Data Science Interest Group, at The Alan Turing Institute
  • Honorary Visiting Professor, Royal Northern College of Music

10:00 – 10:40 SESSION 1: Analytics for Archival Processing

  • 10:00-10:20 #1:Automatic Extraction of Dublin Core Metadata from Presidential e-Records
    [W. Underwood — U. of Maryland, USA] (Metadata, Computational Thinking)
    ABSTRACT: This paper describes how methods of natural language processing, grammatical description and parsing can be uses to recognize the document types of records distributed by the White House Press Office. It is also described how Dublin Core metadata can be extracted from these records to improve access to those records via Faceted Search. The applications being developed have broad potential use in the Presidential Libraries. Research issues being explored include automatic induction of grammars for defining document types and automatic methods for identifying related presidential records. 
  • 10:20-10:40 #2:Multi-label Classification of Chinese Judicial Documents based on BERT
    [M. Dai, C.L. Liu — National Chengchi U., TAIWAN] (NLP, Classification)
    ABSTRACT: Judicial decisions are an important part of modern democratic societies. In this paper, I present results of multi-label classification of Chinese judicial documents. The experiments employ the same corpus that was used in Chinese AI & Law Challenge(CAIL) 2018. 

10:40 – 11:00 COFFEE BREAK: virtual!

11:00 – 11:40 SESSION 2: Analytics for Archival Processing (cont.)

  • 11:00-11:20 #3:A computational Approach to Historical Ontologies
    [M. Kelly, J. Greenberg, C.B. Rauch, S. Grabus, J.P. Boone, J.A. Kunze, P. Melville Logan– Drexel. U. California, Temple U. , USA] (Metadata, Ontologies)
    ABSTRACT: This paper presents a use case exploring the application of the Archival Resource Key (ARK) persistent identifier for promoting and maintaining ontologies. In particular we look at improving computation with an in-house ontology server in the context of temporally aligned vocabularies. This effort demonstrates the utility of ARKs in preparing historical ontologies for computational archival science.
  • 11:20-11:40 #4:Digital Curation and Machine Learning Experimentation in Archives
    [T. Randby, R. Marciano,  — U. North Carolina at Chapel Hill & U. Maryland, USA] (Digital curation, Machine Learning)
    ABSTRACT: In this paper, we present a series of experiments we conducted over the summer of 2020 with the FDR Morgenthau Holocaust Collections at the FDR Presidential Library and Museum, in order to unlock hard-to-reach information in the collections and improve access to the public and researchers. We extract detailed Subject Index metadata from Table of Contents images towards creating better finding aids. We demonstrate how digital curation of archival collections are a necessary preparation step for use with supervised Machine Learning algorithms. Finally, we introduce the notion of historical contextualization of Machine Learning models in order to create culturally-aware training models. 

11:40 – 12:40 SESSION 3: Analyzing Historical Data and Documents

  • 11:40-12:00 #5:Curation of Historical Arabic Handwritten Digital Datasets from Ottoman Population Registers: A Deep Transfer Learning Case Study
    [Y. Said Can and M. Erden Kabadayi — Koç U., TURKEY] (Deep Learning Image segmentation)
    ABSTRACT: With the increasing number of digitization efforts of historical manuscripts and archives, automated information retrieval systems need to extract meaning fast and reliably. Historical archives bring more challenges for these systems when compared to modern manuscripts. More advanced algorithms, archives-specific methods, preprocessing techniques are needed to retrieve information. Cutting-edge machine learning algorithms should also be applied to retrieve meaning from these documents. One of the most important research issues of historical document analysis is the lack of public datasets. Although there are plenty of public datasets for modern document analysis, the number of public annotated historical archives is limited. Researchers can test novel algorithms on these modern datasets and infer some results, but their performance is unknown without testing them on historical datasets. In this study, we created a historical Arabic handwritten digit dataset by combining manual annotation and automatic document analysis techniques. The dataset is open for researchers and contained more than 6000 digits. We then tested deep transfer learning algorithms and various machine learning techniques to recognize these digits and achieved promising results.
  • 12:00-12:20 #6:HRCenterNet: An Anchorless Approach to Chinese Character Segmentation in Historical Documents
    [C-W Tang, C-L Liu, P-S Chiu — National Chengchi U., TAIWAN] (Processing archival documents)
    ABSTRACT: The information provided by historical documents has always been indispensable in the transmission of human civilization, but it has also made these books susceptible to damage due to various factors. Thanks to recent technology, the automatic digitization of these documents are one of the quickest and most effective means of preservation. The main steps of automatic text digitization can be divided into two stages, mainly: character segmentation and character recognition, where the recognition results depend largely on the accuracy of segmentation. Therefore, in this study, we will only focus on the character segmentation of historical Chinese documents. In this research, we propose a model named HRCenterNet, which is combined with an anchorless object detection method and parallelized architecture. The MTHv2 dataset consists of over 3000 Chinese historical document images and over 1 million individual Chinese characters; with these enormous data, the segmentation capability of our model achieves IoU 0.81 on average with the best speed-accuracy trade-off compared to the others. Our source code is available at https://github.com/Tverous/HRCenterNet. 
  • 12:20-12:40 #7:Towards Automatic Data Cleansing and Classification of Valid Historical Data: An Incremental Approach Based on MDD
    [E. O’Shea, R. Khan, C. Breathnach, T. Margaria — U. Limerick, IRELAND] (Digital Curation, Classification)
    ABSTRACT: The project Death and Burial Data: Ireland 1864-1922 (DBDIrl) examines the relationship between historical death registration data and burial data to explore the history of power in Ireland from 1864 to 1922. Its core Big Data arises from historical records from a variety of heterogeneous sources, some aspects are pre-digitized and machine readable. A huge data set (over 4 million records in each source) and its slow manual enrichment (ca 7,000 records processed so far) pose issues of quality, scalability, and creates the need for a quality assurance technology that is accessible to non-programmers. An important goal for the researcher community is to produce a reusable, high-level quality assurance tool for the ingested data that is domain specific (historic data), highly portable across data sources, thus independent of storage technology. This paper outlines the step-wise design of the finer granular digital format, aimed for storage and digital archiving, and the design and test of two generations of the techniques, used in the first two data ingestion and cleaning phases.
    The first small scale phase was exploratory, based on metadata enrichment transcription to Excel, and conducted in parallel with the design of the final digital format and the discovery of all the domain-specific rules and constraints for the syntax and semantic validity of individual entries. Excel embedded quality checks or database-specific techniques are not adequate due to the technology independence requirement. This first phase produced a Java parser with an embedded data cleaning and evaluation classifier, continuously improved and refined as insights grew. The next, larger scale phase uses a bespoke Historian Web Application that embeds the Java validator from the parser, as well as a new Boolean classifier for valid and complete data assurance built using a Model-Driven Development technique that we also describe. This solution enforces property constraints directly at data capture time, removing the need for additional parsing and cleaning stages. The new classifier is built in an easy to use graphical technology, and the ADD-Lib tool it uses is a modern low-code development environment that auto-generates code in a large number of programming languages. It thus meets the technology independence requirement and historians are now able to produce new classifiers themselves without being able to program. We aim to infuse the project with computational and archival thinking in order to produce a robust data set that is FAIR compliant (Free Accessible Inter-operable and Re-useable),

12:40 – 1:40 LUNCH: Virtual!

1:40 – 2:00 Discussion and Questions for Keynote Speaker

2:00 – 2:40 SESSION 4: Representation in Archives

  • 2:00-2:20 #8:Computational Treatments to Recover Erased Heritage: A Legacy of Slavery Case Study (CT-LoS)
    [L. Perine, R.K. Gnanasekaran, P. Nicholas, A.. Hill, R. Marciano — U. Maryland, USA] (Computational thinking, Digital Curation, Ethics)
    ABSTRACT: Graduate students at the University of Maryland’s College of Information Studies (UMD iSchool) collaborated in interdisciplinary teams on a case study to explore application of computational methodologies to datafied collections related to slavery in the Maryland State Archives (MSA). Two research questions were examined: (1) What are the opportunities and limitations for using computational methods and open source tools to characterize data encoded within records of enslavement and to discover new patterns and relationships in that data? (2) How does knowledge of social and cultural systems impact those opportunities and limitations? Computational methods and tools were most effectively used when socio-cultural contextualization and technology’s role as a mediator of representation were taken into account. Three additional technical research areas are identified to enhance recovery of heritage hidden in records of enslavement: visualization, graph databases, and ontologies and metadata. 
  • 2:20-2:40 #9:Elevating “Everyday” Voices and People in Archives through the Application of Graph Database Technology
    [Mark Conrad, Lyneise WilliamsU. Maryland & U. of North Carolina at Chapel Hill, USA] (Representation, Ethics)

    ABSTRACT: In a simple experiment using a graph database we demonstrate that it is possible to increase the number of access points to individual items in archival collections. We do this by leveraging existing machine readable and searchable data and metadata to identify and display relationships between persons, places, dates, events, etc. across items and collections. We discuss some of the financial, ethical and representational implications of decisions made in applying technology to archival holdings. Many decisions are made without considering the ethical and representational implications. Our experiment has illustrated some of these ethical and representational implications. 

2:40 – 3:20 SESSION 5: Visual and Audio Archives

  • 2:40-3:00 #10:A Study of Spoken Audio Processing Using Machine Learning for Libraries, Archives and Museums (LAM)
    [W. Xu, M. Esteva, P. Cui, E. Castillo, K. Wang, H-R Hopkins, T. Clement, A. Choate, R. Huang — UT Austin, USA] (Machine Learning, Audio archives)
    ABSTRACT: As the need to provide access to spoken word audio collections in libraries, archives, and museums (LAM) increases, so does the need to process them efficiently and consistently. Traditionally, audio processing involves listening to the audio files, conducting manual transcription, and applying controlled subject terms to describe them. This workflow takes significant time with each recording. In this study, we investigate if and how machine learning (ML) can facilitate processing of audio collections in a manner that corresponds with LAM best practices. We use the StoryCorps collection of oral histories “Las Historias,” and fixed subjects (metadata) that are manually assigned to describe each of them. Our methodology has two main phases. First, audio files are automatically transcribed using two automatic speech recognition (ASR) methods. Next, we build different supervised ML models for label prediction using the transcription data and the existing metadata. Throughout these phases the results are analyzed quantitatively and qualitatively. The workflow is implemented within the flexible web framework IDOLS to lower technical barriers for LAM professionals. By allowing users to submit ML jobs to supercomputers, reproduce workflows, change configurations, and view and provide feedback transparently, this workflow allows users to be in sync with LAM professional values. The study has several outcomes including a comparison of the quality between different transcription methods and the impact of that quality on label prediction accuracy. The study also unveiled the limitations of using manually assigned metadata to build models, to which we suggest alternate strategies for building successful training data. 
  • 3:00-3:20 #11:From Computational De-Morphogenesis to Contaminated Representation for the Contemporary Digital Tectonics and Lexicon
    [A. De Masi — Brera Academy of Fine Arts Milan, ITALY] (Recognition and Representation of patterns)
    ABSTRACT: The study illustrates a research project di “Digital Tectonics” and “Digital Lexicon” for the recognition and representation of pattern relating to architecture and Cultural Heritage in the Web-Oriented Platform – Building Information Modeling (BIM) / Generative Design of the Modifications. Among the research objectives there is: a) shape (through the dialogue between different 3D modeling and smart tool BIM), processes’s representation of codification and valorization of the architecture; b) cooperation, sharing information (through an advanced 3D semantic ontological model of goods), knowledge’s monitoring (through typologies of representation). This through a methodology defined by Digital Layout of parametric libraries objects divided by category. The study allowed to highlight: a) hierarchy of digital levels and multimedia contents; b) creation of libraries of parametric objects; c) a semantic level of models at the level of detail; d) exchange of multidimensional information; e) transition from parametric representations to objects integrated into 3D Web.

3:20 – 3:40 COFFEE BREAK: virtual!

3:40 – 4:20 SESSION 6: Web and Social Media Archives

  • 3:40-4:00 #12:Modeling Updates of Scholarly Webpages Using Archived Data
    [Y. Jayawardana, A. C. Nwala, G. Jayawardena, J. Wu, S. Jayarathna, M.L. Nelson,. C. Lee Giles — Old Dominion U. Indiana U., Penn. State U., USA] (Web archives)
    ABSTRACT: The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the scholarly web using 19,977 seed URLs of authors’ homepages obtained from their Google Scholar profiles. We first obtain archived copies of these webpages from the Internet Archive (IA), and estimate when their actual updates occurred. Next, we apply maximum likelihood to estimate their mean update frequency (λ) values. Our evaluation shows that λ values derived from a short history of archived data provide a good estimate for the true update frequency in the short-term, and that our method provides better estimations of updates at a fraction of resources compared to the baseline models. Based on this, we demonstrate the utility of archived data to optimize the crawling strategy of web crawlers, and uncover important challenges that inspire future research directions.
  • 4:00-4:20 #13:Using a Three-step Social Media Similarity (TSMS) Mapping Method to Analyze Controversial Speech Rellating to COVID-19 in Twitter Collections
    [Z. Yin, L. Fan, H. Yu, A. GillilandUCLA, USA]
    ABSTRACT: Addressing increasing calls to surface hidden and counter-narratives from within archival collections, this paper reports on a study that provides proof-of-concept of automatic methods that could be used on archived social media collections. Using a test collection of 3,457,434 unique tweets relating to COVID-19, China and Chinese people, it sought to identify instances of Hate Speech as well as hard-to-pinpoint trends in anti-Chinese racist sentiment. The study, part of a larger archival research effort investigating automatic methods for appraisal and description of very large digital archival collections, used a Three-step Social Media Similarity (TSMS) mapping method that aggregates hashtag mapping, TF-IDF Similarity Selection, and Emotion Similarity Calculation on the test collection. Compared to using a purely lexicon-based method to identify and analyze controversial speech, this method successfully expanded the amount of controversial contents detected from 21,050 tweets to 212,605, and the detection rate from 0.6% to 6.1%. We argue that the TSMS method could be similarly applied by archives in automatically identifying, analyzing, describing other controversial content on social media and in other rapidly evolving and complex contexts in order to increase public awareness and facilitate public policy responses.

4:20 – 5:00 SESSION 7: Discussion

5:00 – 5:20 CLOSING REMARKS (Organizers)


The large-scale digitization of analogue archives, the emerging diverse forms of born-digital archive, and the new ways in which researchers across disciplines (as well as the public)wish to engage with archival material, are resulting in disruptions to transitional archival theories and practices. Increasing quantities of ‘big archival data’ present challenges for the practitioners and researchers who work with archival material, but also offer enhanced possibilities for scholarship, through the application both of computational methods and tools to the archival problem space and of archival methods and tools to computational problems such as trusted computing, as well as, more fundamentally, through the integration of computational thinking’ with ‘archival thinking.

Our working definition of Archival Computational Science (CAS) is:

    • A transdisciplinary field that integrates computational and archival theories, methods and resources, both to support the creation and preservation of reliable and authentic records/archives and to address large-scale records/archives processing, analysis, storage, and access, with aim of improving efficiency, productivity and precision, in support of recordkeeping, appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.


This workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice (including record keeping) and new forms of analysis and historical, social, scientific, and cultural research engagement with archives.We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality–meaning, knowledge and value–from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.

This will be the 5th workshop at IEEE Big Data addressing Computational Archival Science (CAS), following on from workshops in 2016, 2017, 2018, and 2019. It also builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a 2016 symposium held in April 2016 at the University of Maryland.

All papers accepted for the workshop will be included in the Conference Proceedings published by the IEEE Computer Society Press. In addition to standard papers, the workshop (and the call for papers) will incorporate a student poster session for PhD and Master’s level students.

Topics covered by the workshop include, but are not restricted to, the following:

    • Application of analytics to archival material, including text-mining, data-mining, sentiment analysis, network analysis.
    • Analytics in support of archival processing, including e-discovery, identification of personal information, appraisal, arrangement and description.
    • Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
    • New forms of archives, including Web, social media, audiovisual archives, and blockchain.
    • Cyber-infrastructures for archive-based research and for development and hosting of collections
    • Big data and archival theory and practice
    • Digital curation and preservation
    • Crowd-sourcing and archives
    • Big data and the construction of memory and identity
    • Specific big data technologies (e.g. NoSQL databases) and their applications
    • Corpora and reference collections of big archival data
    • Linked data and archives
    • Big data and provenance
    • Constructing big data research objects from archives
    • Legal and ethical issues in big data archives

Dr. Mark Hedges
Department of Digital Humanities (DDH)
King’s College London, UK

Prof. Victoria Lemieux
School of Information
University of British Columbia, CANADA

Prof. Richard Marciano
Advanced Information Collaboratory (AIC)
College of Information Studies
University of Maryland, USA

Dr. Bill Underwood
Advanced Information Collaboratory (AIC)
College of Information Studies
University of Maryland, USA

Dr. Tobias Blanke
Distinguished Professor in AI and Humanities
University of Amsterdam, THE NETHERLANDS

Dr. Kristen Schuster
Lecturer in Digital Curation
King’s College London, UK