6th COMPUTATIONAL ARCHIVAL SCIENCE (CAS) WORKSHOP — now taking place virtually

Friday Dec. 17, 2021 
PART OF: IEEE Big Data 2021 — http://bigdataieee.org/BigData2021/index.html

REGISTRATION: https://web.cvent.com/event/122a6d98-2b39-4b98-9536-6680548b454f/summary [see “Fees” tab & “Register Now” tab]

PRELIMINARY PROGRAM (Link: .PDF version — see pages 31,32)

9:00 – 9:15 WELCOME

  • Workshop Chairs:
    Mark Hedges 1, Victoria Lemieux 2, Richard Marciano 3

    1King’s College London UK  /  2U. British Columbia CANADA  /  3U. Maryland USA


9:15 – 10:00 KEYNOTE: 
The American Archival Experience: From Colonial Times to the 21st Century: Michael Kurtz

  • SLIDES — VIDEO

The ideas and themes in the keynote address are drawn, in part, from the research done by Dr. Michael J. Kurtz for his forthcoming book, “The American Archival Experience: From Colonial Times to the 21st Century.”

Dr. Kurtz served at the National Archives and Records Administration (NARA) for 37 years, during which time he held significant leadership positions, supervised hundreds of staff in multiple locations, and led national efforts in electronic records preservation and management, declassification, and transparency of government records, and for the 15 years prior to his retirement from the National Archives (1996-2011) he served as Assistant Archivist for Records Services.

While at NARA, Dr. Kurtz initiated and implemented several national initiatives that have lasting impact on access and preservation of government records, including creation of the National Declassification Center; implementation of the 2002 E-Government Act; declassification and release of some 8 million pages documenting U.S. government involvement with war criminals (as chair of the Interagency Working Group on Nazi and Imperial Japanese War Crimes); and creation of the International Research Web Portal relating to Nazi-era looted cultural assets.

He has distinguished himself as the author of several highly cited publications in the areas of archives management and administration, including Managing Archival and Manuscript Repositories (Society of American Archivists, Archival Fundamentals II Series, 2004), which has guided the management practices of hundreds of archivists. His more recent book, America and the Return of Nazi Contraband: The Recovery of Europe’s Cultural Treasures (Cambridge University Press, 2006) not only is a respected scholarly work on the return of cultural property following World War II, but was also a key inspiration for the 2014 film Monuments Men.


10:00 – 10:40 SESSION 1: Infrastructures and Services

  • 10:00-10:20 #1:Computational Archival Science is a Two-Way Street
    B. Ambacher, M. Conrad     [U. PTAB – Primary Trustworthy Digital Repository Authorisation Body Ltd USA & Advanced Information Collaboratory (AIC) USA]     

    • long-term preservation, OAIS, ISO 16363, information management, trustworthy digital repository, provenance, integrity, understandability, evidence of authenticity, reproducibility, reuse
  • PAPER — SLIDES
    ABSTRACT: Since its definition in 2018 much of the literature written about CAS has been about archives adopting computational theories, methods and resources. Very little has been written about computational professionals adopting archival theories, methods, or resources. The authors believe that CAS could be substantially enriched if some archival theories, methods, and resources were adopted by computational professionals in developing the systems that create and store vast troves of data. For the purposes of this paper, we will focus on two archival resources – ISO 14721 (OAIS) and ISO 16363 (Trustworthy Digital Repositories). These two resources offer recommendations for long term preservation of digital assets, maintaining understandability of those assets through time, and building trustworthy digital repositories to maintain the provenance and integrity of the repository’s collections in such a manner as to provide substantial evidence of the authenticity of the data that it provides to its consumers.
  • 10:20-10:40 #2:Managing Records in Enterprise Resource Planning Systems
    S. Katuu    [Department of Information Science University of South Africa SA]     

    • Archival Diplomatics, Enterprise Resource Planning System, United Nations
  • PAPER — VIDEO
    ABSTRACT: Enterprise resource planning (ERP) systems are increasingly being used for the management of business processes and to integrate tasks within institutions in real time. While managing and integrating processes, ERP systems generate and are expected to manage enormous amounts of data and information that should be managed in trustworthy manner. This article draws from a multi-year ERP implementation project by the United Nations to highlight some recordkeeping challenges.

10:40 – 11:00 COFFEE BREAK: virtual!


11:00 – 12:40 SESSION 2: Applying Analytics to Archival Materials

  • 11:00-11:20 #3:Using Transfer Learning to contextually Optimize Optical Character Recognition (OCR) output and perform new Feature Extraction on a digitized cultural and historical dataset
    A. Inbasekaran, R. Kumar Gnanasekaran, R. Marciano  [Chalmers U. of Technology SWEDEN & School Info. Studies at U. Maryland  USA]

    • BERT, RoBERTa, transfer learning, natural language processing, spelling correction, entity recognition
  • PAPER — VIDEO
    ABSTRACT: Understanding handwritten and printed text is easier for humans but computers do not have the same level of accuracy. While there are many Optical Character Recognition (OCR) tools like PyTesseract, Abbyy FineReader which extract the text as digital characters from handwritten or printed text images, none of them are without unrecognizable characters or misspelled words. Spelling correction is one of the well-known tasks in Natural Language Processing. Spelling correction of an individual word could be performed through existing tools, however, correcting a word based on the context of the sentence is a challenging task that requires a human-level understanding of the language. In this paper, we introduce a novel experiment of applying Natural Language Processing using a machine learning concept called Transfer Learning on the text extracted by OCR tools, thereby optimizing the output text by reducing misspelled words. This experiment is conducted on the OCR output of a sample of newspaper images published between the late 18th century to 19th century. These images were obtained from the Maryland State Archives digital archives project named, the Legacy of Slavery. This Natural Language Processing approach uses pre-trained language transformer models like BERT and RoBERTa which are used as word-prediction software for spelling correction based on the context of the words in the OCR output. We compare the performance of BERT and RoBERTa on two OCR tool outputs, namely PyTesseract and Abbyy FineReader. A comparative evaluation shows that both the models work fairly well on correcting misspelled words considering the irregularities in the text data from the OCR output. Additionally, with the Transfer Learning output text, a special process is conducted to create a new feature that originally did not exist in the original dataset dataset using Spacy’s Entity Recognizer (ER). This new extracted values are added to the dataset as a new feature. Also, an existing feature’s values are compared to Spacy’s ER output and the original hand transcribed data.
  • 11:20-11:40 #4:Using AI/Machine Learning to Extract Data from Japanese American Confinement records
    M. Friedman, C. Ford, M. Elings, V. Singh, T. Tan   [The Bancroft Library, U. California Berkeley USA, Doxie.AI USA]

    • machine learning, artificial intelligence, digital collections, community engagement, archival ethics, Japanese American WWII incarceration
  • PAPER — VIDEO
    ABSTRACT: As part of a Japanese American Confinement Sites (JACS) grant-funded project, supported by the National Park Service, the Bancroft Library at UC Berkeley is digitizing nearly 210,000 pages of War Relocation Authority (WRA) Form 26 individual records of Japanese Americans incarcerated during WWII. The library has partnered with Doxie.AI to develop a customized machine learning pipeline for extracting structured data from these records A number of challenges have arisen due to variability in content, structure, and placement of text on the page, and the presence of a wide variety of characteristics in the archival records that produce visual noise. This has prompted an iterative and dynamic approach to process records by camp. By blending library staff’s content and domain knowledge with the technical expertise of Doxie.AI, the project team is meeting or exceeding baseline targets for accuracy rates across the majority of fields. Additionally, ethical issues pertaining to the digital release of this data and the digitized records in their entirety will be explored with a Community Advisory Group, as the library seeks to establish a responsible digital curation plan for these resources. In alignment with Collections as Data principles that encourage the responsible computational use of special collections, this project represents a crucial opportunity to explore new methods for enhancing access to our growing digital special collections.
  • 11:40-noon #5:A Framework for Unlocking and Linking WWII Japanese American Incarceration Biographical Data
    L. Beltran, E. Ping O’Brien, G. Jansen, R. Marciano        [W. Madison Randall Library, U. North Carolina Wilmington USA, George C Gordon Library, Worcester Polytechnic Institute USA, College of Information Studies, U. Maryland USA]     

    • Computational Archival Science (CAS), Japanese American WWII Incarceration Camps, Entity Resolution, Digital Infrastructure, Linked Archives
  • PAPER — VIDEO
    ABSTRACT: Entity Resolution (ER) is increasingly being used to identify and link names across archival collections. We describe a framework for unlocking and linking biographical data from WWII Japanese American Incarceration Camps using Entity Resolution and other computational approaches. We demonstrate the construction of social graphs that link people, places, and events and which support further scholarship and reveal hidden stories in historical events, especially given contested archival sources. Finally, we show the power of computational analysis to recreate event networks and represent movement of people using maps. This type of modeling is captured through interactive Jupyter Notebooks that integrate these various elements and document our interpretation of Japanese American experiences and events at the Tule Lake concentration camp.

  • noon-12:20 #6:Computational Curation and the Application of Large-Scale Vocabularies
    S. Grabus, J. Greenberg     [MRC, College of Computing & Informatics, Drexel U., USA

    • controlled vocabularies, stemming, lemmatization, natural language processing (NLP), automatic curation
  • PAPER — VIDEO
    ABSTRACT: Paper presents an exploratory case study comparing stemming and lemmatization results for the automatic application of large-scale controlled vocabularies processed against archival encyclopedia entries. The results report relative recall and precision evaluations across both results. Research shows that while stemming has a higher relative recall, lemmatization results in a higher relevance score and eliminates the over-stemming challenges. Results provide insight into improving automatic curation workflows for archival resources.

  • 12:20-12:40 #7:Inference of Absolute Time Value from Temporal Expressions
    J. Sung, S. Mori, H. Kameko, A. Kubo, T. Sekino    [Individual, Academic Center for Computing and Media Studies, Kyoto U., Graduate School of Arts and Sciences, The Open U. of Japan, The International Research Center for Japanese Studies, National Institutes for the Humanities, Kyoto, JAPAN]     

    • temporal expression, absolute time value, named entity recognition
  • PAPER — VIDEO
    ABSTRACT: In this paper, we explore and discuss a way to extract temporal information from natural language texts. The suggested method is divided into two parts: temporal expression recognition and temporal value inference. The former employs the conventional NER approach, using a BiLSTM-CRF architecture. The latter is implemented with a rule-based algorithm, which can be further developed in later work for better coverage of various temporal expressions. In terms of the corpus, we have selected 200 articles from one of the major Japanese newspaper companies to create an annotated corpus, classifying temporal expressions into five different types. As for the performance, we have achieved 0.866 in F-measure for the recognition of temporal expressions and 0.920 in accuracy for the inference of the absolute temporal values of the expressions. Combining the two modules and running them as an end-to-end system, we have attained 0.891 of F-measure.

12:40 – 1:40 LUNCH: Virtual!


1:40 – 2:00  Discussion


2:00 – 3:20 SESSION 3: Analytics to Support Archival Processing

  • 2:00-2:20 #8:eConDist: A Context-based Search Tool for Email Archives
    S. Kuppili Venkata, S. Decker, D. Kirsch, A. Nix    [Digital Archiving Department, The National Archives UK, School of Management and Department of Economic History, U. of Bristol and U. of Gothenburg UK & SWEDEN, Robert H. Smith School of Business and College of Information Studies, U. Maryland USA, Birmingham Business School, U. of Birmingham UK]

    • Contextualisation, Email archives processing, Content analysis, Natural Language Processing
  • PAPER — VIDEO
    ABSTRACT: Preservation of emails poses particular challenges to future discovery as alternative historical sources. Emails represent communications between individuals and contain a wealth of information when viewed as an organisation-wide collection. Existing search tools can extract named entities and keyword searches but are less effective when it comes to extracting patterns and contextual information across multiple custodians. To ad-dress this, we present EMCODIST, a discovery tool for searching the contextual information across emails using attention-based models of Natural Language Processing (NLP). The EMCODIST aims to steer end-users to personalise their searches towards a concept. In this paper, we explain the definition of the ‘context’ for emails which is also suitable for object-oriented computational modelling. The tool is evaluated based on the relevancy of the emails extracted.
  • 2:20-2:40 #9:An AI-Assisted Framework for Rapid Conversion of Descriptive Photo Metadata into Linked Data
    J. Proctor, R. Marciano    [Advanced Information Collaboratory (AIC), U. Maryland USA]

    • Computational Archival Science (CAS), Photograph archives, Digital curation, Metadata and the Semantic Web, Artificial Intelligence (AI)
      PAPER — VIDEO
      ABSTRACT: This paper proposes, tests, and evaluates an innovative Computational Archival Science (CAS) framework to enhance the ability to link people, places, and events depicted in historical photography collections. The protocol combines elements of computer vision with natural language processing, entity extraction, and metadata linking techniques to transform and connect existing archival metadata. Development of the framework is built upon a case study based on the Spelman College Archives Photograph Collection and provides background information, reports on the text processing, image analysis, semantic linking, and evaluation aspects associated with the design and use of the AI-supported framework.
  • 2:40-3:00 #10:Organizing a Content Profile for a Large, Heterogeneous Collection of Interactive Projects
    E. Kaltman, R. Lorelli, A. Larson, E. Wolfe     [Department of Computer Science, California State U. Channel Islands USA

    • computational archival science, computer games, software development, content analysis, history
  • PAPER — VIDEO
    ABSTRACT: This paper details the organization of a “content profile” of a large longitudinal collection of interactive project prototypes of singular provenance. A content profile aims to analyze and summarize aggregate file metadata associated with a collection to aid in digital preservation strategies. Here, we detail the qualitative and quantitative methods used to organize a profile of a 14TB data set containing around 10.5 million files and 5,000 file extensions. The work extends the use of a content profile toward the historical characterization and interpretation of software development records. Additionally, the work prefigures further challenges associated with historical analysis of large, interdisciplinary data sets.
  • 3:00-3:20 #11:Constructing Archives of Population Dynamics and Migration Network As A Way to Access Hard-to-reach Populations: A Research on Taiwan Indigenous Peoples
    Ji-Ping Lin     [Academia Sinica TAIWAN]   

    •  hard-to-reach population; migration network; population dynamics, ethnic lineage; TICD; TIPD
  • PAPER — VIDEO
    ABSTRACT: This paper highlights research on constructing big computational archives of hard-to-reach populations (HRPs), using Taiwan Indigenous Peoples (TIPs) as an example. The research uses archives of (1) anonymous individual-level migration flows computed from population dynamics data and (2) Taiwan indigenous community data (TICD) to illustrate characteristics of HRPs which were unknown before. The research suggests that computational HRP networks (e.g., migration networks) help overcome barriers to accessing HRPs and promote mutual understanding. The archives of Taiwan Indigenous Peoples Open Research Data (TIPD) are a research data source, with archives of address geocoding, population dynamics, and indigenous communities being most relevant to TIPs network systems. The migration flows are computed at the individual level and have unveiled various dimensions of HRP networks that were invisible before. The newly computed TICD archives enable us to trace migration flows of TIPs within and between indigenous communities and urban localities at the individual level in the context of ethnic lineages. The research findings suggest that strengthening intra- and inter-ethnic network connections serves as an effective measure to get deep insights into HRPs.

3:20 – 3:40 COFFEE BREAK: virtual!


3:40 – 4:20 SESSION 4: New Forms of Archives

  • 3:40-4:00 #12:NFTs: Tulip Mania or Digital Renaissance?
    D. Ross, E. Cretu, V. Lemieux     [Electrical and Computer Engineering, U. British Columbia CANADA, School of Information, U. British Columbia CANADA]     

    • nonfungible tokens, blockchain, smart contract, distributed ledger, digital art, provenance, tokenomics
  • PAPER — VIDEO
    ABSTRACT: Galleries, Libraries, Archives and Museums (GLAM) institutions have begun to sell non-fungible tokens (NFTs) of works from their collections following the $69.3 M USD record sale of Beeple’s Everydays: The First 5000 Days at Christie’s auction house on March 11, 2021. But many open questions exist about whether NFTs are beneficial or harmful for such institutions from financial, regulatory, and environmental perspectives. In this paper, we aim to unpack what NFTs are within the context of tokenomics and Ethereum standards development by providing an overview of notable NFTs and selling platforms before discussing the pros and cons of their use in GLAM institutions and exploring open research challenges through the lens of Computational Archival Science. Methodologies for the creation (minting) and purchase of NFTs are provided, emphasizing NFTs’ record keeping abilities, while also highlighting their inherent vulnerabilities, particularly with regards to the now-infamous broken link problem and its implications for provenance tracking and authenticity.
  • 4:00-4:20 #13:Making Case for Using RAFT in Healthcare Through Hyperledger Fabric
    A. Alexandridis, G. Al-Sumaidaee, R. Alkhudary, Z. Zilic     [UECE, McGill U. CANADA, LARGEPA, Universite Paris II Pantheon-Assas FRANCE]

    • blockchain, RAFT, consensus algorithm, Hyper-ledger Fabric, value chain, healthcare
  • PAPER — VIDEO
    ABSTRACT: Blockchain technology is enabled by consensus algorithms to manage the relationships among several economic or business operators without human intervention. With the help of consensus algorithms, distributed systems can reliably reach agreement even if part of the system is faulty. Blockchain yields many benefits, among others, traceability, transparency, and security. We consider using the RAFT consensus algorithm to achieve robust and scalable decentralized applications, with focus on healthcare. We propose a stylized healthcare network, enabled by RAFT and built upon Hyperledger Fabric to showcase the use of RAFT in healthcare blockchain. However, RAFT is by no means limited to healthcare record systems, and can be applied to any other record system and value chain. Our paper offers several insights to those working in value chains and information management-related fields. In addition, we end our study with some future research avenues that may inspire managers and scholars to build or refine new decentralized systems in healthcare and other related fields.

4:20 – 5:00 SESSION 5: Discussion


5:00 – 5:20 CLOSING REMARKS (Organizers)

 

 

 




IMPORTANT DEADLINES: final dates updated on Mon. Nov. 8

      • Electronic submission of workshop papers: Oct 22, 2021 Oct 31, 2021
      • Notification of paper acceptance: Nov 8, 2021 Fri. Nov 12, 2021
      • Registration for Workshop:  Mon. Nov 15, 2021
      • Camera-ready of accepted papers: Nov 15, 2021 Mon. Nov 22, 2021

PAPER SUBMISSION:


    • COMPUTATIONAL ARCHIVAL SCIENCE: digital records in the age of big data

      INTRODUCTION TO WORKSHOP [also see our CAS Portal]:

      The large-scale digitization of analogue archives, the emerging diverse forms of born-digital archive, and the new ways in which researchers across disciplines (as well as the public)wish to engage with archival material, are resulting in disruptions to transitional archival theories and practices. Increasing quantities of ‘big archival data’ present challenges for the practitioners and researchers who work with archival material, but also offer enhanced possibilities for scholarship, through the application both of computational methods and tools to the archival problem space and of archival methods and tools to computational problems such as trusted computing, as well as, more fundamentally, through the integration of computational thinking with archival thinking.


      Our working definition of Archival Computational Science (CAS) is:

        • A transdisciplinary field that integrates computational and archival theories, methods and resources, both to support the creation and preservation of reliable and authentic records/archives and to address large-scale records/archives processing, analysis, storage, and access, with aim of improving efficiency, productivity and precision, in support of recordkeeping, appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.

      OBJECTIVES

      This workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice (including record keeping) and new forms of analysis and historical, social, scientific, and cultural research engagement with archives.We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality–meaning, knowledge and value–from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.

      This will be the 6th workshop at IEEE Big Data addressing Computational Archival Science (CAS), following on from workshops in 2016, 2017, 2018, 2019, and 2020. It also builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a 2016 symposium held in April 2016 at the University of Maryland.

      All papers accepted for the workshop will be included in the Conference Proceedings published by the IEEE Computer Society Press. In addition to standard papers, the workshop (and the call for papers) will incorporate a student poster session for PhD and Master’s level students.


      RESEARCH TOPICS COVERED:
      Topics covered by the workshop include, but are not restricted to, the following:

        • Application of analytics to archival material, including AI, ML, text-mining, data-mining, sentiment analysis, network analysis.
        • Analytics in support of archival processing, including e-discovery, identification of personal information, appraisal, arrangement and description.
        • Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
        • New forms of archives, including Web, social media, audiovisual archives, and blockchain.
        • Cyber-infrastructures for archive-based research and for development and hosting of collections
        • Big data and archival theory and practice
        • Digital curation and preservation
        • Crowd-sourcing and archives
        • Big data and the construction of memory and identity
        • Specific big data technologies (e.g. NoSQL databases) and their applications
        • Corpora and reference collections of big archival data
        • Linked data and archives
        • Big data and provenance
        • Constructing big data research objects from archives
        • Legal and ethical issues in big data archives

      PROGRAM CHAIRS:
      Dr. Mark Hedges
      Department of Digital Humanities (DDH)
      King’s College London, UK

      Prof. Victoria Lemieux
      School of Information
      University of British Columbia, CANADA

      Prof. Richard Marciano
      Advanced Information Collaboratory (AIC)
      College of Information Studies
      University of Maryland, USA


      PROGRAM COMMITTEE MEMBERS:
      Dr. Bill Underwood
      Advanced Information Collaboratory (AIC)
      College of Information Studies
      University of Maryland, USA

      Dr. Tobias Blanke
      Distinguished Professor in AI and Humanities
      University of Amsterdam, THE NETHERLANDS