4th COMPUTATIONAL ARCHIVAL SCIENCE (CAS) WORKSHOP

Wednesday, Dec. 11, 2019, Los Angeles, CA
LOCATION: San Fernando room @ the Westin Bonaventure Hotel & Suites, 404 South Figueroa Street, Los Angeles, CA
PART OF: IEEE Big Data 2019 — http://cci.drexel.edu/bigdata/bigdata2019/

8:45 – 9:00 WELCOME: in the San Fernando room

Workshop Chairs:
Mark Hedges ¹, Victoria Lemieux ², Richard Marciano ³
¹ KCL, ² UBC, ³ U. Maryland

9:00 – 9:40 SESSION 1: Computational Thinking in Archival Science

9:00-9:20 #1:Computational Thinking in Archival Science Research and Education
[William Underwood and Richard Marciano — University of Maryland, USA]

SLIDES — PAPER
ABSTRACT: His paper explores whether the computational thinking practices of mathematicians and scientists in the physical and biological sciences are also the practices of archival scientists. It is argued that these practices are essential elements of an archival science education in preparing students for a professional archival career.

9:20-9:40 #2:Reframing Digital Curation Practices through a Computational Thinking Framework
[Richard Marciano, Sarah Agarrat, Hannah Frisch, Margaret Rose Hunt, Kanishka Jain, Genevieve Kocienda, Hannah Krauss, Chenxi Liu, Mary McKinley, Danish Mir, Connor Mullane, Emery Patterson, Debashish Pradhan, James Santos, Britton Schams, Hilary Szu Yin Shiue, Andy Jose Silva, Mayhah Suri, Tahura Turabi, Mirielle Vasselli, and Jiale Xu — University of Maryland, USA]

SLIDES — PAPER
ABSTRACT: We describe the value of reframing digital curation practices through a computational thinking (CT) framework. Using a case study that demonstrates computational treatments of World War II Japanese-American Incarceration Camp Records, we demonstrate the applicability of CT with respect to: (1) Detecting personally identifiable information, (2) Developing name registries, (3) Integrating vital records, (4) Designing controlled vocabularies, (5) Mapping events and people, and (6) Connecting events and people through networks. The work was carried out by 5 teams of students in an 8-week digital curation exploration and development sprint.

9:45 – 10:05 COFFEE BREAK: California Foyer

10:05 – 11:05 SESSION 2: Archival Thinking in Computational Science

10:05-10:25 #3:An Intelligent Class: The Development off a Novel Context Capturing Framework For The Functional Classification Of Records
[Nathaniel Payne — University of British Columbia, Canada]

SLIDES — PAPER
ABSTRACT: The need to accurately classify records is a core problem in many domains. Current methods for auto-classification focus on a record’s content and not its context. As a result, current auto-classification methods are unable to achieve the levels of precision, accuracy, and recall that match or exceed the levels generated by human classifiers. In order to address this challenge, a new methodology is needed that specifies how to extract contextual features from a record in order to improve the auto-classification accuracy, precision, and recall of records at scale. This paper closes this gap, using the diplomatic definition of context to specify a mapping that will operationalize the capturing of context from a record. This mapping, makes it possible to continue developing a formal method for functional auto-classification and contextual feature extraction that will utilize a record’s context to improve functional auto-classification accuracy, precision, and recall.

10:25-10:45 #4:Extending the Scope of Computational Archival Science: A Case Study on Leveraging Archival and Engineering Approaches to Develop a Framework to Detect and Prevent “Fake Video”
[Hoda Hamouda, Victoria Lemieux, Corinne Rogers, Ken Thibodeau, Jessica Bushey, James Stewart, James Cameron, and Chen Feng — University of British Columbia (UBC), UBC, Artefactual Systems, Fordham University, North Vancouver Museum and Archives, Patriot One Technologies, Patriot One Technologies, UBC, Canada & USA]

SLIDES — PAPER
ABSTRACT: Thousands of videos are posted online every day. The affordability of video editing tools and social networks has facilitated the creation and spread of videos carrying disinformation, i.e. fake videos. Previous attempts to categorize disinformation have focused on content analysis and ascertaining the intention of creators. To extend these approaches, it is beneficial to incorporate the perspective of other fields that study the trustworthiness of records, such as archival science, to help detect and categorize fake videos. This paper proposes to leverage archival science in combination with computer engineering to devise a new framework for detecting and categorizing fake videos. In doing so, the paper offers a case study of the way in which Computational Archival Science, which blends archival and computational thinking, can be used to contribute to a novel approach towards solving the problem of fake videos.

10:45-11:05 #5:ArchContract: using smart contracts for disposition
[Danielle Batista and Tim Weingärtner — University of British Columbia (UBC), Canada & Lucerne University of Applied Sciences and Arts, Switzerland]

SLIDES — PAPER
ABSTRACT: Disposition is one of the consequences of the appraisal archival function. It is true that appraisal is a function that no technology could execute but disposition has been already supported by different tools. In this paper we propose a blockchain based application for disposition, a smart contract called ArchContract, using two different repositories. We discuss appraisal and disposition on blockchain systems, the use of smart contracts as a disposition tool and present the model of ArchContract. We conclude that blockchain and smart contracts have the potential to support some of the records management functions such as disposition.

11:05 – 11:45 SESSION 3: Knowledge Organization

11:05-11:25 #6:Automated interpretability of linked data ontologies: an evaluation within the cultural heritage domain
[Nuno Freire and Sjors de Valk — INESC-ID, Portugal, and Dutch Digital Heritage Network, Netherlands]

SLIDES — PAPER
ABSTRACT: Publication and usage of linked data has been highly pursued by cultural heritage institutions and service providers in this domain. Much research and cooperation are taking place in adapting and improving cultural heritage data models for linked data and in defining ontologies and vocabularies, as well as the setting up of services based on linked data. This article presents an evaluation of ontologies and vocabularies published as liked data, which originate from the cultural heritage domain, or are frequently used and linked to in this domain. Our study aims to evaluate their usability by crawlers operating on the web of data, according to specifications and practices of linked data, the Semantic Web and ontology reasoning. We evaluate having in mind the use case of general data consumption applications based on RDF, RDF Schema, OWL, SKOS and linked data’s guidelines. We have evaluated twelve ontologies and vocabularies and identified that four were not fully compliant, and that alignments between ontologies are not included in the definitions of the ontologies. This study contributes to the research of novel services consuming linked data. It also allows to better assess the automation that can be achieved to handle the variety and large volume of linked data, when assessing the viability of new services based on linked data in cultural heritage.

11:25-11:45 #7:Towards a Flexible System Architecture for Automated Knowledge Base Construction Frameworks
[Osman Din — MIT, USA]

SLIDES — PAPER
ABSTRACT: Although knowledge bases play an important role in many domains (including in archives, where they are sometimes used for entity extraction and semantic annotation tasks), it is challenging to build knowledge bases by hand. Recent advances in the field of automated knowledge base construction (AKBC) offer a promising alternative. A knowledge base construction framework takes as input source documents (such as journal articles containing text, figures, and tables) and produces as output a database of the extracted information. An important motivation behind these frameworks is relieving domain experts from worrying about the complexity of building knowledge bases. Unfortunately, such frameworks fall short when it comes to scalability ingesting and extracting information at scale), extensibility (ability to add or modify functionality), and usability (ability to easily specify information extraction rules).
The contributions presented in this short paper can shed a light on the suitability of using AKBC frameworks for computational use cases in our domain and provide future directions for building improved AKBC frameworks.

11:45 – 12:05 SESSION 4: CAS and the Representation of Objects (Part 1)

11:45-12:05 #8:What Computational Archival Science Can Learn from Art History and Material Culture Studies
[Lyneise Williams — University of North Carolina at Chapel Hill, USA]

SLIDES — PAPER
ABSTRACT: I discuss the significance of considering aesthetic aspects, as practiced in Art History, regarding representations in reproductive technology used by archives and libraries. Reproductive technologies like microfilming and digitizing shapes how we view and remember history. Exploring a case study of newspaper representations of Panamanian Welterweight World Champion Boxer (1929-1936; 1938-1941), Alfonso Brown, I demonstrate how the absence of attention to aesthetic aspects has led to erasure and distortion of already marginalized communities of color and other underrepresented populations in the historical record. Material Culture Studies conceptualization of reproductive technology as a medium of representation, and as such, a component of representations warranting deep and rigorous consideration, is useful for computational archival science (CAS) as we move towards completely digital-based archives. Aesthetic components of representation in archival material are critical to representations of historical and current marginalized fully accessing data of all kinds.

12:10 – 1:30 LUNCH: San Francisco/San Jose, Sacramento

1:30 – 1:50 SESSION 4: CAS and the Representation of Objects (Part 2)

1:30-1:50 #9:Digital Legacies on Paper: Reading Punchcards with Computer Vision
[Greg Jansen — University of Maryland, USA]

SLIDES — PAPER
ABSTRACT: We describe the development of a computer vision-based workflow for normalizing images of the legacy punchcard data format (IBM 029 – 80 column punchcard standard) and then reading the encoded data. We show the role of a newly developed Punchcard Extractor Tool within the Brown Dog service API. We also point to our showcase of these same computer vision techniques in a Jupyter notebook system.

1:50 – 2:30 SESSION 5: CAS Architecture

1:50-2:10 #10:Enterprise Architecture – A Value Proposition for Records Professionals
[Shadrack Katuu — University of South Africa, South Africa]

SLIDES — PAPER
ABSTRACT: Modern institutions operate hundreds of business systems or applications to support their institutional activities. Among the key players or actors within any institution are records professionals, whose mandate is the management of records/archives or potential records/archives generated by these hundreds of business systems or applications. Records professionals need to make sense of the vast array of software applications and technological infrastructure, as well as how they relate to one another in supporting the institution’s functions and activities. Unfortunately, there is often misalignment between various institutional actors, including business actors, information technology (IT) actors, and records professionals. Enterprise architecture (EA) proponents see it as a promising concept to address this fundamental issue. This article draws from a research study exploring the utility of EA frameworks for records professionals.

2:10-2:30 #11:Using Data Partitions and Stateless Servers to Scale Up Fedora Repositories
[Greg Jansen and Richard Marciano — University of Maryland, USA]

SLIDES — PAPER
ABSTRACT: We describe the development and testing of the next-generation Trellis Linked Data Platform with Memento versioning support. In addition to highlighting several features that set this system apart from others, we elaborate on the extensive testing and compatibility work that was done in order to align this system with the Fedora 5.0 specification. We draw attention to the performance and scaling features provided by the Trellis Linked Data Platform in general and by the Cassandra database back end. We review the profound impact that such a system can have on demanding, next generation use cases, such as crowdsourcing, machine learning, and direct file access by desktop applications.

2:30 – 3:10 SESSION 6: Media Archives

2:30-2:50 #12:Preliminary Analysis of a Large-Scale Digital Entertainment Development Archive: A Case Study of the Entertainment Technology Center’s Projects
[Eric Kaltman — California State University Channel Islands, USA]

SLIDES — PAPER
ABSTRACT: This paper describes a research plan for the investigation of the project archive from Carnegie Mellon University’s Entertainment Technology Center, an interdisciplinary professional Masters program in interactive entertainment and game design. Representing nearly 20 years of project design in the interactive arts, the ETC’s ad hoc archive provides a prospective template collection for the analysis of entertainment software projects through historical, archival, and computational methods. The work-in-progress described here is based on a preliminary analysis of four early projects from the archive, and provides a guide to potential synchronic and diachronic investigations of software development methodology in a chronological collection of unified provenance. Access to software process documentation is difficult to come by and this “working” collection provides a significant resource for approximating the organizational state of future software development collections to be ingested in future archives.

2:50-3:10 #13:Building the National Radio Recordings Database: A Big Data Approach to Documenting Audio Heritage
[Emily Goodmann, Mark A. Matienzo, Shawn VanCour, and William Vanden Dries — Clarke University, Stanford University, UCLA, and Indiana University, USA]

SLIDES — PAPER
ABSTRACT: This paper traces strategies used by the Radio Preservation Task Force of the Library of Congress’s National Recording Preservation Board to develop a publicly searchable database documenting extant radio materials held by collecting institutions throughout the country. Having aggregated metadata on 2,500 unique collections to date, the project has encountered a series of logistical challenges that are not only technical in nature but also institutional and social, raising critical issues involving organizational structure, political representation, and the ethics of data access. As the project continues to expand and evolve, lessons from its early development offer valuable reminders of the human judgment, hidden labor, and interpersonal relations required for successful big data work.

3:10 – 4:10 SESSION 7: Open Mic Updates

3:10-3:30 International CAS Network
[Mark Hedges — King’s College London, UK]

SLIDES
3:30-3:50 Computational Thinking Practices in the Workshop Papers
[Bill Underwood — University of Maryland, USA]

SLIDES
3:50-4:10 A Brief Word About the NARA 2022 Deadline and The Importance of CAS
[Jason Baron — Drinker Biddle, USA]

4:10 – 4:30 COFFEE BREAK: California Foyer

4:30-5:00 CLOSING REMARKS

7:00-9:00 BANQUET
San Francisco/San Jose, Sacramento

Best Paper/Best Application Paper/Best Student Papers Awards (PC Chairs, I&G Program Chair)
IEEE Brain Data Bank Challenges and Competitions (Chair: N. Nan Chu)

Computational Archival Science Workshop at IEEE Big Data 2019 – call for papers

This is the 4th workshop at IEEE Big Data addressing Computational Archival Science (CAS), following on from workshops in 2016, 2017, and 2018. All papers accepted for the workshop will be included in the Conference Proceedings published by the IEEE Computer Society Press, made available at the conference, which takes place Dec. 9 – 12, 2019 in Los Angeles, CA, USA.

It also builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a 2016 symposium held in April 2016 at the University of Maryland.

The workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice and new forms of analysis and historical, social, scientific, and cultural research engagement with archives. We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality – meaning, knowledge and value – from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.
IMPORTANT DATES:

Oct 18, 2019 [ updated from Oct. 11]: Due date for full workshop papers submission
Oct 28, 2019 [ updated ]: Notification of paper acceptance to authors
Nov 10, 2019: Camera-ready version of accepted papers [ hard conference deadline ]
Wednesday, Dec. 11, 2019: Workshop

SUBMISSION DETAILS:
Go to: https://wi-lab.com/cyberchair/2019/bigdata19/scripts/submit.php?subarea=S01&undisplay_detail=1&wh=/cyberchair/2019/bigdata19/scripts/ws_submit.php
The formatting instructions are at: https://wi-lab.com/cyberchair/2019/bigdata19/scripts/submit.php?subarea=BigD:

Papers should be formatted to 10 pages IEEE Computer Society Proceedings Manuscript Formatting Guidelines (https://www.ieee.org/conferences/publishing/templates.html). We also accept short papers of up to 6 pages – these are particularly useful for work in progress.
Although we accept submissions in the form of PDF, PS, and DOC/RTF files, you are strongly encouraged to generate a PDF version for your paper submission if your paper was prepared in Word.

RESOURCES and EXAMPLES of CAS can be found at the “COMPUTATIONAL ARCHIVAL SCIENCE” CAS Portal. Also:

Join our Google Group at: computational-archival-science@googlegroups.com
Foundational Paper: “Archival records and training in the Age of Big Data”, Marciano, R., Lemieux, V., Hedges, M., Esteva, M., Underwood, W., Kurtz, M. & Conrad, M.. See: LINK. In J. Percell , L. C. Sarin , P. T. Jaeger , J. C. Bertot (Eds.), Re-Envisioning the MLS: Perspectives on the Future of Library and Information Science Education (Advances in Librarianship, Volume 44B, pp.179-199). Emerald Publishing Limited. May 17, 2018.
- 8 topics: (1) Evolutionary prototyping and computational linguistics, (2) Graph analytics, digital humanities and archival representation, (3) Computational finding aids, (4) Digital curation, (5) Public engagement with (archival) content, (6) Authenticity, (7) Confluences between archival theory and computational methods: cyberinfrastructure and the Records Continuum, and (8) Spatial and temporal analytics.

Recommended Research topics for the CAS#4 Workshop:
Topics covered by the workshop include, but are not restricted to, the following:

Application of analytics to archival material, including text-mining, data-mining, sentiment analysis, network analysis.
Analytics in support of archival processing, including e-discovery, identification of personal information, appraisal, arrangement and description.
Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
New forms of archives, including Web, social media, audiovisual archives, and blockchain.
Cyber-infrastructures for archive-based research and for development and hosting of collections
Big data and archival theory and practice
Digital curation and preservation
Crowd-sourcing and archives
Big data and the construction of memory and identity
Specific big data technologies (e.g. NoSQL databases) and their applications
Corpora and reference collections of big archival data
Linked data and archives
Big data and provenance
Constructing big data research objects from archives
Legal and ethical issues in big data archives

Program Chairs:
Dr. Mark Hedges
Department of Digital Humanities (DDH)
King’s College London, UK

Prof. Victoria Lemieux
School of Library, Archival and Information Studies
University of British Columbia, Canada

Prof. Richard Marciano
Digital Curation Innovation Center (DCIC)
College of Information Studies
University of Maryland, USA

Program Committee Members:
The program chairs will serve on the Program Committee, as will the following:

Dr. Maria Esteva
Data Intensive Computing
Texas Advanced Computing Center (TACC), USA

Dr. Bill Underwood
Digital Curation Innovation Center (DCIC)
College of Information Studies
University of Maryland, USA

Prof. Michael Kurtz
Emeritus Associate Director of the Digital Curation Innovation Center (DCIC)
College of Information Studies
University of Maryland, USA

Mark Conrad
National Archives and Records Administration (NARA)

Dr. Tobias Blanke
Distinguished Professor in AI and Humanities
University of Amsterdam