Workshop Title: 2nd Computational Archival Science (CAS) workshop
Wednesday, December 13, 2017
Westin Copley Plaza
10 Huntington Avenue, Boston, MA 02116
Boston, USA, 20001

PART OF: IEEE Big Data 2017
*** There is a 1-day registration option ***

14 presentations from France, Netherlands, UK, Canada, US, Taiwan; 2 demos from GE, US; Student panel on new curricula.


9:00 – 9:15 Welcome

    • Workshop Chairs:
      Mark Hedges1, Victoria Lemieux2, Richard Marciano3
      1 KCL, 2 UBC, 3 U. Maryland


      • Foundational Paper: Dec. 2017, “Archival records and training in the Age of Big Data”, Marciano, Lemieux, Hedges, Esteva, Underwood, Kurtz, Conrad, accepted for publication. See: LINK.
        • 8 topics: (1) Evolutionary prototyping and computational linguistics, (2) Graph analytics, digital humanities and archival representation, (3) Computational finding aids, (4) Digital curation, (5) Public engagement with (archival) content, (6) Authenticity, (7) Confluences between archival theory and computational methods: cyberinfrastructure and the Records Continuum, and (8) Spatial and temporal analytics.
        • [In: “Advances in Librarianship – Re-Envisioning the MLIS: Perspectives on the Future of Library and Information Science Education”, Editors: Lindsay C. Sarin, Johnna Percell, Paul T. Jaeger, & John Carlo Bertot.]

9:15 – 10:35 Session 1: Exploring Archival Data (talks: 20 mins each)

    • #1: Building new knowledge from distributed scientific corpus; HERBADROP & EUROPEANA: two concrete case studies for exploring big archival data
      [Pascal Dugenie, Nuno Freire, Daan Broeder — CINES, FR & MEERTENS Institut, NL & INESC-ID/Europeana DSI, NL]

      nuno_frere SlidesPaper

      • Computational Methods: EUDAT automated scalable e-infrastructure, integrated computation services,
      • Archival Concepts: Trusted digital repositories (TDR),
        OCR, cultural heritage platforms
    • #2: An Infrastructure and Application of Computational Archival Science to Enrich and Integrate Big Digital Archival Data: Using Taiwan Indigenous Peoples Open Research Data (TIPD) as Example
      [Ji-Ping LinAcademia Sinica, TW]

      jp_ling SlidesPaper

      • Computational Methods: Topic Modelling for concept extraction from large EC archival holdings
      • Archival Concepts: Support accessibility to large historical European Commission archival holdings
    • #3: Computational Curation of a Digitized Record Series of WWII Japanese-American Internment
      [William Underwood, Richard Marciano, Sandra Laib, Carl Apgar, Luis Beteta, Waleed Falak, Marisa Gilman, Riss Hardcastle, Keona Holden, Yun Huang, David Baasch, Brittni Ballard, Tricia Glaser, Adam Gray, Leigh Plummer, Zeynep Diker, Mayanka Jha, Aakanksha Singh, and Namrata Walanj — University of Maryland, USA]

      bill_underwood SlidesPaper

      • Computational Methods: Topic Modelling for concept extraction from large EC archival holdings
      • Archival Concepts: Support accessibility to large historical European Commission archival holdings
    • #4: The Cybernetics Thought Collective Project: Using Computational Methods to Reveal Intellectual Context in Archival Material
      [Bethany Anderson, Christopher Prom, Kevin Hamilton, James Hutchinson, Mark Sammons, and Alex Dolski — University of Illinois at Urbana-Champaign, USA]

      chris_prom SlidesPaper

      • Computational Methods: Archival materials contextual discovery
      • Archival Concepts: Annotation, entity extraction, NLP, machine learning

10:35 – 10:45 Questions and Discussion

10:45 – 11:05 Coffee break

11:05 – 12:25 Session 2: Curation and Appraisal (talks: 20 mins each)

    • #5: Towards Automated Quality Curation of Video Collections from a Realistic Perspective
      [Todd Goodall, Maria Esteva, Sandra Sweat, and Alan Bovik — University of Texas, USA]

      todd_goodall SlidesPaper

      • Computational Methods: Feature computing from video records, automated quality prediction, scalable HPC
      • Archival Concepts: Collection assessment, quality-aware metadata for video collections to inform appraisal, preservation, and access decisions, quality detection in videos
    • #6: Line Detection in Binary Document Scans: A Case Study with the International Tracing Service Archives
      [Benjamin LeeUnited States Holocaust Memorial Museum, USA]

      ben_lee SlidesPaper

      • Computational Methods: Line detection, image segmentation
      • Archival Concepts: Classification of archival images
    • #7: Auto-Categorization & Future Access to Digital Archives
      [Nathaniel Payne and Jason BaronUniversity of British Columbia, CAN & Of Counsel, Drinker Biddle & Reath LLP, USA]

      jason_baron SlidesPaper

      • Computational Methods: Auto-categorization, auto-classification, e-discovery, machine learning
      • Archival Concepts: Recordkeeping
    • #8: Heuristics for Assessing Computational Archival Science (CAS) Research: The Case of the Human Face of Big Data Project
      [Myeong Lee, Yuheng Zhang, Shiyun Chen, Edel Spencer, Jhon Dela Cruz, Hyeonggi Hong, and Richard Marciano — University of Maryland, USA]

      shiyun SlidesPaper

      • Computational Methods: Heuristics for CAS research,
      • Archival Concepts: Iterative design, value-sensitive design

12:25 – 12:45 Session 3: CAS Methods (talk: 20 min)

    • #9: What Can a Knowledge Complexity Approach Reveal About Big Data and Archival Practice?
      [Nicola HorsleyThe Netherlands Institute for Permanent Access to Digital Research Resources, NL]

      nicola_horsley SlidesPaper

      • Computational Methods: Digital narrative with big data,
      • Archival Concepts: Knowledge complexity in archives

12:45 – 2:00 Lunch

2:00 – 3:00 Session 3 CAS Methods cont. (talks: 20 mins each)

    • #10: Protecting Privacy in the Archives: Preliminary Explorations of Topic Modeling for Born-Digital Collections
      [Tim HutchinsonUniversity of Saskatchewan Library, CAN]

      tim_hutchinson SlidesPaper

      • Computational Methods: NLP, NER, sentiment analysis
      • Archival Concepts: PII
    • #11: Identifying Epochs in Text Archives
      [Tobias Blanke and Jon Wilson — King’s College London, UK]

      michael_bryant1 SlidesPaper

      • Computational Methods: Cultural analytics, topic modeling
      • Archival Concepts: Classification of time-coded
        collections of textual collections into epochs and periods
    • #12: GraphQL for Archival Metadata: An Overview of the EHRI GraphQL API
      [Mike BryantKing’s College London, UK]

      michael_bryant2 SlidesPaper

      • Computational Methods: APIs for cultural heritage materials, graph databases
      • Archival Concepts: Structured data interfaces to archival materials

3:00 – 3:40 Session 4: Creation and Management of Current Records (talks: 20 mins each)

    • #13: The Blockchain Litmus Test
      [Tyler SmithAdventium Labs, USA]

      tyler_smith SlidesPaper

      • Computational Methods: Blockchain, secure computing,
      • Archival Concepts: Decentralized recordkeeping
    • #14: A Typology of Blockchain Recordkeeping Solutions and Some Reflections on their Implications for the Future of Archival Preservation
      [Victoria LemieuxUniversity of British Columbia, CAN]

      vicki_lemieux SlidesPaper

      • Computational Methods: Blockchain, computational validation, distributed ledger, computational trust
      • Archival Concepts: Recordkeeping, digital preservation,
        archival trust

3:40 – 4:05 Questions and Discussion

4:05 – 4:25 Coffee break

4:25 – 4:55 Demos

    • Helge Holzmann, L3S Research Center, Hannover, GE
      ArchiveSpark: Efficient Web Archive Access, Extraction, and Derivation of smaller datasets.
      helge_holzmann-850x1024 See: LINK. The original ArchiveSpark paper (with a focus on Web archives only) is available here: LINK
    • Greg Jansen, University of Maryland, USA
      DRAS-TIC for Linked Data and Memento
      greg_jansen-722x1024 We will showcase the next phase of DRAS-TIC software development and scalability testing. Digital Repository At Scale — That Invites Computation (DRAS-TIC) Funded through the NSF Brown Dog project (see: LINK). The next phase of DRAS-TIC development was funded by a two-year grant from the IMLS as the “DRAS-TIC Fedora” project. This will see our horizontal scaling NoSQL digital repository grow to support the Linked Data Platform and Memento APIs for versioned linked data. We aim to meet these stringent LDP requirements and continue to support distributed compute on the Cassandra back-end.

4:55 – 5:15 Student Session:

    • Moderator: Michael KurtzStudents: LEFT TO RIGHT — Jennifer Proctor, Claire McDonald , Will Thomas
      Seven graduate students at the U. Maryland participated in a fall 2017 seminar exploring the eight case studies proposed in the 2017 Foundational Paper: “Archival records and training in the Age of Big Data”, Marciano, Lemieux, Hedges, Esteva, Underwood, Kurtz, Conrad, LINK, to be published in “Advances in Librarianship – Re-Envisioning the MLIS: Perspectives on the Future of Library and Information Science Education”, Editors: Lindsay C. Sarin, Johnna Percell, Paul T. Jaeger, & John Carlo Bertot. Students offered to discuss educational takeaways, and methods of incorporating CAS into the Master’s of Library and Information Science (MLIS) education in order to better address the needs of today’s MLIS graduates looking to employ both ‘traditional’ archival principles in conjunction with computational methods.

5:15 Closing Remarks


Introduction to workshop:
The large-scale digitization of analog archives, the emerging diverse forms of born-digital archive, and the new ways in which researchers across disciplines (as well as the public) wish to engage with archival material, are resulting in disruptions to transitional archival theories and practices. Increasing quantities of ‘big archival data’ present challenges for the practitioners and researchers who work with archival material, but also offer enhanced possibilities for scholarship through the application of computational methods and tools to the archival problem space, and, more fundamentally, through the integration of ‘computational thinking’ with ‘archival thinking’.

Our working definition of Archival Computational Science (CAS) is:
Contributing to the development of the theoretical foundations of a new trans-discipline of computer and archival science

This workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice and new forms of analysis and historical, social, scientific, and cultural research engagement with archives. We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality – meaning, knowledge and value – from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.

This is the 2nd workshop at IEEE Big Data addressing Computational Archival Science (see: 1st CAS workshop). This will builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a 2016 symposium held in April 2016 at the University of Maryland).

Research topics covered:
Topics covered by the workshop include, but are not restricted to, the following:

    • Application of analytics to archival material, including text-mining, data-mining, sentiment analysis, network analysis.
    • Analytics in support of archival processing, including e-discovery, identification of personal information, appraisal, arrangement and description.
    • Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
    • New forms of archives, including Web, social media, audiovisual archives, and blockchain.
    • Cyber-infrastructures for archive-based research and for development and hosting of collections
    • Big data and archival theory and practice
    • Digital curation and preservation
    • Crowd-sourcing and archives
    • Big data and the construction of memory and identity
    • Specific big data technologies (e.g. NoSQL databases) and their applications
    • Corpora and reference collections of big archival data
    • Linked data and archives
    • Big data and provenance
    • Constructing big data research objects from archives
    • Legal and ethical issues in big data archives
    • Program Chairs:
      • Dr. Mark Hedges, Department of Digital Humanities (DDH), King’s College London, UK
      • Prof. Victoria Lemieux, School of Library, Archival and Information Studies, University of British Columbia, Canada
      • Prof. Richard Marciano, College of Information Studies, University of Maryland, USA
    • Program Committee Members:
      The program chairs will serve on the Program Committee, as will the following:

      • Dr. Maria Esteva, Data Intensive Computing, Texas Advanced Computing Center (TACC), USA
      • Dr. Bill Underwood, College of Information Studies, University of Maryland, USA
      • Prof. Michael Kurtz, College of Information Studies, University of Maryland, USA
      • Mark Conrad, National Archives and Records Administration (NARA)
      • Dr. Tobias Blanke, Department of Digital Humanities, King’s College London, UK