Workshop Title: Computational Archival Science: digital records in the age of big data
Thursday, December 8, 2016
Hyatt Regency Washington on Capitol Hill
400 New Jersey Avenue, NW
Washington, D.C., USA, 20001
PART OF: IEEE Big Data 2016
http://cci.drexel.edu/bigdata/bigdata2016/
*** There is a 1-day registration option ***
FINAL PROGRAM:
Keynote, 10 presentations from Belgium, Germany, UK, Canada, USA (universities, government agencies, companies), Panel, Breakout sessions.
8:45 – 9:00 Welcome
- Workshop Organizers:
Mark Hedges1, Richard Marciano2, Victoria Lemieux3, Maria Esteva4, Bill Underwood2, Michael Kurtz2, and Myeong Lee2, Mary Kendig21 KCL, 2 U. Maryland, 3 UBC, 4 TACC
9:00 – 9:45 Keynote (30 min + 15 min discussion)
- “Collaboration is the Thing”, Mark Conrad [Archives Specialist, National Archives and Records Administration (U.S.A.)]
![]() |
Slides |
9:45 – 10:45 Session 1 (3 talks: 20 mins each)
-
- #1: Exploring Archives with Probabilistic Models: Topic Modelling for the Valorisation of Digitised Archives of the European Commission
[Simon Hengchen, Mathias Coeckelbers, Seth van Hooland, Ruben Verborgh, Thomas Steiner — U. Libre de Bruxelles, Ghent U. (Belgium), Google Germany]Slides — Paper - Computational Method: Topic Modelling for concept extraction from large EC archival holdings
- Archival concept: Support accessibility to large historical European Commission archival holdings
- #1: Exploring Archives with Probabilistic Models: Topic Modelling for the Valorisation of Digitised Archives of the European Commission
-
- #2: Traces Through Time: A Probabilistic Approach to Connected Archival Data
[Sonia Ranade — The UK National Archives]
- #2: Traces Through Time: A Probabilistic Approach to Connected Archival Data
![]() |
Slides — Paper
|
-
- #3: Opening Up Dark Digital Archives Through The Use of Analytics to Identify Sensitive Content
[Jason Baron, Bennett Borden — Drinker Biddle & Reath LLP (Washington D.C.)]
- #3: Opening Up Dark Digital Archives Through The Use of Analytics to Identify Sensitive Content
![]() |
Slides — Paper
|
10:45 – 11:05 Coffee break
11:05 – 12:45 Session 2 (5 talks: 20 mins each)
-
- #4: Computational Provenance in DataONE: Implications for Cultural Heritage Institutions
[Robert Sandusky — U. of Illinois at Chicago Library]
- #4: Computational Provenance in DataONE: Implications for Cultural Heritage Institutions
![]() |
Slides — Paper
|
-
- #5: Content-based Comparison for Collections Identification
[Weijia Xu, Ruizhu Huang, Maria Esteva, Jawon Song, Ramona Walls — U. Texas at Austin, TACC]
- #5: Content-based Comparison for Collections Identification
![]() |
Slides — Paper
|
-
- #6: Breaking Down the Invisible Wall to Enrich Archival Science and Practice
[Kenneth Thibodeau — US National Archives (retired) ]
- #6: Breaking Down the Invisible Wall to Enrich Archival Science and Practice
![]() |
Slides — Paper
|
-
- #7: Mind the explanatory gap: Quality from Quantity
[Jenny Bunn — UCL (UK)]
- #7: Mind the explanatory gap: Quality from Quantity
![]() |
Slides — Paper
|
-
- #8: Understanding Computational Web Archives Research Methods Using Research Objects
[Emily Maemura, Christoph Becker, Ian Milligan — U. of Toronto, U. of Waterloo (Canada)]
- #8: Understanding Computational Web Archives Research Methods Using Research Objects
![]() |
Slides — Paper
|
12:45 – 2:00 Lunch
2:00 – 2:40 Session 3 (2 talks: 20 mins each)
-
- #9: Appraising Digital Archives with Archivematica
[Michael Shallcross — U. Michigan Bentley Historical Library]
- #9: Appraising Digital Archives with Archivematica
![]() |
Slides — Paper
|
-
- #10: Mining and Analysing One Billion Requests to Linguistic Services
[Marco Büchler, G. Franzini, E. Franzini, T. Eckart — Georg-August U. Gottingen, U. Leipzig (Germany)]
- #10: Mining and Analysing One Billion Requests to Linguistic Services
![]() |
Slides — Paper
|
2:40 – 3:30 Panel: The future for research and education in CAS
- Panelists: [Bill Underwood (summary & position), Maria Esteva (position), Victoria Lemieux (position), Mark Hedges (position), Richard Marciano (position), Mary Kendig (position)]
3:30 – 4:00 Coffee break & Posters [U. British Columbia & U. Maryland]
4:00 – 5:00 Reporting back and next steps [Lead by Maria Esteva and Vicki Lemieux]
Introduction to workshop:
The large-scale digitization of analogue archives, the emerging diverse forms of born-digital archive, and the new ways in which researchers across disciplines (as well as the public) wish to engage with archival material, are resulting in disruptions to transitional archival theories and practices. Increasing quantities of ‘big archival data’ present challenges for the practitioners and researchers who work with archival material, but also offer enhanced possibilities for scholarship through the application of computational methods and tools to the archival problem space, and, more fundamentally, through the integration of ‘computational thinking’ with ‘archival thinking’.
Our working definition of Archival Computational Science (CAS) is:
An interdisciplinary field concerned with the application of computational methods and resources to large-scale records/archives processing, analysis, storage, long-term preservation, and access, with aim of improving efficiency, productivity and precision in support of appraisal, arrangement and description, preservation and access decisions, and engaging and undertaking research with archival material.
This workshop will explore the conjunction (and its consequences) of emerging methods and technologies around big data with archival practice and new forms of analysis and historical, social, scientific, and cultural research engagement with archives. We aim to identify and evaluate current trends, requirements, and potential in these areas, to examine the new questions that they can provoke, and to help determine possible research agendas for the evolution of computational archival science in the coming years. At the same time, we will address the questions and concerns scholarship is raising about the interpretation of ‘big data’ and the uses to which it is put, in particular appraising the challenges of producing quality – meaning, knowledge and value – from quantity, tracing data and analytic provenance across complex ‘big data’ platforms and knowledge production ecosystems, and addressing data privacy issues.
This is the first workshop at IEEE Big Data addressing Computational Archival Science, although it builds on three earlier workshops on ‘Big Humanities Data’ organized by the same chairs at the 2013-2015 conferences, and more directly on a 2016 symposium held in April 2016 at the University of Maryland.
Research topics covered:
Topics covered by the workshop include, but are not restricted to, the following:
- Application of analytics to archival material, including text-mining, data-mining, sentiment analysis, network analysis.
- Analytics in support of archival processing, including appraisal, arrangement and description.
- Scalable services for archives, including identification, preservation, metadata generation, integrity checking, normalization, reconciliation, linked data, entity extraction, anonymization and reduction.
- New forms of archives, including Web, social media, audiovisual archives, and blockchain.
- Cyber-infrastructures for archive-based research and for development and hosting of collections
- Big data and archival theory and practice
- Digital curation and preservation
- Crowd-sourcing and archives
- Big data and the construction of memory and identity
- Specific big data technologies (e.g. NoSQL databases) and their applications
- Corpora and reference collections of big archival data
- Linked data and archives
- Big data and provenance
- Constructing big data research objects from archives
Program Chairs:
Dr. Mark Hedges
Department of Digital Humanities (DDH)
King’s College London, UK
Dr. Tobias Blanke
Department of Digital Humanities (DDH)
King’s College London, UK
Prof. Richard Marciano
College of Information Studies
University of Maryland, USA
Prof. Michael Kurtz
College of Information Studies
University of Maryland, USA
Dr. Bill Underwood
College of Information Studies
University of Maryland, USA
Prof. Victoria Lemieux
School of Library, Archival and Information Studies
University of British Columbia, Canada
Dr. Maria Esteva
Data Intensive Computing
Texas Advanced Computing Center (TACC), USA