Data:

Metadata key

ID: Internet Archive unique identifier
Title
Author
Camera: tech spec of camera used to scan
Contributor: Organization owner who provided book
publish_date: date original work published
language: language of original book
operator: name of worker performing scan
ppi: resolution of scan
repub_state: unknown
scanner: machine used to perform scan
scanningcenter: location scan performed
sponsor: entity paying for scan
scandate: date scanned (deprecated, use date)
imagecount: number of pages
republisher_operator: worker performing republishing (editing, quality control, formatting)
republisher_date: day republishing labor performed
republisher_time: time republishing labor performed
scanfee: amount charged for scanning
sponsordate: unknown
ocr_converted: software used to perform OCR on scan
page_number_confidence: unknown
search_date: date added to IA records
date: cleaned and formatted date

Complete and processed data files are available at this Box site.

READ_ME.boxnote: details about datasets
geocoded-texts-data.csv: combined metadata across of all IA centers
scancenter-data: raw metadata from IA
IA-blog: archived HTML pages from IA's blog

The code to create the website is available at our static site repo

The code for the high volume API queries and creating all of our data visualizations is available at this Github Repo

Mapping Methods

Geocoding the centers

Geocoding refers to the process through which geographers (or GIS systems) transform language referring to a place into mappable geographic coordinates. GIS systems can often automatically geocode information like addresses, city names, or country names. However, they are incapable of resolving less standardized place references to geographic coordinates. In that case, a human must geocode them.

For the Scanning Labor project, we geocoded the metadata field, “scanningcenter” in the downloaded Open Library records. Of the 3 million records we scraped from the Open Library API, only 2.5 million had any information in the “scanningcenter” field. Workers created these 500,000 records mostly from 2001 to 2008 under the purview of the Google Books project. Google, unlike IA, required employees to sign NDAs to not reveal the scanning center location. As such, we cannot geocode the centers at which workers scanned these 500,000 books based on the records we have. This geographic opacity is strategic on Google’s part.

The other 2.5 million records contain 93 unique values in the “scanningcenter” field. We attempted to geocode these through the following method:

For each value, we entered the following query in the archive.org search bar: “scanningcenter:(value)”. For example, to geocode the value “il”, we searched “scanningcenter:(il)”
Next, we panned over the books on the result list. If they all appeared to be part of the same collection, we navigated to the collection page. If they were not part of the same collection or series of collections, we skipped down to step 4.
If the number of books in the collection was similar to the number of books the search returned, it is likely that the institutional owner of the collection is the same as that which housed the scanning center. According to the director of IA’s scanning partners program, Elizabeth McCleod, most of the scanning centers are located within the partner institution’s library. Therefore, we geocoded these centers to the partner institution’s library’s geographic coordinates.
Scanning centers for which the query did not return books belonging to a single collection were probably operated by internet archive to digitize its own collections. We browsed the books in this result to confirm this. If the “sponsor” field or “owner” field contained “Internet Archive” or “Kahle/Austin” foundation, we deduced these likely IA-owned books. Geocoding these centers was more complicated, as most are not associated with any partner institution and many are contracted business processing outsourcing (BPOs) organizations in East or Southeast Asia.
To geocode them, we searched for the “scanningcenter” value--i.e., cebu--in the Internet Archive blog. Sometimes, this returned blog entries announcing a new relationship with a BPO company. Other times, there were no blog entries, and so we searched the Internet Archive’s 990 tax returns for contractor partners that may correspond to “scanningcenter” values. For example, the value “shenzhen” goes unmentioned in any IA blog posts. However, IA’s 990s reveal the organization began contracting with a Chinese BPO firm, Datum Data Co. Ltd., located in Shenzhen at the same time the “scanningcenter” value “shenzhen” appeared in the data.
We were unable to geocode a few of the “scanningcenter” values. We decided to map these in spite of our inability to accurately geocode them because we did not want to further invisibilize the labor of the workers who scanned the books and created the records. However, we distinguished these from centers with locations we successfully geocoded using an aqua dot instead of red.
The 93 unique values in the “scanningcenter” field corresponded to 64 scanning center locations. For example, “il”, “ill”, and “illinois” all refer to the University of Illinois at Urbana--Champaign. We made the list of tags, coordinates, and names of the scanning center available in the scanning center location map. We separated the “scanningcenter” tags for each center with “||”

Data Preparation for mapping

We created a python script to process the downloaded data and count the number of records per “scanningcenter” tag per month. The result is a spreadsheet where each row represents a year in a month and each column represents the number of records tagged with the “scanningcenter” value for that month. View the csv file here.
Next, we geocoded the data. To do this, we iterated over every record in the scans/tag/month dictionary and mapped each tag to the center it corresponded with, adding together the number of records along the way. For example, if there were 90 records tagged with “il” and 3 tagged with “ill” in January 2018, there should be 93 scans at the University of Illinois during that time period. If the number of records digitized in a month was 0, we did not include the center on the map (instead, assuming it closed).
We transformed each date into YYYY-MM-DD HH:MM:SS format.
We created a csv file where each row corresponds to a scanning center for each month of the period of time between December 2001 to October 2022. See creates_csv_map.py

Making the KeplerGL map

KeplerGL is an open source mapping platform run on Uber’s mapping API. We decided to use it in lieu of Esri's ArcGIS because it is open source and allows for easy export to HTML. We uploaded the csv file containing the number of scans per center per month to Kepler’s online user interface. From there, we made the radius of each point correspond to the number of scans, “count.” Next, we made the color of the point dependent on how certain we were that we had geocoded the location properly. Finally, we added a time filter to create the animation.

The python script with which we created the csv file, the csv file, along with the map as a json file are all available on our GitHub.

Oral History Methods

Initially, our goal for this project was to include oral histories of scan operators at each of the scanning centers. To do so, we developed a series of questions found here: DH-IA-oral history questions and planned to conduct one-hour oral interviews to capture the day-to-day experience of scanning for IA.

We scraped over 500 emails from the metadata and reached out individually to 25 of those people. Unfortunately, we had trouble connecting with any of them - almost all of the emails bounced back. This highlights the high turnover rate of scanners at IA that contributes to the invisibility of these workers. We did receive one reply from a scan operator: “Thank you for considering me for this project. Unfortunately I don't feel I would be a good fit at this time. Please contact Chris Freeland at chrisfreeland@archive.org. Chris is our PR representative. I'm sure they will be able to help you.” We reached out to Chris Freeland at Internet Archive but did not receive a reply.

Pivoting our approach, we created a google survey to send out oral history questions via email. We felt a google survey would lead to more responses as it takes less time from interviewees and is completely anonymous. We wanted to be cognizant of the extra time and unpaid labor we would be asking of these workers, so the survey is only 13 questions and none of them are required. We were able to send this survey to all 500 of the email addresses that were in the metadata, so we had a broader reach and higher likelihood of connecting with workers who were interested and emails were still active. We created a Google forms survey. So far we’ve only received three responses and many bounced back emails.

Beyond gathering stories and anecdotes through oral history and survey, we also sought out pre-existing worker narratives on glassdoor. From our experience, it seems crucial to approach oral history more intentionally and slowly, to be realisitic about the commitment to relationship building it entials.

Another aspect of oral history in our project was to meet with several non-scanning staff at the Internet Archive (although two of the people we interviewed had started as scanners). We conducted two hour long interviews to understand the workflow of scanning centers, better understand our findings in the metadata, and get contacts for further interviews.

While IA middle management were initially willing to meet with us and interview them for the project, they refused to go forward with the project after we sent out our survey to IA scanning center workers. After sending the survey, we received this email from an IA staff member: “I'm glad our conversation was helpful for your project. At this point, we have participated as an organization to the extent that we are comfortable. If you have any further inquiries, please direct them to me.”