The Dutch Digital Heritage Network (DDHN)-funded project "Evaluation of Social Media Archiving Tools" is underway at IISH for the last couple of weeks. While it is considerably early for detailed results, this brief report outlines a few preliminary findings about a few of the tools tested so far, specifically the ones that I have found to be the most useful and usable until now.
For the observations made especially with regard to usability, I am assuming that the target user of the tools below will be someone with at least average or above average IT proficiency. I follow the definition of average IT proficiency proposed by DIGHUMLAB (Digital Humanities Lab Denmark):
"People in this skill level are experienced with programs they use for daily tasks, often including some advanced functions, but they will usually require some time and help for learning new routines. They usually are aware of the importance of security, and have a basic understanding of some data types and data handling, but when learning new routines and programs they are usually best served with also establishing a number of fixed routines and precautions in how the data should be stored, handled, and protected. Advanced programs and tasks can be learned, but the learning process may prove difficult and time- demanding."
The two main categories of social media archiving tools that we used to prepare our test plan are:
- Tools that capture the content that the end-user of a social media page/account is served on their browser when they access the social media platform, including e.g. hyperlinks, embedded video, audio, and other rich-media content (i.e. the "look and feel" or the "artefactual value")
- Tools that capture the information off a page/account/other seed, as well as metadata about it, in the form of structured data from a social media platform's API (i.e. the "informational content")
Based on these two categories, and taking into account the main software quality attributes we set out, the testing process so far has indicated the following:
1. "Look-and-feel" tools
brozzler is a crawler that uses a browser to fetch, crawl, and capture pages. Unlike more "traditional" crawlers like Heritrix, brozzler is able to access and capture web content in the form a web browser would display that content, using pre-defined behaviors (e.g. scrolling down a page). This allows it to preserve dynamic features and work around the restrictions that, for example, infinite scrolling imposes on archiving social media.
- Installation: The installation of brozzler was not particularly difficult if one followed closely the steps provided on the tool's Github but it was revealed that there were discrepancies between the version of modules used by some of the tool's components; some components required a specific version and another a different one. While this might be something fixable in the future, it was decided that in order to work around this problem the application had to be installed within a Python virtual environment.
- Usage: Because crawls are set up by writing a YAML file, it is advisable to download and use a code editor for this purpose, such as Atom or Visual Studio Code.
The database that the tool uses to manage crawls, RethinkDB, has basically stopped being developed and supported, thus issues with it could be predicted in the future (it is of course possible to replace it with a different database if IT resources can be spared for this).
One of the strengths of brozzler seems to be its distributed capability, i.e. the ability to run multiple processes simultaneously. In the simple tests that I have run so far, this has not been possible as the default configuration only allows for one process to run at a time; it is something to research further and it could be beneficial to large-scale archiving.
As it is open-source, modifications can be made to increase security in a production environment.
The desktop version of Webrecorder.io, this tool records the user's clicks through a page in order to make a faithful copy of it. It integrates the Autopilot feature (for automatic scrolling of pages), as well as a Preview mode, that allows the user to log in first and then start recording, in the case of capturing password-protected websites. There is also an experimental mobile device emulation mode.
Webrecorder Desktop is incredibly intuitive to install and to use. Some tweaking might be needed if you encounter errors or missing content after you record, e.g. you might want to try recording with an alternative virtual browser that the tool offers instead of your current one, but all in all this is the tool requiring the least amount of specialized skill.
The Autopilot feature, while sometimes very successful in recording e.g. an Instagram page, is not always reliable. It fails repeatedly with Twitter hashtag recordings, as well as Facebook content. As such, there is still a chance that manual recording will be needed.
Additionally, recent experience of recording Twitter indicates that sometimes the content that Webrecorder captured will not render properly or appear as if it has not been saved. This could be due to the rate-limiting introduced by Twitter, due to bugs in the software components, or both, but it is an issue to be dealt with in the near future.
The considerable advantage of the desktop app against the hosted version of Webrecorder is the ability to store locally without limitations on data storage. This means that one can record as much content as they want.
Storing locally also ensures that the content can be stored safely (esp. in the case of sensitive materials containing passwords and personal information).
There is no possibility to use different accounts for different users, but that can by by-passed by setting up the tool on one or more servers and setting up credentials locally.
2. "Structured data" tools
This tool was developed by Documenting the Now, and has an avid community of researchers and activists behind it. It is a command-line tool which accesses the public API offered by Twitter to capture public tweets from the last 7 days. It is also able to harvest user and friend metadata, as well as random samples and a live stream of tweets based on keyword, hashtag, and user ID queries.
It requires for the user to register for a Twitter Developer account, complete an application form, and create an application that twarc will use to access the API. It is advised to apply for a personal account as it will make the process easier.
Because of Twitter restrictions on reproduction and republication of their data, one of the elements of the application is to ensure Twitter that you will only be using the content you download for personal research and that you will not share it with any other party. This means that datasets that you will download using twarc cannot be published as they are; the need to be "dehydrated" i.e. reduced only to the Twitter IDs matching to each user and message. Only such a catalogue of IDs can be published, which someone can use in the future to "rehydrate" the dataset by accessing the API and fetching the content the IDs point to. This of course means that content that has in the meantime been removed or made private will not be retrieved, which is an issue to take caution with.
In terms of usage, while there is requirement to use the command-line, one could follow the instructions set out by DocNow and get by fairly well. To do more advanced operations like configuring authentication, using twarc as a library for collecting tweets, visualizing collected datasets, etc., a slight learning curve will exist for the average user.
The tools works without almost any errors when used correctly. A realistic bottleneck for its reliable performance is of course the limitations imposed by the Twitter API itself.
If the need arises, it seems to be possible to increase the number of requests twarc can make to the API by switching from user to app authentication (User Auth to App Auth), especially for the purpose of using the search option.
The user needs a personal Twitter account and a Twitter Developer account, both of which need to be authenticated. Thus, it can be argued that in an institutional setting more than one accounts can be created and used for accountability and documentation.