- 0.1 - After online session 9-4-2020
- 0.2 - After online session 16-4-2020
- 0.3 - After online session 30-4-2020
Before reading this FAQ (and please do)
Twarc is a tool for harvesting Twitter and yields structured data material. It is a command-line tool, so it has something of a learning curve. But much of it is really well documented. This tool was developed by Documenting the Now, and has an avid community of researchers and activists behind it. More detail in Preliminary Social Media Archival Tool Report - April 1 2020.
An alternative to twarc is TAGS, which is used via Google Spreadsheets. This is very good for a novice user and allows you to harvest API data from Twitter in a similar fashion with twarc, only with a little less flexibility.
How can you choose specific combinations of hashtags, accounts, keywords, etc. to archive (e.g. all the posts with a specific keyword from a Facebook user or all the tweets with a specific hashtag from a specific Twitter user)?
You can use youtube-dl for this purpose. This is a command line tool (so has a learning curve) that is pretty reliable. More detail on this tool will follow.
Can you remove unwanted content from inside a WARC?
Yes you can, for example with a command-line tool called warc-extractor. By providing some keywords, this tool will search the URLs stored inside the WARC, filter out what you do not want, and create a new WARC.
How difficult is the harvesting of the different social media archiving platforms?
In general none of the platforms make archiving easy because this activity doesn't really fit their business models. Also the rules of use and software behind the platform will change so tools that work now can be obsolete tomorrow.
- Twitter doesn't make it easy but with some trouble a lot can be done. You will need to create a developer API account when using tools that follow the structured data approach.
- Instagram is definitely archivable with Webrecorder and tools that work similarly - some more research is needed to look into the API harvesting possibilities for it (there are tools available, such as Instaphyte, etc.)
- Facebook makes harvesting very difficult and in may cases harvest will be incomplete or not work at all.
Using tools to capture social media content can often result in lower quality captures or rights violations, and that is an accepted reality for this task. An alternative to this is to contact the archival creator/donor and ask them for a copy of the data you want to preserve. Most if not all social media platforms offer this functionality of downloading one's own data, although the output differs in form, structure, and richness. Doing this is of course only possible when you have access to the social media users or creators you want to archive and are able to have/build such a relationship with them, which can often prove to be difficult.
Is information about the tools themselves important to document and preserve together with captured materials?
Yes it is. In social media archiving, and in web archiving in general, the collected materials are essentially created by the tools as we use them. The WARCs, JSONs and other files we harvest and capture do not exist before we start the harvesting process, like a DOC or JPEG that is selected to be archived does. Because of this, and in order to be able to preserve the provenance of social media records, recording information about specific tools used, their versions, operating systems, specific configurations, commands, etc. could be invaluable for the users of the material in the future.