Building Social Media Archives: Collection Development Guidelines

Social media platforms produce and disseminate a record of our cultural heritage. They document the everyday details of individual lives while also representing new modes of expression tied to historic sociopolitical events. As such, both the content and the form of social media may fall within the collecting interests of libraries and other cultural heritage institutions that take responsibility for stewarding the historical records of their local, regional, national, and international communities.

However, this work poses distinct legal, ethical, and technical challenges that have stymied many archival institutions’ ability to create nimble, efficient, and ethically-informed approaches to the preservation of the social media historical record.

This document seeks to illuminate issues potentially facing cultural heritage institutions that want to use social media platforms’ APIs to collect, preserve, and disseminate historically significant data.¹ In this document, we discuss the collection and stewarding of social media data as a historical record of potential long term value. This is distinct from the capture of social media data sets at the request of researchers to support specific research projects.

The issues are presented below in the form of brief discussions followed by a series of questions for organizations to consider and, where appropriate, to develop explicit policies to address. The answers to these questions will depend on factors such as the nature of the content being collected, the organization’s relationship to the content creators, the organization’s level of resources and capacity for experimentation, and the organization’s level of comfort with legal and archival risk. While you should consider these questions before developing a collection, you may find that you are not able to answer all of them without the experience of doing the work. One approach would be to start small and use that as a way to explore your organization’s position with respect to these issues.

Complex and often conflicting ethical concerns are likely to arise as organizations build social media archiving programs. For example, archivists collecting social media data may find that this work surfaces inherent conflicts between ethical commitments to privacy and respecting record creators’ preferences about whether their utterances are archived, and commitments to preserving a historical record that does not reinforce systematic patterns of exclusion and silencing. Archivists may also find their understandings of privacy and agency challenged by evolving societal expectations around open data and privacy. Restrictions made to an organization’s collecting in order to respect the privacy or agency of record creators and their subjects will likely come at the cost of silences in the historical record, and the two should be thoughtfully weighed against one another.

As your institution articulates its answers to the questions below in policies, consider the undocumented decisions that may be embedded into the technologies used to collect, manage, and make accessible social media content. These decisions may be manifested in technologies out of your institution’s control, such as social media platforms’ APIs, or in the collection-building tools that your institution selects and chooses to adopt. As you inventory these previously undocumented decisions, your institution may be motivated to make changes to your collection procedures or to simply document and make more explicit the decisions that guide your procedures.

1.1 Questions

In the context of community-based archives, cultural heritage organizations are increasingly opting to facilitate record creators’ preservation of their own content, in order to respect their agency and preferences. For the social media content that your institution wants to collect, are there identifiable individuals, institutions, or community groups with whom your institution can collaborate to build and curate the collection? Do they already consider your institution to be a trusted ally? Do they have the interest, resources, stability, and track record to manage the collection without the involvement of your institution? Who will lead this effort, and how could curatorial decision-making power be shared? Which staff at your organization would be involved in this collaboration, and would they have the time available to maintain active and positive relationships with collaborations?

Is it possible and reasonable for your institution to seek the consent of individual records creators to collect and provide access to their content? If you wish to seek consent from records creators, how will you contact them and manage responses? Will consent be active (e.g., a deed of gift or release) or passive (opt-out), and will it be sought prior to collecting or after the fact? Will you seek consent from all types of records creators? For example, will it be required from public figures, members of hate groups, international terrorist groups, and anonymous groups? If you are collecting topically (e.g., by hashtag), will you attempt to seek consent from each account holder? If you pursue consent, what staff would be responsible for seeking, documenting, and managing consent and access restrictions?

Are the creators or subjects of the social media data at risk of being harmed by your collecting and sharing of the data? Alternatively, are the creators or subjects of the social media data at risk of being harmed if the social media data is not recorded and is absent from the historical record? Do the creators or subjects have an obligation to public transparency or an obligation to defend the public good?

Understanding that in many cases it is not desirable or feasible to obtain explicit consent from records creators, will a lack of consent prevent your institution from collecting or providing access to content? If it prevents collection or access, which voices would be missing from the historical record? Would your institution be more comfortable or less comfortable with its decision if that decision were made explicit in public or private documentation?

How will your organization respond to requests for you to not collect or make data accessible? How will your organization deal with items that are deleted from or made private on the social media platform? Would your institution be willing and able to remove data completely from the archive? Would you be able to document reappraisal decisions? How much detail does this documentation require and what is “good enough” documentation?

What is the likelihood that the creators or subjects of the social media data include individuals who are sometimes afforded a special degree of privacy, such as minors or individuals experiencing mental health issues? Can this risk be mitigated, and how does it balance against the risk of not collecting the content?

How do your organization’s answers to the above questions create patterns of representation and silence in the historical record? Which individuals or groups will be systematically excluded from or underrepresented in the historical record? Will future audiences’ ability to understand history be impeded?

2. API terms of service

One of the most obvious hurdles that has slowed the API-based capture and preservation of social media are the terms and conditions that social media services routinely place on the use of APIs. Unlike the “crawling” techniques employed in traditional web archiving, harvesting data via APIs subjects organizations to additional terms of service and to the possibility that platforms may enforce these terms by limiting future access to the APIs. These terms of service may have significant impacts on an organization’s ability to collect, preserve, and provide access to the data. For example, see our earlier discussion of the Twitter API’s terms as of mid-2016.

Social media APIs’ terms and conditions appear to have been designed primarily with commercial re-use use cases in mind, rather than preservation of social media’s historical record. These policies and their enforcement can change at any time without notice, and most institutions do not have the resources to make a long-term commitment to routinely review these policies, track them, and engage legal counsel to interpret implications for access and re-use. Thus, cultural heritage institutions are currently faced with four choices, which are not necessarily mutually exclusive:

collect and make data accessible in violation of the platforms’ policies,
collect and make data accessible according to the restrictions imposed by current policies and terms, in hope of broader access in the future, should social media platforms liberalize their policies or close and the restrictions are no longer in existence or being enforced,
advocate for changes to social media platforms’ policies and terms that would make room for cultural heritage institutions to archive content, or
not collect, preserve, or make accessible social media content.

2.1 Questions

Have you read the agreements and terms of service that apply to your collecting of social media via APIs?

If your institution makes collecting social media an ongoing program, will you monitor the platforms’ terms of service for changes over time? Will you archive copies of the terms of service to refer to later? What staff will be in charge of monitoring, archiving, and responding to changes in terms of service?

Will your institution attempt to strictly follow the terms of service, take a risk management approach towards interpreting and enforcing the terms of service, or will your institution decide that the terms of service prevent your institution from collecting social media data via APIs? Knowing that there are risks inherent in all three options, which risks does your institution consider to be greatest? What material within your collecting areas may be lost to the historical record? What choice is most in line with your institution’s mission?

Will your institution attempt to contact or reach out to the social media platforms? Is your institution interested in advocating for improved terms of service and policies for archival collecting programs?

Does your institution have legal counsel, and will you engage them to help interpret the terms of service?

What is your institution’s commitment to (or policy on) making content publicly accessible, and how will this be impacted by platforms’ terms of service and other policies? What choices has your institution made in the past towards online access to born-digital and digitized material? If terms of service state that you may not make content publicly accessible, what will your institution’s response be?

If terms of service state that you may only share identifiers or only portions of the data content, what will your institution’s response be? For example, will you share complete tweets rather than only tweet IDs? Will you differentiate level of access based on the audience (e.g., a member of your institution vs. a user at another institution)?

To what extent can you inform users of the limitations on reuse/distribution and request that they abide by platforms’ terms of service?

Will you preserve or provide access to content that has disappeared from or changed on its original platform?

Are you able to meet your collecting goals while still adhering to the agreements and terms of service? What risks in this area are you and/or your institution willing to take?

Will the answers to any of these questions vary based on the nature of the content creator, e.g., content created by your own institution, by donors who have signed deeds of gift, by public figures or by government agencies?

3. Scoping harvests appropriately

Social media APIs provide access to incredible quantities of data. APIs support mechanisms for selecting the data to be retrieved, e.g., by providing a query or selecting a sample. Further, the API may impose rate limits on the number of items than can be retrieved. Harvest tools (like Social Feed Manager) provide an interface for specifying the scoping criteria that define a collection by determining what is retrieved from a social media platform’s API. They may also have scoping decisions “baked in” to their design. For example, a harvest tool may only work with certain APIs or may only be able to collect certain media types within platforms.

Scoping rules can include things such as:

Whether data is collected continuously or at regular intervals, and what those intervals should be
Type of content collected (private messages, posts, account information, images, video, etc.)
Search terms (keywords, hashtags, accounts)
Which API to use (Twitter, for example, offers different APIs that provide access to either current streaming or limited historic data)

These scoping decisions will be the primary means to shape archival collections. Because they are executed by the harvest tool and social media platforms, and because social media data is especially likely to be used for computational analysis, the potential and the power to create patterns of inclusion and exclusion is significant. You should understand the capacities and limitations of these different options so that you are equipped to make thoughtful decisions about the scope and content of the collections you build. Similarly, when planning a social media archiving program, you should determine the amount of staff time available to learning, implementing, maintaining, and monitoring your scoping selections. Such tasks can consume as much or as little time as is budgeted.

Example: Twitter in December 2016

The core scoping options offered by the APIs themselves can be complex and require some investment of time to master. Twitter, for example, provided at least four methods of collecting data as of December 2016, each of which offered different sets of data and scoping options, and none of which offered complete access to all of Twitter’s data:

The Twitter Search API retrieved public tweets from a sampling of tweets from the most recent 7-9 days.
The Twitter filter streaming API allowed for searches to be performed on all tweets from the current time forward, interrupted by rate limits.
Another method allowed for retrieving the most recent 3,200 tweets from an individual Twitter user account.
Lastly, the Twitter sample stream provided a sample estimated to be 0.5 - 1% of the population of public tweets on any given day.

If an archivist wanted to collect data related to a recently-emerged and ongoing issue, for example, she might use the search API to collect recent tweets related to designated keywords, while also setting up a collection using the filter streaming API to gather on an ongoing basis, and retrieving the timelines of recent tweets from central individuals. As the issue develops over the course of days and weeks, the archivist may add new terms to her ongoing harvests.

With some platforms, such as Twitter, the credentials used to harvest tweets will also affect what is collected. If the account whose credentials are used for a harvest has been allowed to follow an account set to private, all tweets from that account will be harvested, including ones not intended for public distribution.*

3.1 Questions

Do you and your staff understand the scoping limitations posed by the APIs? Do these limitations affect the feasibility of your social media collection development goals?

What harvesting application (e.g., Social Feed Manager) will you use to interact with social media APIs? Is that application robustly documented? Is the documentation transparent and easily understandable by you and/or your program’s staff?

Does the documentation indicate the scoping choices that the application builders “baked in” to the software? Which things can you configure, and which can you not?

Do you want to only collect incrementally, that is, to only collect new content with each harvest, or do you want to reharvest the same photo or tweet multiple times, showing edits to the content and changes in its metadata (such as likes or retweets) over time?

How frequently do you need and have resources to harvest? Is it acceptable that content could be created and deleted between your harvests?

Do you want to collect private content?

Harvesting webpages linked in tweets, Tumblr blog posts, etc., as well as embedded media can provide important context to the data, but can also drastically increase the size of a harvest. When would your institution choose to harvest webpages or media?

Who will be responsible for tracking and managing collection configurations? If this responsibility is shared by more than one person, how will they coordinate? If responsibility is open to many people, how will you train those people and maintain consistency in use of the tool? Can training and scoping rules be developed over time within your institution? In the future, do you expect to have more or less staff time available to these tasks?

4. Documenting decisions that affect a collection

The SFM project has focused special attention on the ability to thoroughly document appraisal decisions and actions taken during the harvest process that would impact the content obtained. This has been done with the intention of providing future researchers and archivists with sufficient provenance information to understand the context of a collection’s creation, interpret its content, and document the data’s authenticity and trustworthiness. However, while some of the decisions embedded into the system’s functions are automatically recorded, much depends upon those who set up and curate collections to manually record relevant rationales and decisions. It is entirely up to users of SFM to determine how much manual metadata is enough to enable reuse of the data they are collecting.

4.1 Questions

Who are your intended users, and what information will they need to understand and reuse the collections? Do your users have established specifications for how this information should be provided? Is it necessary that you attempt to meet those specifications?

What information is collected, maintained, and made available by the harvesting tools that your institution has adopted? How can that information be exported? Do they meet your users’ needs? If not, how much value is lost?

What additional provenance/appraisal information might your institution choose to collect and document outside your harvesting application?

As time passes, the nature of social media and its terminology may become obscured. How will you attempt to support future researchers’ ability to understand and interpret the data?

What standards of provenance documentation does your institution follow for its other materials, digital and non-digital? Is there such a thing as “too much” provenance metadata? Is there such as thing as “good enough” documentation?

5. Providing meaningful access to collected content over time

Keeping in mind the tenet that preservation cannot exist without access, your institution should carefully consider how it plans to provide access to harvested data while navigating terms of service and respecting records creators’ privacy and autonomy. The following are just a few of the options available.

Different levels of access could be provided to users within and outside your organization.
Content can be provided as raw datasets (e.g., JSON or XML file), extracts from data sets (e.g., CSV or Excel files) or an organization can provide an interface for browsing and searching individual items directly.
Access may be mediated, where catalog or finding aid records indicate the presence of the data but users must contact the archive and interact with staff before gaining access.
Access may be provided on-site only, remotely, or on the public web.

The access that you provide today may not be the same as the access you plan to provide in the future. Some institutions may proactively collect content today in anticipation that they may be able to provide more robust access if third party restrictions are lifted. When approaching such decisions, archivists typically make a careful examination risk, costs, available resources, and value.

Keep in mind that more complex and variable access restrictions may better meet your institution’s needs but will also require more resources and commitment over time.

Decisions about access may come into play as you select your API harvesting tool, or your institution may be faced with these questions after having already amassed large data collections. Harvesting tools may be designed to collect and export content, but may not be intended as a long-term preservation or access systems. Social Feed Manager is not intended as a long-term preservation system and currently provides limited ability for browsing and searching individual items. Other harvesters may feed directly into access systems, as is the case with ArchiveSocial, a commercially-available tool.

5.1 Questions

What is your institution’s commitment to (or policy on) making content publicly accessible, and how will this be impacted by platforms’ terms of service and other policies? What choices has your institution made in the past about online access to born-digital and digitized material?

Will your institution take a risk management approach to making content available, including assessment of the legal and ethical risks, and risks to the historical record? In the past, w**hat risks in this area has your institution been willing to take?

Is content in immediate and high demand? Does your institution have designated communities it intends to serve, and are these the communities demonstrating high demand? If the data is in demand by law enforcement, commercial companies, or political groups, will this influence your institution’s decisions around access and data sharing? Does your institution not want to make content available to specific audiences?

Will content be made publicly available, or available only to certain audiences? If terms of service state that you may not make content publicly accessible, what will your institution’s response be? Would this response be in line with your archives’ mission, existing collection development policies, and access procedures? What additional resources would you need to create and enforce that kind of access environment?

Will access be limited to certain access points (e.g., on-site only)? Why, and what risk does this mitigate? Do you expect to receive users if you limit to on-site access? Does this raise issues of exclusion and privilege that conflict with your organization’s mission?

Will access be mediated by staff of your institution? Why, and what risk does this mitigate?

How will users discover that content exists? Will you provide multiple points of discovery for different user types (e.g., catalog records, finding aids, public search engine full text indexing**)? Considering that today’s users expect to discover and access content simultaneously through fulltext searching, will these traditional means of archival discovery be suited to meet user needs?

What format will content be available in? Will it be accessible only as datasets? Will an interface be provided for search and browsing?

Will content be made accessible with fulltext searching? Will the fulltext be discoverable and indexable by popular search engines? How would this be technically achieved, and what user needs does it meet? Does it pose risks to make the data so easily discoverable by the public?

Will users be able to query the data (fulltext or otherwise) and export the results of their queries? Or, will data be pre-packaged for export?

When will content be made available? Immediately? After processing? Will embargoes be placed on content? If so, does your organization have resources and past experience managing embargoes?

Does your organization plan to make content available only in the future if social media organizations’ terms of service are no longer enforceable? If you do not make content available now (or make it available in only very limited ways), will you be able to sustain institutional resources to provide mediated access during the interim and preserve content until it can be made more accessible? How does this fit within your institution’s long term resource planning, including financial and staff resources?

If terms of service state that you may only share content ids or other portions of the data content, what will your institution’s response be? For example, will you share complete tweets rather than only tweet IDs?

If terms of service state that you must restrict the amount of data that users can access (for example, up to 50,000 tweets per user per day), will you attempt to enforce this? How? Can users be informed of the limit and required to self-enforce?

Will you preserve or provide access to content that has disappeared from or changed on its original platform? Will this vary depending on who created the data? E.g., what if the data was created by your own institution, by donors with whom you have a deed of gift allowing such access, by political candidates or public figures, by government, by terrorist groups.

6. Conclusion

You may find that your organization cannot yet answer many of these questions, and that more experience is necessary to fully engage with the ethical, technical, and institutional issues raised above. This document seeks simply to help guide discussions within institutions, and encourage decisions to be made explicitly and with active consideration. Your organization may opt to take an iterative approach, starting small and returning to these questions as the program matures. As you consider available resources and the time it takes to implement, mature, and sustain a program, you may find that it is also beneficial to take a practical approach and consider what “good enough” looks like. Despite the issues involved in their use, social media APIs afford a unique opportunity to document a historic period in global communication, unmatched by alternative web archiving techniques.

Application Programming Interfaces (APIs) provide a mechanism for retrieving data from social media platforms. These APIs return data in a structured text format, typically JSON or XML, that is intended to be machine-readable. Collecting social media data from APIs is related to, but distinct from, the efforts of cultural heritage institutions to collect social media from the websites of social media platforms using traditional web archiving techniques. ↩