About the project
Responsible Sourcing Across the Data Supply Line (Responsible Sourcing) is a multi-stakeholder initiative led by the Partnership on AI. Data labelers, data cleaners, and others who contribute human judgment to artificial intelligence (AI) systems play a critical role in developing this technology. Drawing from a diverse range of perspectives, the Responsible Sourcing project aims to develop recommendations and actionable resources to improve the working conditions of these professionals.
News & Updates:
Models for quality working conditions
In the fall of 2020, the Partnership on AI hosted a Workshop Series on Responsible Sourcing of Data Enrichment Services. To kick off the event, Mary L. Gray (Microsoft Research, Indiana University) led a conversation with Dean Jansen and Aleli Alcala (Amara, a project of the Participatory Culture Foundation) highlighting alternative models for employment in on-demand work that produce better outcomes for workers.
As we continue to develop these recommendations, we hope to connect with data scientists, AI engineers, and product managers who are interested in providing feedback and potentially piloting the recommendations in their workflows.
We also welcome anyone interested to sign up for updates on this area of work and future calls for participation.
With many businesses pursuing automation and personalization with their technology investments, AI applications are becoming an increasingly common feature of industry. Alongside this boom has been the expansion of data enrichment work.
Despite being an essential component of AI development, data enrichment work has for too long been both out of sight and out of mind for AI developers. Without knowledge of (and appreciation for) how it is produced, enriched data can be too easily treated as a simple commodity. This disconnect leads to a devaluing of data enrichment work, poor working conditions for data enrichment workers, and, often, worse outcomes for AI development itself.
Increasingly, AI practitioners are recognizing the importance of data enrichment work and the people behind this critical enabling step in the AI development process. Unfortunately, too many AI developers still aren’t aware of the ways they are precipitating harmful and precarious working conditions and those who are don’t know what they can do to help. From AI developers we’ve heard sentiments like “We feel we must care about the transparency of our supply chain. But there is no transparency in data labeling. Guidelines on how to navigate this would be very useful.” Similarly, data enrichment providers express that they “would love the buyers [of data labeling] to be more educated and have realistic expectations when they set the price and terms of tasks.”
The Responsible Sourcing initiative addresses these questions by working to provide actionable guidance for data scientists, AI engineers, and product managers, to empower these critical ecosystem players to do their part in ensuring healthy and fair working conditions across the data supply line.
What is data enrichment?
The concepts of machine learning have been around for more than half a century, but most of the major advances have taken place in the last five to ten years. This is thanks to improvements in hardware performance and the affordability of computing power which have made it possible to collect and analyze data at an unprecedented scale. As Aaron Courville, Ian Goodfellow, and Yoshua Bengio wrote in their 2015 book Deep Learning, “The most important new development is that today we can provide these algorithms with the resources they need to succeed.” Those resources are data.
But today’s AI systems cannot be built with just any data. They require enriched data. Data enrichment is a broadly defined term that encapsulates various types of data preparation and cleaning as well as human-review processes. Enriched data is essential for the training and validation of supervised learning models, the dominant form of applied AI. Examples of data enrichment work include:
- Data preparation and cleaning:
- Data annotation
- Intent recognition
- Sentiment analysis
- Image recognition
- Speech to text validation
- Human-review/human in the loop work, which may include:
- Content moderation
- Creating a continuous feedback loop
- Validating algorithmic outputs and models