To help AI researchers make their tools and processes more inclusive, Meta has released a massive, diverse dataset of face-to-face video clips.
Casual Conversations v2 includes a broad range of diverse individuals, and will help developers assess how well their models work for different demographic groups.
Meta’s VP of civil rights, Roy Austin Jr, said current large language models “lack diversity”. He argues that “the only way to test is to have a diverse model, to have those voices that may not be in the larger models and to be intentional about including them”.
Why it matters
For AI to serve communities fairly, researchers need diverse and inclusive datasets to thoughtfully evaluate fairness in the models they build.
Gathering data that assesses how well a model works for different demographic groups is difficult. That’s due to complex geographic and cultural contexts, inconsistency between different sources and challenges with accuracy in labelling.
With this new publicly available resource, researchers can better evaluate the fairness and robustness of their AI models.
This dataset is designed to maximise inclusion by giving AI researchers more samples of people from a wide range of backgrounds.
How was the database created?
The v2 database was informed and shaped by a comprehensive literature review around relevant demographic categories.
The dataset includes more than 25,000 videos from more than 5,000 people across seven countries. Rather than rely on algorithms, people self-identify their age, gender, race and other characteristics such as disability and physical adornments.
Trained experts then added additional metadata, including voice and skin tones.
The videos, featuring paid participants who gave their consent to be in the dataset, included both scripted and unscripted monologues. Participants were also given the chance to speak in both their primary and secondary languages.
Unlike Meta’s earlier dataset, which included few categories and only US participants, v2 offers a more granular list of 11 self-provided and annotated categories.
The self-provided categories include age, gender, language/dialect, geolocation, disability, physical adornments and physical attributes.
To further measure algorithmic fairness and robustness in these AI systems, v2 expanded their geographics to seven countries. Namely Brazil, India, Indonesia, Mexico, Vietnam, Philippines and the United States.
The new dataset will help AI developers address concerns around language barriers and physical diversity, which has been problematic in some AI contexts.
Nathalie Parent, Chief People Officer at Shift Technology: “HR is the conscience of an organisation”
For more than 30 years, Nathalie Parent has led global HR teams, working primarily with software companies. Today she’s Chief People Officer at Shift Technology
Amazon introduces new storage class that makes it cheaper to store rarely used files
Robot carers are real, but caregiving has bigger problems, writes Richard Trenholm in this FlashForward edition