Meta releases dataset to maximise inclusivity and diversity
To help AI researchers make their tools and processes more inclusive, Meta has released a massive, diverse dataset of face-to-face video clips.
Casual Conversations v2 includes a broad range of diverse individuals, and will help developers assess how well their models work for different demographic groups.
Meta’s VP of civil rights, Roy Austin Jr, said current large language models “lack diversity”. He argues that “the only way to test is to have a diverse model, to have those voices that may not be in the larger models and to be intentional about including them”.
Why it matters
For AI to serve communities fairly, researchers need diverse and inclusive datasets to thoughtfully evaluate fairness in the models they build.
Gathering data that assesses how well a model works for different demographic groups is difficult. That’s due to complex geographic and cultural contexts, inconsistency between different sources and challenges with accuracy in labelling.
With this new publicly available resource, researchers can better evaluate the fairness and robustness of their AI models.
This dataset is designed to maximise inclusion by giving AI researchers more samples of people from a wide range of backgrounds.
How was the database created?
The v2 database was informed and shaped by a comprehensive literature review around relevant demographic categories.
The dataset includes more than 25,000 videos from more than 5,000 people across seven countries. Rather than rely on algorithms, people self-identify their age, gender, race and other characteristics such as disability and physical adornments.
Trained experts then added additional metadata, including voice and skin tones.
The videos, featuring paid participants who gave their consent to be in the dataset, included both scripted and unscripted monologues. Participants were also given the chance to speak in both their primary and secondary languages.
Diverse improvements
Unlike Meta’s earlier dataset, which included few categories and only US participants, v2 offers a more granular list of 11 self-provided and annotated categories.
The self-provided categories include age, gender, language/dialect, geolocation, disability, physical adornments and physical attributes.
To further measure algorithmic fairness and robustness in these AI systems, v2 expanded their geographics to seven countries. Namely Brazil, India, Indonesia, Mexico, Vietnam, Philippines and the United States.
The new dataset will help AI developers address concerns around language barriers and physical diversity, which has been problematic in some AI contexts.
NEXT UP
Mark Allen, Head of Cybersecurity at CloudCoCo: “It’s alarming to witness the extent to which deepfakes can be weaponised”
Mark Allen, Head of Cybersecurity at CloudCoCo, provides what amounts to a step-by-step guide to keeping your business more secure against cyberattacks – including deepfakes
Inside Lenovo’s Budapest factory: making workstations more sustainable
Lenovo has just switched on 5,072 solar panels to make its Budapest factory more sustainable. We go behind the scenes
Eight lessons from building an AI product
In 2021, long before ChatGPT, Prashant Mahajan built his own generative AI tool for product managers. Here, he shares his playbook for building an AI product that will stand out from the crowd.