Meta releases dataset to maximise inclusivity and diversity

To help AI researchers make their tools and processes more inclusive, Meta has released a massive, diverse dataset of face-to-face video clips.

Casual Conversations v2 includes a broad range of diverse individuals, and will help developers assess how well their models work for different demographic groups.

Meta’s VP of civil rights, Roy Austin Jr, said current large language models “lack diversity”. He argues that “the only way to test is to have a diverse model, to have those voices that may not be in the larger models and to be intentional about including them”.

Why it matters

For AI to serve communities fairly, researchers need diverse and inclusive datasets to thoughtfully evaluate fairness in the models they build.

Gathering data that assesses how well a model works for different demographic groups is difficult. That’s due to complex geographic and cultural contexts, inconsistency between different sources and challenges with accuracy in labelling.

With this new publicly available resource, researchers can better evaluate the fairness and robustness of their AI models.

This dataset is designed to maximise inclusion by giving AI researchers more samples of people from a wide range of backgrounds.

How was the database created?

The v2 database was informed and shaped by a comprehensive literature review around relevant demographic categories.

The dataset includes more than 25,000 videos from more than 5,000 people across seven countries. Rather than rely on algorithms, people self-identify their age, gender, race and other characteristics such as disability and physical adornments.

Trained experts then added additional metadata, including voice and skin tones.

The videos, featuring paid participants who gave their consent to be in the dataset, included both scripted and unscripted monologues. Participants were also given the chance to speak in both their primary and secondary languages.

Diverse improvements

Unlike Meta’s earlier dataset, which included few categories and only US participants, v2 offers a more granular list of 11 self-provided and annotated categories.

The self-provided categories include age, gender, language/dialect, geolocation, disability, physical adornments and physical attributes.

To further measure algorithmic fairness and robustness in these AI systems, v2 expanded their geographics to seven countries. Namely Brazil, India, Indonesia, Mexico, Vietnam, Philippines and the United States.

The new dataset will help AI developers address concerns around language barriers and physical diversity, which has been problematic in some AI contexts.

Trending Articles

Paris 2024: The greenest games ever

Salesforce, Workday team up to launch AI employee service agent

Intel launches apprenticeship program for manufacturing technicians

Featured Resources

Unleash The Power of Adobe Acrobat Across The Organisation.

Connect Your Workflows: Seamless Adobe Integrations with Microsoft & Salesforce.

Adobe x Microsoft Document Workflows To Grow and Energise Your Business.

Trending Topics

Meta releases dataset to maximise inclusivity and diversity

Why it matters

How was the database created?

Diverse improvements

Zara Powell

NEXT UP

Ryan Beal, CEO & Co-Founder of SentientSports: “Sports generate some of the richest datasets globally”

Paris 2024: The greenest games ever

Salesforce, Workday team up to launch AI employee service agent