Meta and Google Introduce New Auto-Curation Method for Self-Supervised Learning Datasets

June 02, 2024 at 6:31:28 AM

TL;DR Meta and Google researchers have developed a new method for automatically curating high-quality datasets for self-supervised learning (SSL). This technique uses embedding models and clustering algorithms to create large, diverse, and balanced datasets without manual annotation. Experiments show that models trained on these auto-curated datasets perform nearly as well as those trained on manually curated ones, reducing costs and effort in dataset curation.

Meta and Google Introduce New Auto-Curation Method for Self-Supervised Learning Datasets

Meta and Google researchers, along with collaborators from INRIA and Université Paris Saclay, have introduced a new technique for automatically curating high-quality datasets for self-supervised learning (SSL). This method leverages embedding models and clustering algorithms to create large, diverse, and balanced datasets without manual annotation.

Balanced Datasets in Self-Supervised Learning

Self-supervised learning, which trains models on unlabeled data, is crucial for modern AI applications such as large language models and visual encoders. However, the quality of the datasets is critical for the performance of SSL models. Randomly assembled datasets from the internet often have skewed distributions, leading to biases in the models.

The researchers emphasize that datasets for SSL should be large, diverse, and balanced. Manual curation, though less time-consuming than labeling, remains a bottleneck in scaling model training.

Automatic Dataset Curation

The proposed automatic curation technique involves:

  1. Feature Extraction: A model computes embeddings, which are numerical representations of the semantic and conceptual features of the data.
  2. Clustering: Using k-means clustering, data points are grouped based on similarities. However, traditional k-means clustering tends to over-represent dominant concepts.
  3. Hierarchical Clustering: A multi-step hierarchical k-means approach is applied to create balanced clusters. This method builds a tree of data clusters in a bottom-up manner, ensuring well-represented concepts at each level.

This technique is described as a "generic curation algorithm agnostic to downstream tasks," capable of inferring interesting properties from uncurated data sources.

hierarchical-k-means-sampling.webp

Evaluating Auto-Curated Datasets

The researchers conducted extensive experiments on computer vision models trained on datasets curated with hierarchical clustering. Key findings include:

  • Improved performance on image classification benchmarks, especially on out-of-distribution examples.
  • Better performance on retrieval benchmarks.
  • Models trained on automatically curated datasets performed nearly on par with those trained on manually curated datasets.

The algorithm was also applied to text data and satellite imagery, leading to significant improvements across all benchmarks. Models trained on well-balanced datasets could compete with state-of-the-art models while using fewer examples.

Implications

The automatic dataset curation technique has significant implications for applied machine learning projects, particularly in industries where labeled and curated data is scarce. It can reduce the costs associated with annotation and manual curation, making model training more scalable and efficient. This method could be especially beneficial for large companies like Meta and Google, which possess vast amounts of raw data.

The researchers believe that automatic dataset curation will become increasingly important in future training pipelines.

Q&A

Have more questions on this topic? Ask our AI assistant for in-depth insights.

The Only Digital Marketing Feed You'll Ever Need.

Stay informed your way. Tailored updates when and how you want them. 100% Free.

10,000+ Users

500+ Sources

1000+ Tools

Or

Related Posts

Google launches Portraits AI coaching bots based on real experts like Kim Scott

Google launches Portraits AI coaching bots based on real experts like Kim Scott

Google
Google

Official Source

Official Source

Google is a Official Source. The source has been verified by Swipe Insight team.

Official Source
Meta to automate 90 percent of product risk assessments with AI system

Meta to automate 90 percent of product risk assessments with AI system

Tired of spending too much time creating audits for your clients?

Tired of spending too much time creating audits for your clients?

Featured
Google AI introduces new creative tools for ads video and brand management

Google AI introduces new creative tools for ads video and brand management

Google Ads AI +1 more
Google
Google

Official Source

Official Source

Google is a Official Source. The source has been verified by Swipe Insight team.

Official Source
Google unveils Flow AI filmmaking tool with Veo Imagen and Gemini models

Google unveils Flow AI filmmaking tool with Veo Imagen and Gemini models

Google
Google

Official Source

Official Source

Google is a Official Source. The source has been verified by Swipe Insight team.

Official Source
Google launches AI Mode for shopping with new virtual try-on feature using personal photos Trending ️‍🔥

Google launches AI Mode for shopping with new virtual try-on feature using personal photos

Google
Google

Official Source

Official Source

Google is a Official Source. The source has been verified by Swipe Insight team.

Official Source
Google launches NotebookLM mobile apps for Android and iOS with offline audio and sharing

Google launches NotebookLM mobile apps for Android and iOS with offline audio and sharing

Google
Google

Official Source

Official Source

Google is a Official Source. The source has been verified by Swipe Insight team.

Official Source
Perplexity partners with PayPal to launch AI-powered in-chat shopping for US users

Perplexity partners with PayPal to launch AI-powered in-chat shopping for US users

PayPal Newsroom
PayPal Newsroom

Official Source

Official Source

PayPal Newsroom is a Official Source. The source has been verified by Swipe Insight team.

Official Source

Related Tools

Markifact logo

Markifact

Verified Tool

Verified Tool

Markifact is a Verified Tool. Want to get this badge? Contact us.

Verified Tool

Marketing Workflows Powered by AI

Featured
Marketing Auditor logo

Marketing Auditor

Verified Tool

Verified Tool

Marketing Auditor is a Verified Tool. Want to get this badge? Contact us.

Verified Tool

Automated audits for Google Ads and Analytics.

Get Featured Here

Showcase your tool in this list.

Contact Us
Thunderbit logo

Thunderbit

No-code AI apps and automations for business users

Workflow Automation
Formula Bot logo

Formula Bot

AI-powered data analysis and visualization tool

Data Analysis