The Menu

Monday, July 7, 2025

The Menu

News

Monday, July 7, 2025

The Menu

The Cost of Better Data: Rethinking the Data-Centric Debate

Freeman Lewin

Wednesday, June 18, 2025

Data-centric AI is often seen as expensive and slow, but mounting evidence suggests it may be the more effective path forward. Built through a collaboration between Zedge, Perle.ai, and Brickroad, the DSD project used fewer than eight thousand human-verified images to train vision-language models that outperformed their baselines. It offers a powerful case that quality, not quantity, is the real frontier in AI development.

A Paradigm Shift in AI Development

For years, progress in artificial intelligence has been driven by a model-centric approach. The formula has been clear. Improve the architecture, scale the parameters, and train on more data. Researchers poured their energy into designing ever-larger neural networks and refining loss functions, guided by the belief that better models alone would inevitably produce better results.

That belief is now being questioned. A growing number of researchers argue that the true bottleneck is not in the model, but in the data. This view, known as the data-centric approach, treats the dataset not as a fixed input, but as a core lever for performance. According to this philosophy, the structure, accuracy, and richness of the data are just as important as the model that learns from it.

But encouraging developers to shift their focus from models to data is not easy. The model-centric mindset is deeply entrenched. Modern tooling makes it fast and inexpensive to experiment with new architectures or scale existing ones. By contrast, improving a dataset demands time, human labor, domain knowledge, and infrastructure that is rarely reusable. It is slower to iterate, harder to standardize, and more expensive to scale.

Fortunately, empirical evidence is beginning to show that the data-centric approach may prove, despite its costs, to be more effective in the long run. In their 2024 study, Bhatt et al. demonstrated that models trained on cleaner, more structured datasets consistently outperformed those trained on larger but noisier corpora. Their findings suggest that careful data curation can yield gains in accuracy, efficiency, and generalization that rival or exceed the benefits of algorithmic innovation. The study concludes that while data-centric development introduces new frictions, it is increasingly the more durable path to building performant and reliable AI systems.

Why Data Still Lags Behind Models

Despite growing support, the data-centric movement continues to face resistance. Bhatt and his co-authors described several structural challenges that continue to limit its adoption. The most fundamental is cost. High-quality data is expensive to create. It requires annotation teams, subject matter experts, compliance reviews, and validation workflows. For many organizations, these expenses are not easily justified in comparison to the low cost and fast pace of model experimentation.

There is also the matter of access. Some of the most valuable datasets—like proprietary image libraries, medical imaging archives, or transaction records—are controlled by companies, governments, or institutions. Researchers and members of technical staff working outside of these walls often find themselves relying on public benchmarks that are dated, poorly labeled, or riddled with bias.

Finally, the current incentives in AI research still favor model innovation. Leaderboards measure top-line accuracy, not annotation fidelity. Open-source communities are often structured around novel architectures and faster training code, not around better data practices. As a result, even researchers who believe in the promise of data-centric development may struggle to find the funding, tools, or recognition to pursue it.

A Case Study in Curation: The DSD Dataset

The DataSeeds.AI Sample Dataset was created to test whether a smaller, carefully constructed dataset could outperform the sprawling, lightly labeled corpora that dominate AI training today. Launched in June 2025, DSD is a new image dataset built through a collaboration between Zedge's DataSeeds.AI, Perle.ai, and Brickroad. It contains 7,772 images drawn from Zedge’s GuruShots platform—a photography game where millions of users submit and vote on themed photographs.

From the beginning, the DSD was designed around human judgment. The images were not scraped from the web or auto-labeled from social media. Rather, they were selected from a global community of photographers and filtered through a competitive ranking system. Every photo included in the dataset had already been assessed by thousands of human eyes before it ever reached an annotation pipeline.

Once selected, the images were handed off to Perle.ai, an annotation firm that specializes in expert-in-the-loop workflows. Each image was given multiple layers of annotation. These included pixel-level segmentation masks, structured scene descriptions, natural-language captions, stylistic and technical metadata, and categorical tags. A typical image in the DSD might come with three paragraphs of descriptive text, ten segmented regions, and full EXIF metadata capturing lighting, camera, and aperture details.

The result is a dataset that is not just richly annotated, but deeply contextual. Each image is treated as a visual story, not a flat object. According to Perle’s research lead, Sajjad Abdoli, this approach enables models to learn not just what is in an image, but how the image was composed, what it evokes, and how it relates to broader visual themes. It offers something that raw pixels and bounding boxes cannot.

The Results: Fewer Images, Better Models

To test its impact, the DSD team fine-tuned two leading vision-language models—LLaVA-NEXT and BLIP2—using the dataset. These models, which convert images into descriptive text, have become key tools for tasks like multimodal search, image captioning, and generative visual reasoning.

The fine-tuned versions of both models significantly outperformed their baseline counterparts. LLaVA-NEXT, when trained on DSD, saw a 24 percent improvement in BLEU-4 score, a widely used metric for text quality. Other metrics, including ROUGE-L, CLIPScore, and BERTScore, also improved. BLIP2 showed similar gains, particularly in producing more stylistically accurate and lexically rich captions.

The team also ran a comparative benchmark using AWS Rekognition, a commercial image-tagging API, to test the gap between automated labeling and human annotation. Rekognition achieved an F1 score of just 0.19 against DSD’s human ground truth, highlighting how far generic models still lag behind when it comes to nuanced visual understanding.

Taken together, the experiments showed that a dataset of fewer than eight thousand images—when curated with precision—could yield state-of-the-art results across multiple models. This directly challenges the idea that larger datasets are always better. It suggests that the future of vision AI may be defined not by the number of images, but by the richness of the annotations that describe them.

Rethinking the Pipeline

The DSD effort also highlights a more practical shift. Rather than treating open-source dataset creation as a sunk cost or an academic exercise, the team designed it as a commercial workflow. Zedge provided the content and the rights. Perle provided the annotation infrastructure. Brickroad served as the project lead and broker, handling the licensing, compliance, and distribution. The dataset is now live on Hugging Face, where it can be downloaded and used freely. A larger commercial version is also immediately available for enterprise use.

We think that this model of public sample and private expansion may be the template for future datasets in a data-centric environment. It allows researchers to validate quality before purchasing, giving suppliers the opportunity to "prove their worth" and price accordingly.

A More Durable Path Forward

The data-centric approach is not a silver bullet. It does not eliminate the need for innovation in model architecture, training algorithms, or compute efficiency. But it reframes the starting point. Instead of optimizing from the top down, it encourages developers to look at the foundation. What if we spent less time tweaking how a model sees, and more time improving what it sees?

That question is no longer rhetorical. The DSD project shows what is possible when data is given the same strategic weight as the model. It demonstrates that annotation quality, dataset structure, and human context can produce gains that scale—and that those gains are more likely to be robust, interpretable, and aligned with human reasoning.

What is left to figure out, is whether others will follow.

As of this week, the dataset is available on Hugging Face. Alongside it, we have published the model weights, code, and a full research paper. We invite the community to use it, evaluate it, and improve upon it.

A John Kappa Template

Most Recent Articles

Jun 18, 2025

The Cost of Better Data: Rethinking the Data-Centric Debate

Jun 18, 2025