India, June 2,
2025: Yandex has
published Yambda (Yandex Music Billion-Interactions Dataset), the world’s
largest currently available open dataset for recommender systems, containing
nearly 5 billion anonymized user interactions with audio tracks from its music
streaming platform, Yandex Music.
Yambda serves as
a universal benchmark for testing new approaches and algorithms across all
domains utilizing recommender systems — e-commerce, social networks, and
short-form video platforms.
The dataset
enables researchers to develop and test new recommender algorithms against its
baseline models, accelerating innovation. Startups with limited data can
leverage the dataset to build and test systems using Yambda before scaling.
This accelerates the creation of advanced technologies tailored to business
needs worldwide.
Bridging the
research-industry gap
The quality and
scale of training data are critical to delivering relevant recommendations on
platforms like streaming services, social networks, short-form video apps, and
e-commerce marketplaces. However, research in recommender systems has lagged
behind rapidly advancing fields like large language models, largely due to
limited access to large-scale datasets. Effective recommendation models require
terabytes of behavioral data, which commercial platforms possess but rarely
share publicly.
Researchers are
often left with small, outdated datasets that fail to capture the complexity of
modern usage:
- Spotify’s Million Playlists dataset is too small for
commercial-scale recommender systems.
- Netflix Prize dataset, with ~17,000 items and
date-only timestamps, limits temporal modeling and large-scale research.
- Criteo 1TB Click Logs dataset lacks proper
documentation and identifiers, and focuses narrowly on ad clicks.
“Recommender
systems are inherently tied to sensitive data. Companies can only publish
recommender system datasets publicly after exhaustive anonymization, a
resource-intensive process that’s slowed open innovation,” explains Nikolai
Savushkin, Head of Recommender Systems at Yandex.
This data
scarcity creates a gap: models that excel in academic settings often
underperform in real-world applications. Efforts to integrate recommender
systems with advanced architectures are also constrained by the lack of
suitable training data.
About the
Yambda dataset
Yambda addresses
recommender system challenges by providing a massive, anonymized dataset from
its music streaming service with ~28 million monthly users. This dataset
provides insights into how users interact with the content offered by Yandex
Music, which is known for its sophisticated recommendation system My
Wave that tailors the listening experience to the tastes of each user.
To protect privacy, all user and track data is anonymized, using numeric
identifiers to meet privacy standards.
Key features of
the dataset:
- 4.79 billion anonymized user
interactions collected over 10 months.
- Data from 1 million users and
anonymized descriptors for 9.39 million tracks.
- Includes two feedback types: implicit
interactions (listens) and explicit interactions (likes, dislikes, and
their removal).
- Offers audio embeddings (vector
representations generated via convolutional neural networks) and
anonymized information about tracks.
- Features an "is_organic"
flag marking whether users discovered tracks independently or through
recommendations, enabling deeper behavioral analysis.
- All events are timestamped, which
supports the analysis of user behavior over time and allows models to be
evaluated under conditions that closely resemble real-world use.
The dataset is
released in Apache Parquet format, compatible with distributed processing
systems such as Spark or Hadoop and analytical libraries like Pandas and
Polars.
“Yambda empowers
researchers to test innovative hypotheses and businesses to build smarter
recommender systems. Ultimately, users benefit — finding the perfect song,
product, or service effortlessly,” notes Nikolai Savushkin.
Dataset
versions and evaluation
Available in
three sizes — approximately 5 billion, 500 million, and 50 million events — the
Yambda dataset accommodates researchers and developers with different needs and
computational resource capacities.
The dataset uses
Global Temporal Split (GTS) for evaluation, a method that splits data by
timestamps to preserve event sequences. Unlike Leave-One-Out, which removes the
last positive interaction from each user’s history for testing, GTS avoids
breaking temporal dependencies between training and test sets. This ensures a
more realistic model testing — mimicking real-world conditions where future
data is unavailable.
Baseline
implementations include MostPop, DecayPop, ItemKNN, iALS, BPR, SANSA, and
SASRec, providing benchmarks for comparing new recommender system approaches.
These baselines are evaluated using standard metrics, including:
- NDCG@k (ranking quality)
- Recall@k (retrieval effectiveness)
- Coverage@k (catalog diversity)
“When industry
leaders share hard-won tools and data, a rising tide lifts all boats:
researchers gain real-world benchmarks, startups access resources once reserved
for tech giants, and users everywhere enjoy greater personalization,” added
Nikolay Savushkin.
Yambda, the
world’s largest open recommender system dataset, is now available on Hugging Face.
About Yandex
Yandex is a
global technology company that builds intelligent products and services powered
by machine learning. The company’s goal is to help consumers and businesses
better navigate the online and offline world. Since 1997, Yandex has been
delivering world-class, locally relevant search and information services and
has also developed market-leading on-demand transportation services, navigation
products, and other mobile applications for millions of consumers across the
globe.
About My Wave
My Wave, a
personalized recommendation system integrated into the multi-million-user music
streaming service, Yandex Music, employs deep neural models and AI algorithms
to analyze over a thousand factors — including user interactions, customizable
mood/language settings, and real-time music analysis of spectrograms, frequency
ranges, rhythm, vocal tone, and genre. By processing listening history and
track sequences, it dynamically adapts to user preferences, identifies audio
similarities, and predicts musical tastes to deliver tailored suggestions.