System Design for Modern Recommendation System

Contents
System Requirement
- Design a recommender system to recommend top 10 items for the user on e-commerce platform
- Model retraining pipeline
Define the Business Goal
What is valuable to the e-commerce platform?
- Most of the e-commerce’s income comes from the partial of cash flow between customer and seller. Increase the platform transaction amount is what we care about.
Setting the
Online Evaluation Metricbase on our goal- GMV (Gross Merchandise Value) = Page View * Checkout Conversion Rate * Average Order Value
Usually online evaluation metric could be multiple at the early stage of the recommender system
- ex
- checkout CVR
- add-to-cart CVR
- favorite CVR
- ex
Define the Offline Evaluation for the Ranking System
How to define the score for ranking the items?
- score = p1^a * p2^b * p3^c * price of item
- p1 = checkout conversion rate
- p2 = add-to-cart conversion rate
- p3 = click through rate
- a, b, c are tunable hyperparameter
- score = p1^a * p2^b * p3^c * price of item
Evaluation metrics
- MAP
- MRR
- NDCG
Define Each Row Structure of the Dataset
- <user profile, item profile, score>
- User profile
- User ID
- User preference
- User behavior
- Demographic property
- Item profile
- Item ID
- Seller’s information
- Item content
- Item statistical features
- user rating
- like, click, buying rate in last 7/30 days
- Score
- Calculate from the user and item interaction
- Score = p1^a * p2^b * p3^c * price of item
- p1 = checkout conversion rate
- p2 = add-to-cart conversion rate
- p3 = click through rate
- a, b, c are tunable hyperparameter
Baseline Ranking Algorithm
- Are there any ranking service existed in online service?
- Rule based
- Ranking directly by item’s like, click, buying rate
Ranking Model
- Matrix factorization
- Two tower’s model
System Design Architecture

- Vector database
- Store item’s vector by model offline inference
- Kafka
- Handling the user events in real-time
Design Deep Dive - Multi-Stage Ranking System
- Directly using only two tower’s model would maybe reach the performance limit
- It’s usually has business purpose for inserting particular item

Phase 1: Retrieval
- Purpose
- First stage for generate the potential candidate
- Reduce the loading for the computation
- Multiple recall channels
- Item-based collaborative filter
- Content based filter
- Good for cold start
- Model
- Two tower’s model
- Store the item vector into vector database
- Two tower’s model
- Purpose
Phase 2: Filter
- Purpose
- Filter the items bought by the user
- Filter the empty remaining item
- Filter the item scored 0 by the user
- Purpose
Phase 3: Rank
- Purpose
- Use more complex model to capture the dependency between the feature and output score for getting more accurate ranking
- Model
- GBDT
- Deep Cross Network
- Factorize Machine
- Caching the ranking result
- Purpose
Phase 4 Filter
- Purpose
- Filtering user’s page selection
- price
- product type
- Filtering user’s page selection
- Purpose
Phase 5: Re-rank
- Purpose
- Avoid to putting near items together
- Inserting particular time for every N items for business purpose
- Purpose
Design Deep Dive - Training Pipeline

- Item Index/Model update
- Fully update
- Update the model and the item index at every 00:00 night which has lowest system concurrency
- Incrementally update
- To achieve online learning
- Update the model and the item index for every 5 minutes/ 1 hour
- Only update the item with new ID
- Fully update
$cd ~