twitter-algorithm-ml/projects/twhin/README.md

Twhin in torchrec

This project contains code for pretraining dense vector embedding features for Twitter entities. Within Twitter, these embeddings are used for candidate retrieval and as model features in a variety of recommender system models.

We obtain entity embeddings based on a variety of graph data within Twitter such as:
  "User follows User"
  "User favorites Tweet"
  "User clicks Advertisement"

While we cannot release the graph data used to train TwHIN embeddings due to privacy restrictions, heavily subsampled, anonymized open-sourced graph data can used:
https://huggingface.co/datasets/Twitter/TwitterFollowGraph
https://huggingface.co/datasets/Twitter/TwitterFaveGraph

The code expects parquet files with three columns: lhs, rel, rhs that refer to the vocab index of the left-hand-side node, relation type, and right-hand-side node of each edge in a graph respectively.

The location of the data must be specified in the configuration yaml files in projects/twhin/configs.


Workflow
========
- Build local development images `./scripts/build_images.sh`
- Run with `./scripts/docker_run.sh`
- Iterate in image with `./scripts/idocker.sh`
- Run tests with `./scripts/docker_test.sh`
Twitter's Recommendation Algorithm - Heavy Ranker and TwHIN embeddings 2023-03-31 18:05:14 +00:00			`Twhin in torchrec`

			`This project contains code for pretraining dense vector embedding features for Twitter entities. Within Twitter, these embeddings are used for candidate retrieval and as model features in a variety of recommender system models.`

			`We obtain entity embeddings based on a variety of graph data within Twitter such as:`
			`"User follows User"`
			`"User favorites Tweet"`
			`"User clicks Advertisement"`

			`While we cannot release the graph data used to train TwHIN embeddings due to privacy restrictions, heavily subsampled, anonymized open-sourced graph data can used:`
			`https://huggingface.co/datasets/Twitter/TwitterFollowGraph`
			`https://huggingface.co/datasets/Twitter/TwitterFaveGraph`

			`The code expects parquet files with three columns: lhs, rel, rhs that refer to the vocab index of the left-hand-side node, relation type, and right-hand-side node of each edge in a graph respectively.`

			`The location of the data must be specified in the configuration yaml files in projects/twhin/configs.`


			`Workflow`
			`========`
			- Build local development images `./scripts/build_images.sh`
			- Run with `./scripts/docker_run.sh`
			- Iterate in image with `./scripts/idocker.sh`
			- Run tests with `./scripts/docker_test.sh`