PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

Updated on 2023.10.30

Get started

Install pytorch

See in https://pytorch.org/. (Our version is Pytorch 2.1.0 & CUDA 11.8)

Installation

pip install transformers==4.34.1
pip install datasets==2.14.6
pip install lightning==2.1.0
pip install wandb

Try our pre-trained models

We release all models and tokenizers through the HuggingFace transformers library. You can use them directly with the library or download them from the model hub.

For example

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("AI4Protein/deep_base")
model = AutoModelForMaskedLM.from_pretrained("AI4Protein/deep_base")

Tokenizing proteins

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("AI4Protein/deep_bpe_3200")
sequence = "MSLGAKPFGEKKFIEIKGRRM"
tokens = tokenizer.tokenize(sequence)
one_hot_encoding = tokenizer.encode(sequence)
print(tokens)
# ['M', 'SLG', 'AK', 'PF', 'GE', 'KK', 'FI', 'EI', 'KG', 'RR', 'M']
print(one_hot_encoding)
# [1, 16, 331, 95, 197, 107, 56, 109, 180, 124, 48, 16, 2] (1 is the start token, 2 is the end token)

Generating hidden states for proteins

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("AI4Protein/deep_base")
model = AutoModel.from_pretrained("AI4Protein/deep_base")

sequences = [
    "MSLGAKPFGEKKFIEIKGRRM",
    "MKFLQVLPAL",
    "MKLLVVLSLVAVACNAS",
    "MKIAGID",
]

tensors = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt", max_length=1024)

input_ids = tensors["input_ids"]
attention_mask = tensors["attention_mask"]

outputs = model(input_ids, attention_mask=attention_mask)
hidden_state = outputs.last_hidden_state
print(hidden_state.shape)
# torch.Size([4, 23, 768])

Available Models and Tokenizers

model_path	tokenization type	vocab_size
AI4Protein/deep_base	per-AA	33
AI4Protein/deep_bpe_50	BPE	50
AI4Protein/deep_bpe_100	BPE	100
AI4Protein/deep_bpe_200	BPE	200
AI4Protein/deep_bpe_400	BPE	400
AI4Protein/deep_bpe_800	BPE	800
AI4Protein/deep_bpe_1600	BPE	1600
AI4Protein/deep_bpe_3200	BPE	3200
AI4Protein/deep_unigram_50	Unigram	50
AI4Protein/deep_unigram_100	Unigram	100
AI4Protein/deep_unigram_200	Unigram	200
AI4Protein/deep_unigram_400	Unigram	400
AI4Protein/deep_unigram_800	Unigram	800
AI4Protein/deep_unigram_1600	Unigram	1600
AI4Protein/deep_unigram_3200	Unigram	3200

Evaluating pre-trianed models on PETA benchmark

Download dataset at https://drive.google.com/file/d/1o1yIE18WPOVJ8gBL5xcZEldtLSazRVYb/view?usp=sharing

unzip benchmark_datasets.zip
ls ft_datasets

Evaluate command

Note: PRECISION='bf16' is only available on the Ampere architecture GPUS (RTX 3090, Telsa A100, etc.). If your GPU is not Ampere architecture, please use PRECISION='fp16' or 'bf16-true' instead.

export PYTHONPATH="$PYTHONPATH:./"

DATASET="gb1"
SPLIT_METHOD="one_vs_rest"
BATCH_SIZE=128
MODEL="AI4Protein/deep_base"
POOLING_HEAD="attention1d"
DEVICES=1
NUM_NODES=1
SEED=3407
PRECISION='bf16'
MAX_EPOCHS=100
ACC_BATCH=1
LR=1e-3
PATIENCE=20
STRATEGY="auto"
FINETUNE="head"

python peta/train.py \
--dataset $DATASET \
--split_method $SPLIT_METHOD \
--batch_size $BATCH_SIZE \
--model $MODEL \
--pooling_head $POOLING_HEAD \
--devices $DEVICES \
--strategy $STRATEGY \
--num_nodes $NUM_NODES \
--seed $SEED \
--precision $PRECISION \
--max_epochs $MAX_EPOCHS \
--accumulate_grad_batches $ACC_BATCH \
--lr $LR \
--patience $PATIENCE \
--finetune $FINETUNE \
--wandb_project ft-$DATASET \
--wandb

Not log to wandb (optional)

export PYTHONPATH="$PYTHONPATH:./"

DATASET="gb1"
SPLIT_METHOD="one_vs_rest"
BATCH_SIZE=128
MODEL="AI4Protein/deep_base"
POOLING_HEAD="attention1d"
DEVICES=1
NUM_NODES=1
SEED=3407
PRECISION='bf16'
MAX_EPOCHS=100
ACC_BATCH=1
LR=1e-3
PATIENCE=20
STRATEGY="auto"
FINETUNE="head"

python peta/train.py \
--dataset $DATASET \
--split_method $SPLIT_METHOD \
--batch_size $BATCH_SIZE \
--model $MODEL \
--pooling_head $POOLING_HEAD \
--devices $DEVICES \
--strategy $STRATEGY \
--num_nodes $NUM_NODES \
--seed $SEED \
--precision $PRECISION \
--max_epochs $MAX_EPOCHS \
--accumulate_grad_batches $ACC_BATCH \
--lr $LR \
--patience $PATIENCE \
--finetune $FINETUNE

You can find all available datasets and splits in peta/dataset.py.

If you want to use your own dataset, you can refer to peta/dataset.py and peta/train.py to write your own dataset class. Welcome to propose a pull request to upload your own dataset.

🙋‍♀️ Feedback and Contact

Send Email

🛡️ License

This project is under the MIT license. See LICENSE for details.

🙏 Acknowledgement

A lot of code is modified from 🤗 transformers and Lightning-AI.

📝 Citation

If you find this repository useful, please consider citing this paper:

@article{tan2024peta,
  title={PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications},
  author={Tan, Yang and Li, Mingchen and Zhou, Ziyi and Tan, Pan and Yu, Huiqun and Fan, Guisheng and Hong, Liang},
  journal={Journal of Cheminformatics},
  volume={16},
  number={1},
  pages={92},
  year={2024},
  publisher={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
ft_datasets		ft_datasets
peta		peta
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
band.jpg		band.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

Get started

Install pytorch

Installation

Try our pre-trained models

Tokenizing proteins

Generating hidden states for proteins

Available Models and Tokenizers

Evaluating pre-trianed models on PETA benchmark

Not log to wandb (optional)

🙋‍♀️ Feedback and Contact

🛡️ License

🙏 Acknowledgement

📝 Citation

About

Releases

Packages

Contributors 3

Languages

License

ginnm/ProteinPretraining

Folders and files

Latest commit

History

Repository files navigation

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

Get started

Install pytorch

Installation

Try our pre-trained models

Tokenizing proteins

Generating hidden states for proteins

Available Models and Tokenizers

Evaluating pre-trianed models on PETA benchmark

Not log to wandb (optional)

🙋‍♀️ Feedback and Contact

🛡️ License

🙏 Acknowledgement

📝 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages