Interaction between proteins and lipids is crucial for numerous cellular processes. Similar to protein-protein interactions, the interacting segments may be intrinsically disordered regions (IDRs) that may gain secondary structures upon binding. We have collected proteins with experimentally annotated lipid-interacting IDRs, named membrane molecular recognition features, MemMoRFs – https://memmorf.hegelab.org
We use this dataset to develop an accurate sequence-based predictor of MemMoRFs, thereby supporting the tedious and relatively costly experimental identification of the membrane-interacting IDRs. We considered to use protein language models (pLMs) for this task and found:
- The Ankh pLM performed better than protT5 language model.
- Selecting an important subset of features can increase model performance.
- Application of our model to predict MemMoRFs in the human proteome resulted in rational outcomes.
Our work (BIORXIV URL - TODO) underscores the importance of evaluating various pLMs for specific predictive tasks and identifying key embedding features to enhance the performance of the pLM-based predictors.