Interaction between proteins and lipids is crucial for numerous cellular processes. Similar to protein-protein interactions, the interacting segments may be intrinsically disordered regions (IDRs) that may gain secondary structures upon binding. We have collected proteins with experimentally annotated lipid-interacting IDRs, named membrane molecular recognition features, MemMoRFshttps://memmorf.hegelab.org

We use this dataset to develop an accurate sequence-based predictor of MemMoRFs, thereby supporting the tedious and relatively costly experimental identification of the membrane-interacting IDRs. We considered to use protein language models (pLMs) for this task and found:

  • The Ankh pLM performed better than protT5 language model.
  • Selecting an important subset of features can increase model performance.
  • Application of our model to predict MemMoRFs in the human proteome resulted in rational outcomes.

Our work (BIORXIV URL - TODO) underscores the importance of evaluating various pLMs for specific predictive tasks and identifying key embedding features to enhance the performance of the pLM-based predictors.