Overview

The proteomes of the nine completely sequenced yeast species are classified into families based on all-to-all sequence comparisons and algorithmic consensus clustering, as described in (Nikolski and Sherman, 2007). The raw alignements used in this computation are homeomorphic [sharing full-length sequence similarity and similar domain architectures (see Wu et al., 2004)] and nonhomeomorphic systematic Smith-Waterman and Blast. The computed families were systematically compared to external data, namely PIR-SF and PIR-CF families (Wu et al.), Genolevures 2 (GL2) families and Genolevures 3 (GL3) curator-defined homolog groups. The best consensus was chosen, using the criteria of coverage (of GL3, GL2 and PIR-SF), and quality metrics internal to consensus algorithm. These families were further classified into two categories: robust families were found using all combinations of statistical parameters and are the most reliable, and consensus families were found using a combination of parameters evaluated using a Condorcet election procedure.

Four types of protein families are defined :

  • Robust families GL3R.* were found using all combinations of statistical parameters and are the most reliable.
  • Consensus families GL3C.* were found using a combination of parameters evaluated using a Condorcet election procedure, and in some cases manual curation. They often represent a merge of subfamilies.
  • Multiple choice families GL3M.* which have a very variable composition dependent on statistical parameters. Many of them concern notoriously complicated families such as polyproteins and repeat domains.
  • Unique families GL3U.* correspond to singletons, i.e. one protein per family.

Results and Data  

Family identifiers are arbitrary. Each family is associated with a phyletic pattern indicating, for each species, the presence or absence of a protein from that species.


Download files

DateRelease of GénolevuresViewInformation
2008/01/03R3C2byfamily.txt
2008/01/03R3C2byprotein.txt
Used data : Génolevures Release 3 candidate 2 data + Saccharomyces cerevisiae SGD data + Eremothecium gossypii AGD data (2008-01-03)

  • byfamily.txt: Tab with one row per family indicating its pattern, its profile, and the names of the genes coding for the proteins.
  • byprotein.txt: Tab with one row per protein indicating its family.

References