CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences. CD-HIT was originally developed by Dr. Weizhong Li at Dr. Adam Godzik’s Lab at the Burnham Institute (now Sanford-Burnham Medical Research Institute)
CD-HIT is very fast and can handle extremely large databases. CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset. (from the CD-HIT home page)
module load cdhit
Generically
cd-hit -i someproteins.fasta -o my.out
Cluster with minimum 70% identity
cd-hit -i all.prot.fasta -o cdhit.70.out -c 0.7 -n 5 -d 0
Separate the clusters file into multiple fasta files (for clusters with at least 10 proteins, and place them in the directory clusters70 )
make_multi_seq.pl all.prot.fasta cdhit.70.out.clstr clusters70 10
See the user guide for further documentation.