From: softlib.cs.rice.edu
Last mod: JanuarY 30, 1995
fastlink 3.0p

unused alleles


Diagnostic for extra alleles

This file explains the diagnostic that states that a pedigree or dataset has unused alleles. This diagnostic has been implemented by Chris Hoelscher for inclusion in FASTLINK 2.3P and beyond. The renumbering is implemented starting in version 3.0P and beyond.

The running time of LINKAGE and FASTLINK grows rapidly with the number of alleles specified for each locus used in a run. Therefore, it is important to specify no more alleles than are actually needed for the analysis. Various partial solutions to the "extra allele" problem have been implemented by:

  Ellen Wijsman (in the context of LIPED)
  Jathine Wong and Cathryn Lewis (in the context of LINKAGE/FASTLINK)
  Scott Diehl, Bettie Duke, and Lynn Ploughman (in the context of MENDEL)
  Alan Young (in the context of GAS)
At the end of this essay we briefly describe the the partial solution implemented by Wijsman and Diehl-Duke-Ploughman. In the context of FASTLINK, their solution is applicable only to the LINKMAP and MLINK programs. We have not yet implemented an extension of their solution in FASTLINK 3.0P.

Extra alleles in symbols and an example

Suppose a locus has n alleles, A1 through An, that occur in the population at large. Suppose that in a population to be studied with linkage analysis, only alleles A1 through Ak, with k < n-1 occur. Then one may combine alleles A(k+1) through An into one "catch-all" allele unless one is estimating allele frequencies. The frequency of the catch-all allele is the the sum of the frequencies of A(k+1) though An.

A concrete FASTLINK example:

Suppose the general population has the possibilities:

  Allele           1   2   3   4   5   6
  Frequency       .3  .2  .15 .1  .22  .03
and this is encoded in the locus file (datain.dat).

Suppose that the pedigree(s) encoded in the pedigree file (pedin.dat) contain only the alleles 2, 4, and 5. LINKAGE and FASTLINK require that the alleles be numbered consecutively starting at 1. Therefore, in the process of reducing from 6 to 4 alleles it is necessary to renumber the alleles.

Renumber old allele 2 to be new allele 1 with frequency .2
Renumber old allele 4 to be new allele 2 with frequency .1
Renumber old allele 5 to be new allele 3 with frequency .22
Create catch-all allele 4 with frequency .48 (sum of frequencies of old 1, old 3, old 6)
No person should have the catch-all allele, but it is absolutely wrong to omit the catch-all allele.

Important technical note: the process of renumbering alleles to reduce their number loses no information in a statistical sense, unless one is estimating allele frequencies. Renumbering is distinct from "downcoding", in which multiple alleles that are distinct and do occur in the population are given the same number, in the interest of reducing running time. In general, downcoding loses information, although there are some special situations in which it does not because the frequencies of some different alleles happen to be identical.


Extra alleles and separating pedigrees

The use of extra alleles often arises when the original data had P pedigrees amongst which all n alleles occur, but the population in some analysis with Q < P pedigrees contains only k < n-1 of the alleles.

The MLINK and LINKMAP programs analyze each pedigree one at a time, and sum the values of -2*(log(likelihood)) for each pedigree. Since allele renumbering makes sense on a per pedigree basis, it is valid to renumber alleles for each pedigree in an optimal manner. This requires using a different locus file for each pedigree because the renumbering may assign the same new allele number to different old alleles. One annoyance of doing the analysis for each pedigree separately is that the output values must be summed. The process of automating the separation of input pedigrees and combination of output results was automated for LIPED by Ellen Wijsman and for MENDEL by Scott Diehl, Bettie Duke, and Lynn Ploughman.

The above solution does not work for ILINK or LODSCORE.


FASTLINK diagnostic error message

The main programs in FASTLINK do not know about all the loci in the locus file (datain.dat). They only know about the loci that are actually used in a given analysis. For example, if an analysis uses loci 1, 7, and 12, in *any* order, locus 1 will have index 1, locus 7 will have index 2, and locus 12 will have index 3 when reported in the diagnostic.
back to fastlink