CLUMP documentation


CLUMP is a program designed to assess the significance of the departure of observed values in a contingency table from the expected values conditional on the marginal totals. The present implementation works on 2 x N tables and was designed for use in genetic case-control association studies, but the program should be useful for any 2 x N contingency table, especially where N is large and the table is sparse. The significance is assessed using a Monte Carlo approach, by performing repeated simulations to generate tables having the same marginal totals as the one under consideration, and counting the number of times that a chi-squared value associated with the real table is achieved by the randomly simulated data. This means that the significance levels assigned should be unbiased (with accuracy dependent on the number of simulations performed) and that no special account needs to be taken of continuity corrections or small expected values. The method is described in full in: Sham PC & Curtis D. 1995. Monte Carlo tests for associations between disease and alleles at highly polymorphic loci. Ann Hum Genet. 59: 97-105. Please cite this reference when using the CLUMP program.

The chi-squared value associated with a contingency table is defined in the usual way as the sum over all cells of the squared difference between observed and expected value divided by the expected value. The expected values are calculated conditional on the row and column totals. The expected value for a cell is the total for its row multiplied by the total for its column divided by the overall total number of observations.

The original feature of CLUMP is the novel chi-squared value which it derives. This is produced by "clumping" columns together into a new two-by-two table in a way which is designed to maximise the chi-squared value. This is like testing a post-hoc hypothesis: putting all the columns with the first value higher than expected into one group and all those with the second value higher into another and then looking at the difference between the groups. Of course this procedure will produce an inflated chi-square value which will not be distributed as a chi-square statistic with one degree of freedom, but this does not cause any problems in interpretation because the significance of the value is assessed using the Monte Carlo method rather than by making any assumptions about the distribution of the test statistic.

The advantage of this method is that it allows one to gain a robust assessment of the post-hoc hypothesis that certain columns tend to have higher values in one row than the rest. Since there are many ways of selecting a group of columns from N where N is of moderate size, the alternative of carrying out a Bonferroni correction on the significance of the chi-squared value from the "clumped" 2 x 2 table assessed using a chi-squared distribution might be rather conservative.

The method of "clumping" the columns into two groups is slightly more complicated than mentioned above. The procedure does begin by dividing the columns into those with higher than expected values in the first row from those with lower values. However this will not necessarily yield the maximum chi-squared value. What happens next is that each column in turn is moved into the opposite group to see if this increases the chi-squared value. If it does then the column is assigned to the new group, but if not it is put back into its original group. This process is repeated until no further moves can be found which increase the chi-squared value for the table. Again, there is no guarantee that this procedure will yield the absolute maximum chi-squared value possible, but it does represent a simple and intuitively appealing method for producing a value which seems at any rate likely to be close to the maximum.

For convenience the CLUMP program has been written to use Monte Carlo methods also to evaluate the significance of chi-squared values produced by more conventional methods of analysis. In all, chi-squared values are generated for four tables, and their significance is evaluated by seing how many times the value produced is exceeded by chance. The four tables are as follows.

1) The raw 2-by-m table supplied by the user.

Typically the first row might contain data from cases and the second from controls. There can be any number of columns (greater than two). The chi-squared value of this table is calculated. The statistic produced is referred to as T1. If there are small expected values in some cells then this value might not follow the expected distribution of a chi-squared statistic with m-1 degrees of freedom, but the significance can be reliably assessed using Monte Carlo simulations. Simulated 2-by-m tables are constrained to have the same row and column totals as this table, and the number of times a simulated table yields a chi-squared value greater than or equal to that obtained from the real table is output.

2) The original table with columns containing small numbers of cases clumped together.

If any cell has an expected value less than 5 then its column is lumped together with the column with the next smallest total. This process is repeated until every cell has an expected value of 5 or more. This probably represents the most conventional method of dealing with a sparse contingency table. The significance of the chi-squared value obtained from this table using the Monte Carlo method should closely correspond to the nominal significance of the chi-squared statistic with m'-1 degrees of freedom (where m' is the number of columns in the clumped table). The statistic produced is called T2. For the Monte Carlo simulations, tables with 2-by-m' cells are simulated with row and column totals constrained to be the same as for the clumped table (not the original 2-by-m table), and the number of times a simulated table yields a chi-squared value greater than or equal to that obtained from the clumped table is output.

3) A 2-by-2 table obtained by comparing one column of the original table against the total of all the other columns.

This represents another fairly conventional way of dealing with a contingency table having several columns, and tests the hypothesis that there is one particular column which has cells deviating from the expected values. For each column in turn all the other m-1 columns are clumped into one to yield a 2-by-2 table and the column producing the maximum chi-squared value is used. The statistic produced is called T3. Columns containing expected values less than 5 are not allowed to be considered on their own, but are clumped together with the other columns. In order to assess the significance of the maximum chi-squared value produced by this means, one could consider it as a chi- squared statistic with one degree of freedom and then apply a Bonferroni correction to the p value obtained. This Bonferroni correction would assume that a number of trials had been performed equal to the number of columns which had been compared against all other columns (i.e. the number of columns having cells with expected values greater than or equal to 5). One could argue that these trials were not in fact independent, in which case the Bonferroni correction would result in a conservative significance value. Assessing the significance using Monte Carlo simulations avoids these problems. These simulations are performed by simulating 2-by-m tables constrained to have the same row and column totals as the original table, and then clumping each into a 2-by-2 table by comparing one column against the total of the others and using the column which produces the maximum chi-squared value. (Again, this must be a column in which the expected values are both at least 5.) The number of times a clumped simulated table yields a chi-squared value greater than or equal to that obtained from the clumped real table is output.

4) A 2-by-2 table obtained by clumping the columns of the original table to maximise the chi-squared value.

The columns of the original table are grouped together as described above, using the novel procedure designed to maximise the associated chi-square. This chi-square value is inflated and would be expected to be higher than an ordinary chi-squared statistic with one degree of freedom, and its significance is assessed using Monte Carlo simulations. The statistic produced is called T4. These are performed by simulating 2-by-m tables constrained to have the same row and column totals as the original table then clumping them into 2-by-2 tables to produce maximal chi-squared values. The number of times a clumped simulated table yields a chi-squared value greater than or equal to that obtained from the clumped real table is output.

Running CLUMP

CLUMP can either receive its input interactively or can read input from a file. Such an input file should contain exactly the same parameters as would be entered interactively. In order to use an input file, specify its name as the first argument to the command to run clump. If a second argument is also given, it will be taken as the name of a file to write output to. Otherwise output will just be displayed on the screen. From the operating system prompt, enter:
clump [ inputfilename [ outputfilename ] ] 

The first line of input consists of the number of columns in the table to be entered. The second line of input consists of the values for the first row of the table, separated by spaces, and the third line of input the values for the second row. The fourth line of input is the number of simulations to be performed to assess significance. The fifth line contains a value to seed the random number generator which CLUMP uses. A typical input file might appear as follows:
  5 
  13 4 3 8 2 
  8 7 6 10 5 
  100 
  1 

This would instruct CLUMP to carry out 100 sets of simulations to assess the significance of the supplied 2-by-5 table.

The output (which is displayed on screen or sent to the output file if one is specified) consists of the chi-squared value produced by each of the four procedures described above together with the number of times such a value was reached by a simulated table. The proportion of times the chi-squared value produced by the real data is reached yields an estimate of the significance of the departure of the observed data from the expectation under the null hypothesis. The more simulations which are performed the more accurate this estimate will be. It would be possible to use binomial probabilities to calculate an upper confidence limit for the true significance based on this estimate of the significance, but a simpler approach is to perform a large enough number of simulations to give a reasonably accurate estimate of the significance. As a rough rule of thumb, one might perform as many simulations as are necessary for the real chi-squared value to be reached 20 or more times. One might begin by performing a set of 100 simulations. If the real chi- squared value were reached more than 15 or 20 times the results would clearly be non-significant. Otherwise one might go on to perform a set of 1000 or 2000 simulations, and if only a few of these reached the real chi-squared value one might go on to perform 10000 or more, until a satisfactorily accurate estimate of the true significance was achieved.

CLUMP produces a significance value for each of the 4 statistics derived, assessing the strength of evidence for a deviation from the null hypothesis that the underlying population frequencies are the same for the top and bottom rows. From the work presented in the paper, we would recommend that either the normal chi-squared (T1) or the chi-squared for the "clumped" 2x2 table (T4) should be used. The other two statistics lacked power to detect association in the data set which we studied, whereas T1 and T4 seem to perform similarly well. Obviously, which statistic is to be used should be decided on in advance of seeing the data.

As well as producing the output which appears either on screen or in the output file, CLUMP also writes a file called chisq.out which contains the real and simulated tables with their associated chi-squared values. For the real table and each set of simulated data 4 chi-squared values are derived according to the procedures described above. This file can be studied if it is desired to gain a clearer understanding of what CLUMP is doing (and perhaps to make sure it is doing it correctly).

I will aim to keep up-to-date versions of CLUMP at diamond.gene.ucl.ac.uk in /pub/packages/dcurtis.

Please don't hesitate to contact me with any problems or suggestions: dcurtis@hgmp.crc.ac.uk

Dave Curtis, Genetics Section, Institute of Psychiatry, De Crespigny Park, London SE5 8AF. (71) 703 5411

http://www.iop.bpmf.ac.uk/home/depts/psychmed/general/dcurtis/dcurtis.htm


back to linkage software list