We are very happy to collaborate in the analysis of data, and we think it can be an enriching experience for all involved. However, to help in the process, maximize its efficiency, rationalize our workload, and minimize misunderstandings, we have prepared these guidelines. We would appreciate if you could read them before coming to us.
A lot has been written about statistical consulting, and we have read a little bit of it. Just judging by the number of pages this issue still generates (e.g., there is a publication, The Statistical Consultant, devoted solely to this topic), this is a touchy and difficult subject. So please bear with us for a few more paragraphs.
Simplifying a lot, we can think of three different styles at CNIO.
We do not perform analyses before the objectives are spelled out: we require that we all know what you want to do with your data. In other words, that you know what question you are asking. If not, then we will need a first meeting to thoroughly discuss your objectives, hypotheses under consideration, etc. We are glad to help with this step. Please, be ready to provide us with background information, relevant biological details, and to endure thousands of apparently trivial (and silly) questions about procedures, design, previous results, etc. Before helping, we must be sure we understand the problem.
Some studies are clearly testing a very specific hypothesis, other studies are more exploratory, searching, for example, for candidate genes. But, in all cases, there is an explicit question behind the study. We do not like to get involved in studies that try to torture the data till they confess something, whatever that might be.
At the beginning of the project, we will ask you to provide a written description of the objectives of the project. This description will be filed together with the rest of the logs and files of the project.
It is much, much better if you ask us before you start your study. In particular, you will probably want to discuss issues related to study design, variables to record, type of microarray design (controls, dye-swaps, pooling, etc), etc. Otherwise, we might not be able to help you at all with your data: it might be impossible to analyze data from poorly designed studies. We want to emphasize this: many studies are wasted time, money, and effort because they do not allow for sensible statistical analyses.
(This is from R. A. Fisher himself, in an address to the First Indian Statistical Congress, in 1938: "To consult a statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.")
We are very happy to discuss with you alternative designs for your prospective studies.
We expect the cDNA microarray data to have been properly normalized. Unless you can justify otherwise, we require the data to have been normalized using print-tip loess. You can use Bioconductor or similar tools, or you can use our dnmad tool. Please, remember not to normalize the data with the GenePix default: save the raw GPR file without normalization.
The data should be given to us as plain ASCII files, with columns separated with tabulators (tabs). Do not send us Excel files. If you use Excel and then export as ASCII, please ensure that all rows have the same number of columns.
Besides the microarray data, you will often have additional phenotypic data. Please, send this as a text file (with tabs), were there is one row per subject (or array, or experimental unit) and one column for each variable. For instance, if you have five subjects, and there are three variables (age, sex, hospital), we expect to receive a matrix of five rows and three columns, with an additional first column for the subject ids and a first row with the column names. (Yes, we are aware this format is the transpose of the Pomelo format, for example.)
You can provide us the data using:
Please, do not leave the data in the "network drives" (it is very hard for us to access them from GNU/Linux) nor bring us ZIP drives (most of our machines do not have ZIP drives).
We prefer well established, theoretically justified methods, rather than fancy new algorithms that lack statistical justification. We are happy to discuss (and learn!) different methods/approaches, but please, do not try to coax us to use the fancy yy method that such and such just published, if we are telling you that we'd rather use discriminant analysis. Of course, you are free to do the analyses on your own (but we will not do the programming for you).
By now, we expect not to have to convince you that multiple testing must be taken into account when screening large numbers of genes for "significant differences". If you still have doubts, please check the references mentioned in the help of Pomelo.
We also expect that you are convinced that it is necessary to obtain honest, unbiased, estimates of the performance of any predictor you build. This, of course, involves including gene selection in the cross-validation (if you have used gene selection). We will soon have a tool to help you with this task. In the meantime, you might want to read Ambroise & McLachlan, 2002 (PNAS, 99: 6562--6566) and Simon et al., 2003 (JNCI, 95: 14--18). If we do help you with the building of a predictor, using cross-validation or bootstrapping to obtain estimates of the error rate will be an integral part of our work.
As we mentioned above, some of our choices, options, beliefs and preconceptions are discussed in the CNIO-stats-FAQ. You probably want to look at it.
Some of our recommendations and preferences change over time, because methodology advances, availability of method improves, and our own understanding gets better.
If your analyses fall into an area we are not familiar with, they might take longer than usual, and be given lower priority (see also Scheduling)
As many analyses are rather involved, we might want to discuss coauthorship. But this also means that we take full responsibility for the use of our analysis (see also the comments about internal logic of the sequence of analyses). Thus, please do not be offended if we ask you to remove our name from the list of coauthors if we disagree with the analyses or their presentation. As well, if we think our contribution does not warrant coauthorship we will ask you to not add us to the list of coauthors. We also appreciate if you also ask for permission before adding our names to the acknowledgements section.
If, after the analyses, you think that the study could be refocused and reanalyzed in a different way, we can do that, and it will be considered a "new submission" (so we go to step 1 of "General procedure (and scheduling) when getting help …".)
Bross, I.D.J. 1974. The Role of the Statistician: Scientist or Shoe Clerk. The American Statistician, 28: 126--127.
Browne, R. 1996. Tips for Beginning Consultants. The Statistical Consultant, 13 (1): 8--10. (Download).
Finch, H. 1999. Client Expectations in University Statistical Consulting Lab. The Statistical Consultant, 16 (3): 5--9. (Download).
Finch, H. 1999. Client Perceived Pitfalls in Statistical Consulting: An Ethnographic Study. The Statistical Consultant, 18 (1): 9--11. (Download).
Hunter, W.G. 1981. The Practice of Statistics: the Real World is an Idea Whose Time Has Come. The American Statistician, 35: 72--76.
Ittenbach, R.F., Tsai, Y.-J., and Billingsley, C. 1996. Consultation in the Social Sciences: An Itegrated Model for Training and Service. The Statistical Consultant, 13 (3):2--5. (Download).
Kirk, R.E. 1991. Statistical Consulting in a University --dealing with people and other challenges. The American Statistician, 45: 28--34.
Mann, B.L., Quinn, L., Boardman, T., Bishop, T., and Gaydos, B. 1999. What my Mother Never Told Me: Learning the Hard Way. The Statistical Consultant, 16 (3): 2--5. (Download).
Strickland, H. 1996. The Nature of Statistical Consulting. The Statistical Consultant, 13 (2): 2--5. (Download).
Tweedie, R. and Taylor, S. 1998. Consulting: Real Problems, Real Interactions, Real Outcomes. Statistical Science, 13: 1--3. (Download).
Young, S.S. 2001. Industy/Academic Statistics Collaborations. The Statistical Consultant, 18 (1): 2--6. (Download).