James W Fairley BSc FRCS MS
Consultant ENT Surgeon
Sandyhurst House
Sandyhurst Lane
Ashford Kent TN25 4NX
www.entkent.com
Ian D B Hore MSc FRCS
Consultant Paediatric ENT Surgeon
St Thomas' Hospital
London SE1 7EH
Sackett (1996) defined Evidence Based Medicine as
"The conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients"
The definition has been broadened to include other areas of practice. Several terms have emerged.
They all carry similar meaning. McMaster’s University (2000) defined an Evidence Based Approach as follows
"An Evidence Based Approach is one where the clinician is aware of the evidence that bears on their practice and the strength of that evidence"


EBM is a powerful tool. Like a chainsaw, it needs skilled and careful handling. Errors in interpretation and implementation are dangerous. One common error is to conclude that, because there is no evidence of effectiveness for a certain treatment, that treatment is ineffective. It is essential to understand that
No evidence of effectiveness does NOT mean evidence of no effectiveness
Systematic reviews of surgical treatments which use strict methodology suffer from a lack of admissible evidence. This is because very few high quality randomised controlled trials (RCTs) of surgical treatments are done. RCTs of surgical treatments are not done for various reasons:
There are inherent limitations in attempting to apply the RCT methodology in surgery.
Definitions of equipoise vary (Gifford, 2001), but it can be regarded as the condition in which there is genuine uncertainty, with the scales of judgement balanced equally between two possible courses of action. It is only when clinical equipoise exists that it becomes ethical to advise patients that the choice of treatment may reasonably be based on chance alone. But whose equipoise is it? Does it have to exist in the mind of each individual doctor seeing each individual patient, or is it an attribute of the research community as a whole? And if the research community is uncertain about treatment options, but an individual doctor has a preference in a particular case, is it ethical for that doctor to pretend he doesn't really know which is best, and recommend his patient enters a randomised controlled trial? Many doctors are unhappy about sublimating their finely honed instincts, their accumulated experience, their nous, all the unquantifiable nuances that contribute to clinical judgement, to the random results of the toss of a coin. Treatment at random under these circumstances has been described as a betrayal of Hippocrates (Retsas, 2004). Surgeons in particular make crucial decisions with immediate feedback of results, and they need to know what to do. Sitting on the fence, delaying decisions, is not what patients expect of their surgeon. Whether surgeons really do know what is best for their patients, or are merely opinionated and possessed of "indefensible certainty" (Burton, 2007) is of course open to debate.
Even where RCT's have been attempted, many fail to reach successful completion. Often this is because patients in the control group are not happy. They simply go off and get treatment somewhere else.
Parents of children with middle ear effusions agreed to be part of a study where the control group were randomised not to have grommets. When it came to it, most of the control group were unhappy. They were not prepared to accept their child being disadvantaged by continued reduced hearing when others were successfully treated. As soon as the glue ear persisted beyond a time where it may reasonably have been expected to clear by itself, most of the control group elected for surgery. In the study by Maw (1999) 85% in the watchful waiting group had grommets by 18 months. For good methodological reasons, the results were analysed on the basis of the groups as originally randomised. This is known as an Intention to Treat (ITT) analysis. The other options would have been
The first option greatly reduces the power of the study. Both options would introduce potentially large amounts of bias. Yet, by including the hearing results of children who had grommets in the control group, we automatically dilute the estimate of efficacy of grommets in the treatment group. This matters a great deal when the study results are summarised. Commentators who have neglected to consider the detail point out the small effect size and use it as a weapon to attack the "ineffective" operation.
Efficacy is not the same thing as effectiveness. Efficacy is a generic property, applied to a treatment. A question of efficacy asks whether the treatment has the desired and expected effect on the organ in question. Effectiveness is more specific. It is grounded in the detailed clinical stage of the condition being studied. A question of clinical effectiveness asks whether, given this stage of the disease, the treatment being studied offers a worthwhile benefit compared to another option. In the case of the grommet study, the other option was to continue watchful waiting. The basis of watchful waiting in glue ear is that 50% of newly diagnosed cases will resolve within 3 months, and 75% within 6 months. If they don't, grommets can be fitted later. But Maw's cases in Bristol were stringently selected, and already had persistent glue. During the study period, 85% of the control group had continuing problems. Paradoxically, a well designed and well executed study gave an artificially low estimate of treatment efficacy. But that was not the question the study was designed to answer. The study was designed to answer a question of considerable relevance to NHS practice, where we see patients who are likely to wait some months before treatment is given. Does it matter if your grommet operation is delayed a few months? You might get better anyway, and avoid the risks of surgery. Most specialists in the field railed at the apparent absurdity of the ITT analysis. But almost everyone forgot the question the trial was designed to answer. The clinical question is whether, given these children with this stage of the disease, there is any advantage to giving the grommets now as opposed to watchful waiting with an option to fit them later. The results showed that there was an advantage, but it was relatively small. Simplistic commentators then pin the low estimate of benefit onto the treatment itself, rather than the more complex and subtle message. Grommets are really very good at restoring hearing, but you might not need them. You might get better anyway. The research question then changes into trying to predict the children who are not going to get better, and treat those. This is not a simple concept to put across.
Nearly everyone prefers a clear and simple message. Are grommets any good? I heard on BBC Radio 4's respected Today Programme that they were useless and a waste of taxpayers' money. Unfortunately, real life medicine is rarely that simple. Not many doctors have time to pore over the detail of clinical trials. Even summaries of systematic reviews are beyond most. Can we rely on the medical news? It is very easy to fall prey to the wiles of the spin doctor.
The interpretation of clinical trials is rarely simple. Demand management is a priority for health funding bodies. Rubbishing the product is one method of reducing demand. Deferring to the expertise of specialists in the field is now known as Producer Capture. For politicians and civil servants, allowing your government department to become a victim of Producer Capture is a bad career move. Spin doctors charged with reducing demand can easily cast doubt over the ethics of surgeons, paid to operate. They also trumpet recent pay rises earned by General Practitioners, neglecting to mention that they are being paid extra for meeting government targets. Some of these targets are themselves evidence based, for example the control of blood pressure. Spin doctors are the modern incarnation of propagandists. To influence mass opinion, they reduce complex issues to simple ideas.
Considered, thoughtful opinion is useless to the spin doctor. Catchy slogans are needed. Active news management relies on gaining the attention of editors. Archie Cochrane himself complained about the tendency for newspapers to pander to the public's desire for simplistic answers (Cochrane, 1972). In the era of the sound byte, even university students have the attention span of a gnat. Channel-hopping viewers skip over 24 hour rolling news broadcasts. Unless hooked within 2 or 3 seconds, web users surf on. It becomes almost impossible to put across the balanced, guarded, nuanced messages of EBM. The open admission of uncertainty, which lies at the heart of EBM, is unpalatable and risks ridicule.
Public health policy has always been a political matter. Modern politics in Western democracies consists largely of trying to stay in tune with public opinion. While media headlines determine health policy, the spin doctor over-rules the medical doctor. In many ways, truly scientific evidence-based practice is the antithesis of modern politics. To quote Tallis (2004)
"The commitment to minimizing the role of chance, of bias, or of wishful thinking, is what scientific medicine requires. Avoiding beliefs guided by delusive hope, unfounded authority, superstition and plain stupidity, it cultivates an attitude of healthy skepticism towards itself to prevent its practitioners from misleading themselves or their patients. Its permanent strategy of active uncertainty, and the humility this implies, is the distinctive virtue of scientific medicine. In the world outside of scientific medicine, however, humanity has had little time to adjust to this almost inhuman scrupulousness."
Some surgical trials cannot be done for ethical reasons. The sham cardiac surgery procedures of the past (Cobb et al, 1959) could not be done nowadays. It is very difficult to design control groups for surgical interventions that are both ethical and scientifically valid. An obvious factor is scarring. Patients know whether they have a scar or not. But we can't have surgeons making the mark of Zorro then not doing anything.
Health commissioners have a difficult task. Resources are finite in all healthcare systems. Faced with inexorably rising demand, and fixed budgets, they look to EBM to provide some degree of objective guidance. In deciding between competing demands for funding, it is perfectly reasonable to insist on reliable evidence that
But this is not always easy to prove. And failure to commission treatments, simply on the basis that high level evidence is not available, may well deny patients beneficial treatment.
An evidence based approach starts with a question. You then look for evidence to help answer the question. Critical appraisal means you don't just accept what you are told at face value. You have to use your critical faculties to appraise the evidence. Clearly, you are not going to be able to do this personally, every single time. You need to know where to find evidence that has already been critically appraised, and how to check that it really does apply to the patient in front of you. The steps are as follows:
That final step requires your individual clinical expertise and judgement. Clinical expertise and judgement develop with training and experience. Clinical judgement is the part which is often neglected in discussions of EBM (this article included). Please do not mistake the lack of discussion of clinical expertise and judgement as indicating a neglect of its importance. It is the crucial final link. Without it, the whole exercise fails its purpose. We are in practice to help individual patients. EBM is only a means to an end, not an objective in itself.
You shouldn't believe everything that is written in the papers. That includes medical research papers. The combined vigilance of medical journal editors and the peer review process gives some quality assurance. The WAME (2008) World Association of Medical Editors ethical policy is all well and good. But if all editors were as selective as those of the major journals, they would quickly run out of stuff to publish.
Traditional academic journals use peer review to quality-control content. The limitations of the process were highlighted on the alternative Web-based academic publishing blog, Scholarship 2.0.(Arms, 2007).
Not all rejected papers are bad. A reviewer may reject a paper for reasons of prejudice, academic jealousy, or findings which cast the reviewer's own work a poor light. Most papers, however flawed, will get published somewhere if the authors are sufficiently persistent in submitting. Journal editors have space to fill.
Current clinical practice is often based on older research, when publication standards were less rigorous.
Trisha Greenhalgh (1997) wrote an excellent series of articles on how to read a paper, subsequently published as a book. Here is a link to the first paper in the series, content available online from the British Medical Journal.
For those interested in taking the process further, tools are now available online to help interpretation of different types of study.
The NHS Public Health Resources Unit (PRHU) Critical Appraisal Skills Programme (CASP) provides a range of resources. These tools are provided free for download at www.phru.nhs.uk. They are designed to help critical appraisal of papers. The following types of research are covered
Short courses in critical appraisal are available, such as those run by Superego Cafe.
Quantitative studies are done where we have something we can measure, for example blood pressure or rate of stroke in a population. The distinction between quantitative and qualitative studies is not absolute. Some data from qualitative studies - for example a questionnaire survey on patients' attitudes and beliefs - can be processed in a highly quantitative way. The following types of studies are generally considered to be quantitative.
Before getting bogged down in maths, and details of which tests to use and when, one important fact should be understood above all else:
All conventional statistical tests of probability take the form of an if - then statement.
An if - then statement takes the form
If x is true, then y should happen
The if is one of the assumptions underlying the test. All statistical tests of probability rely on assumptions. A common assumption is that the different factors that may influence the result act independently of one another. These assumptions are not always explicitly acknowledged. They tend to be ignored by those who use statistics
like a drunken man leaning on a lamppost, for support rather than illumination
The underlying assumptions must be valid, otherwise the statistical test is not. So,
In fact, you can never prove anything to be true with statistics. You just reach a known level of probability, and even that known level of probability is based on assumptions that always have to be made in designing the hypothesis.
A p-value is the probability that the results we see have arisen by chance. p is no longer used a great deal in clinical studies. Nevertheless, the p-value is still used in basic sciences and remains an important concept to understand. The design principles of the RCT are more easily appreciated when simplified to hypothesis testing using a p-value. Significance testing using p-values was, until the 1980's, the commonest way of establishing that results were likely to be real, and not due to chance.
A conventional standard is to accept a p-value of 0.05 as statistically significant. p = 0.05 means there is a 5% chance that an observed association is due to chance. Using such a cut-off means we accept that, 5% of the time, we will conclude that there is an association when there is none. Put more simply, one in twenty papers claiming a positive association with a p-value of 0.05 will be wrong. Being wrong in this way is known as a Type I error, or a false positive finding.
There may be instances when we wish to be more certain. Choosing a smaller p-value of 0.01 as statistically significant means we will be wrong only one time in a hundred. With a p-value of 0.001 we will be wrong only one time in a thousand, and 0.0001 one in ten thousand. This can go on indefinitely, but we can never be entirely sure that the results have not arisen by chance. The p-value of a lottery jackpot winner in the UK is around 0.00000007 (one in fourteen million), yet it happens by chance most weeks. The medical literature is full of Type I errors. For example, if a busy medical journal's output for the year contained 100 papers reporting positive findings at the p=0.05 level, we would expect five of them to be wrong due to Type I error. And we wouldn't know which were the wrong 'uns. We might have an idea that something doesn't sound very plausible. But the only way we could really find out would be when the study was replicated, and the positive finding was not repeated. Even with a pair of positive studies at the p=0.05 level, there is still a chance that we are observing a chance effect. The probability is the product of the two p-values. 0.05 times 0.05 gives 0.0025, a one in four hundred chance. One in four hundred events do happen quite often.
Since computerised records have become routine, vast quantities of clinical data are held on various systems throughout the world. Some researchers make it their business to search through that data, looking for patterns and associations. This is an ideal way to produce the Type I error, to discover associations that are there due to chance. The search is known as data dredging. When the results are presented, the presenter can be described as a Texas sharpshooter. The Texas sharpshooter blasts away with his pistol on a barn door. After the bullets have hit, he walks up and chalks a target circle around each one. He then boasts how good a shot he is. The best defence against the Texas sharpshooter is to make him draw his targets in advance. This usually means a prospective study, with a predefined null hypothesis.
When using the results of trials to help guide clinical practice, it is generally much more useful to know how much of an effect the treatment had. This is known as the Effect Size (ES). The measured effect size comes directly from the trial data and is known. But that measured effect size would not be exactly the same if we repeated the study again. Nor would it have been exactly the same if we happened to have recruited some slightly different patients to those we actually had. We therefore also need an estimate of how accurate the study was in measuring effect size. The spread of possible values of effect size, up to a certain level of probability, is known as the confidence interval (CI). The 95% CI is often used, by convention, and is analogous to the p-value of 0.05. It means that there is a one in twenty chance that the real effect size is greater than the upper limit, or less than the lower limit, of the confidence interval. A 99% confidence interval means that we would expect only one in a hundred repetitions of the study to give an effect size beyond the limits. Naturally, a 99% confidence interval is wider than a 95% confidence interval for the same data. When we have two sets of observations, we can calculate an effect size and confidence interval at our chosen level. If the selected confidence interval for the effect size includes zero then we cannot say that we have demonstrated a positive effect of treatment.
A Type II error means the study failed to find a difference that really exists. The usual cause of a Type II error it that there weren't enough patients in the trial.
Mostly, we don't know what the true effect size is, that's what we are trying to find out. But, we can take an educated guess.
We can then design the study to be able to find an effect of that size. When we say we, we do mean we. Doctors and patients should decide together on what would be the minimum important difference in outcome that would lead to a change in practice. For example, the trialists might interview rhinitis patients to find out how much better they would have to feel to make it worth them taking a twice daily nasal spray.
As well as having an idea of the size of the effect, you need some idea of its variability. How big are the differences between individuals due to natural variation? If there is a lot of variability (noise) in the data, you will need a bigger sample. The conventional parametric statistical method (which assumes a normal distribution - not always the case) is to estimate the standard deviation. Based on these considerations, most study protocols estimate the sample sizes needed to give an 80% chance of picking up the effect, should it exist. If we want to be more sure of finding an effect, we could go for a higher percentage chance, and that would mean more patients.
Power calculations are done before the study begins, at the design stage. Once the study is under way, if it becomes obvious that we have a bigger beneficial effect than anticipated, it may be ethical to stop the trial before reaching the number of participants estimated by the power calculation. This cannot always be done. If the study is double blind, it will be necessary to break the code to discover in which group the larger than expected effects have occurred.
Free software is available to help with power calculations from the Simple Interactive Statistical Analysis SISA website http://home.clara.net/sisa.
In the recent past, it was common to see studies reported which had inadequate numbers of patients. A Type II error was commonly present, but not always recognised. Often, different studies would give contradictory results. A reviewer would, typically, choose the results of a selection of favoured studies and come up with an overall conclusion. He may well give undue weight to some studies - for example those conducted by people he knew and trusted, studies he had read recently - while ignoring others. He may not have known about some relevant studies, for example in foreign language journals. A better method of reviewing was called for. In a systematic review, the reviewers
These are the principles underlying systematic reviews and meta-analysis. As outlined above, the commonest cause of a Type II error - failure to show a difference when there is one - is inadequate sample size. If the natural variation in the outcome measure is high, and the effect size is small, it will be hidden in the background noise. Background noise, being random, should cancel out with large numbers. By pooling the results of lots of studies we may be able to see the hidden effect more clearly. The commonest method for doing this is to produce a forest plot. The forest plot forms the centre of the 
Cochrane logo.
The diagram shows the results of a systematic review of seven RCTs of a short, inexpensive course of a corticosteroid given to women about to give birth too early. The first of these RCTs was reported in 1972. The diagram summarises the evidence that would have been revealed had the available RCTs been reviewed systematically a decade later.
The forest plot indicates strongly that corticosteroids reduce the risk of babies dying from the complications of prematurity. By 1991, seven more trials had been reported, and the picture had become still stronger. This treatment reduces the odds of the babies dying from the complications of prematurity by 30 to 50 per cent.
Because no systematic review of these trials had been published until 1989, most obstetricians had not realised that the treatment was so effective. As a result, tens of thousands of premature babies have probably suffered and died unnecessarily (and needed more expensive treatment than was necessary). This is just one of many examples of the human costs resulting from failure to have a structured programme to evaluate new health technologies. Performing systematic, up-to-date reviews of RCTs of health care does, of course, rely on the research studies being done in the first place. The purpose of the Cochrane review is to discover what is already known. It is especially valuable where the effect size is too small, amongst the other variability, to be highly obvious within one caseload.
The main reason for doing systematic reviews is to make maximum use of research that has already been done. The Cochrane Collaboration was not founded to encourage more RCT's. It was founded to realise the value of work already done. In business terms, it is sweating the asset. The asset is that gigantic and ever-expanding repository of information contained in the medical literature. Checking what is already known avoids wasteful and pointless duplication of research effort. It minimises delay in getting the benefits of research based knowledge into practice. Identifying gaps in the knowledge base is another important function, to help direct future research efforts.
In 2005, The Lancet announced a policy to tackle unnecessary and badly presented research (Young and Horton, 2005). They stated that unnecessary clinical trials
Authors of clinical trials are now required to include
To judge the results of a trial, we usually (but not always) need statistics. If, let us say, we were conducting a randomised controlled trial of the effectiveness of parachutes on survival when jumping out of an aeroplane at 10,000 feet, we would have the following null hypothesis:
"if parachutes are ineffective, then the mortality rate will be the same, whether or not the parachute is worn"
When the first randomly assigned participant without a parachute hit the ground at terminal velocity, we might decide that we didn't need any statistics, perhaps not even a trial, to decide this question. The first rule of statistics is - or should be - that you don't always need statistics. Statistics is an aid to common sense, not a substitute.
The parachute trial is an extreme example, but surgeons are increasingly told that there is "no evidence base" for the majority of their work. One reason is they don't need a trial to tell them that controlling that bleeding artery is the right thing to do.
In controlling a bleeding artery, the effect size is large, and the time interval between intervention and observable result is very short. The signal to noise ratio is very high. Skill, training and judgement are needed to achieve the result, and none of these are amenable to double blind randomised controlled trial. Although this should be obvious, it has only recently been emphasized in EBM circles that, where the signal to noise ratio is high, an RCT is not necessary (Glasziou, 2007).
The RCT of bleeding arteries, like the RCT of parachutes, will never, ever, be done. If someone was foolish enough to look for, fail to find, then publish the fact that there is no RCT evidence for the benefit of controlling a bleeding artery, Archie Cochrane would turn in his grave. He was a practising doctor, who served his time burying his tuberculous patients as a prisoner in the Second World War.
The sort of cases where statistics are needed are where the effect size is small, and the time interval between intervention and result is long - like most drug trials. That is what RCT's were designed for, and that is what they are good at.
The RCT model can and should be applied to some surgical interventions. Such interventions are typically for conditions where
These are good grounds for questioning the value of any intervention. Most surgeons would agree that such procedures should be subject to randomised controlled trials. Designing RCT's for surgical interventions is, however, considerably more difficult than for medical interventions. Consequently, few surgical trials are done. Of those that are published, most fail the strict criteria for inclusion in Cochrane systematic reviews. Surgical trials are not as easy to organise as drug trials. In a drug trial, it makes little difference who writes the prescription. In a surgical trial, the skill of the individual operator is a major factor. Randomisation is possible, but concealment of treatment allocation is not. No-one wants a blindfolded surgeon operating on them. It is likely that the Shamanistic rituals of surgery induce a sizeable placebo effect (Green, 2006), yet sham surgery is not ethical as a control group. These factors combine to make the surgical literature a barren landscape when searching for high quality RCT's. That is no reason not to look, and no reason not to try, but the absence of strong RCT-based evidence is to be expected in much of surgical practice.
When we talk about strong evidence, what we mean (in conventional statistical terms) is that
Strong evidence is not the same as a big important effect. You can get strong evidence by having lots of patients in your trial, even though the size of the effect is small. Strong evidence does not mean good medicine. Neither does absence of strong evidence mean bad medicine.
A study looking at the effectiveness of a treatment may suffer from bias in many ways.
The doctors carrying out the study could, subconsciously or otherwise, pick patients they thought would do better for the treatment group. Even if some apparently reasonable non-random method of producing a control group were used, there could be selection bias. For example, if patients attending a Monday clinic were allocated to the treatment group, and those attending a Wednesday clinic as controls, it could be that patients attending on Mondays were generally sicker than Wednesday's patients. Randomization protects against unknown as well as known sources of selection bias.
Those charged with observing and recording the results of the study could, subconsciously or otherwise, give an inflated opinion of the results in the favoured treatment group, while minimizing or ignoring any adverse effects. They might, at the same time, be more assiduous in looking for poor results in the control group. This source of bias can be removed by blinding the observer to the treatment group. Blinding can't always be achieved. If patients in the treatment group have a surgical scar and the control group haven't, that could be a bit of a give-away.
The study organizers might assume that patients never came back because all was well, when if fact they didn't come back because they were dissatisfied or even died.
Patients in the treatment group may feel special and act differently than the control group. For example, they might
Any of these mechanisms could cause effects additional to and separate from the treatment being studied.
The way to avoid participant bias is to blind the patient as to which treatment they are getting. This can't always be done. For example, it may become known that the real medicine has a certain taste, while the placebo doesn't. It is very difficult to blind the participant to a surgical treatment. A scar is a scar. Sham surgery has been conducted in the past, but is now considered unethical.
The choice of outcome measures plays a crucial part in determining the results of clinical trials. The development of the RCT model in medicine was based largely on drug trials in otherwise fatal conditions - especially respiratory infections such as pneumonia and tuberculosis. The outcome measure was simple - the patient was either alive or dead. But the bulk of modern medical and surgical interventions are not to avoid death, they are to improve the quality of life. Until recently, this was thought too difficult to measure, but the application of psychometric techniques to patients' symptoms in the 1980s began to allow a more quantitative approach to soft outcome measurement (Powell, 1989).
Before the 1980's, medical practitioners, and perhaps particularly surgeons, did not like to dwell too much on the subjective symptoms of their patients, particularly any that persisted after operation.
Lavelle and Harrison (1971) reported their results of middle meatal antrostomy purely on a technical measure of success, the continued patency of the surgical opening into the sinus. They deliberately excluded patients' views from their analysis, stating that
"little is achieved by quoting figures and statistics, as the results depend to a great extent on subjective response of a patient"
In the 1970's, the established medical view was that symptoms were important, but mainly as clues in a jigsaw puzzle. The aim was to establish a diagnosis and thereby institute appropriate treatment. In scientific studies of the effectiveness of treatment, most doctors would prefer objective to subjective measures of outcome. The surgeon prefers to know that the patient is cured of the disease, rather than whether he merely feels better. That requirement for objective measures of successful outcome is difficult where
For example, in chronic rhinosinusitis, the correlation between subjective symptom severity and a variety of objective findings is around 7% (Fairley, 1993). The disease itself is now formally defined purely on the basis of persistent symptoms (Fokkens et al, 2007). The pure symptom-based definition is restricted to epidemiologists and general practitioners. Specialists making the diagnosis are required to undertake at least one form of objective diagnostic examination.
All studies which attempt to correlate symptoms with disease severity suffer from a similar philosophical and methodological difficulty. That difficulty lies in the definition of disease, which is often tenuous. If our gold standard for diagnosing and rating disease severity rests on some objective tests, and symptoms are compared against it, we are making an implicit value judgement. Symptoms are somehow less important than signs, and need to be accounted for by physical findings. Radiology, endoscopy, microbiology, surgical exploration and histology are all examples of physical findings. Although it is of course necessary to look carefully for physical findings, especially where these may reveal serious disease or will change management, it is intellectual arrogance to conclude from the absence of detectable pathology that there is nothing wrong with the patient. That is why psychiatry was the first area of medicine to develop a methodology for making reliable and valid outcome measures based on symptom questionnaires. Since they didn't have too many physical findings to distract them, they set about measuring what they could. It now turns out that many of our objective measures are at best loosely correlated with what the patients have come to us about in the first place - symptoms. So, if we want to find out if we are doing any good, just measure the symptoms before and after, and regard what happens in between - the medical intervention - as a black box.
An explosion of research interest in the 1990s and early 21st Century has resulted in hundreds of disease-specific outcome measures, as well as numerous validated general health outcome measures. We are now spoilt for choice. In orthopaedics, the number of symptom / questionnaire based outcome measures available now exceeds the number of joints in the human body. Comparative evaluations are needed to decide which measure to use (Beaton et al, 2001, Roach 2006). Despite their widespread application, there remain significant difficulties in defining suitable outcome measures. Although a great deal of time and effort has gone into developing reliable, valid and responsive outcome measures, the choice which to use is, in the end, subjective. It is invariably influenced by the sponsors of the trial. It is essential that the outcome measures used in any particular trial are relevant to the clinical question being asked.
Health insurers and governments funding health expenditure worldwide are looking to EBM to cut expenditure on self limiting conditions. They might save money by not paying for crutches for patients with broken legs. How about an RCT of crutches? Of course, the patients denied crutches would not be able to walk for a while, but, once the leg had healed, and certainly by one year, they should be walking again. By choosing an outcome measure
"ability to walk one year following the injury"
and comparing patients randomly allocated either to receive or not receive crutches, the trial would probably conclude "no evidence of benefit" from crutches in the treatment of broken leg, a self limiting condition. But surely no one would take such a trial seriously. Or would they? Look at the outcome measures chosen in trials of grommet insertion, sponsored by the UK Government, for children with hearing loss due to glue ear. Following grommet insertion, most children get a dramatic improvement in hearing. The average grommet lasts nine months, during which hearing remains good. Once the grommets come out, a minority will get further glue ear. Meanwhile, a large proportion of the children who did not receive grommets will slowly clear the fluid and their hearing will improve. Those who don't are often given grommets anyway, but the results are reported on the basis of "intention to treat" - so the benefit accrues to the non-treatment group. The trials report hearing results at one and two years, when most of the grommets have fallen out. Dramatic and consistent short term improvements are ignored in the conclusions.
Conventions have evolved in recent years to improve the reliability and validity of outcome measures. Yet choice of outcome measures for trials remains subjective. Interested parties may well seek to introduce bias at the design stage. A drug company would naturally like to choose an outcome measure which shows a positive effect for their product, even if that measure is only indirectly related to patient-perceived benefit. Increasingly, there is a formal regulatory apparatus with semi-public consultation and justification. An a priori declaration of interests is one way to avoid the bias of only reporting the one outcome measure that shows what you wish to prove.
A good outcome measure will be reliable, valid, and responsive to change in the group to which it is to be applied. The three properties of reliability, validity and responsiveness are related, with some overlap, but distinct.

A 30 cm ruler gives a reliable measurement of the diameter of a thinker's head. We can show that
But it is not a valid measure of what the head is thinking. It is measuring a different domain altogether.
That example may appear trite. A child would spot the error. It is obviously wrong to try and measure what someone is thinking with a ruler. But this type of mistake is rife in clinical research. It is not always so obvious that the chosen outcome measure is invalid. Indeed, the error may be embedded in our collective medical culture, in our limited understanding and flawed concept of the disease.
Berg and Carenfelt (1988) studied 155 patients with suspected acute sinusitis. An algorithm based on symptoms and signs was compared with the "gold standard" of maxillary sinus empyema versus not empyema, as established by antral aspiration (sinus washout). 68 patients were found to have an empyema and 87 not. Purulent flow from the middle meatus seen on rhinoscopy was pathognomonic for empyema when seen, but only occurred in 6 cases. Severe cacosmia was also of high positive predictive value, but only occurred in 12 cases. Of symptoms that occurred frequently, unilateral predominance of pain or purulent rhinorrhoea were found to be strongly predictive of empyema. By combining the analysis of symptoms with a high ESR, they found that "diagnostic reliability" of their algorithm could reach 80%. However, this begs the question as to whether the gold standard they used - namely antral aspiration - was valid. From a perspective of 20 years later, it almost certainly was not.
The misapplication of reliable, yet invalid, outcome measures is increasingly likely as outcome measures proliferate.
A reliable outcome measure will give the same answer when the same thing is measured repeatedly. This is known as test-retest reliability. If the outcome measure is an observer rating, inter-rater reliability is important. Different observers, ostensibly applying the same rules, may well give different ratings - like the judges at a dance competition. Intra-rater reliability (consistency) can also be assessed. For example, in grading the severity of facial palsy using the House-Brackmann scale, the same observer is asked to rate the same series of clinical photographs a few weeks apart.
Another measurement of reliability for summed questionnaire scales is Cronbach's alpha. This is a measure of internal consistency of the scale. Cronbach's standardized item alpha coefficient is a generalised measure of reliability. Alpha is based on internal consistency of the scale. It is calculated from the average inter-item correlation and the number of items in the scale. Alpha behaves as a squared correlation coefficient and ranges from 0 (none) to 1 (perfect). If the number of items in the scale is large, the inter-item correlations do not have to be so high to obtain high reliability scores. Reliability in this context means the extent to which the total symptom score is likely to give the same result as another similar measurement of symptom severity. If each item on the questionnaire is measuring some part of a related concept (overall severity of the condition) then individual items should be correlated with one another to the extent that they are measuring the common entity. The result can be interpreted as the extent to which the scale tested would be expected to correlate with all other possible k-item scales, constructed from a hypothetical universe of questions on the subject of interest. Another interpretation is that alpha times 100% of the variability in a hypothetical test, composed of all possible questions on the subject of the questionnaire, would be accounted for by the results of the k-item test used. In the senior author's study of the reliability and validity of a symptom scoring scale for rhinosinusitis (Fairley, 1993) Cronbach's alpha was calculated on a series of 411 patients attending ENT out-patient clinics with a variety of conditions and found to be 0.78. This is reasonably good. Alpha should certainly be over 0.5, and ideally over 0.8.
The first test of validity is simply to look at the questions and consider them at face value, to see whether they make sense. In most clinical studies involving questionnaires, up until the 1990's, this "face validity" was the only kind of validation that took place. Face validity is usually enough to spot a gross error. If you are planning on using information based on questionnaire outcome measures, it is a good idea to read the questions.
The next test of validity is to consider whether questions cover all aspects of the concept being measured. Various points may or may not be important depending on the use to which the scale is to be put. It must be borne in mind that a more complex and time consuming questionnaire is less likely to be of general use. Formal methods of establishing content validity start out with a very large number of questions. These are culled from other studies of the problem, expert opinion, and unstructured interviews with patients of the Grounded Theory type (Glaser and Strauss, 1967). Questionnaires based on these are tested on groups of patients, and by techniques such as cluster analysis (Norusis, 1988) independent dimensions are discerned and redundant questions can be eliminated progressively.
This means testing your proposed outcome measure against another, already known to be valid (Powell, 1989). Unfortunately, the commonest reason to introduce a new measure is precisely because no such "gold standard" is available.
The final and most difficult test of validity is "Construct validity". Simply put, construct validity is whether the measure really measures what it is supposed to measure. Does it do what it says on the tin? That question may be easy to answer for a wood preservative, but not so for a questionnaire. In using an outcome measure that tries to quantify something from the patient's point of view, we are trying measure an abstract concept or construct. As an example, here is some of the discussion that went into evaluating the validity of an early questionnaire for nasal symptom severity, the subject of a Master's thesis by the senior author (Fairley, 1993)
If the nasal symptom scores really are measuring nasal symptoms, it would be reasonable to expect higher scores in patients suffering from nasal conditions.
A great deal of research effort has gone into developing general measures of Quality of Life (QoL) such as the 36-Item Short Form Health Survey (SF-36® Ware, 2003). Such measures are inherently flawed when applied to individuals, each of whom will have their own specific health problems. If you are deaf, the fact that you can tick a box on a QoL questionnaire that you climb a set of stairs without getting out of breath doesn't really make your deafness any less of a problem to you. The main use for general QoL measures is in deciding where to place healthcare resources. Insurance-based systems, whether run by the state or private companies, remove the need for individuals to pay for treatment at the time of need. They spread the risk of having some horribly expensive disease. But they also remove the incentive for the individual to seek value for money. The price a scheme member pays for peace of mind, for not having to worry about medical bills, is more than just the premium. The price includes the fact that, when it comes to your claim, somebody else decides what will and will not be covered. And when you have paid into the scheme, but develop an expensive health problem that does not score highly on the general quality of life scale, you might feel somewhat aggrieved to discover you aren't covered. A health commissioning organization, whose job it is to decide priorities for resource allocation, will be biased in favour general outcome measures. A general measure of benefit helps them compare the value of one treatment with another. But you can't really compare apples with oranges, let alone with a fillet steak. A general measure of food benefit is clearly not very sensible. Each foodstuff has its own contribution to make. Yet we have the SF-36 and similar general outcome measures being used (misused) to show how much benefit we get from a cochlear implant compared to a hip replacement. Dr John E Ware, writing on the SF-36.org website, states that
"clinical trials to date demonstrate that the SF-36 is very useful for descriptive purposes such as documenting differences between sick and well patients and for estimating the relative burden of different medical conditions" (our emphasis)
But everything depends on the questions asked. A cochlear implant will not help you get up the stairs without getting out of breath. Neither will an apple give you your recommended daily allowance of protein.
The biggest difference between condition-specific and general outcome measures is likely to be in their responsiveness. General QoL measures all take the form of a weighted shopping basket, just like the official estimates for economic inflation. If you happen to suffer from / need an item that ain't in the basket, it won't show up in your score. You need a condition specific outcome measure. And don't let them fob you off with the wrong shopping basket.
Most countries have laws restricting extravagant and unfounded claims for medical treatments. But there are always loopholes. Marketing departments of drug companies are very good at finding them. Some pharmaceutical companies spend more money marketing their products than they do on research and development (Gagnon and Lexchin, 2008). There are lots of ways to make your product look good. Glossy ads in medical magazines are just a small part of it. But surely the presentation of the underlying dry figures and statistics can't mislead. Well, yes it can, it does, and it's all legal.
If a drug rep told you their new molecular engineered prostacyclin analogue gave a 25% reduction in the risk of stroke compared to plain old aspirin 75mg, you'd probably be impressed. But you need to know something else, that s/he didn't tell you. The 25% risk reduction in favour of the new (expensive, potential late side effects unknown) drugs is true. But that is only part of the story. To decide whether to prescribe, you need to know what is the risk of stroke in the patients you plan on prescribing - presumably those whom you already had on 75mg aspirin. If their risk of stroke is still high, a 25% difference is impressive. But if their risk is already low, it is less so. A 25% reduction in something that is already very small is something very small indeed.
When you ask to see the data from the trial, you see that 3 percent of the patients on the new drug had a stroke over a five year period, compared with 4 percent of the patients on aspirin.
The Relative Risk (RR) of new vs old is therefore 3/4 = 0.75 = 75%. Therefore the reduction in risk by 25% appears correct.
But
Somehow, that doesn't seem quite so impressive as 25%.
The Number Needed to Treat is the reciprocal of the ARR = 1/0.01 = 100.
That means, in order to see a difference in outcome, you would have to treat one hundred patients with the new medication in order to prevent one stroke. Well, stroke is a very serious disease, and you may well take the view that it is worth it. But don't get taken in by risk reduction figures for uncommon events.
Absolute Risk (AR) = the event rate for a given individual
Relative Risk (RR) = the event rate in the exposed group / event rate in the control group
Absolute Risk Reduction (ARR) = event rate in control group, minus event rate in treatment group
Number Need to Treat (NNT) = the number need to treat to avoid 1 bad outcome = 1 / ARR.
The Absolute Risk and Relative Risk terms are particularly relevant to the epidemiology of a disease, and can also be applied to treatment and control groups in studies. The Number Needed to Treat is the reciprocal of the Absolute Risk Reduction. The ARR and NNT are important to know when making evidence based treatment recommendations.
A disease has an Absolute Risk for an individual of 8 in 100. That means 8 out of every 100 get it.
If smoking increases the Relative Risk of this disease by 50% compared to someone who does not smoke, the 50% applies to the 8 percent.
Therefore Absolute Risk for that individual goes up to 8 + 4 = 12 in 100.
A trial found that 12 /100 suffering from a certain disease had a bad outcome in a control group, compared with 10 /100 in the treatment group.
The ARR from treatment of that disease would be (12/100) - (10/100) = 0.02.Therefore the number of patients you would need to treat to stop 1 patient getting a bad outcome would be 1/0.02 = 50
You would need to treat fifty patients in order to prevent just one bad outcome. If the treatment was expensive or had a lot of side effects, that may not be a very good deal for the forty nine patients who paid for treatment, and ran the risk of side effects, without gaining any benefit. Deciding on whether a treatment does represent a good deal is a value judgement. It depends on the seriousness of the adverse outcome we are trying to prevent. If the adverse outcome is death or major disability, then a NNT of 50 might be acceptable. If the adverse outcome is a minor nuisance, and the treatment onerous or expensive, you - or your patient - might decide it's not worth the trouble. If you know the NNT, you can offer your patient a better informed choice than if you just know the treatment has some beneficial effect.Recently published papers in major medical journals usually include the NNT. However, not all do so. Bandolier have an online NNT calculator. Their downloadable worksheet provides a template to calculate NNTs from papers and systematic reviews. It is a useful educational exercise to try filling this in yourself, from the information provided in a published paper. Even if the NNT is already given, you can check the workings for yourself, and thereby have a better understanding of how the figure is arrived at.
link to Bandolier NNT calculator
If a randomised controlled trial shows there does appear to be positive effect, that is not due to chance, or the way the study was done, can it be applied to your population? This is a crucial question. Even within the same geographical area, different population groups exist. Individuals each have their own specific situation. It is the responsibility of the individual doctor to explore, with the individual patient, whether the evidence really applies in each case.
We cannot measure everything. Some of the things that matter most to patients can't be reduced to figures and statistics. Often, a combination of common sense, empathy and gut feeling provide enough practical guidance. But we cannot always rely on our own perspective. Professionals can, all too easily, fall into a rut of blinkered complacency. If we want to find and apply best evidence in certain areas, we need to look at qualitative studies.
Qualitative studies are valuable in
We have to ask a clear question. Only then do we have any chance of discovering the answer. The research question in clinical medicine is not always obvious. In clinical research, the choice of outcome measures is crucial. Formal qualitative studies, looking at what really matters to patients, are helpful in deciding which outcome measures to use in subsequent quantitative studies of treatment effectiveness.
Qualitative studies can also be valuable in finding out why previous studies have produced contradictory or uninterpretable findings. They can help refocus research.
Areas where qualitative studies are likely to provide the best evidence include
As with any form of evidence, we need to know which qualitative studies give us results we can rely upon.
The methodology for qualitative studies is less well established than for quantitative. Some principles have emerged and become generally accepted in recent years. Various authors (Pope 2000, Cutliffe 1999, Boulton 1996, Popay 1998, Beck 1993) have identified indicators to help distinguish good qualitative studies. To assess the methodological quality of a qualitative study, consider the following
It has become almost de rigueur to rank evidence in Levels. The aim - laudable enough - is to identify the strength of the evidence underlying any given management recommendation. At present, the hierarchy of evidence places the well conducted systematic review of RCT's at the top. However, we have seen that the most appropriate evidence varies, depending on the question being asked. Hierarchies placing systematic reviews of RCT's at the top are specifically about effectiveness of interventions. These levels of evidence should not be regarded as a sole and universal indicator of quality. Other schemes, similar in spirit, but differing in the detailed assignment of numbers, are also based primarily on the study design. Studies which reduce bias the most are placed at the top of the hierarchy. Reduction in bias is an important aspect of the quality of a study, but it is by no means the only thing that matters. In many important areas, we are unlikely ever to have high level evidence. As mentioned earlier, the absence of high level evidence of effectiveness is not evidence of absence of effectiveness. The hierarchy of evidence, and the recommendation gradings, are primarily based on the trial design rather than the clinical importance of the results. Careful thought, with skilled and judicious application to individual questions in individual cases, is essential. Otherwise, bad advice and bad decisions will result.
| Level | Type of Evidence |
| I | Systematic Reviews of well controlled Randomized Controlled Trials (meta-analysis) or single RCT with narrow CI (confidence interval) |
| II | Systematic review cohort studies or lesser quality RCTs |
| III | Case controlled studies (non randomized) |
| IV | Case series (no control group) |
| (V) | Expert opinion (GOBSAT - Good Old Boys Sat Around Table) |
Recommendations can also be graded depending on the level of evidence they were based. Grading recommendations more-or-less follow the hierarchy
| Evidence Level | Recommendation Grade |
| I | A |
| II | B |
| III | C |
| IV, V | D |
Using EBM is closely allied with Problem Based Learning (PBL). Problem based learning is now incorporated into many medical student and continued professional development (CPD) courses. Problem based learning (Miller 1966)
Compared to didactic teaching, PBL takes longer and requires more effort. But knowledge obtained this way may be retained for longer. There is some evidence (Shin 1993) it leads to more lifelong use of core EBM skills such as
Modern medicine does not stand still. Continuous advances mean that all healthcare staff need lifelong learning skills.
It is up to individual doctors, in consultation with individual patients, to decide which evidence applies. Sackett (1996) identified the risk that, without clinical expertise, practice could become tyrannized by evidence. Even the best external evidence will not apply to all cases. Worldwide, healthcare commissioners, whether government or insurance based, look to evidence based clinical protocols and pathways to ensure value for money. Deviance from such guidelines can result in financial and even legal penalties for doctors. But strict adherence means a dumbed down, monodimensional, mechanistic and unthinking approach, which has been dubbed cookbook medicine. In 1996, Sackett stated that EBM could not result in a slavish, cookbook approach. But in 2007, the EBM cookbook has become very popular with health commissioners. Its recipes are chosen by committee. Health economists decide what is best for the population. The chosen dishes are then served up franchised, quality controlled to ISO 9000, Mcdonalds fashion. Newly created regiments of specialist practitioners, unburdened by the skepticism that follows a broad medical education, are taught to follow the rules, follow the guidelines, and all will be well. The public like it because they are getting their treatment quicker. But is this fast medicine good for your health in the long run? The best defence against this misapplication of EBM is to be skilled and confident enough to know when the guidelines apply, and when they do not.
Hampton (2003) wrote that
"Guidelines for medical management are now part of medical life. A fool - loosely defined as someone who does not know much about a particular area of medicine - will do well to follow guidelines when treating patients, but a wise man (again, loosely defined as someone who does know about the disease in question) might do better not to follow them slavishly. The problem is that the evidence on which guidelines are based is seldom very good. Clinical trials have a variety of problems which often make their relevance to 'real world' medicine dubious. The interpretation of trial results depends heavily on opinion, and a guideline that purports to be evidence based is actually often opinion based. A guideline will depend on the opinions of those who wrote it, and the wise man will use his judgement and give due weight to his own opinions and expertise."
A guideline has to be reasonably simple otherwise it is impractical. Cut-off points and categories of patients have to be specified. Management algorithms have to be drawn. A good guideline will cover around 80% of the cases in its area of application. If it tries to cover much more than that, it will become unwieldy to the point of being useless. It will become a textbook. By the time you've learned the textbook, you won't need to look up a guideline. A guideline is like a simple outline drawing, a flat cartoon character, a pixellated simplification of reality. If it has been well designed, it will fit reasonably well over most real life complex patients. If it has not, or if attempts are made to apply it on a population other than that for which it was designed, it will not fit. It will chafe. It will pinch, like a badly fitting shoe, and it will either be discarded or cripple the unfortunate wearer. Statistically, it is inevitable that doctors will not always follow guidelines. But they should be aware of them, and be able to justify deviations. In statistical modelling, we start with as complete and accurate a representation of reality as we can get. All factors and variables are taken into account, and the model is as near to perfect as can be. It is a fully saturated model. But it is far to complex to understand. It hasn't helped simplify anything. We then start removing elements from the model, one at a time, starting with those that least affect the representation of reality. When we reach as simple a model as we can, that doesn't differ too wildly from the fully saturated model, we call it quits. That is how modern evidence based guidelines can be made. Of course, we are deliberately removing complexity. When come to use the model in practice, we have to accept that we may have to put some complexity back in.
A good doctor is often more concerned about the 20% of difficult cases than the 80% of routine. A good manager will tend to have the opposite priority, especially when that 80% is the main source of income. Very few managers have any real depth of knowledge of the subject of the guideline. In UK, NHS managers are simply given targets to implement. Central bureaucrats produce a five year plan, much like the former Soviet Union. This explains many of the reservations we have over guidelines, especially when they begin to congeal into enforceable rules. When they become rigid, like the Procrustean bed, the wise will stay away from that institution. Procrustes (Greek mythology) lived in the hills. He tempted passing travellers to lay in his iron bed. But they had to fit exactly. Taller guests would have their protruding extremities cut off. Shorter folk were stretched on the rack to fit the size of the Procrustean bed. As medicine becomes more protocol-driven, we must beware doing the same to patients who don't quite fit the mould.
Some conditions and treatments are difficult to study by RCT. If the level of evidence is used as a criterion of clinical effectiveness, we risk denying patients effective treatments. Governments, health insurers and health maintenance organizations genuinely need to prioritize and ration limited resources. It is just too tempting to misuse EBM to justify rationing.
In the UK, the Department of Health and the Chief Medical Officer (CMO) have rubbished grommet and tonsil operations for years. They point out the lack of high level RCT evidence, and unexplained variations in the numbers of operations done. On 21 July 2006, the CMO's annual report (covering the year 2005) highlighted the clinical waste of unnecessary tonsil operations (Dept of Health, 2006). The point was rather crudely illustrated. A yellow clinical waste bin was shown stuffed with cash. Banknotes were falling over the edges, onto the operating theatre floor. The public were spared an X-rated version of blood-spattered money, soaking in a gory puddle. Spin doctors seem to have been involved in the production of this propaganda. The British Association of Otolaryngologists were not. The results of their national audit of tonsil operations (RCS 2005) - covering over 33,000 operations and the biggest cooperative study of the subject in the world to date - were quoted with extreme selectivity. The CMO's message, that taxpayers' money is wasted on unnecessary tonsil operations, was immediately picked up by the press and widely reported as fact. A letter of protest from the President of the BAOL to the CMO was fruitless. In 2007 the BMJ published the results of one of the first RCT's of tonsillectomy from Finland (Alho et al 2007). The results were clearly in favour of the operated group. The study has been criticised for its short follow up and use of throat swabs as an objective outcome measure. Further randomised controlled trials using longer follow-up and more patient-centred outcome measures will be needed to convince the paymasters. It will be difficult to persuade patients who have suffered severe and frequent tonsillitis that they should not have surgery. By the time UK patients reach an ENT surgeon, they seldom need persuasion of the merits of operation, though some will decline when informed of the risks.
External evidence can only be properly applied if you understand how it relates to your own local population, and to the individual patient.
The final step in critical appraisal of any external evidence is to check whether it is really relevant to your patient.
Where relevant evidence is available there many obstacles to getting it into practice (Haynes and Haines, 1998).
Doctors, nurses and allied health care workers train for years to establish their practice. They develop skills and expertise. But they can also get into a comfortable rut. When advances occur, and evidence suggests their is a better way of doing things, it is wasteful to delay. Introducing change takes time. People have to learn new skills. Procedures must be in place to protect patients during the learning curve. There are no magic bullets to bring about change in doctors' behaviour. A mixed approach is often best. Methods that have been shown to work in combination are
There is strong evidence that a didactic lecture on its own does not work. There is only weak evidence that money on its own works (Scott, Sims). A combination of educational initiatives is usually needed.
The CONSORT statement www.consort-statement.org is an evidence-based set of recommendations for reporting RCTs. Standardising the format of reports
The CONSORT statement, which is an evolving document, currently comprises a 22-item checklist and flow diagram, with brief text description. The checklist items cover
The flow diagram displays the progress of all participants through the trial. The Statement has been translated into several languages.
Researches are less likely to submit for publication studies which do not have positive exciting results (Olson 2002). PubMed therefore has an overall positive bias, when in fact negative results may be very important.
Publication bias can result in serious harm to patients, even deaths. Chalmers (2006) points out persistent biased under-reporting of research undertaken by the pharmaceutical industry. The worry is that unwelcome side-effects, picked up during pre-marketing drug trials, can be quietly swept under the carpet, and only the positive findings published.
The most effective way to avoid publication bias is to register studies before they are done. Several databases of protocols for studies, and studies under way, are available. These help researchers to find all the evidence.
The metaRegister of Controlled Trials (mRCT) aims to identify ongoing and unpublished RCTs. It is open to trials of all types of intervention in all healthcare specialties. The mRCT is built by pooling databases of ongoing (and some completed) trials from public, charitable and commercial organisations.
This scheme helps identify which trial is which. Confusion arises because
Multiple reports based on the same patients can cause further confusion.
The solution is a unique number for each RCT. The number is issued when the researchers register the trial. This should be done as early as possible, preferably before patients are recruited. The ISRCTN then stays with a trial, throughout its life cycle. A searchable database of ISRCTN registered trials is freely available online to clinicians, researchers, patients and public at www.controlled-trials.com. Registering trials helps avoid
The register provides opportunities for
Failure to report RCTs is increasingly seen as scientific and ethical misconduct. There is growing pressure to register all trials. Some countries have legislation requiring registration of trials. Registration is recommended by many funding agencies and official bodies.
Once these registers have been in place for some time, a protocol driven systematic reviews will be able to group together all RCT's on the subject in question, published or unpublished. Unpublished data currently has to be found by painstaking hand-searching, for example abstracts of meetings, and making enquiries of researchers known to be active in the field. Identification of such data should, in future, be much easier from prospective trial registers.
These studies would then ideally have their original patient data pooled, effectively making each smaller study into one huge study.
Such a gold standard systematic review of prospectively identified studies using original patient data has been achieved for Tamoxifen and breast cancer (Early Breast Cancer Trialists' Collaborative Group 2001) but few other subjects.
The methodology of randomised controlled trail was invented in 1940s out of necessity when devising a study for TB with limited amounts of streptomycin (Medical Research Council 1948). The technology for pooling them (in protocol driven systematic reviews) has been developed since then to a high standard. The Cochrane Collaboration pools evidence about research methodologies. The Cochrane Methodology Register forms part of the Cochrane Library. It provides resources on
Bayesian statistical methods are increasingly being used to make decisions about healthcare, particularly at an economic level. (Luce and O'Hagan, 2003) They are also coming into drug trials because they can reach conclusions quicker, with fewer trial participants. Although Bayesian statistics is regarded as something new, The Reverend Thomas Bayes theorem was first published in 1763. It is quite simple. It is a way of calculating the odds of something happening, if you already know something else. We use the principle, naturally and often subconsciously, all the time in making decisions about what to do.
Bayesian statistics is based on an acknowledgement that we already know, or at least believe, something before we start a trial. This is known as the "prior", expressed as a probability distribution. Probability is defined as a degree of individual belief, ranging from 0 (complete disbelief) to 1 (certainty). Any uncertainty is located philosophically in the human mind, not in the data itself. The trial then provides some new information. Bayes theorem - a simple mathematical formula - is then used to synthesize the new information with the prior, to form the "posterior", also expressed as a probability distribution. Bayes theorem uses simple addition and multiplication to operate probability laws familiar to any gambler. Bayesian statistical inferences are based on the posterior, which combines information from both the prior and the trial data.
Bayesian statistics is focussed on answering the question “How does this new evidence change what we believe?” It explicitly allows for all available evidence to be taken into account. Because it acknowledges that prior belief can vary, various forms of evidence can be included or discounted from the prior belief, and given different weightings. Using a variety of different priors means the same data can be interpreted in many ways. This allows numerous models to be run, testing the effects of different assumptions and viewpoints on the observed data.
Although Bayes theorem itself is simple, the mathematical models built up to incorporate all these different factors quickly become so complex as to be impenetrable. The computations involved in the Markov Chain Monte Carlo simulation are such that Bayesian statistics in practice becomes a "black box" exercise. Various inputs result in various outputs, and it is almost impossible to work out what is going on in the middle. This induces a mistrust of the process in those accustomed to working with the relatively simple, transparent and fixed formulae of conventional statistical tests.
First, probability itself. Both Bayesian and conventional statistics consider probability as varying from 0 to 1. In Bayesian statistics,