A personal analysis of the NINDS study using patient-level data Warning: This manuscript has many graphical displays and you should use your monitor's full screen to view this manuscript -- in order to prevent visual distortion of the graphical displays that could occur when you only use a limited part of your monitor's screen.
An adobe pdf file version of this manuscript, which is useful for printing purposes, is available at:-
http://jeffmann.net/soapbox/NINDSpersonalanalysis.pdf
Click on any of the headings to rapidly navigate to the relevant section of this manuscript.
Analysis of the 91-180 minute arm of the NINDS study
Additional lessons learnt from studying the patient-level data from the NINDS study
Critical commentary and recommendations
- the comparative efficacy of TPA based on time-to-treament
- the median - an imperfect tool for demonstrating that two patient groups are balanced
- small sample size
The results of the NINDS study of intravenous thrombolytic therapy (TPA) in acute ischemic stroke was first published in the NEJM [1] in 1995. Since that date there have been endless arguments regarding the validity of the NINDS (National Institute of Neurological Diseases) Study Group's claim that the drug TPA (Tissue Plasminogen Activator) has sufficient therapeutic efficacy to warrant its routine use in community practice. When the NINDS investigators presented the results of the NINDS study at an evening dinner presentation at the American College of Emergency Physicians' Scientific Assembly meeting in October 1996, a number of physicians criticised the wisdom of their recommendation that TPA therapy should be used in community practice. The main basis for the criticism was the fact that the NINDS study was the only randomised controlled trial (RCT) that could demonstrate that TPA was effective in patients with acute ischemic stroke. All the other major TPA-for-stroke RCTs have had negative results. It is now eight years later, and the NINDS study is still the only RCT that claims to have unequivocally demonstrated that TPA is effective in acute ischemic stroke.
Different critics of the NINDS study focus on different sub-plot issues relating to the NINDS trial. The two main subplot issues that generate the maximum amount of critical commentary are i) the "true" degree of TPA's efficacy in acute ischemic stroke and ii) the degree of harm that can occur if TPA is used in community hospitals, rather than specialised stroke centers. The major harm that can occur secondary to the use of TPA is symptomatic intracranial hemorrhage (ICH) and the average rate of ICH in specialised stroke centers is approximately 6% (or less). However, the symptomatic ICH rate may be considerably greater than 6% when TPA is used in community hospitals that do not have specialised "stroke teams". To counter the TPA contrarians' concern about a high risk of iatrogenic ICH, many proponents of TPA therapy believe that the greater ICH rate is primarily due to protocol violations, which could theoretically be avoided if clinicians were better educated. Whether the high risk of an ICH is entirely due to protocol violations, or not, many clinicians in community practice are very reluctant to use the drug, because they are inordinately concerned about the high risk of iatrogenic ICH. The validity of a clinician's inordinate concern about a potentially high risk of iatrogenic ICH also depends on how accurately a clinician establishes the risk:benefit ratio of TPA therapy. If the absolute efficacy of TPA therapy is substantial, then the risk:benefit ratio of TPA therapy may still favor therapy despite a high risk of an iatrogenic ICH. To be able to calculate the risk:benefit ratio of TPA therapy with a reasonable degree of accuracy, a clinician needs to know whether the anticipated degree of efficacy of TPA therapy significantly exceeds the anticipated degree of iatrogenic harm that can occur secondary to a TPA-induced ICH. Therefore, to calculate a risk:benefit ratio for TPA therapy for a particular stroke patient, a clinician first needs to precisely estimate the absolute degree of benefit that a particular stroke patient can expect from TPA therapy. Because the NINDS study remains the only RCT that has demonstrated efficacy in acute ischemic stroke, proponents of TPA therapy often use the same numerical figures when describing the efficacy-value of TPA therapy in acute ischemic stroke patients. They usually state that TPA increases the likelihood of a favorable stroke outcome by 12% and they also state that this works out to a number-needed-to-treat (NNT) figure of 8. However, those numerical values generally apply to stroke patients treated between 0-180 minutes (< 3 hours) because they were derived from the results of the entire NINDS trial, and they do not specifically apply to the 91-180 minute time period. It would be an extremely rare event in community practice for a stroke patient to be treated with TPA in <90 minutes from the time of stroke onset, so most clinicians in community practice are mainly interested in the exact extent of TPA's efficacy for patients treated between 91-180 minutes. In the original NINDS article, the absolute efficacy of TPA therapy in patients treated between 91-180 minutes was estimated to be 21% (46% treated patients, 25% placebo patients) using the modified Rankin scale (mRS) stroke outcome scoring system. It is my contention that the 21% absolute risk reduction (ARR) figure is inaccurate, and that the "true" efficacy figure for TPA therapy is probably less than 50% of that value. The main reason for my belief that the "true" efficacy of TPA is probably less than 50% of the ARR (risk difference) figure calculated by the NINDS investigators, is primarily based on my argument that the standard interpretation of the results of the NINDS study's 91-180 minutes arms is flawed -- because the NINDS investigators did not take into full account the marked imbalances in stroke severity that existed between treated and placebo patients, and the effect of chance-events that favored the treated patients.
In my journal article [2], which was published in the Western Journal of Medicine in May 2002, I described why I thought that imbalances in baseline stroke severity between treated and placebo patients could markedly affect the internal validity of the NINDS's study's 91-180 minutes results. At the time I wrote that article, I had incomplete information regarding the raw data of the NINDS trial, so I could not precisely estimate to what degree the imbalance in stroke severity issue affected the correct interpretation of the NINDS study's results. After Lenzer published her controversial article [3] in the British Medical Journal (bmj), I wrote a series of rapid response letters to the Rapid Response Section of bmj.com delineating my concerns re: this problematic issue. The NINDS Study Group posted their reply [4] to bmj.com on June 22nd 2002. In their rapid response letter, they provided further subgroup data on the 91-180 minute arm of the NINDS study. The provision of this additional data enabled me to make more precise estimations of the degree to which imbalances in stroke severity between treated and placebo patients affected the correct interpretation of the NINDS study's results. I subsequently wrote two more rapid response letters to the bmj [5], and detailed the reasons why I thought that the "true" efficacy of TPA in patients treated between 91-180 minutes was not 21% as the NINDS investigators claimed, but more likely around 8-11%. The NINDS Study Group did not post a further rapid response letter to demonstrate why my logical reasoning and "facts" were flawed, and the subplot issue of the effect of imbalances in stroke severity between treated and placebo patients subsequently became a side-issue of minor importance to most stroke researchers and stroke interventionalists. However, the Director of the NINDS division of the National Institute of Health was presumably concerned about any criticism of the internal validity of the NINDS study, because he decided to appoint an Independent Panel to review the NINDS study's results. The Independent Panel submitted its review privately to the NIH-NINDS, and their initial conclusion [6] was posted in the Journal of Cerebrovascular Diseases. The precise details of the Independent Panel's review have not been made public, and I therefore do not have any knowledge regarding their specific opinions with respect to the subplot issue of the "effect of imbalances in stroke severity between treated and placebo patients".
On October 8th 2003, I obtained a set of patient-level raw data relating to the NINDS study. I did not receive the complete set of patient-level data. However, I did receive the complete set of patient-level data relating to the following parameters for each of the 624 patients in the NINDS study: Time-to-treatment; Baseline NIHSS stroke severity score; Rate of favorable (excellent) stroke outcome at 3 months (modified Rankin score <1). Armed with this data, I was able to re-analyse the NINDS study's results in greater depth and with greater accuracy. My in-depth analysis has convinced me that my initial hypothesis was correct, and that the standard interpretation of the NINDS study's results, as presented in the orginal NEJM journal article, was flawed. The major section of this manuscript is related to a dissection of the results from the 91-180 minutes arm of the NINDS study, and I hope that my manuscript presentation will convince interested readers that the NINDS study was not internally valid, and that its published results are probably not reflective of the "true" efficacy of TPA therapy in acute ischemic stroke. Before I present my analysis of the NINDS study's results, I have included a background section that will help interested readers, who are not sufficiently familiar with issues relating to clinical trial design and interpretation, to better understand the foundational basis of my arguments, which are presented in the subsequent section Analysis of the 91-180 minute arm of the NINDS study. If a reader cannot fully understand the direct relationship between confidence in the results of a clinical trial and the trial's signal/noise ratio (which are described in detail in the background section), then it will probably be impossible, or near-impossible, for him to readily understand my interpretative analysis of the NINDS study's results.
Many clinical trialists (and EBM practioners) regard a randomised controlled trial (RCT) as the "best" type of trial design, and they think that it should routinely be used to determine the true efficacy of an experimental drug in clinical practice. The execution of a RCT involves the randomisation of patients to two patient groups -- a placebo (or control) patient group and a treated patient group. Trial designers make every effort to ensure that two groups are equally balanced for prognostic variables that may influence the outcome of interest. By ensuring that baseline prognostic variables are equally balanced between treated and placebo patients, the final result of the trial (difference in outcome events between treated and placebo patients) is more likely to be internally valid. Internal validity is a pre-requisite for determining the scientific truth, and any RCT that is not internally valid cannot establish the scientific truth.
David Sackett, one of the original gurus in the field of evidence-based medicine (EBM), wrote an article for the Canadian Medical Journal called "Why randomized controlled trials fail but needn't: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!) "[7]. In that article (which I highly recommend), Sackett analysed the critical trial design issues that need to be fulfilled if a trialist wants to ensure that one can be confident in the legitimacy of a randomised controlled trial's results.
In that article, Sackett stated that the "only" formula of physiological statistics that a clinical trialist needs to know with respect to clinical trials is the following formula:
Sackett expressed the above formula in words as follows:- "The confidence in the conclusion of an RCT is the ratio of the magnitude of the signal to the magnitude of the noise times the square root of the sample size".
Confidence
Sackett stated that "confidence describes how narrow the confidence intervals is (the narrower the better) around the effect of treatment, whether expressed as an absolute or relative risk reduction, or as some other measure of efficacy". Sackett stated further along in his article that "In order to generate extremely small and highly convincing confidence intervals around moderate but important benefit signals, a very strong case can and has been made for really large, really simple RCTs".
Signal
Sackett stated that "the signal describes the differences between the effects of the experimental and control treatments." For example, if one was evaluating the efficacy of an antibiotic in curing bacterial pneumonia, the signal would be the difference in cure rate between the treated (experimental) and placebo (control) patients. Confidence in a trial's results increases when the signal increases, and confidence decreases when the signal decreases.
Sackett suggested that if a trial designer wanted to increase confidence in a trial's results, then it is important to enroll "high risk" patients in the trial. He specifically stated that "restricting eligibility to patients who are at higher than average "baseline" risk of outcome events leads to higher "Control Event Rates" (CER) among those receiving placebo or standard therapy. Because the absolute risk reduction signal is equivalent to the product of this control event rate and the relative risk reduction from therapy (ARR = CER x RRR) it follows that, if the relative risk reduction achieved by the experimental treatment is both true and constant over different control event rates, the experimental treatment will generate a larger absolute risk reduction signal when the control event rate is high than when it is low." In other words, if a trial designer was enrolling patients into a clinical trial of an antibiotic for the treatment of bacterial pneumonia, then it is important to ensure that the pneumonia patients have a high risk of an outcome control event (failure to clinically recover from the pneumonia), so that the efficacy of the experimental antibiotic is seriously challenged. It would make no sense to recruit pneumonia patients, who have a low likelihood of the outcome event (failure to recover from the pneumonia), because the trial's signal would then be too small. Therefore, according to Sackett, patients with very mild bacterial pneumonia should not be recruited into a clinical trial of antibiotic therapy for bacterial pneumonia, because those recruited patients would dilute the trial's signal. The confidence intervals (C.I.) around the absolute outcome difference between treated and placebo patients would be significantly reduced if a clinical trial contained too many patients with a low risk of the outcome event.
Sackett also stated that if one wanted to maximise a trial's signal, that it is equally important to recruit patients who are likely to be highly responsive to the experimental drug's therapeutic effect. So, for example, in a clinical trial of antibiotic therapy for bacterial pneumonia patients, it would not be be wise to recruit moribund nursing home patients with severe bacterial pneumonia into the trial, because those patients are unlikely to respond to the antibiotic as well as healthier patients. If many bacterial pneumonia patients with endstage disease (which is highly unlikely to be responsive to antibiotic therapy) are recruited into a trial of antibiotic therapy for bacterial pneumonia, then their presence would significantly dilute the trial's signal. A significant lowering of the signal level would significantly decrease one's confidence in the trial's results by decreasing the trial's signal/noise ratio.
Sackett summarised his advice regarding the recruitment of patients into a RCT as follows in his CMAJ article:-
In summary, note that Sackett specifically advised against the recruitment of two groups of patients -- i) very sick patients, who are too sick to benefit from the therapeutic drug and ii) mildly sick patients, who are too well to need any treatment (especially if they are also unlikely to respond to the experimental drug). Recruitment of substantial numbers of those types of patients would markedly dilute the "true" signal of the clinical trial, and markedly decrease one's confidence in the trial's results.
Noise
Sackett defined "the noise (or uncertainty) in an RCT as the sum of all the factors ("sources of variation") that can affect the absolute risk reduction or absolute difference (trial's *signal). Confidence in a trial's results increases when the noise level in a trial decreases, and confidence decreases when the noise level increases ( thereby decreasing the calculated signal/noise ratio).
A major source of variation (uncertainty) in a clinical trial occurs when recruited patients are heterogeneous, rather than homogeneous, with respect to disease severity and/or their likely degree of responsiveness to the therapeutic agent. Other sources of uncertainty relate to variations in baseline prognostic variables between treated and untreated patients. For example, in a trial of antibiotic therapy for bacterial pneumonia, it is important to ensure that enrolled patients in the treatment and placebo arms of the trial have the same frequency/level of underlying chronic lung disease (eg. COPD), because any significant imbalance would increase the difference in likely spontaneous recovery rate between treated and untreated patients, and thereby increase the trial's *noise level, which would (according to Sackett's physiologic statistic formulae for confidence) decrease one's confidence in the trial's results.
Sackett also stated that a clinical trialist could decrease the noise level in a clinical trial by making "sure that every study patient actually has the target condition whose natural history you are attempting to change. Misdiagnoses at patient entry create subgroups of patients with the wrong conditions who may be incapable of responding to your experimental treatment, thus adding noise to the trial." Therefore, the recruitment of viral pneumonia patients into a clinical trial of antibiotic therapy for patients with bacterial pneumonia would increase the *noise level and decrease one's confidence in the trial's results. Likewise, recruitment of TIA patients in TPA-for-stroke trials would increase the trial's *noise level. Note that the NINDS study had a <5% prevalence of TIA patients in the trial.
Another major source of noise occurs when trial results are not counted accurately in one arm of a trial. Sources of inaccuracy could be inadvertent errors in counting outcome events in either the placebo or treated patients, or deliberate errors that occur when a fraudulent investigator, who becomes unblinded to patient-allocation, only counts negative outcome events in placebo patients (or positive outcome events in treated patients).
Chance events are a major source of noise in RCTs. It is always possible that treated patients, or placebo patients, could have a lower number of outcome events due to chance alone. This chance-event phenomenon is much more likely to be evident in RCTs that enroll a small sample of patients. Chance-events could also be due to unrecognised prognostic variables that are inadvertently maldistributed between treated and placebo patients, despite an optimised randomisation process. Chance events may cause the "apparent" signal to be much greater than the "true" signal if the chance events occur more frequently in the treated patient group compared to the placebo patient group and those chance events result in an increased number of favorable outcome events, or they may decrease the difference between the "apparent" signal and the "true" signal if the chance events result in a decreased number of favorable outcome events in the treated patient group compared to the placebo group.
Sample size
Sackett stated the following with respect to sample size "the sample size is the number of patients in the trial. Note that its influence on confidence intervals is as its square root. As you'll see later, this means that, if you want to cut the confidence interval around a study's absolute risk reduction in half by adding more patients to it, you need to quadruple their number."
To be able to understand how I analysed the NINDS trial, you have to understand the following hypothetical trial scenarios.
Trial number 1
Imagine a trial involving 1,000 patients with bacterial pneumonia - 500 placebo patients and 500 antibiotic treated patients. Each arm of the trial has 100 patients with moderate bacterial pneumonia and they have a likelihood of spontaneous recovery from the pneumonia of 60% if untreated. Each arm of the trial has 300 patients with moderate-severe bacterial pneumonia and they have a likelihood of spontaneous recovery from the pneumonia of 50% if untreated. Each arm of the trial has 100 patients with severe bacterial pneumonia and they have a likelihood of spontaneous recovery from the pneumonia of 40% if untreated. The "average" rate of spontaneous recovery for untreated patients would be 50%. This trial can be regarded as having a homogeneous population of pneumonia patients of relatively high potential responsiveness to an antibiotic -- because the recruited patients only have a 40-60% likelihood of spontaneous recovery if not treated with antibiotics and there is much room for improvement, and the scatter of spontaneous recovery rates is small (within 10% of the average of 50%).
Graphic representation -- Number in each box is the number of patients.
Pneumonia severity Moderate Moderate-severe Severe Placebo group - 500 patients 100 300 100 Treated group - 500 patients 100 300 100 Then presume that the investigational antibiotic cures 50% of the patients who would otherwise not be cured. The cure rate for the moderate pneumonia patients would be 80 out of 100 treated patients (60 get better spontaneously and the antibiotic cures 50% of the remaining 40 patients). The cure rate for the moderate-severe pneumonia patients would be 225 out of 300 treated patients (150 get better spontaneously and the antibiotic cures 50% of the remaining 150 patients). The cure rate for the severe pneumonia patients would be 70 out of 100 treated patients. (40 get better spontaneously and the antibiotic cures 50% of the remaining 60 patients).
At the completion of the trial, one could assess the efficacy of the antibiotic by totalling the total number of recovery rates in the placebo patients and treated patients and measuring the difference in recovery rate for all the patients.
Table:- Rate of recovery from pneumonia.
Severity of pneumonia Treated patients Placebo patients Moderate 80/100 (80%) 60/100 (60%) Moderate-severe 225/300 (75%) 150/300 (50%) Severe 70/100 (70%) 40/100 (40%) All patients 375/500 (75%) 250/500 (50%) Interpretation of the trial's results: The antibiotic produced an absolute risk difference of 25% (75%-50%). The RR is 1.5 (75%/50%). I will designate the absolute size of the risk difference (25%) as representing the "apparent" efficacy of the antibiotic in curing pneumonia as determined by this particular clinical trial. Seeing that I cannot identify any *noise (uncertainty bias) in the trial's design, I would regard the "true" efficacy of the antibiotic as being equal to the "apparent" efficacy of the drug, and both are equal to 25%.
Trial number 2
Imagine another trial of 1,000 pneumonia patients using the same antibiotic, which has exactly the same potency as in the first trial. The only difference is that the trial will have 100 patients with mild pneumonia in each arm of the trial. Presume that 100% of patients with mild pneumonia have a spontaneous recovery with, or without antibiotics. To keep the total number of trial patients constant at 1,000, there will be 100 fewer moderate-severe pneumonia patients enrolled in each arm of the study (200 instead of 300).
Graphic representation -- Number in each box is the number of patients.
Pneumonia severity Mild Moderate Moderate-severe Severe Placebo group - 500 patients 100 100 200 100 Treated group - 500 patients 100 100 200 100 Then presume that the antibiotic cures 50% of the patients who would otherwise not be cured (as in trial number 1). The cure rate for the moderate pneumonia patients would be 80 out of 100 treated patients (60 get better spontaneously and the antibiotic cures 50% of the remaining 40 patients). The cure rate for the moderate-severe pneumonia patients would be 150 out of 200 treated patients (100 get better spontaneously and the antibiotic cures 50% of the remaining 100 patients). The cure rate for the severe pneumonia patients would be 70 out of 100 treated patients. (40 get better spontaneously and the antibiotic cures 50% of the remaining 60 patients).
100% of the mild pneumonia patients get better spontaneously due to the natural course of the disease -- that is 100 patients in both the placebo and treatment groups.
Table:- Rate of recovery from pneumonia.
Severity of pneumonia Treated patients Placebo patients Mild 100/100 (100%) 100/100 (100%) Moderate 80/100 (80%) 60/100 (60%) Moderate-severe 150/200 (75%) 100/200 (50%) Severe 70/100 (70%) 40/100 (40%) All patients 400/500 (80%) 300/500 (60%) Interpretation of the trial's results: The antibiotic produced an absolute risk difference of 20% (80%-60%). The RR is 1.33 (80%/60%).
The "apparent" efficacy of the antibiotic as measured in this particular trial is 20%. However, the "true" efficacy of the antibiotic is known to be 25% (which was determined in trial number 1). Why is the "apparent" efficacy of the antibiotic less than the "true" efficacy of the antibiotic? The correct answer is a *low trial signal. By recruiting patients with very mild pneumonia, who get better spontaneously 100% of the time without antibiotic therapy, the trial design is reducing the power of the trial to generate a large *signal. A lower signal means that the signal/noise ratio of the trial is reduced. According to Sackett's formula, confidence in the validity of the trial results will decrease if the signal/noise ratio decreases. In other words, one is no longer confident that the trial is measuring the "true" efficacy of the antibiotic. The "apparent" efficacy of the antibiotic is less than the "true" efficacy of the antibiotic, and it is a result of the poor trial design (allowing pneumonia patients who are too well to be enrolled in the trial). The recruitment of mild pneumonia cases (who cannot respond to the investigational antibiotic because they are going to get spontaneously better with, or without treatment) deflates the "apparent" efficacy of the antibiotic, and makes the drug appear to be less efficacious than it really is.
Trial number 3
Imagine another trial of 1,000 pneumonia patients using the same antibiotic, which has exactly the same potency as in the first trial. The only difference is that the trial will have 100 patients with mild pneumonia in the placebo group and 300 patients with mild pneumonia in the treatment group. Presume that 100% of patients with mild pneumonia have a spontaneous recovery with, or without antibiotics. To keep the total number of trial patients constant at 1,000, there will be 100 fewer moderate-severe pneumonia enrolled in the placebo arm, and no moderate-severe pneumonia patients in the treatment arm.
Graphic representation -- Number in each box is the number of patients.
Pneumonia severity Mild Moderate Moderate-severe Severe Placebo group - 500 patients 100 100 200 100 Treated group - 500 patients 300 100 0 100 Then presume that the antibiotic cures 50% of the patients who would otherwise not be cured (as in trial number 1). The cure rate for the moderate pneumonia patients would be 80 out of 100 treated patients (60 get better spontaneously and the antibiotic cures 50% of the remaining 40 patients). The cure rate for the moderate-severe pneumonia patients would be 0 out of 0 treated patients because there are no moderate-severe pneumonia patients in the treatment arm of this trial. The cure rate for the severe pneumonia patients would be 70 out of 100 treated patients. (40 get better spontaneously and the antibiotic cures 50% of the remaining 60 patients).
100% of the mild pneumonia patients get better spontaneously due to the natural course of the disease -- that is 100 patients in the placebo group and 300 patients in the treatment group.
Table:- Rate of recovery from pneumonia.
Severity of pneumonia Treated patients Placebo patients Mild 300/300 (100%) 100/100 (100%) Moderate 80/100 (80%) 60/100 (60%) Moderate-severe 0/0 100/200 (50%) Severe 70/100 (70%) 40/100 (40%) All patients 450/500 (90%) 300/500 (60%) Interpretation of the trial's final results: The antibiotic produced an absolute risk difference of 30% (90%-60%). The RR is 1.5 (90%/60%).
The "apparent" efficacy of the antibiotic as measured in this particular trial is 30%. The "true" efficacy of the antibiotic is known to be 25% (which was determined in trial number 1). Why is the "apparent" efficacy of the antibiotic greater than the "true" efficacy of the antibiotic? The answer is a *low trial signal due to the recruitment of patients with mild pneumonia in the trial, in combination with an increased *noise level due to the differential recruitment of more mild pneumonia patients in the treatment arm compared to the placebo arm. As a result of a decreased signal and increased noise, the trial's signal/noise ratio is decreased and one cannot be confident in the validity of the trial's results. In other words, if one recruits disproportionately more mild patients in the treatment arm (compared to the placebo arm) it inflates the "apparent" efficacy of the antibiotic, and makes the drug appear to be more efficacious than it really is.
Trial number 4
Imagine another trial of 1,000 pneumonia patients using the same antibiotic, which has exactly the same potency as in the first trial. The only difference to trial number 3, is that there is a computer glitch and 50% of the favorable outcome results in the placebo patients' moderate subgroup are accidently not counted. That is a loss of 30 positive outcomes in that subgroup.
Graphic representation -- Number in each box is the number of patients.
Pneumonia severity Mild Moderate Moderate-severe Severe Placebo group - 500 patients 100 100 200 100 Treated group - 500 patients 300 100 0 100 Table:- Rate of recovery from pneumonia.
Severity of pneumonia Treated patients Placebo patients Mild 300/300 (100%) 100/100 (100%) Moderate 80/100 (80%) 30/100 (30%) Moderate-severe 0/0 100/200 (50%) Severe 70/100 (70%) 40/100 (40%) All patients 450/500 (90%) 270/500 (54%) Interpretation of the trial's results: The antibiotic produced an absolute risk difference of 36% (90%-54%). The RR is 1.67 (90%/54%).
The "apparent" efficacy of the antibiotic as measured in this particular trial is 36%. Compare the results to trial number 3. Note that the "apparent" efficacy of the antibiotic has increased by another 6% from 30% to 36%. There are two sources of *noise in this trial -- *noise from a differential recruitment of mild pneumonia cases to the treatment and placebo arms + *noise from a chance-event (failure of the computer to acurately count all the positive results in the placebo arm's moderate subgroup).
The "true" efficacy of the antibiotic is known to be 25%. The "apparent" efficacy of the antibiotic as measured in this trial is 36%. Do you not think that its appropriate to make adjustments to correct for those highly significant *noise elements, so that one can "best" ascertain the "true" efficacy of the antibiotic? What appropriate adjustments? One possible adjustment could be to not consider the results from the mild cases in the final tally (because those "extra" mild cases in the treatment arm artifactually biases the trial's results in favor of the treatment group). The second adjustment could be to look at the placebo group's results from a pathophysiological perspective, and note that the moderate cases only had a 30% spontaneous recovery rate, which seems disproportionately low when one notes that the placebo group's moderate-severe cases had a 50% recovery rate and the severe cases had a 40% recovery rate. One could then arbitrarily chose a value of 60% as being a reasonably appropriate adjustment value. This particular adjustment corrects the chance-event bias phenomenon that artifactually favored the treatment group.
Then the adjusted table of results (after making the corrective adjustment and excluding the mild cases from consideration) would look like this:-
Severity of pneumonia Treated patients Placebo patients Moderate 80/100 (80%) 60/100 (60%) Moderate-severe 0/0 100/200 (50%) Severe 70/100 (70%) 40/100 (40%) All patients 150/200 (75%) 200/400 (50%) Interpretation of the trial's adjusted results: The antibiotic produced an absolute risk difference of 25% (75%-50%). The RR is 1.5 (75%/50%).
These results are identical to the results in trial number number 1 and reflect the "true" efficacy of the antibiotic. If one did not make any adjustments, and accepted the unadjusted "apparent" efficacy results, then the absolute "false" efficacy error would be 11% (36%-25%).
In other words, if a rational clinician knew that a clinical trial was poorly designed (imbalance in severity of disease between the treatment and placebo groups) or had significant inadvertent chance-bias (chance-events favoring either the treatment or placebo group), and he wanted to uncover the "true" efficacy of the investigational drug (and differentiate it from the "apparent" efficacy of a drug), then I think that it would be wise to make appropriate adjustments in order to minimise (or eliminate) the likely bias-error due to significant *noise elements, so that he can "best" identify the "true" efficacy of the investigational drug.
This adjustment approach is definitely not the optimum approach. The optimum approach is to have well-designed clinical trials that are not biased (unbalanced) in favor of the treatment or placebo groups. This adjustment approach is a second class approach that is only necessary if one wants to discover the true" efficacy of a drug in a clinical trial that can be demonstrated to be significantly biased in favor of the treatment or placebo group.
Trial number 5
Imagine another trial of 1,000 pneumonia patients using the same antibiotic, which has exactly the same potency as in the first trial. The only difference to trial number 4, is that the placebo arm contains 100 patients with very severe pneumonia and 100 less patients with moderate-severe pneumonia (so that the total number of patients remains at 500 patients). Presume that the very severe pneumonia patients are very sick and that there is a zero percent recovery rate in that subgroup of patients. Presume that there is the same computer glitch in this trial as in trial number 4 and 50% of the favorable outcome results in the placebo patients' moderate subgroup are accidently not counted.
Graphic representation -- Number in each box is the number of patients.
Pneumonia severity Mild Moderate Moderate-severe Severe Very severe Placebo group - 500 patients 100 100 100 100 100 Treated group - 500 patients 300 100 0 100 0 Then presume that the antibiotic cures 50% of the patients who would otherwise not be cured (as in trial number 1). The cure rate for the moderate pneumonia patients would be 80 out of 100 treated patients (60 get better spontaneously and the antibiotic cures 50% of the remaining 40 patients). The cure rate for the moderate-severe pneumonia patients would be 0 out of 0 treated patients because there are no moderate-severe pneumonia patients in the treatment arm of this trial. The cure rate for the severe pneumonia patients would be 70 out of 100 treated patients. (40 get better spontaneously and the antibiotic cures 50% of the remaining 60 patients).
100% of the mild pneumonia patients get better spontaneously due to the natural course of the disease -- that is 100 patients in the placebo group and 300 patients in the treatment group.
100% of the very severe pneumonia patients in the placebo group do not spontaneously recover.
Table:- Rate of recovery from pneumonia.
Severity of pneumonia Treated patients Placebo patients Mild 300/300 (100%) 100/100 (100%) Moderate 80/100 (80%) 30/100 (30%) Moderate-severe 0/0 50/100 (50%) Severe 70/100 (70%) 40/100 (40%) Very severe 0/0 0/100 (0%) All patients 450/500 (90%) 220/500 (41%) Interpretation of the trial's results: The antibiotic produced an absolute risk difference of 49% (90%-41%). The RR is 2.25 (90%/40%).
The "apparent" efficacy of the antibiotic as measured in this particular trial is 49%. The "true" efficacy of the antibiotic is known to be 25% (which was determined in trial number 1). Why is the "apparent" efficacy of the antibiotic as measured in this trial so much greater than the "true" efficacy of the antibiotic? The correct answer is that this trial suffered from a low *signal level and a high *noise level. The low *signal level was due to the recruitment of mild pneumonia patients into the trial. The high *noise level was due to the combined effects of the i) differential recruitment of more mild pneumonia cases to the treatment arm than the placebo arm of the trial; ii) the differential recruitment of very severe pneumonia cases to the placebo arm of the trial; and iii) the miscounting of 50% of the positive results in the moderate pneumonia subgroup of the placebo arm of the trial.
Would you accept this trial's "apparent" efficacy figure of 49% as representing the "true" efficacy of the antibiotic? Or, do you think that appropriate post hoc adjustments of the trial's results need to be made to uncover the "true" efficacy of the antibiotic?
If you agree that appropriate post hoc adjustments needs to be made in order to uncover the likely "true efficacy" of the antibiotic, then you may agree with my personal approach to analysing the results from the 91-180 minutes arm of the NINDS study, which suffered from all of these distorting elements -- a low *signal level (due to the recruitment of mild stroke patients into both arms of the trial), and a high *noise level (due to the differential recruitment of a greater number of mild stroke patients to the treatment arm, plus the differential recruitment of a greater number of very severe stroke patients to the placebo arm, plus a chance-event phenomenon in the moderate-severe stroke severity group of the placebo arm of the trial that particularly handicaps the placebo group's favorable stroke outcome rate).
Analysis of the 91-180 minute arm of the NINDS study
The NINDS study of TPA therapy for acute ischemic stroke recruited 624 patients into the trial. Approximately half the patients were treated in <90 minutes, and the other half were treated between 91-180 minutes. The 91-180 minutes arm of the trial enrolled 320 patients -- 153 in the treated group and 167 in the placebo group.
In the original NEJM article report [1] on the NINDS study, the NINDS study authors reported the following favorable stroke outcome (mRS<1) results for patients treated between 91-180 minutes -- 46% (70/153) for treated patients and 25% (42/167) for placebo patients. The difference between the two results is the absolute risk difference (or ARR) and it works out to a figure of 21% (46%-25%). That absolute risk difference of 21% represents TPA's therapeutic efficacy in the 91-180 minute arm of the study. Is that 21% absolute efficacy figure pure *signal or could it be significantly contaminated by *noise, so that the "true" efficacy of TPA is substantially less than the estimated figure of 21%? To appreciate my answer to that question, the reader needs to understand each element in the following detailed analysis of the NINDS study's results.
The NINDS trialists derived the absolute risk difference figure of 21% by simply summing up the favorable stroke outcome results from all the recruited patients, irrespective of their level of stroke severity. The NINDS study population was very heterogenous from a stroke severity point of view, and the study population included acute ischemic stroke patients with mild strokes, moderate strokes and severe strokes. The usual classification system used to describe stroke severity in acute ischemic stroke patients is the NIHSS scoring system. The NIHSS score of patients recruited into the NINDS study varied from 1-37. From the perspective of this analysis of the NINDS study, stroke patients with a NIHSS score of <5 can be regarded as having a very mild stroke and those patients have a likely spontaneous recovery rate of >70%. Patients with a NIHSS score >20 have a very severe stroke and their likelihood of a spontaneous favorable stroke outcome is <7%. That wide disparity in favorable stroke outcome rates demonstrates that the NINDS study had a very heterogeneous population of stroke patients whose baseline likelihood of spontaneous recovery varied by 10-fold from approximately <7% to >70%. Although the stroke patients recruited into the NINDS trial were very heterogeneous from a stroke severity perspective, the NINDS trialists did not differentially weigh the stroke outcome results according to baseline stroke severity. The NINDS trialists divided the recruited stroke patients into five subgroups based on baseline stroke severity (for the purpose of post hoc analysis), and they simply added the favorable stroke results for each stroke severity subgroup using simple arithmetic.
The following table from Grotta's letter to bmj.com [4] demonstrates how the NINDS trialists calculated the absolute efficacy figure of 21% for the 91-180 minute arm of the trial.
Figure1. Table of favorable stroke outcome rates in patient subgroups from the 91-180 minute arm of the NINDS study.
Baseline NIHSS (Patients treated 91 to 180 minutes)
Patients with Rankin Good Outcome (0,1) at Three months Relative Risk
(95%CI)
TPA Placebo 1-5 24/29 (83%) 6/7 (86%) 1.0 (0.7,1.4) 6-10 23/37 (62%) 23/46 (50%) 1.2 (0.8,1.8) 11-15 10/26 (38%) 5/35 (14%) 2.7 (1.0,6.9) 16-20 9/33 (27%) 6/33 (18%) 1.5 (0.6,3.7) >20 4/28 (14%) 2/46 (4%) 3.3 (0.6,16.8) All Patients 70/153 (46%) 42/167 (25%) 1.8 (1.3,2.5) >5 (All, excluding 1-5) 46/124 (37%) 36/160 (23%) 1.6 (1.1,2.4) Note, from the first column, that the NINDS trialists divided their enrolled patients into five stroke severity subgroups according to their baseline NIHSS stroke severity score. Stroke patients with a baseline stroke severity score of 1-5 have a mild stroke. A baseline NIHSS score of 6-10 signifies a mild-moderate stroke, a NIHSS score of 11-15 represents a moderate-severe stroke, a NIHSS score of 16-20 represents a severe stroke, and any NIHSS score >20 represents a very severe stroke.
The second column supplies the rate of favorable stroke outcome results (measured at 3 months) for the treated patients, and the third column applies to the placebo patients. The fourth column gives the relative risk (RR) figures for each subgroup, and one can note a final RR value of 1.8 for all the patients in the 91-180 minutes arm of the trial. The RR figure is derived by dividing the favorable stroke outcome event rate percentage figure for the treated patients by the figure for placebo patients. Note that the RR value for all the patients in the 91-180 minute arm of the trial was 1.8, which basically means that treated patients had a 1.8x greater likelihood of having a favorable stroke outcome at 3 months than untreated patients.
As I dissect these results in greater depth, it is important that the reader closely follow the effect of my critical analysis on the NINDS trialists' calculated absolute risk difference value of 21% and the RR value of 1.8.
Mild stroke patients
If you look at Grotta's table, you will note that patients with mild strokes (baseline NIHSS score of 1-5) had a >80% rate of favorable stroke outcome, irrespective of whether they were treated, or not treated, with TPA. According to Sackett's statistical formula for confidence in a clinical trial' results, confidence in a trial's results decreases when the true *signal level decreases and/or the *noise level increases. By recruiting mild stroke patients into the NINDS study, the NINDS trialists decreased the power of the trial's *signal because they broke one of Sackett's cardinal rules of "good" trial design -- the rule which states that a trial designer should not recruit patients with a low likelihood of generating a "true" *signal into a clinical trial. In fact, the NINDS Study Group's own official trial design rules advised their recruiting centers not to recruit patients with a baseline NIHSS score of <4 into the NINDS study. However, this advisory rule was repeatedly broken, especially in the treated patient group, and patients with mild strokes constituted an amazingly high proportion of the treatment arm's stroke population -- 18.5% of the treated patients in the 91-180 minute arm of the trial.
To make matters much worse, there was a significant difference in the number of mild stroke patients recruited into the treated group compared to the placebo group in the 91-180 minutes arm of the NINDS study. That significant difference of 14% (18%-4%) creates significantly increased *noise (source of variation), and the presence of that increased *noise decreases one's confidence in the trial's results. Note that the NINDS trialists simply added the favorable stroke outcome results in mild stroke patients (24 in the treated group and only 4 in the placebo group) to the total favorable stroke outcome figure, without weighing its clinical significance. In other words, by that simple action alone, the NINDS trialists had inadvertently favored the treated patient group in a manner that has nothing to do with the "true" efficacy of TPA.
If you have difficulty understanding the points that I made in the last paragraph, consider what would happen if the treated group only consisted of mild stroke patients (baseline NIHSS score 1-5) and the placebo group had the same balance of patients as described in the Grotta table. Then the rate of favorable outcome in the treated group would be 83% while the placebo group's rate of favorable stroke outcome would be 25%. The absolute risk difference would be 55% and the calculated RR value would be 3.3. That hyper-inflated result would be artifactual and not representative of TPA's true efficacy. The same phenomenon could occur in reverse fashion if all the patients enrolled in the placebo arm had a mild stroke (baseline NIHSS score of 1-5) while the treated group had the same balance of stroke patients as described in the Grotta table. Then the rate of favorable stroke outcome would be 46% in the treated patients and 86% in the placebo patients. That surprising result (absolute risk difference of 40% in favor of the placebo group) would artifactually suggest that TPA markedly decreases the rate of favorable stroke outcome in acute ischemic stroke patients!
To what degree did this differential recruitment phenomenon affect the trial's signal/noise ratio, and one's confidence in the published results of the NINDS trial?
The answer is readily apparent if one examines the NINDS study's 91-180 minutes results -- without including patients with a mild stroke (baseline NIHSS score <5) in the analysis. Look at the last row in Grotta's table and note that the absolute risk difference between treated and placebo patients is reduced to 14% and that the calculated RR value is 1.6 if mild stroke patients are not included in the summation analysis. I think that removing the mild stroke patients results from consideration when attempting to determine the "true" efficacy of TPA therapy is appropriate, because the additional 7% absolute risk difference (21%-14%) was solely due to the numerical imbalance in mild stroke patients between the treated and placebo groups, and not due to TPA therapy. However, note that this appropriate correction reduces the "relative" efficacy of TPA therapy by 33% (21%-14% = 7%; 7% divided into 21% is 33%).
Very severe stroke patients
If one examines the favorable stroke outcome results from the very severe stroke patients (baseline NIHSS score >20) in the 91-180 minutes arm of the NINDS study, one notes that the RR value for those patients is 3.3. That RR value is very high, which would suggest that TPA is especially effective in very severe stroke patients. However, experienced trial-interpreters do not consider RR values in isolation. They also study the "actual" patient-results underlying the RR value. Note that the 14% rate of favorable stroke outcome in treated patients with a baseline NIHSS score >20 is derived from the positive results of four patients. Could one-or-more of those favorable outcome events be due to a chance-event? The answer becomes readily apparent when one looks at the favorable stroke outcome results from the very severe stroke patients (baseline NIHSS score >20) in the 0-90 minute arm of the NINDS study.
The favorable stroke outcome rate for the very severe stroke patients (baseline NIHSS score >20) in the 0-90 minute arm of the NINDS study was 6% (2/35) in treated patients and 3% (1/36) in placebo patients (I obtained those results from the patient-level raw data). Those results demonstrate that very severe stroke patients have a very low likelihood of having a favorable stroke outcome -- irrespective of whether they receive TPA therapy, or not.
According to Sackett's rules of "good" trial design, those very severe stroke patients, who are very unlikely to respond to TPA therapy, should not have been enrolled in the NINDS study. The inclusion of patients who are very unlikely to respond to TPA therapy dilutes the "true" signal of the NINDS trial. However, that is not the only problem! The fact that were far more very severe stroke patients (baseline NIHSS >20) in the placebo group (46 patients) compared to the treated group (28 patients) differentially biased the results in favor of the treated group. If you cannot readily understand the last point, consider the following two hypothetical scenarios. What would happen if all the placebo patients had very severe strokes (baseline NIHSS score >20) and the treated patients had the same balance of stroke severity as the treated patients in the Grotta table? Then the rate of favorable stroke outcome for all the patients in the 91-180 minute arm of the study would be 46% for treated patients and 4% for placebo patients. The calculated absolute risk difference would be 42% and the RR value would be 11.4! If the situation was reversed, and all the treated patients had very severe strokes (baseline NIHSS score >20) and the placebo group had the same balance of stroke severity as the placebo patients in the Grotta table, then the calculated absolute risk difference would be 14%-25% = -11% (a result that would significantly disfavor TPA therapy).
Again, I think that it is appropriate not to consider the favorable stroke results from the very severe stroke patients (baseline NIHSS score >20) in the total count, because the differential recruitment of very severe stroke patients to the treated and placebo groups -- far more very severe stroke patients in the placebo group -- artifactually deflated the possibility of a similar number of favorable stroke outcome results from the placebo group (that would have occurred if the placebo and treated patient groups were perfectly balanced). Because there were so few favorable stroke outcome events in either the treated or placebo patients, removing those results from consideration does not significantly alter the value of the absolute risk difference. After removing those results from consideration, the rate of favorable stroke outcome in treated patients would be 44% (42/96) and it would be 30% (34/114) in placebo patients. However, although the calculated absolute risk difference would still be 14%, the RR value would be reduced to 1.46.
Noise from chance events
Could the differential presence of chance events between treated and placebo patients have significantly biased the NINDS study's 91-180 minute results in favor of TPA therapy?
Look at the NIHSS 11-15 subgroup results in Grotta's table. Note that the placebo group had an inordinately low rate of favorable stroke outcome of 14%. From a purely pathophysiological perspective, one would expect the result-figure to be intermediate between 50% (figure for the NIHSS 6-10 subgroup) and 18% (figure for the NIHSS 16-20 subgroup). It should not be less than 18%, which is the rate of favorable stroke outcome value for patients with more severe strokes (baseline NIHSS score 16-20).
What actually caused the low favorable stroke outcome rate figure of 14% for the NIHSS 11-15 placebo group?
I only discovered the answer to that question when I received the NINDS study's patient-level raw data. So, let's examine the patient-level data more thoroughly.
Figure 2: Rate of favorable stroke outcome (Y axis) plotted for each level of baseline NIHSS stroke severity from 1-21 (X axis). The actual patient numbers for the baseline NIHSS score of 21 includes all the stroke patients with a baseline NIHSS score between 21-37.
Note that the percentage rate of favorable stroke outcome (mRS<1 at 3 months) is plotted against the baseline NIHSS score for each level of baseline NIHSS score. Each point estimate on the graph was obtained by determining the percentage rate of favorable stroke outcome in all the patients who had a particular NIHSS score. For example, there were 14 placebo patients with a baseline NIHSS score of 7, and 10 of those patients had a mRS<1 at 3 months. That works out to a percentage favorable stroke outcome figure of 71%. The plotted figures for a baseline NIHSS score of 21 includes all the patients who had a NIHSS score >20.
The curves in black and red are "best fit" curves for the wide scattering of point estimates and they were simply derived from Microsoft's Excel's polynomial curve function. They may not be the optimum "best fit" curves for the data from a statistician's perspective, but that doesn't matter because I am only going to use the curves in a "relative" sense to make certain educational points.
Before I discuss the issue of chance-events, I would like to point out some interesting facts.
1) Note that the "best fit" curves are very close together in patients with mild strokes (baseline NIHSS score 1-5) and in patients with very severe strokes (baseline NIHSS score >20). This phenomenon simply confirms the fact that TPA does not have any significant clinical efficacy in patients with mild or very severe strokes.
2) Note that the "best fit" curves are continuously changing in slope angle and that they are particularly steep in stroke patients with a baseline NIHSS score <15. This demonstrates that the stroke population is very heterogenous in terms of their likelihood of a favorable stroke outcome, and that there is no single cut-off point that separates patients with a low likelihood of a favorable stroke outcome from patients with a high likelihood of a favorable stroke outcome.
3) Note that each plotted point estimate is derived from the stroke outcome result-values of very few patients. This is due to the small sample size of the NINDS study. The confidence intervals around the value for each point estimate must therefore be very wide. Also, note the wide scattering of the point estimates. This should make readers appreciate the significant degree of uncertainty that would accompany any attempt to develop a "definitive" interpretation of these stroke outcome results.
4) Note that the maximum separation of the "best fit" curves occurs in the stroke severity range of baseline NIHSS score of 11-17. That fact would suggest that TPA is particularly efficacious for patients who have a stroke in that severity range.
However, is the wide separation of the "best fit" curves in the region of baseline NIHSS scores of 11-15 due to TPA's efficacy or is it due to chance-events?
Consider the following graphic display (a simplified version of the graphs in figure 2).
Figure 3: Plot of statistical outliers.
Note that the placebo patients with a baseline NIHSS score of 12, 13 or 14 had a zero rate of a favorable stroke outcome at 3 months. Does that make sense from a pathophysiological perspective? I believe that those "statistical outlier" points are either due to chance-events in a small study sample and/or due to other prognostic co-variables that I have not studied (eg. age, high blood pressure, co-morbid diseases) because I do not have access to all the raw data from the NINDS study.
I believe that that the 3x zero rate of favorable stroke outcome in placebo patients with a baseline NIHSS score of 12-14 accounts for the low rate of favorable stroke outcome figure of 14% for the placebo NIHSS 11-15 subgroup. Using that low figure of 14% for placebo patients artifactually favors TPA therapy and makes the "apparent" efficacy of TPA for the subgroup of patients with a baseline NIHSS score of 11-15 greater than the "true" efficacy of TPA.
How should one correct for that chance-event phenomenon? What is the likely favorable stroke outcome rate for placebo patients with a baseline NIHSS score of 11-15?
In my rapid response letter to the bmj [5b], I initially proposed a figure of 26-30% for the NIHSS 11-15 placebo subgroup's expected rate of a favorable stroke outcome -- based on my examination of the TOAST graph.
Figure 4: TOAST graph plotting the probable rate of excellent stroke outcome using the Barthel Index scoring system (Y axis) against the baseline NIHSS score (X axis) -- from reference number [8].
Note that if one examines the graph for non-lacunar strokes, that the expected probability of an excellent stroke outcome at 3 months for stroke patients with a baseline NIHSS score of 12-14 is 32-38%. Although it may therefore make physiological sense to use a value of 32-38% for placebo patients with a baseline NIHSS score of 12-14 in the 91-180 minute arm of the NINDS study, a significant interpretative problem arises if one attempts to use the TOAST graph as a "definitive" comparator. First of all, the TOAST graph used the Barthel Index stroke outcome scoring system and the relationship between stroke outcome scores as measured by the Barthel Index scoring system and the modified Rankin Scale scoring system is not directly linear. Secondly, the baseline NIHSS scores in the TOAST study were apparently measured at 24 hours and some stroke patients could have improved (or deteriorated) during the first 24 hours.
Is there a better comparator? The answer is obvious -- one merely needs to examine the placebo group's rate of favorable stroke outcome in the 0-90 minute arm of the NINDS study, because there is no pathophysiological reason why there should be a large discrepancy in the rates of favorable stroke outcome between placebo patients enrolled in the 0-90 minutes arm of the study compared to placebo patients enrolled in the 91-180 minutes arm of the study.
Figure 5: Comparison of the rates of favorable stroke outcome at 3 months (Y axis) against the baseline NIHSS scores (X axis) for placebo patients from the 0-90 minutes arm versus the 91-180 minutes arm of the NINDS study.
It is immediately apparent when one looks at the two graphs, that the favorable stroke outcome results from placebo patients in the 91-180 minute arm of the trial, who had a baseline NIHSS score of 12-14, were statistical outliers.
Note that the placebo patients in the 0-90 minutes arm of the study, who had a baseline score of 12, 13, 14, had a favorable stroke outcome rate of 57%, 20%, 27% respectively. The value of 57% is obviously a statistical outlier.
The "average" rate of favorable stroke outcome for the placebo NIHSS 11-15 subgroup patients in the 0-90 minutes arm of the NINDS study was 32.4% (unadjusted for the statistical outlier value at baseline NIHSS score 12). That value is much higher than the value of 14% for the 91-180 minute arm of the study. This simple comparison confirms the fact that by using an unadjusted value of 14% for the placebo NIHSS 11-15 subgroup patients in the 91-180 minutes arm of the study, that one is artifactually inflating TPA's "apparent" efficacy in the 91-180 minute arm of the NINDS study.
What happens if one plugs in a value of 32% for the placebo NIHSS 11-15 subgroup patients in the Grotta table (instead of using a value of 14%) and then re-calculates the risk difference and RR values.
The answer for all the 91-180 minutes patients (excluding stroke patients with a baseline NIHSS score <5 and >20) would be 44% (42/96) for the treated patients and 35% (40/114) for the placebo patients.
The re-calculated absolute risk difference would be 9% and the RR would be 1.26.
That means that the "true" efficacy of TPA could be much lower than the calculated value that the NINDS trialists presented in their NEJM article (absolute risk difference of 21% and RR of 1.8), and the "true" efficacy of TPA may be less than 50% of the value of the "apparent" efficacy of TPA (the "apparent" efficacy is a calculated value that is unadjusted for the unequal numbers of patients in the mild and very stroke subgroups, and the presence of a disproportionate number of statistical outlier results in the treated patients NIHSS 11-15 subgroup).
This difference can be vividly portrayed in graphic fashion by plotting a corrected "best-fit" curve.
Figure 6: Projected direction of "best fit" curve if corrected for statistical outliers.
It is important to realise that I am not really trying to be "definitive" in my analysis, and that I am not claiming that the "true" efficacy of TPA is definitely less than 50% of the "apparent" efficacy of TPA (using the NINDS Study Group's absolute risk difference value of 21% from the Grotta table as the "apparent" efficacy value). This analysis is only one way of looking at the raw data. Other trial-interpreters may look at the same data in a different manner and come up with different adjusted values for the absolute risk difference and the RR.
However, there is one major reason why I think that my personal estimate is reasonably accurate. TPA is known to be more effective if given early, and the favorable stroke outcome results from the 0-90 minute arm of the NINDS study demonstrated that TPA produced an absolute risk difference of 12% (mRS<1) and a RR of 1.7 for patients treated less than 90 minutes from the time of stroke onset. Therefore, one would expect the favorable stroke outcome results for patients treated between 91-180 minutes to be somewhat less than that value. An absolute risk difference of 9% and RR of 1.3 is a much more physiologically plausible result for patients treated between 91-180 minutes than the unadjusted figures published in the original NEJM article.
Additional lessons learnt from studying the patient-level data from the NINDS study
In this section, I will be discussing a number of issues relating to the analysis of the NINDS trial. However, I will not be limiting my commentary to the NINDS trial. I will also be discussing a number of issues relating to the overall quality of TPA-for-stroke research.
At the start of this section, I will consider the issue of whether the NINDS study demonstrated that TPA is significantly more efficacious for stroke patients treated <90 minutes from the time of stroke onset compared to stroke patients treated between 91-180 minutes.
The comparative efficacy of TPA based on time-to-treatment
In the previous section, I concluded that if the favorable stroke outcome results for the 91-180 minute group of patients were adjusted for imbalances in baseline stroke severity, then the adjusted values would suggest that patients treated after 90 minutes would have an absolute benefit secondary to TPA therapy that is less than it would be for patients treated < 90 minutes from the time of stroke onset. The estimated difference between the two groups after my corrctive "adjustments" was 3%.
Consider the NINDS trialists' official position regarding this time-to-treatment issue.
In the discussion section of the original NEJM article that was published in 1995, the NINDS investigators did not proffer an explanation as to why the NINDS study's results "apparently" demonstrated that TPA was more effective for patients treated later (91-180 minutes) rather than earlier (0-90 minutes). They simply provided the favorable stroke outcome results at 3 months in tabular form in the results section of the paper. With respect to the rate of favorable stroke outcome using the modified Rankin Scale scoring system, they supplied the following figures.
Rate of favorable stroke outcome at 3 months (mRS<1)
<90 minutes treated patients -- 40%
91-180 minutes treated patients -- 45%In other words, the NINDS study's results appeared to demonstrate that TPA was more effective in patients treated after 90 minutes compared to patients treated < 90 minutes from the time of stroke onset.
The NINDS investigators were presumably nonplussed by these unexpected results and they eventually wrote another article that was specifically targeted at this time-to-treatment issue.
In that article by Marler [9], the authors stated the following with respect to the time-to-treatment issue:-
"In the initial analysis of the results, there was little apparent observed difference between the two time strata in the number of favorable outcomes at 3 months or in the number of intracranial hemorrhages. The apparent lack of additional benefit for earlier treatment was unexpected because pilot studies 2,3 and laboratory research 4 had shown reduced rates of hemorrhage and increased benefit with earlier treatment. A recent study has commented on the strength of the research data supporting the concept that earlier treatment would be expected to produce a better outcome. ------------------In other words, the NINDS investigators initially appeared to be perplexed by the apparent lack of additional benefit for earlier treatment, and when you look at the avenues they were pursuing to explain this conundrum, they appeared to be pursuing some highly unlikely explanations.Seeking to understand the unexpected lack of difference between patient outcomes in the two time strata of these two trials, we proposed several explanations for the apparent lack of effect of time on patient outcome at 3 months: 1) patients treated sooner after stroke onset could have come to medical attention earlier because their presenting symptoms were more severe or noticeable; 2) patients starting treatment sooner after stroke onset may have come at a different time of day and received different care in the emergency department or intensive care unit; 3) patients treated earlier may have ischemic stroke subtypes less responsive to thrombolytic treatment; 4) for patients treated earlier compared to those treated later, baseline characteristics predicting better outcomes may have been distributed differently between the rt-PA and placebo groups."
When I received, and analysed, the NINDS study's patient-level raw data in early October 2003, I could easily establish that TPA was more efficacious in stroke patients treated in <90 minutes, compared to its efficacy in stroke patients treated between 91-180 minutes. Why could I easily make this determination, when the NINDS Study Group team could not readily make this determination? I am not sure why the NINDS investigators could not quickly determine that TPA was more effective if given < 90 minutes from the time since stroke onset -- I presume that it was because of a "fixed" mindset that caused them to be overly fixated on those results presented above.
How did I establish (using the NINDS study's own raw data) that TPA was more effective if given <90 minutes from stroke onset, compared to 91-180 minutes after stroke onset?
I did it by simply plotting the rate of favorable stroke outcome (mRS<1) at 3 months against the baseline NIHSS stroke severity score for both treatment groups.
Figure 7: Comparison of the rate of favorable stroke outcome (mRS<1) at 3 months for patients treated in <90 minutes versus patients treated between 91-180 minutes for each baseline NIHSS stroke severity score level.
It probably only takes the average reader of this manuscript about 5-10 seconds to notice that the 1-90 minutes treated patients had a better rate of favorable stroke outcome at 3 months than the 91-180 minutes treated patients -- simply because the pink plotted results are generally higher than the blue plotted results throughout most of the stroke severity range (from a baseline NIHSS score of 5 through 19).
It should be obvious that it is not going to be easy to quantify the difference in the rate of favorable stroke outcome between the two time-to-treatment groups by measuring the difference in vertical height between these graphs, and that any attempt to accurately quantify the difference is significantly hampered by the wide scattering of the point estimate results. Also, the value of each point estimate result must have a very wide confidence interval because of the small number of patients that constitute the "value" of each point estimate. I would personally approach any "definitive" interpretation of this graph, that depends on specialised statistical techniques, with a great deal of suspicion, because I think that the total sample size is too small and the point estimates too widely scattered to enable me to feel confident about the accuracy of any interpretative conclusions.
Is there another post hoc way of looking at those favorable stroke outcome results, so that it could allow a trial interpreter to "semi-accurately" estimate the size of the difference in therapeutic response to TPA between the two time-to-treatment groups?
The second way that I looked at the same raw data was by dividing the patients into five subgroups as Grotta did with the 91-180 minutes groups in his table. The following table presents the favorable stroke outcome results with respect to the five stroke severity subgroups for both the 0-90 minutes and 91-180 minutes treated patients.
Figure 8: Table of favorable stroke outcome rates for treated patients according to stroke severity subgroups from the 0-90 minutes and 91-180 minutes arms of the NINDS study.
Baseline NIHSS 0-90 minutes arm 91-180 minutes arm 1-5 9/13 (69%) 24/29 (83%) 6-10 23/30 (77%) 23/37 (62%) 11-15 17/39 (44%) 10/26 (38%) 16-20 12/40 (30%) 9/33 (27%) >20 2/35 (6%) 4/28 (14%) All patients 63/157 (40%) 70/153 (45%) All patients (excluding <5) 54/144 (38%) 46/124 (37%) All patients (excluding <5 and >20) 52/109 (48%) 42/96 (44%) Red row -- Note that these results were obtained by simply adding the results from the different subgroups (as Grotta did for the 91-180 minute arm of the study). The results for all the patients apparently suggests that TPA is more efficacious for patients treated after 90 minutes. However, note that there were significantly more patients with a NIHSS score of 1-5 in the 91-180 minutes treatment arm (compared to the 0-90 minutes arm) and that they had a disproportionately better rate of favorable stroke outcome (which is not necessarily due to the drug -- remember that I have previously demonstrated that placebo patients in the 91-180 minutes group had a 86% chance of a favorable stroke outcome due to the natural course of the disease while treated patients had a 83% rate of a favorable stroke outcome). That means that the 91-180 minutes treatment group's results is artifactually inflated by including those results in the total figure for all the 91-180 minutes treated patients. By removing those results from consideration, one would be decreasing the *noise level and increasing one's confidence that one was determining the "true" comparative efficacy of TPA for the two treatment groups. If one includes the unadjusted results from the NIHSS 1-5 subgroups in the total results without making a "corrective" adjustment, then it makes the 91-190 minute treated group appear to have a 6% greater degree of responsiveness to TPA when the "real" truth is that the 6% greater responsiveness is simply due to the combination of two effects:- i) a greater number of patients being present in the 91-180 minute treated arm's NIHSS 1-5 subgroup, and ii) a greater degree of spontaneous stroke recovery in that same subgroup of patients.
Orange row -- The results are for all the patients, excluding the NIHSS 1-5 subgroup of patients. However, note that the very severe stroke patients (NIHSS >20) in the 91-180 minute arm had a much higher rate of favorable stroke outcome (14%) than the same subgroup in the 0-90 minutes arm of the study (6%). That differential response in the 91-180 minutes treated group is probably due to a chance-event (or another unknown prognostic variable). That differential response artifactually inflates the rate of favorable stroke outcome in the 91-180 minute group (as compared to the 0-90 minute group). Note that the increased number of very severe stroke patients in the 0-90 minute treatment arm (compared to the 91-180 minute treatment arm) also artifactually benefits the the 91-180 minutes treatment group relative to the 0-90 minutes treatment group.
Green row -- These results are for all the patients, excluding the NIHSS <5 and >20 subgroups. I think that these results are more reflective of the "true" comparative efficacy of TPA based on time-to-treatment, because they have excluded a significant amount of *noise (TPA-unresponsive patients who were not equally balanced between the two treatment groups). Note that the final calculated results (green row) suggest that TPA is "apparently" more efficacious in patients treated <90 minutes from the time of stroke onset.
Do I think that this 4% difference is the "real" difference in TPA's efficacy between stroke patients treated <90 minutes as compared to stroke patients treated between 91-180 minutes? Absolutely not! All these calculations are merely theoretical estimations, which reflect the effect of making "best guess" corrective adjustments for highly significant *noise elements.
To add to the complexity of this "best guess" estimation, one still has to prove that the 0-90 minutes group of treated patients had the same baseline risk of a favorable stroke outcome before treatment as the 91-180 minutes group of treated stroke patients. How does one establish that these two TPA-treated groups were well balanced for the critically important prognostic variable of baseline stroke severity prior to treatment? See the next section.
The median - an imperfect tool for demonstrating that two patient groups are balanced
It is common practice in TPA-for-stroke RCTs for trialists to attempt to demonstrate that the treatment and placebo groups are balanced for the important prognostic variable of baseline stroke severity by demonstrating the fact that the two groups had the same median baseline NIHSS stroke severity score. In the NINDS trial, the trialists stated that the treatment and placebo groups were balanced for the prognostic variable of baseline stroke severity because the median baseline NIHSS score was 14/15 for both groups.
Trialists running post-marketing TPA-for-stroke studies also use the median baseline NIHSS score to claim that their study's results confirms the "true" efficacy of TPA -- simply because their study's stroke population had a median baseline NIHSS score that is identical to the NINDS study's median score of 14 (or within 1-2 points of that value).
I think that the using a median score is too crude a measure of the degree of balance in baseline stroke severity between two patient populations, because there is not a linear correlation between baseline stroke severity and the rate of a fvorable stroke outcome in untreated stroke patients. The "best fit" curve demonstrating the relationship between baseline stroke severity and the rate of a favorable stroke outcome in untreated stroke patients has a continuously changing slope angle in the middle section of the stroke severity range (NIHSS 6-17), while the curve is very flat at the extreme ends (mild stroke severity range and very severe stroke severity range).
Consider the distribution of the NINDS study's treated patients according to baseline stroke severity.
Figure 8: Distribution of treated patients according to baseline stroke severity.
I do not know of a precise method of making a "corrective adjustment" for the fact that the 0-90 minutes treated patients had a different baseline stroke severity distribution pattern than the 91-180 minutes treated patients (and therefore a different baseline expectation of a favorable stroke outcome), and I therefore cannot precisely estimate how much more effective TPA could be for patients treated <90 minutes compared to patients treated between 91-180 minutes when using the NINDS study's raw data. Can you?
Although I cannot make precise corrective adjustments for the difference in the distribution pattern of baseline stroke severity between the 0-90 minute arm and the 91-180 minute arm of the NINDS study, I know that one cannot ignore the problem as being without significance. There are two major differences in the stroke severity distribtion pattern that significantly favor the 91-180 minute arm of the study -- the relatively increased number of mild stroke patients with baseline NIHSS scores between 4-6 (those patients are likely to have a high rate of spontaneous improvement due to the natural course of the disease and not due to TPA therapy) and the relatively decreased number of very severe stroke patients with baseline NIHSS scores >20 (those patients usually have a very low likelihood of a favorable stroke outcome even with TPA therapy). To ignore that differential distribution in stroke severity between the two treatment arms is to accept an artifactually inflated likelihood of a favorable stroke outcome rate in stroke patients treated between 91-180 minutes (compared to 0-90 minutes) that is not directly related to the therapeutic effect of TPA therapy.
This is a very important issue, and I think that it forms the central focus of the TPA-contrarian argument regarding the NINDS study. Many TPA-contrarians pose their contrarian argument in two parts by first arguing that the primary question is not whether TPA is effective if given in <90 minutes, because even if TPA is effective in patients treated <90 minutes from the time of stroke onset, it is extremely unlikely that many stroke patients in community practice can be treated in <90 minutes. The second part of the contrarian argument is that the only RCT that has demonstrated that TPA therapy is effective in patients treated between 91-180 minutes is the NINDS study, and they assert that the NINDS study was too small in sample size to warrant designation as level I evidence from an EBM-quality perspective.
On the other hand, ardent proponents of TPA therapy take the contrary point of view, and they argue that there is no need to perform another larger TPA-for-ischemic stroke RCT, because they think that NINDS study was i) sufficiently large in sample size (300+ patients for the 91-180 minute arm of the study) and ii) the trial's signal was sufficiently large (absolute risk difference of 21% for the mRS<1 endpoint for the 91-180 minute arm of the study).
I personally think that the TPA-proponents' argument would only have some merit if it can be demonstrated that the sample size of any positive TPA-for-stroke RCT had at least 300+ stroke patients, who were i) very homogeneous from a stroke severity point of view, so that the stroke patient population only consisted of moderate -- moderate-severe -- severe stroke patients, who are expected to be highly responsive to TPA therapy (so that the trial's *signal signal is maximized), and ii) there were an equal number of patients in the treated and placebo groups for each stroke severity level (so that a "false" signal would not be generated as a result of a significant imbalance). The NINDS study does not meet those basic requirements.
In the 91-180 minutes arm of the NINDS study, 35% of the recruited stroke patients had baseline NIHSS scores <5 or >20. That means that the sample size of potentially "highly responsive" stroke patients (NIHSS 6-20) was only 65% of a total sample of 320 patients. That works out to 210 patients.
How can one justify giving a single RCT an EBM level I evidence designation if the RCT only had a sample size of 210 "potentially responsive" patients?
Critical commentary and recommendations
How does one know whether any TPA-for-stroke RCT's placebo group's results are accurate and truly reflective of what would happen to the "average" stroke patient in community practice if untreated?
Let's start from the beginning. How do we even know that TPA is significantly effective in the treatment of acute ischemic stroke?
Consider the results of a few TPA-for-stroke studies that have calculated the rate of favorable stroke outcome using the same stroke outcome measuring system -- a mRS score of <1 -- for patients treated in < 3 hours.
NINDS -- 39%
ECASS ITT (< 3 hours) -- 40%
ECASS II -- 40%Can you tell from those values whether TPA is significantly effective in increasing the rate of favorable stroke outcome in patients with acute ischemic stroke? Obviously, one cannot make that judgment without comparing those results to the rate of favorable stroke outcome (mRS<1) results from a comparable group of untreated stroke patients (placebo patients). Therefore, the accuracy of one's estimation of the degree of efficacy of TPA therapy in acute ischemic stroke depends on how accuratedly one estimates the rate of favorable stroke outcome in placebo patients.
It is interesting to note that when some stroke research experts discuss this particular issue in the medical literature, that they only use the favorable stroke outcome results from a single placebo group as a comparator value.
Here is an example of a "comparison" using a single placebo group from a review article on thrombolytic therapy in stroke.
Figure 9: Comparison of rate of favorable stroke outcome (mRS<1) in different TPA-for-stroke studies.
Note that the author has used the NINDS study's placebo group as a comparator group. However, in determining whether that comparison is a valid comparison, one should first note that the NINDS placebo group's results were misrepresented (probably due to a misprint) and the correct figure should be 26%. Even then, is it rational to compare the favorable stroke outcome (mRS<1) results of treated patients from other TPA-for-stroke studies to the NINDS study's placebo group's results -- considering that those other studies have a population of treated stroke patients, whose "average" median baseline stroke severity score of 13 (median baseline NIHSS score of the NINDS treated patients 14, ECASS treated patients 13, Cologne treated patients 12) is very different to the median baseline NIHSS score of 15 for the NINDS study's placebo group? Is that a fair comparison?
What would people think if I used the ECASS II study's placebo group's rate of favorable stroke outcome results (36.6%) as a comparator value? Using a figure of 36.6% as a comparator value for a placebo group's rate of favorable stroke outcome would cause a rational person to conclude that TPA has no significant efficacy in the treatment of acute ischemic stroke. However, note that the median baseline NIHSS stroke severity score for the ECASS II placebo group was 11, so that is not a fair comparison. How does one make a fair comparison? I don't think that the stroke research community has thought about this issue in sufficient depth, and I will now demonstrate how "fluid" and intangible the placebo group's rate of favorable stroke outcome value can be depending on how one structures and interprets a TPA-for-stroke RCT.
Here are the NINDS study's rate of favorable stroke outcome results for the placebo groups from the 0-90 minutes arm and 91-180 minutes arm -- presented side-by-side in a similar manner to the Grotta table.
Figure 10: Rate of favorable stroke outcome (mRS<1) at 3 months for placebo patients in the 0-90 minutes arm and 91-180 minutes arm of the NINDS study.
Baseline NIHSS 0-90 minutes arm 91-180 minutes arm 1-5 7/9 (78%) 6/7 (86%) 6-10 15/37 (40%) 23/46 (50%) 11-15 10/31 (32%) 5/35 (14%) 16-20 8/37 (21%) 6/33 (18%) >20 1/31 (3%) 2/46 (4%) All patients 41/145 (28%) 42/167 (25%) Note that the rate of favorable stroke outcome values vary for each stroke severity NIHSS subgroup. Theoretically, if the NINDS study's stroke patient population is representative of the "average" stroke population, then there should not be a wide discrepancy in the stroke outcome results between the two groups of placebo patients in the NINDS study, because they should both have the same (or closely similar) rate of favorable stroke outcome result due to the natural course of the disease. One can note from looking at the TOAST graph and the NINDS placebo patient groups graph that the rate of favorable stroke curve varies throughout the stroke severity range in a continuous curve (flat at both ends and steeper in the intermediate stroke severity range) and that one should be able to estimate what the expected rate of favorable stroke outcome "value" should be for any particular stroke severity subgroup with a reasonable degree of accuracy.
I therefore decided to create a reasonably fair "hypothetical" placebo group for comparison purposes by looking at the NINDS study's placebo groups' graphs for guidance. I also looked at the actual results in the above table. I finally decided to use simple common sense when choosing between the different percentage figures for each stroke severity NIHSS subgroup, and in most cases, I simply decided to split the difference. However, for the "hypothetical" placebo group's NIHSS 11-15 subgroup, I decided to choose an arbitrary value of 32% because it is intermediate in value between 45% (split value for the NIHSS 6-10 subgroup) and 19% (split value for the NIHSS 16-60 subgroup) and I decided to ignore the biased result of 14% from the 91-180 minute arm's NIHSS 11-15 subgroup, which I previously demonstrated was due to chance-events.
Figure 11: Table -- Column 4 -- Rate of favorable stroke outcome results (mRS<1) at 3 months expressed as a percentage for a "hypothetical" placebo group.
Baseline NIHSS 0-90 minutes arm 91-180 minutes arm Hypothetical placebo
group1-5 7/9 (78%) 6/7 (86%) 82% 6-10 15/37 (40%) 23/46 (50%) 45% 11-15 10/31 (32%) 5/35 (14%) 32% 16-20 8/37 (21%) 6/33 (18%) 19% >20 1/31 (3%) 2/46 (4%) 4% All patients 41/145 (28%) 42/167 (25%) What would the "apparent" efficacy of TPA therapy be for the NINDS study's patients treated between 91-180 minutes if this "hypothetical" placebo group was used as a comparator group, and we presumed that there were the same number of patients in each stroke severity subgroup as occurred in the NINDS study's 91-180 minutes treated stroke patient subgroups.
Figure 12: Hypothetical TPA-for-stroke trial number 1 -- Comparison of the rate of favorable stroke outcome (mRS<1) at 3 months between the NINDS study's 91-180 minute treated stroke patients versus the "hypothetical" placebo group's stroke patients -- using the same patient numbers for the placebo group as was used for the treatment group in the Grotta table.
Baseline NIHSS Treated patients "Hypothetical" Placebo Group 1-5 24/29 (83%) 24/29 (82%) 6-10 23/37 (62%) 17/37 (45%) 11-15 10/26 (38%) 8/26 (32%) 16-20 9/33 (27%) 6/33 (19%) >20 4/28 (14%) 1/28 (4%) All patients 70/153 (46%) 56/153 (37%) Note that the number of stroke patients in this "hypothetical" TPA-for-stroke trial has been equalised, so that there is zero discrepancy in patient numbers in each stroke severity NIHSS subgroup between the treated patients and placebo patients, and that the patient numbers are identical to the treated group's numbers in the 91-180 minute arm of the NINDS study. In other words, this "hypothetical" TPA-for-ischemic stroke trial has less *noise than the NINDS study because there is no imbalance in stroke severity between the treated and placebo groups.
Note that in this balanced "hypothetical" TPA-for-stroke trial that the placebo group has a rate of favorable stroke outcome (mRS<1) value at 3 months of 37%. That value is much higher than the value of 25% obtained in the 91-180 minute arm of the NINDS study. Also, note that the absolute risk difference (which is equivalent to the "apparent" efficacy of TPA) in this "hypothetical" TPA-for-stroke trial would be 9%. That absolute risk difference value is far less than the absolute risk difference value of 21% obtained in the 91-180 minute arm of the NINDS study. The reason for the reduction in the "apparent" efficacy of TPA is that the placebo group's "apparent" rate of favorable stroke outcome goes up from 25% to 37% simply by equalising the number of patients in each stroke severity subgroup, while keeping the favorable response rates unchanged for each stroke severity subgroup.
The estimated "apparent" efficacy of TPA would be very different if we organised this "educational exercise" differently. Instead of changing the number of patients in each placebo subgroup to equal the number of patients in each treated subgroup of the NINDS study, one could theoretically design a "hypothetical" TPA-for-stroke trial that was better balanced.
Consider a trial design for a "hypothetical" TPA-for-stroke trial that deliberatedly attempts to reduce the *noise level due to imbalances in baseline stroke severity. Consider a hypothetical TPA-for-stroke trial of 2,000 acute ischemic stroke patients -- 1,000 treated patients and 1,000 placebo patients. Presume that 5% of the enrolled patients (50 patients in each arm of the trial) had a stroke severity score of NIHSS 1-5, 5% (50 patients in each arm of the trial) had a stroke severity score of NIHSS >20, and that each of the NIHSS 6-10, 11-15 and 16-30 stroke severity subgroups had exactly 30% of the total number of patients (300 in each subgroup).
Then the revised stroke outcome table would look like this:-
Figure 13: Hypothetical trial number 2 -- Comparison of the rate of favorable stroke outcome (mRS<1) at 3 months between treated patients and placebo patients -- using the same percentage rate of favorable stroke outcome for each "hypothetical" treated subgroup as occurred in each of the stroke severity subgroups of the treated patients arm of the NINDS study (91-180 minutes), and the same percentage rate of favorable stroke outcome for each "hypothetical" placebo subgroup as occurred in the "hypothetical" placebo group from hypothetical trial number 1.
Baseline NIHSS Treated patients "Hypothetical" Placebo Group 1-5 41/50 (83%) 41/50 (82%) 6-10 186/300 (62%) 135/300 (45%) 11-15 114/300 (38%) 96/300 (32%) 16-20 81/300 (27%) 57/300 (19%) >20 7/50 (14%) 2/50 (4%) All patients 429/1000 (43%) 331/1000 (33%) Note that this "hypothetical" TPA-for-stroke trial is much better balanced than the NINDS study and that it minimised the number of mild stroke and very stroke patients enrolled into the trial.
Note that the placebo patients had a 33% rate of of favorable stroke (mRS<1) at 3 months. Note the "apparent" efficacy of TPA in this "hypothetical" trial is 10%. This figure is less than 50% of the unadjusted "apparent" efficacy figure of 21% obtained by the NINDS trialists in the 91-180 minute arm of the NINDS study.
Of course, all of these calculated numbers will vary as one changes the number of patients recruited into each of the five stroke severity subgroups, because the balance between the stroke severity subgroups would be altered. This stroke severity heterogeneity problem means that one can never determine the "real" efficacy of TPA for acute ischemic stroke using this method of quantifying any TPA-for-stroke RCT's favorable stroke outcome results -- because the difference between the "apparent" efficacy of TPA and the "real" efficacy of TPA will vary depending on the degree of stroke severity heterogeneity present in any TPA-for-stroke trial. This is a critically important issue that the stroke research community has not adequately considered -- the significance of the fact that acute ischemic stroke patients recruited into TPA-for-stroke RCTs have a wide range of expected stroke recovery due to the natural course of disease, which can vary >10x from <7% to >70%, and that *noise due to the imbalances in baseline stroke severity can obscure the RCT's *signal with a significant amount of *noise, thus decreasing one's confidence in the validity of the RCT's unadjusted results.
I can think of only one way of avoiding the confounding effects of this stroke severity heterogeneity problem, and that is to plot the favorable stroke outcome results of a TPA-for-stroke RCT for each baseline NIHSS score (as was performed in figure 2). That would eliminate the stroke severity heterogeneity problem entirely. The difference in height between the "best fit" curves for the plotted results would represent the "true" efficacy of TPA for the entire range of stroke severity NIHSS scores. However, that TPA-for-stroke RCT would probably have to be very large (>10,000 enrolled patients) for the results to be statistically valid.
Jeff Mann. MD.
Retired Emergency Physician.
E-mail address: jmannemg@earthlink.net
Date that I received the NINDS study's patient-level data: October 8th 2003.
Date that I completed this personal analysis of the NINDS study: October 19th 2003.The limitations of a RCT in determining the scientific truth:
I recently read an insightful book called "Fiction and Fantasy in Medical Research: The Large-scale Randomised Trial", which was written by James Penston. The author is a critic of large RCTs and he thinks that they have a limited ability to determine the scientific truth. I am sympathetic to his point of view, and I would highly recommend the book to trial designers, clinical trialists, trial interpreters, and clinicians who base their clinical practice on EBM evidence from RCTs.
Here are a selection of quotes from his book, selectively presented through the prism of my rose-tinted glasses.
"In other words, the fundamental objective of clinical trials is the identification of causal relationships. However, in order to ensure that any difference in outcome is related to the difference in exposure or treatment, the groups must be matched in all all other respects."
"Even if it were theoretically possible for our knowledge to extend to every minute detail of every factor relevant to the outcome of disease, it would, in practical terms, be impossible to create two treatment groups matched in every respect. As a consequence, any claim that a difference in outcome of a clinical trial between one group and another is due to the effect of treatment may be challenged on the grounds that the difference may reflect an unequal distribution of other relevant variables."
"The internal validity of a clinical trial refers to the justification for the inference that the difference in outcome is due to the difference in treatment. This inference is only valid if alternative explanations -- namely, the unequal distribution of other relevant variables among the treatment groups, the presence of bias in the assessment of outcome and the difference being due to chance -- have been excluded by the design, performance and analysis of the trial."
"During recent decades, though, statistics have acquired an exaggerated importance. The clinical relevance of the results now takes second place to the assertion of statistical significance, while imperceptible differences between drug and placebo are obscured by boasts of low P-values. Researchers, editors of journals and the medical profession in general have become mesmerised by statistics while forgetting that statistical analysis is a minor player in medical research. The role of statistics is merely permissive, to nod affirmatively when the arithmetic indicates that chance is unlikely to explain the difference, thus allowing the more important business of judging the rest of the evidence, including the clinical relevance of the findings, to proceed."
"External validity concerns the justification for the assertion that the results of a clinical trial are applicable to a wider population of patients than simply those patients participating in a study. At this point, it is worth noting that if the internal validity is poor, then there is little point in considering the external validity."
"One of the principles of the scientific method is that any reference class -- that is, the group or collection of items about which a general statement is to made -- should be restricted to the most homogeneous class available."
"In theory at least, the randomised controlled trial appears to offer a solution to the problem of heterogeneity. However, the methodology is complex, and, at every stage, errors may occur including faults with the initial random allocation of treatment, the failure to preserve the equal distribution of relevant variables throughout the study period and the distribution of the randomisation process during data analysis."
"Underlying this approach, there seems to be the assumption that the problems of randomised controlled trials are grounded in human error alone, that they may be eradicated by a concerted campaign of education, and that, provided researchers as well as those involved in checking their work behave correctly, all will be well and clinical research with yield reliable generalisations. But this is a one-sided analysis. It is the case of blaming the workman without any regard to the quality of his tools. An alternative interpretation is that the sheer complexity of randomised trials impedes compliance with all of the conditions required for internal validity. Indeed, methodology is so unwieldy that it seems unlikely that full compliance would ever be achieved on a regular basis. And the reliability of the results is so dependent upon compliance at every stage of the process that a single error or omission may bring the whole enterprise into question."
"If the practice of randomised trials so frequently fails to meet the requirements for internal and external validity, then perhaps the fault lies in the methodology itself. But such a conclusion is anathema to those involved in medical research who view the randomised controlled trial as sacrosanct. Indeed, the randomised trial is so entrenched in the minds of researchers that it has become, to quote Kuhns terminology, a paradigm, and, as such, must not be a subject of doubt. Nonetheless, doubts about the methodology cannot be ignored, especially when the focus turns to large-scale randomised trials."
"For all the accolades bestowed upon them, mega-trials are permeated with uncertainty. Wherever we look, we stumble across nothing but a morass of murky data, unsubstantiated claims and dubious inferences."
"Large-scale randomised trials are deeply flawed. They deliver nothing but a deep shadow of therapeutic benefit, a pretense of efficacy to fool the unwary and the means by which those with little interest in either the integrity of medical research or genuine improvements in health care inflate their reputations and maximise their profits. Time is likely to show that what is being built on the flimsy foundations of mega-trials is nothing more than a house of cards."
Definitions of some terms used in this manuscript:
Favorable stroke outcome -- also called excellent stroke outcome. Many different stroke outcome measures can be used to quantify stroke outcome results, and the NINDS trialists used a number of different stroke outcome measures (modified Rankin Scale, Barthel index, Glasgow Scale, Global Statistic) when describing their study's results. I tend to exclusively use the modified Rankin scale (mRS) because it is the primary endpoint stroke outcome measuring system used in most TPA-for-stroke RCTs, and it is also the most commonly used stroke outcome measure in the published medical literature.
External validity -- The extent to which the conclusions of a clinical study of an investigational drug can be correctly applied to persons beyond those who were investigated. Those persons are generally patients in community clinical practice who are expected to receive the investigational drug when the drug is subsequently marketed.
Internal validity -- The extent to which the conclusions of a clinical study are correct for the subjects under investigation.
Modified Rankin scoring scale for measuring stroke outcome
Level 0 -- No symptoms
Level 1 -- No significant disability, despite symptoms; able to perform all usual duties and activities
Level 2 -- Slight disability; unable to perform all previous activities but able to look after own affairs without assistance
Level 3 -- Moderate disability; requires some help, but able to walk without assistance
Level 4 -- Moderatedly severe disability; unable to walk without assistance and unable to attend to own bodily needs without assistance
Level 5 -- Severe disability; bedridden, incontinent, and requires constant nursing care and attention
Evidence based medicine (EBM) terminology
Absolute risk reduction = control event rate - treated event rate expressed as a numerical figure or as a percentage; or the absolute value of the difference between the experimental event rate and the control event rate (EER - CER). The absolute risk reduction is also referred to as the absolute risk difference.
Number needed to treat = The number of patients that need to be treated with a proposed therapy in order to prevent an outcome event or cure one individual; 1 divided by (control event rate - treated event rate); or the reciprocal of the absolute risk reduction (1/ARR).
Odds ratio = The proportion of the patients with the target event divided by the proportion without the target event.
Relative risk = The relative risk of a target event in control patients relative to the risk of a target event in treated patients (or vica versa).
Relative risk reduction = (control event rate - treated event rate) divided by control event rate expressed as a numerical figure or as a percentage; or the absolute value of the difference between the experimental event rate and the control event rate divided by the control event rate (EER - CER/CER).
Many clinicians feel that the absolute risk reduction is a more valuable index than the relative risk reduction because it takes into account baseline risk, and because its reciprocal is the number needed to treat, which is a useful figure to use when comparing the relative value of different treatment options. If a journal article only supplies a numerical figure for the relative risk reduction, it can (fairly easily) be converted to the absolute risk reduction by multiplying the RRR by the control event rate, a value that is almost always supplied in the journal article.
The ARR looks at the absolute risk of an event occurring, while the odds ratio estimates the odds of having an event versus not having an event (expressed as a proportion).
In most instances, odds and risks are almost equal (risks are equal to odds/1 + odds, and odds are equal to risk/1 - risk). Relative risk (RR) is the preferred term for a ratio of risks, and odds ratio (OR) for a ratio of odds. Although no one intuitively understands a ratio of odds, it has been the predominant measure of association, because the term is essentially independent of the arbitrary choice between a comparison of the risks of an event (such as death) or the corresponding non-event (such as survival), which is not true of the term relative risk. As clinicians, we would like to be able to substitute the RR (which we subjectively understand) for the OR (which we do not intuitively understand). The validity of this substitution requires that the RR, which is [a/(a + b)] divided by [c/(c + d] be more or less equal to the OR, which is a/b divided by c/d. For this to be the case, a must be much less than b, and c much less than d; in other words, the outcome must occur infrequently in both the treatment and control groups. For low event rates, common in most randomized trials, the OR and RR are very close. The RR and OR will also be closer together when the magnitude of the treatment effect is small, (that is, OR and RR are very close to 1) than when the treatment effect is large.
Instructions for printing the large graphs.
Chose "file" on Internet Explorer's tool panel => choose "page setup" => choose "landscape" orientation => set the margins to 0.5" => click OK => choose "print preview" to ensure that the entire graph can fit on the page.
1. The National Institute of Neurological Disorders and Stroke rt -PA Stroke Study Group. Tissue Plasminogen Activator for Acute Ischemic Stroke. N Engl J Med 1995;333:1581-1587.
2. Mann J. Truth about the NINDS study: setting the record straight. West J Med 2002;176:192-194.
Available at http://www.ewjm.com/cgi/content/full/176/3/192
3. Lenzer J, Alteplase for stroke: money and optimistic claims buttress the “brain attack” campaign. BMJ 2002;324;723-729.
Available online at http://bmj.bmjjournals.com/cgi/content/full/324/7339/723
4. Rapid response letter to the bmj -- The NINDS Stroke Study Group Response James C. Grotta, et al. bmj.com, 27 Jun 2002.
Available online at http://bmj.bmjjournals.com/cgi/eletters/324/7339/723#23369
5. Rapid response letters to the bmj -- Mann J.
a. The raw data of the NINDS trial should be made public. Jeffrey Mann. bmj.com 8 July 2002.
Available online at http://bmj.bmjjournals.com/cgi/eletters/324/7339/723#23692
b. The difference between the "apparent" and "true" efficacy of tPA in the NINDS trial. Jeffrey Mann. bmj.com, 14 Jul 2002.
Available online at http://bmj.bmjjournals.com/cgi/eletters/324/7339/723#23927
6. Ingall TJ, O’Fallon WM, Louise TA, et al. Initial findings of the rt-PA acute stroke treatment review panel. Cerebrovasc Dis 2003; 16 Suppl 4: S1-S125.
7. Sackett, David L. Why randomized controlled trials fail but needn't: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!) CMAJ: Canadian Medical Association Journal. 165(9):1226-1237, October 30, 2001.
Available online at http://www.cmaj.ca/cgi/content/full/165/9/1226
8. Adams, H. P. Jr. MD. Davis, P. H. MD. Leira, E. C. MD. Chang, K.-C. MD. Bendixen, B. H. PhD, MD. Clarke, W. R. PhD. Woolson, R. F. PhD. Hansen, M. D. MS. Baseline NIH Stroke Scale score strongly predicts outcome after stroke: A report of the Trial of Org 10172 in Acute Stroke Treatment (TOAST). Neurology. 53(1):126-131, July 13, 1999
9. Marler, J R. MD. Tilley, B. C. PhD. Lu, M. PhD. Brott, T.G. MD. Lyden, P. C. MD. Grotta, J. C. MD. Boderick, J. P. MD. Levine, S. R. MD. Frankel, M.P. MD. Horowitz, S. H. MD. Haley, E. C. Jr. MD. Lewandowski, C. A. Kwiatkowski, T. P. MD. for the NINDS rt-PA Stroke Study Group *. Early Stroke Treatment Associated With Better Stroke Outcome: The NINDS rt-PA Stroke Study. Neurology 55 (11) 1649 - 1655, December 12, 2000