As Jed S. Rakoff said in his Timbers’ lecture, science is increasingly important in the courtroom. It has ‘invaded the courtroom to an unparalleled extent’, however, since science and law are so different in all aspects, whether that is declaring what is right or ethical or debating the appropriate length of this thesis, Rakoff notices a ‘love/hate relationship’ between the two. Love because exact science should be that: exact. Hate because often, this isn’t the case; while science is associated with certainty, ‘scientists themselves are more comfortable these days with probabilities’. Another difficulty with science is that some types prove to be unreliable. Examples given by Rakoff are polygraphs and psychoanalyses. ‘How to keep pseudoscience out of the courtroom while letting real science in?’ Judges need to decide when science is admissible in court and when it isn’t – and that can prove to be very hard. ‘As now codified in Rule 702 of the Federal Rules of Evidence, a
qualified expert can testify as to scientific, technical or other specialized knowledge only if a judge first determines that “(1) the testimony is based upon sufficient facts or data, (2) the testimony is the product of reliable principles and methods, and (3) the witness has applied the principles and methods reliably to the facts of the case.”24’
‘Plainly, this rule is far from self-executing.’ An example: neuroscience is seen by Rakoff as being too undeveloped to be admissible – but that may very well change over time. Therefore, admissibility requires reassessment over time because of the dynamics of science. The problem is that science is getting so specialized that even scientists have problems assessing data and articles written by others – let alone jurists. Rakoff: ‘In short, science and law remain uncomfortable bedfellows; but twin beds are not an option. We may expect, therefore, that, jumbled together, they will toss and turn for a long time to come.’
A Leiden University Honours Class was started in the autumn of 2014 with as its focus the problem addressed in Rakoff’s speech. Its title, named after Rakoffs speech: Science and the Law: Uncomfortable Bedfollows? (acronym: (HC) SLUB)
In this class, we have been looking at various cases to illustrate clashes between sciences. Our main object of study was the case of Lucia de Berk, who got convicted (basically) on statistics. Her case is special because it contains many different types of clashes, both intra- and interdisciplinary. In this Honours Class, we decided there were four types of disciplines: alpha, beta, gamma and delta. I will define these in Section Two. We had lectures on Dutch Law, Medicine Law, Statistics and Law and Engineering and Law to make ourselves familiar with the parts of the Dutch Law system that were most important in both her conviction and her eventual exoneration.
In this thesis, I would like to focus on several clashes in the case of Lucia de B. The first one is a beta-beta clash: a discussion between statisticians on how to calculate the probability that Lucia de B. was present at all the deaths without having to do anything with it. Furthermore I will focus on the backtalk that was going on between jurors and doctors. The second clash is an alpha-beta clash on whether the case of Lucia de B. could be reopened. Finally, I am also very interested in the cases of poisoning that allegedly happened during her shifts. There were discussions, not only about who did the poisoning, but also (alarmingly) whether there had been a poisoning. I find it hard to define this clash, but I think it is a beta-beta-alpha one. Was there a poisoning? (beta-beta) and whose testimony do we take seriously? (beta-alpha) The problem in all of these clashes is contradictory expert evidence. I would regard this as a serious PROBLEM, because when different experts proclaim different truths, who do you listen to? This is one of the problems on which Lucia’s conviction was based: The judges hand-picked the evidence that suited them most when they could, and I don’t think this is how the law is supposed to work. I would define the PROBLEM in the following way:
That experts can give contradictory evidence, resulting in heavily weighing testimonies that can be hand-picked in court
My DREAM would be
That there would be a better way of picking and comparing evidence to decide which one is the most trustworthy.
I shall first describe what tools I will use to analyse this problem. Then I will introduce and describe the aspects of the Lucia de B. case most relevant to my story, already mentioned shortly above. I will try to make an engineering analysis to research how my DREAM can help solve the PROBLEM. It is important to know all relevant actors in the aspects of the case mentioned above to see how their roles could be improved. In the last sections, I will revisit the analysis to see how falsifiable and relevant it is.
In the Honours Class, we have learnt about many tools for analysis of a problem.
I will first try to make a short overview of the various tools, so that if necessary I can reference to them.
- The lack of rules or law, which leads to ad hoc and inconsistent adjudication.
- Failure to publicize or make known the rules of law.
- Unclear or obscure legislation that is impossible to understand.
- Retrospective legislation.
- Contradictions in the law.
- Demands that are beyond the power of the subjects and the ruled.
- Unstable legislation (ex. daily revisions of laws).
- Divergence between adjudication/administration and legislation.
Disciplines rub because of friction between applied and fundamental knowledge. To make them closer to each other we can use several heuristics (reasoning shortcuts that are imperfect but easy to compute); Crude look at the whole, primacy of parsimony, complete and consistent, quantum mechanics.
There are four different disciplines in science; alpha, beta, gamma and delta. The alpha sciences are literary disciplines like languages, history and law. The beta sciences are the ‘exact sciences’ like maths, statistics, or biochemistry. I will compare between those two, but for the sake of completeness: the gamma discipline contains the social studies like psychology, and the delta discipline uses principles even though they are already falsified.
Visceral creed is an individual’s (or an institution’s) collection of gut-felt convictions or culturally constrained (normed) talents. Knowledge is comprehension that most participants in a group/culture/institution are prepared to act upon (useful comprehension). Conviction is a modality of understanding, of comprehension.
Culturally constrained (normed) talents turn (with situated practice) into niche-dependent capabilities – e.g.:
To survive (Darwin: repr., met. & eco. efficiency)
To group (Sherif: Robbers Cave; Asch’s exp.)
To band against bullies (De Waal)
To worship & cooperate (religions, wars)
To experiment, understand (convictions)
To communicate (imagery, languages)
Models (also: theories) support useful comprehension, through fostering:
Connectedness (pattern recognition),
Unification (diverse application),
Lies are models or stories employed to cheat. Stories are all other forms of imagery.
There is a morality in admitting ignorance; stories presented as models breed lies.
I have read Derksen’s “Reconstructie van een gerechtelijke dwaling” and De Noo’s “Er werd mij verteld, over Lucia de B.”. I will occasionally reference to these books.
I have already talked about the article of Rakoff in my introduction. I would also like to reference to the summary of the article from Margaret Berger on the Daubert case (The admissibility of Expert Testimony) and Daniel Goodstein on the workings of science (How Science Works), as I think it illustrates very well the influence the court can have on expert testimony and the difficulties in uniting science and the law. I would also like to reference to my own summary of Kaye and Freedman (Reference Guide on Statistics), because it explains statistical reasoning very well, especially all the mistakes that can be made by it. In a case where an innocent was basically convicted because of mistakes made in statistical reasoning, I find it very important.
184.108.40.206 Berger and Goodstein summarized (Freweini’s summary)
Berger described the Dauber trilogy and its impact to eventually demonstrate issues that judges encounter and have to resolve.
Determining whether scientific evidence is admissible; federal circuits refer to the Dauber decision.
Goodstein described that the role of scientific proof has been the subject of discussion. More precisely, judges these days have to decide whether something is scientific or not. Eventually, due to scientific inventions that have been going on since the 16th century, science has taken a central important role in our lives and inevitably penetrated the courthouses . But what is science? Science is essentially the scientific method, a way of testing and theoretically explain the natural world to discover important truths about it1.
And how does science work? Numerous thinkers have thought about this matter. Francis Bacon’s idea that collecting observations should be proceeded without prejudice is disputed by many thinkers, since we all do science depending on presumption. Karl Popper constructed a more sceptical theory: The Falsification Theory, which states that a theory can never be proved right by agreement with observation, but it can be proved wrong by disagreement with observation. Therefore good ideas can be replaced by even better ideas. However, the scientific community doesn’t follow the path of Popper, since crediting in science is most often given by offering correct theories. Kuhn adds to the description of how science works that contradictions and difficulties arise that cannot be resolved. However, scientists won’t acknowledge them, until to a certain point these difficulties accumulate to make the situation intolerable to ignore them. This will cause a scientific revolution, which replaces the present paradigm (way of thinking) with an entirely new one. A so called ‘Paradigm Shift’ has occurred. But the question remains how big the change must be in order to qualify it as a paradigm shift.
That science and law might be uncomfortable bedfellows is due to the fact that they differ both in language use and the objectives they seek to accomplish. The same word might have different meaning. For example, the word ‘force’, as used by lawyers, is associated with violence and domination of one person in contrast to the word force used in science, which is more associated with speed and direction of motion. Another word, more applicable to the Lucia de B. case is the word ‘error’. In law ‘error’ and ‘mistake’ are used almost synonymous. In science however error an mistake have different meanings. Mistakes can be made by everyone won’t have to be reported in scientific literature. However an error, is intrinsic to any measurement and should not be ignored, or covered in order to carefully analyse the error to put limits on findings.
Furthermore, the objectives of law and science differ significantly: the objective of the law is justice, however the objective of science is the truth. This consequently means that in justice a decision has to be made in a reasonable and limited time period whereas in science there’s no time limit.
Despite these differences both science and the law seek to arrive at rational conclusion that surpasses the prejudices and self-interest of individuals .
Finally, in order to regulate the distinguishing between science and pseudoscience the Daubert decision was installed in 1993 by the U.S. court. Daubert says that methods should be judged on the following.
- The theoretical underpinnings of the methods must yield testable predictions by means of which the theory could be falsified.
- The methods should preferably be published in a peer-reviewed journal.
- There should be a known rate of error that can be used in evaluating the results.
- The methods should be generally accepted within the relevant scientific community
The Daubert decision touches upon Poppers perspective, however it manages to avoid that scientists are sceptical of their own ideas. Instead it is an impressive attempt to serve justice if not the truth.
220.127.116.11 Summary Kaye and Freedman
This reference guide describes the elements of statistical reasoning.
Statistical studies will generally be admissible under the Federal Rule of Statistics. Sometimes a study may not fit, however, often the battle over evidence concerns weight or sufficiency.
Statistics has three subfields: probability theory, theoretical statistics, and applied statistics. Many scholars are exposed to statistical ideas, however, experts are more likely to interpret results correctly. Forensic experts may miss information how data are computerized, but statisticians may lack background information required to compute a problem correctly.
What are procedures that enhance statistical testimony?
- Maintaining professional autonomy. Objectivity is required.
- Disclosing other analyses. Not only favourable results should be presented when there are several ways there could be looked at the data.
- Disclosing data and analytical methods before trial. Collection of data is hard, expensive and time-consuming. Pretrial discovery procedures should be used to minimize debates over accuracy and choice in techniques.
How Have the Data Been Collected, Is the Study Designed to Investigate Causation?
Work is needed to bridge the gap between association and causation. Anecdotal evidence can’t be conclusive, so a randomized controlled experiment is needed. With anecdotal evidence there is no control group. Proceed careful with the control group; often the control and problematic group may differ in ways other than the one examined. In a controlled experiment, the investigators make the groups as equal as possible. But in observational studies, the subjects ‘choose’ their own group, making them likely to differ in other factors as well.
Randomized controlled experiments are generally more secure than inferences based on observational studies. Comparisons with a control group are essential.
Often, an experiment is not possible. Observational studies provide good evidence if the association is seen in different types of groups, the association holds when effects of confounding variables are taken into account, and alternative explanations should be less plausible. It should be questioned what the differences between the control group and the treatment group are.
Important question: can the results be generalized?
Are the subjects a representation of the outside world? Do different studies point in the same direction?
Descriptive Surveys and Censuses
- What method is used to select the samples in a population (units)? Sometimes it is hard to reach a group that is fitted exactly like the whole population. There may be both selection and non-response bias.
- Of the units selected, which are measured? Surveys should report non-response cases. A good survey defines an appropriate population, uses a probability method for selecting the sample, has a high response rate, and gathers accurate information on the sample units.
- Is the measurement process reliable? Reliability refers to reproducibility of results.
- Is the measurement process valid? Is there a correlation between the measurement process and the thing you want to measure?
- Are the measurements recorded correctly?
- What is random? Random means everyone could be chosen with the same probability, and looser definitions of randomness are inadequate for statistical purposes.
How Have the Data Been Presented?
- Are Rates or Percentages Properly Interpreted?
- Have appropriate benchmarks been provided? The figures must be put into perspective.
- Have the data collection procedures changed? Changes in definitions and collection methods influence numbers.
- Are the categories appropriate?
- How big is the base of a percentage? If the total number is small, changes in percentage are easily made.
- What comparisons are made? Would another comparison give a different view?
Is an Appropriate Measure of Association Used? When a comparison with percentage points is used, the difference may seem small. But when both percentages are low, the relative difference may be big. The odds ratio is more symmetric. Aggregation may distort data.
Does a Graph Portray Data Fairly?
- How are trends displayed? Pay attention to the scales on the axes.
- How are distributions displayed? Inquire how the analyst chose the bin widths.
Is an appropriate measure used for the Centre of a distribution? Mean, mode and median are different things.
Is an appropriate Measure of Variability used? Look careful at the spread; the range, interquartile range and standard deviation are different. You should supplement them with a figure that displays much of the data.
What inferences can be drawn from the data? One must watch out for random errors.
In the case of estimation;
- What estimator should be used?
- What is the standard error? The confidence interval?
- How big should the sample be?
- What are the technical difficulties? Confidence intervals are only usable if the standard deviation is small.
Significance levels and Hypothesis Tests
- What is the p-value? The p-value is the probability of getting data as extreme as, or more extreme than, the actual data—given that the null hypothesis is true. It is not the probability that the null hypothesis is true.
- Is a difference statistically significant? 5% and 1% in p-value are most used.
- Tests or interval estimates? The p-value does not measure the strength or importance of an association.
- Is the sample statistically significant? Samples can’t be statistically significant, they are representative or unrepresentative.
Evaluating Hypothesis Tests
- What is the power of the test? Power is the chance that a test will declare an effect when there is an effect to be declared. This depends on the size of effect and sample. When no effect is seen in a low power study, the result is inconclusive and not negative.
- What about small samples? Underlying assumptions are hard to validate, confidence intervals are hard to compute and they may be unreliable.
- One tail or two? Doesn’t matter, but they produce different p-values, so it should be made clear what test was used.
- How many tests have been done? When a test is repeated too often, artifacts can come up.
- What are the rival hypotheses? That one proves wrong doesn’t mean the other is right.
Often, the frequentist statistician won’t compute the probability that a hypothesis is correct. One should compute the probability prior to looking at the data. (Bayesian approach)
- Correlation and regression
One could plot two variables in a scatterplot to look at the correlation. This can be quantified with a correlation coefficient, designed for a linear association. The coefficient can be influenced by outliers. Also, association does not mean causation. A regression line can be created, although it becomes less trustworthy as we move away from the bulk of the data. The unit of analysis should be the same as the unit we want to compare. Example: a state doesn’t go to school, people do. We should look at people and not at state scores if we want to look at correlation.
- Frequentists and Bayesians (or: objectivists and subjectivists)
A Bayesian sees probabilities as representing not facts but degrees of belief: Probabilities are subjective. Assessing probabilities is never straightforward, and assumptions should be questioned.
The remainder of the appendix consists of technical backup for the examples in the paper, a glossary of terms, and a reference guide. They are not summarized in this summary.
Probability reasoning was used for the first time when director Smits of the hospital went to the police, saying that one of his nurses had been far too often present at the deaths of his patients. In his own words, he had used ‘rough/rush statistics’ and said ‘nothing has happened with these statistics apart from me going to the police’ (Derksen). Apparently, he had enough faith in his statistics both to go to the police and to give a declaration to Dutch newspaper ‘De Telegraaf’. His calculations aren’t known to us, so unfortunately we know neither ‘the impossibly big number’ he talked about nor the reasoning behind it. However, once in court, the statistics expert Elffers redid the calculations and got the (still) impossibly large number of one in 342 million. This number got into people’s heads, so even once the court acknowledged that ‘one cannot be convicted on statistics’ the number was still in their minds. In the end, Lucia got convicted on implicit statistics, ‘for it was just not probable that she had been present at so many deaths’.
Ton Derksen argues in his book ‘Reconstructie van een gerechtelijke dwaling’ that the probabilities Elffers came up with are wrong. His main argument is that Elffers doesn’t account for a priori chances in his calculations. What he means is that there is no accounting for the question: How big is the chance of there being a serial killer present in a hospital in the Netherlands? When you take into account that the chance an sich is very small, the chance that that serial killer is Lucia becomes much smaller as well. He also argues that Elffers answers the wrong question: that an innocent nurse has shifts during eight incidents, instead of the chance that someone who has shifts during eight incidents is an innocent nurse. There is a subtle difference between those two questions. Elffers furthermore multiplicates chances of various hospitals, which doesn’t make sense; nurses who switch hospitals more often are more suspect of being serial killers when you take this argumentation through.
Derksen furthermore states in Chapter Five of his book that even without thinking about a priori chances, Elffers’ number is entirely wrong. He argues that the method of getting all the data is prejudiced. Because the court can reclassify deaths as being natural or suspected, and it can pick its evidence as already mentioned above, the data can’t be objective and therefore, no fair statistical calculations are possible (Derksen Chapter 5II1).
His second argument is that data are used twice, for creating and for testing the hypothesis. In a fair test, there must be separate data for creating and testing. A special series of events doesn’t automatically mean something special is going on; it could be chance. To prove it, you should look at data and see if the special series happens again. (Derksen, Chapter 5II2)
The data of Lucia’s shifts in other hospitals should be added to the calculation. Now, the main data used are from the Juliana’s child hospital, and those are the most suspect. When other hospitals are taken into account, the chance becomes less extreme. (Derksen, Chapter 5II3)
Finally, incidents where Lucia wasn’t present should be taken into account as well. Now, an incident was suspect when Lucia was present, but when it appeared that she wasn’t present after all, the incident automatically became non-suspect. This is ridiculous, because this makes the data appear as if there happened many more incidents during Lucia’s shifts. Incidents when Lucia wasn’t present aren’t marked as incidents at all. (Derksen Chapter 5II4)
Derksen eventually makes another calculation in which he tries to follow his own advices and gets to the number 1 in 44. He states that Richard Gill gets with his own calculations to the number 1 in 9. Richard Gill himself argues: “Actually, later I had to withdraw my “one in nine”. It was based on a correction to the data made by Ton Derksen and I only later realised that he had “sinned” by biasing the data heavily in the other direction using a “trick” which the court had earlier authorized (for legal reasons) but which statistically, is inadmissible. Today I would promote the number “one in 26″ based on the same methodology as the former “one in 9″ but on a less biased data-set. But I also have some other numbers based on several other methodologies and based on different assumptions. The “one in 26″ occurs in the preprint http://arxiv.org/abs/1009.0802 which was submitted to the famous journal “Statistica Neerlandica” but rejected on the basis of a single referee report; I rather suspect the single referee was my old friend Henk Elffers.”
He also admits that Elffers made several errors, most of which I have already mentioned above. I quote: “Elffers’ most famous number was “1 in 342 million”. I would say that there are three major errors involved in that number each inflating it by a factor of between 100 and 1000. Two of those errors are IMHO inexcusable.
Secondly there is the possibility that the court misunderstood the number (the so-called prosecutors fallacy). Certainly, most journalists and most readers of newspapers misunderstood the number.
Error 1: nobody checked the data – it had in fact been gathered in a highly biased way.
Error 2: no taking account of confounding factors e.g. weekends, different kinds of nurses *do* have different kinds of shifts…
Error 3: the product of three p-values is not a p-value.
Errors 1 and 3 are, IMHO, inexcusible (ie proof of gross incompetence). Trouble is, the defence did not hire competent statisticians either.
(…) a second nurse had almost the same amount of incidents in her shifts as Lucia (maybe one less) but was never investigated.
Then the issue, whether the probability was interpreted properly. P(A | B) is not the same as P(B | A). The number was not meant to be “the chance Lucia is innocent, given such extreme statistics”, but it was meant to be “the chance of such extreme data, assuming that Lucia is innocent”.
See for instance http://probabilityandlaw.blogspot.nl/2014/11/the-ben-geen-case-another-miscarriage.html for a recent discussion in connection with a similar case in the UK.”
Another problem that was prevalent from the beginning in this case, and that also played a big role in the feeling ‘convict her!’ toward Lucia, was the familiarity everyone was addressed with. Everyone knew everyone; ‘ons kent ons cultuur’. The director of the hospital knew people in court, decisions were known before they were officially made or made public. People in court could easily step to doctors that would be giving evidence and discuss the proceedings in advance. This problem is clearly addressed in Metta de Noo’s book, in which she describes from a personal perspective her involvement in the case. Metta herself was of course also drawn to Lucia’s case by means of ‘ons kent ons’. Her sister-in-law was chief paediatrician at the hospital where Lucia worked. Metta’s first suspicions were based on the fact that her sister was ill and was perhaps misinterpreting facts. (De Noo, Er werd mij verteld) However, Metta couldn’t have become the important figure she was in Lucia’s fight for freedom without her brother at her side. Ton Derksen, philosophist of science, was so incensed on Lucia’s behalf when he found out about all the things that were said wrong in Lucia’s case that he wrote a book that played an important role in getting Lucia free; the advantage of knowing academics. Academically educated people live in a small world; they know each other since university, are married, have friends from their studies, etc. They eat together, talking about work. They are interested in each other’s problems and of course they discuss how their specialty can help in that case of the serial killer that covered her tracks very well; after all, a serial killer has to be convicted, doesn’t she?
Many people have argued for the reopening of Lucia’s case. On the forefront there were Derksen and De Noo, who formed a committee of support. They argued that the chain-link proof was weak, that the digoxin measurements weren’t reliable, that the statistical evidence was weak and that not even one murder could be proven on itself. However, the Dutch legal system requires a novum to reopen a case, which wasn’t available. All the data were there, but experts had drawn the wrong conclusions, causing the court (by picking the experts that said condemning things about Lucia) to convict her.
The Posthumus II commission looks at closed cases. Derksen submitted his research in spite of there not being new data, and the commission decided (against the rules) that they would look more closely at the case. They eventually concluded that the case, from the start, had been biased because jurors and medical officers knew each other well. The independent medical experts were hardly independent, since they had also helped to collect the evidence, and therefore the case was recommended to be reopened.
The following three arguments were the main reasoning behind this:
- Was there a digoxin poisoning?
- Were the experts independent?
- Was the reasoning behind the statistics (that had induced the tunnel vision of Lucia being guilty) sound?
Baby Amber, whose death caused the beginning of Lucia’s prosecution, died in September 2001. She was six months old and had a long history in hospital. The court argued that she died of digoxin poisoning. Digoxin tests are known to be false-positive, e.g. to give too high a concentration in young children like Amber. This is because infants have Digoxin Like Immunoreactive Substances in their blood. To test the digoxin concentrations in Amber’s blood, three different methods were used: Emit 2000, IMx and HPLC-MS. HPLC-MS is according to the experts by far the most specific test and creates the least artifacts. It can differ better between DLIS and digoxin. There were gauzes found on Ambers body during a second obduction. They contained some bloody fluids that were used in tests. The results from these tests are stated in Table 1.
Table 1. Results of different digoxin tests.
|September 5, 2002||Emit 2000||IMx||HPLC-MS|
|fluid from gauzes||22µg/L||25µg/L||7µg/L|
Even though it is known that the HPLC is the most specific, the court decides that the results that correspond the most are reliable and finally uses the average of the two other tests, 24µg/L. When one test is first named ‘the golden standard’, why would you eventually exclude it?
However, the acceptable concentrations in blood are 1-2µg/L, so even the HPLC measures an overdose, right? No. Another factor that is not taken into account by the court is that after death, concentrations of digoxin can become far higher because of leakage from other organs. Note also that there was only ‘bloody fluid’ that can very well have come from organs. According to specialists, a reduction of 5µg/L is appropriate when the blood is extracted after 24 hours. Amber’s ‘bloody fluid’ was even extracted after 48 hours, so a reduction of at least 5µg seems in order. However, since the court uses the other data, it still thinks Amber is poisoned.
The Strassbourg Laboratory is asked to measure the concentrations of digoxin as well and gets the number 7.4µg/L. However, the court got these results too late, when the verdict has already fallen, and nothing was done with them. The reason for this is unclear.
In this case it is most clear that no poisoning at all has happened according to the data, and where specialists went wrong. When you compare the way the judges collected evidence in the cases of the other poisonings, something very queer springs to the eye: the judges quote an expert when he provides evidence against Lucia de B. Someone can be called an expert in his field in one paragraph, only to be held ‘not expert enough’ two paragraphs later concerning the next ‘murder’. In this way, judges pick evidence like you would pick flowers: only the ones you consider pretty are carried away in your basket.
Let’s revisit the PROBLEM and DREAM statements made at the beginning of this thesis. I stated that the problem drawing my attention in the case of Lucia de B. was that experts can give contradictory evidence, resulting in heavily weighing testimonies that can be hand-picked in court and that that could be solved if there was a better way of picking and comparing evidence to decide which one is the most trustworthy. The question is: would the case of Lucia de B. have progressed differently if there had been a better (in hindsight) choice of experts? I will revisit my three cases where expert judgement might have been crucial to see what would have happened.
The number that was prevalent in the public eye was 1 in 342 million. Even when the court decided not to use statistics in the closing of the case, because they had enough other evidence, it was still the number in everyone’s minds. Lucia had to have done it, the statistics were so small. The court was prejudiced because they had only one statistician, Elffers, look at it. If they had asked the advice of more statisticians, they would have had more feeling for statistics and hopefully realised that although statistics is a so-called exact science, it can still have a wide spread in its results, depending on who you ask. In hindsight, what definitely went wrong was the fact that Elffers wasn’t in his specialty. He could be called an amateur statistician and it is terrible that he has had such a big influence of the case. According to Richard Gill, it was inexcusable that Elffers multiplied p-values – something Kaye and Freedman talk about as well as a basic rule of statistics. Furthermore, it is painful that the collecting of the data wasn’t checked, when in hindsight they were so biased. If the parallel that another nurse had only one incident less in the same period of time had been made before the case and not after, Lucia’s case could have been relativized. Now the ‘very big number’ kept singing around in everybody’s mind. Here the experts at fault are Elffers, for making enormous mistakes that you shouldn’t make as statistician, and the court, for accepting the expert at face value. Testimonies – also (or especially) those from experts – should be checked.
I found in the article of Kaye and Freedman a very clear (although sometimes quite dull) reference manual on scientific evidence. They talk about practically everything that wasn’t done here. Questions like ‘How are the samples collected? What is the p-value?’ or cautions as ‘p-values are not a probability’ could have done much good in using statistics in this case. I strongly recommend courts to start using this manual.
Of course, the widespread knowledge of the cause also didn’t help Lucia any. The fact that all the people who played roles in this case knew each other so well in advance has been a big move in her conviction. This is something that should have gone different; the probability calculation ensured that everyone wanted her convicted and the intimacy between jurors and doctors ensured that they could. Everyone who was involved in ‘achterkamertjespolitiek’ is at fault here. It is not ethical to discuss a testimony prior to delivering it, or to speak about patients dossiers prior to handing them over to the court, etc.
To reopen a case, it needs to be severe, there has to be a novum, and there has to be a scientist or some kind of officer (e.g. a police officer with a bad conscience) who can report the case to the committee. In Lucia’s case, there was no novum, delaying the request for reopening the case. Entire books were written before the request was made, because you want to be sure the request gets granted. Failure is not an option and when you (strictly speaking) have no novum, the request needs to be even more sound.
I would say the Dutch system is at fault here, for making it so hard to reopen a case. This is not the fault of the experts working within the system. There are a lot of other critical sounds towards the ‘Commissie evaluatie afgesloten strafzaken’. People are asking for a ‘Revisieraad’ so that the threshold to report cases will be lower and the committee will be independent. What would this change in the case of Lucia de B.? If there had been a ‘Revisieraad’, would the case have been reported sooner? Maybe, but it remains guesswork whether this ‘Raad’ would have been as thorough as Ton Derksen or as fanatical as all the other people who helped with the objection of Lucia’s conviction. Therefore I find it unclear whether they would have been able to free her.
The digoxin test results were critical in this case. They proved that Amber was in fact poisoned and are important pieces of evidence. The problem, of course, is the fact that Amber wasn’t poisoned and that half of the tests came up with this result. The experts analysing the material and delivering the results followed protocol without fault. The expert providing testimony, however, drew entirely the wrong conclusion and chose the less reliable tests as being the main evidence. I find this inexcusable, because an expert on toxicology should know their stuff. Odd is that prior to knowing the results, De Wolff also stated that ‘the HPLC-MC is the best and most reliable test’. After he knew the results, nothing more was heard from that. I find it odd that the judges didn’t question this through. Other experts in poisonings (also some internationals) later stated that the tests didn’t show a poisoning. How come this wasn’t used in court? It looks like the court was out to convict Lucia. If more experts had given evidence in court, the court could have seen that whether there was a poisoning was questionable.
The problem right now is that experts, even though they are experts in their field, are still only human and can still draw the wrong conclusions. This could be solved by making it a requirement to have at least two experts with evidence that requires it. It would help if they came from a different background and were as independent (also to each other!) as possible. This enlarges the chance of one picking up on what the other didn’t see. The court would be required to list both statements before deciding which one they choose as being the best statement.
This could have been done in the case of Lucia de B. My dream is that there would have been more focus on the statistics. Juries would be more critical if one expert proclaims a chance of a million times bigger. I hope this would have planted a seed of doubt – was it really such an unusual occurrence for someone to be present at so many deaths? Furthermore: with more experts looking over more dodgy cases, in my dream the reopening of the case could have happened sooner. Between the first actions of Metta the Noo and the final reopening were many years. The toxicologist quoted by the court is Prof. de Wolff. He missed part of the data and also made a crucial mistake: he proclaimed the wrong tests as right and therefore created proof for a poisoning. If more experts had been asked to testify, this fact might have been brought to light.
My plan focuses on more experts in the courtroom. To put this plan into practice, I would require at least two experts on a topic to provide their opinion for the court. If the court needed explanation on some topic they were not familiar with, one expert could provide it and the other could check whether there was information missing or whether the information provided was correct. Maybe it would help if the court didn’t know which person provided which part of the information, because it is more objective and makes it harder to pick your favourite person. However, I doubt this is viable. Experts need to be questioned in court and that can’t be done anonymously. Another aspect of my dream for a better integration of experts in court is for the court to include all evidence provided, but especially to explain in detail why they decide to listen to one expert and not the other in the case of contradictory evidence. This is already done to some extent, but I find that hand-picking evidence is still easily accepted in our legal system. Perhaps there could be someone in court that looks over choosing evidence. I do not find myself familiar enough with the legal system to make a clear plan for this, nor do I know how strict choosing between evidence is nowadays.
I find the problem of the ‘ons kent ons cultuur’ harder to solve. It is a natural consequence of studying and working together. Lucia de B.’s conviction also happened in Den Haag, a city that is known for its elite. It would be impossible to prohibit people to mingle with whoever they want. Perhaps it would be possible to pay more attention in court to the fact that people know each other. Right now, it is somewhat of a taboo – or in any case unknown. I had never heard of this problem until I read Metta de Noo’s book. Perhaps confidentiality agreements could be drawn up, specifying that it does not do to discuss cases at the dinner table. The problem is that people won’t do that unless they will benefit from it, but once they think their acquaintance could help them win their next case, they run to him. Officially prohibiting that won’t work; we need a change in conscience. Once it is socially unacceptable, people won’t use ‘achterkamertjespolitiek’ anymore, and if they try, they’ll be corrected by their environment. This should be started in university, perhaps by discussing cases like Lucia’s, that went wrong because of it.
The solution I proposed: more experts in the courtroom. However, I am not sure that it would make a significant difference in the fight for better specialist evidence. If a court has in advance decided to convict someone, they can still pick the evidence that suits them most instead of the evidence that is most probably correct. In Lucia’s case, if more statisticians were heard, the first number of one in 342 million would still be the number that stuck. And since Lucia was ‘guilty in advance’, even smaller numbers would still be against her favour.
Since a court wants to show a united front, I am not sure how effective someone in court would be who constantly questions all the evidence that is chosen. After all, this is supposed to be done by the opposition. I know clearly what the problem is, but I find it difficult to find and present an effective solution. To falsify my solution, it would be best to use it in the courtroom and see whether cases would proceed differently. However, the vast part of cases doesn’t include miscarriages of justice even with only one expert present, and therefore it would take a while before results could be compared.
My other solution was letting juries explain more clearly their reasoning behind picking one piece of evidence over the other. The problem with this solution is that they do that now already. In a couple of sentences, they can state for example ‘that this isn’t his exact field of study’ even when it is. This should be checked better – but by whom? Maybe a ‘Revisieraad’ could be of help in such cases, but then again, maybe not. After all, a Revisieraad is proposed for closed cases, when the mistake is already made. And my dream is less mistakes with experts in the courtroom, not after.
I stated the ‘ons kent ons cultuur’ as something negative, and it should be checked whether in a culture where there is no elite in law processes are more fair. To do this, a comparison could be made between various courts to see if a less ‘elitous’ culture has a positive effect on the amount of cases that has to be revised later. Conclusions should be drawn with care, as less revised cases may also mean a less active Revisieraad instead of a better court. If it were indeed proven that elite culture has an effect on judgements, a change in conscience is very much required. However, I think all courts have somewhat of an intern ‘ons kent ons cultuur’ and a control group as required by Kaye and Freedman is in my opinion hard to find.
To conclude; Many lawsuits are not proceeding as they should. Experts aren’t always as qualified as they ought to be, judges pick the evidence that suits them best or misinterpret the expert. Although science and law are uncomfortable bedfellows, they like to flirt, even when they shouldn’t – but as Rakoff states, ‘twin beds are not an option’. Therefore, we should look into methods to improve on the present situation. The conscience of academics should be the finding of truth, not of winning a case. You should try to disprove yourself like Poppers method, however, in neither science nor law is this the main objective.