Sampling bias
One of the most probable causes was that the samples the polls used were off: they missed a certain demographic and therefore underestimated Trump’s chances. In methodological terms, this was a severe case of sampling bias. The hypothesis that sampling bias was indeed the issue seems plausible given that one of the few pollsters that did come up with an accurate prediction, the RAND corporation, actually tackled this issue. This corporation has the ability to do something few of us can, but all of us should: they can carry out a proper random sample. What they did was to take the telephone book of the USA and randomly select 6000 telephone numbers. Each number was called and asked to participate in an ongoing online survey about the election. So, you might say, this is what all pollsters do, right? Now here's the kicker. If for some reason the people called had no internet, RAND would provide internet for them. Although costly, this strategy actually ensured that their sample was closer to a proper random sample than other pollsters. In addition, using an online survey as opposed to a telephone survey eliminated another source of bias, namely social desirability: “I’m voting for Trump but I’m not going to tell you that over the phone.” RAND’s random sample indicated a clear win for Trump: they captured the responses of a specific subgroup of people who would normally have been left out of the sample, but who overwhelmingly voted for Trump.
Include uncertainty
Having the means to provide internet for your participants if they don’t have it is of course a luxury. So if we don’t have the means to take a random sample in this way, how can we get around this? One option is to incorporate uncertainty into your prediction model. This is the approach of the FiveThirtyEight model. FiveThirtyEight uses a complex model to aggregate and weigh the information from almost all national polls conducted in the USA. But in addition to the outcomes of the polls, it also incorporates information on the quality of these polls: How well did the polls predict previous elections, or what was the sampling strategy of the poll; a telephone or internet survey? The model even incorporates information on the correlation between states; if one state votes for Clinton, a demographically similar state might also vote the same way. All in all, the most beautiful thing about this model is that it includes uncertainty: it incorporates the fact that the outcome of a poll might be wrong. In statistical jargon we have an errorinvariables model. FiveThirtyEight runs the model 10,000 times, and for each run it counts the number of electoral votes predicted for each candidate, ending up with a probability that a candidate will win. Eventually the model predicted a 71.4% of a Clinton win. But before you start yelling that this prediction was wrong, the confidence intervals of the predicted outcomes show that a Trump win was well within limits. The model was not wrong; real life just caught up with it, and something that had a 28.6% chance of happening (which is, after all, a reasonable chance) did indeed happen.
Upcoming Dutch election
Given the unexpected, but in hindsight not so unexpected, outcomes of the last ‘big’ votes, the Brexit referendum and the U.S. elections, I’m really curious, and maybe a bit anxious (with sampling bias in mind, allowing people to directly vote online seems like a very bad idea), how the upcoming Dutch election will turn out. Hopefully the pollsters will be a bit more open about the uncertainty of their predictions. But until then: I’ll trust the polls; I just won’t trust their sample.
]]>The columns of Prof. Ionica Smeets are always a joy to read. Last week, she wrote about the time we spend waiting when we go to the bathroom or the supermarket (Sir Edmund Volkskrant, August 27). She wrote that it would be more efficient if supermarkets combined all the lines into one. This would lower the average waiting time a lot, because nobody would be ‘trapped’ behind someone who was laboriously counting out the dimes for his groceries.
The total waiting time, and therefore the average waiting time remains equal
I found this a marvelous idea! My rebelcommunistwithafondnessforefficiency within was getting enthusiastic, already looking forward to the next visit to the supermarket, where it would call out: “Waiting people of the world, unite! Unite into a single line and everyone will be home earlier!” Unfortunately, after some thinking, I realized this may be incorrect, if there are never any registers that are available, but not used by costumers. In practice, people will exit their line whenever another register becomes available. Unless this line switching takes up a substantial amount of time, the total waiting time, and therefore the average waiting time, remains equal if everyone joins a single line. The thing that does get reduced is the variation in waiting time: Although there will be fewer customers (or bathroom users) with an extremely long waiting time, there will also be fewer with an extremely short waiting time.
Dice have no memory of previous throws
This was disappointing, but the column also provided me with intellectual consolation: it reminded me of my favorite demonic probability distribution: the exponential, or waitingtime distribution. I call it demonic, because it is a memoryless distribution, which often seems unfair: earlier outcomes do not influence future outcomes. For example, if I have thrown a dice ten times without throwing a six, the probability that my next throw will be a six is exactly equal to the probability of throwing six on the first throw. Which is counterintuitive, or maybe almost unfair: if I have not thrown a six for a while, I might think the probability of getting a six on the next throw increases, but it does not. Dice have no memory of previous throws.
Sometimes you win, sometimes you lose
However, I remember talking to a friend who was convinced that dice and slot machines and things you find in a casino do have memory. He used to brag about earning his rent in the casino every month. I had a hard time believing this and told him: “But the house always wins, right?” He told me: “No, your luck always changes, right? Sometimes you win, sometimes you lose. After winning for a while, you are bound to start losing at some point. And after losing for a while, you are bound to start winning at some point. So when I go to the casino, as long as I have lost money that day, I keep on playing. And I only quit when I have more money than I brought with me.” Of course, it is not entirely impossible that with this strategy, he has won more money than he has lost. But it is not very probable either. The reality is that in casinos the house almost always wins, and customers only very rarely.
The slot machine has no memory of your losses
Of course, I cannot be sure if my friend was lying. Maybe he has made money in the casino, overall. But not because his strategy is a winning one. If you lose ten times in a row at a slot machine, the probability of losing the next ten times does not decrease, it remains exactly the same. So, if you are losing money at the casino, you should not be thinking: great, I will soon be winning! The slot machine has no memory of your losses. It only ‘knows’ what your probability of winning is, which is obviously very low because you have been losing. You should get the *beep* out of there!
]]>In the first two weeks of March 2016, three newspapers reported on 'the Curse of London': the strikingly high death rate among athletes who participated in the 2012 Olympic Summer Games in London. The newspapers reported that, of the 10,500 athletes who participated, 18 had died since the Summer Games, and that such a death rate was strikingly high. I thought it was striking, too. That is, the newspaper report, not the number, because the newspaper did not report what a nonstriking, or 'normal' death rate among Olympians would be. Does 18 deaths out of 10,500 persons constitute an outrageous death rate, I wondered. I gathered some additional statistics and performed some calculations to test the hypothesis that 18 deaths is indeed a striking number.
The 2012 Olympic Summer Games took place from July 27 until August 12, so 3 years and 7 months have passed since. According to estimates from the US National Vital Statistics System, the average yearly death rates in the US per year, per age group, per 100,000 persons are as follows:
For the purpose of my calculations, I am going to make some assumptions. The numbers above are from 2007, and I am assuming that they are still representative. Also, Olympic athletes may not be the same as the average American, so I use the death rate among 1524 year olds. The athletes of the Summer Games of 2012 were and are probably somewhat older than that, on average, but Olympian athletes may have a somewhat lower death rate than average US citizens of the same age.
Based on these statistics, we would expect that among 10,500 persons, in a period of 3 years and 7 months, the number of deaths, on average, would be (79.9/100.000) x 10,500 x (3+7/12) = 30.06 deaths.
Given what would be the ‘normal’ (average) rate, the hypothesis that the number of deaths among the Olympic athletes is a striking number can be statistically tested. To do that, we have to calculate the probability that this, or an even more extreme number of deaths occurs, over this period of time, given the average death rate. If the probability is below a preselected threshold (e.g., 0.05*), we reject the null hypothesis that the number of deaths is equal to the average, and we accept the alternative hypothesis: that the number of deaths is indeed striking.
Death rate distributions are modeled with the Poisson distribution (Zocchetti & Consonni, 1993). Using the Poisson distribution, we can calculate the probability of an event happening a specific number of times within a specific time interval, given the population average. With a population average of 30.06, the probability of observing 18 or fewer deaths is equal to 0.01. The probability of observing 18 or more deaths is .99. The probability of observing 18 deaths or less is obviously very low, even below the threshold. So, I reject the null hypothesis, and accept the alternative hypothesis that 18 is a striking number: a strikingly low number! I hereby refute the title of the newspaper report. I suggest that from now on we refer to 'The Blessing of London' instead.
* I am going to perform a twosided test, because I am not sure if the observed number of deaths among the Olympic athletes is greater, or smaller than the average. Therefore, I have to divide the threshold for the probability by 2, and use .025 as the new threshold.
OK, I’ll admit it: I like models. I especially like models because they are predictable and simple. And I like my models to be wrong. This may sound counterintuitive, but when making predictions a model that is wrong is often better. Let me show you why.
Suppose we set out to discover the relationship between age and a certain cognitive ability. We conduct an experiment in which participants of different ages perform a cognitive task measuring this ability. The data, age on the xaxis and ability on the yaxis, are plotted in Figure 1. The gray Xs are the observed data; the red circles show the actual shape of the relationship. Our task is thus to discover the relationship between ability and age. Or, more formally, our task is to find the function f that relates age and ability:
ability=f(age)+ε (1)
In this equation, f is the true function that defines the shape of this relationship (red circles), and ε is the error term (differences between red circles and gray Xs). Error in this case could come from many sources. It could be measurement error due to the task. It could be error due to participants’ characteristics (e.g. mood, concentration). It could even be error from another confounding variable that influences the relationship (e.g. people with higher intelligence score higher on our ability task). Whatever the source of the error, we don’t know how big this error is: we only know the gray Xs. That is why we call ε irreducible error.
So the data we observed contain the true relationship and some unidentifiable error. How can we now best go about finding this relationship? The most common way is to use a statistical model, like this one:
predicted ability=f ̂ (age) (2)
In this formula, f ̂ is our prediction for the shape of the relationship and predicted ability contains the estimated ability scores. But how do we decide what shape of f ̂ is our best guess for the true relationship? The easiest way to do this is to fit different models and see which models make the best predictions of ability. This is shown in Figures 2a through e. The first three Figures (2ac) show a linear model, a quadratic model, and a cubic model. With each step in the model we use a more complex function for f ̂ (with more parameters), and with each step we see that the fit of the model increases (the line fits the observed points better).
We could even go a step further and fit even more complex models. Figures 3ab show two LOESS models that use a more complex shape function, the last model being the most complex of all. We see that fit goes up again, until we have even completely reduced the error: the model fits perfectly!
If you were in charge of analysis and if I were to ask you which model you would choose, my guess is that you wouldn’t opt for the LOESS models. You would probably choose one of the first three models. But why? Clearly, these models fit less well than our LOESS model, so why are we still opting for a model that contains more error? Your answer may have something to do with interpretability. While interpretability is a valid reason  we have to write down a description of the shape somewhere, after all  there is another reason why we (should) opt for a more ‘wrong’ model.
The reason has to do with bias and variance. To understand bias and variance we have to look at what happens when we fit increasingly complex models. With increased model complexity our error in predicting the data decrease. With this decrease in error we also decrease bias. Bias is defined as the difference between the shape of the true relationship f and the relationship we estimate from the data. So the more closely a model follows the data points, the closer we are (on average) to the true shape, and the smaller our bias is.
But what happens to our irreducible errors? As the complexity of the model increases, our estimated shape relies more and more on these errors. To see this, imagine what would happen to our estimates of model 2a if we did the experiment again and fitted this model to these new data. Would our estimates differ? They probably would to some degree, but not very much. This is shown in Figure 4a. Our new model estimates (black line) do not differ much from the estimates in our original model (green dotted line). Now imagine what would happen to the estimates of the most complex model (3b). They would probably differ to a greater extent in the new dataset, as these estimates will closely follow our new data points. This is shown in Figure 4b. The estimates from our new dataset (black line) will differ substantially from our original estimates (green line). Actually, with each new dataset, predictions from a complex model will vary to a greater extent than predictions from a simple model. In other words, with increased model complexity comes increased variance.
Figure 5 shows the relationship between bias and variance. On the xaxis is model complexity: left are simple models, right are more complex models. The yaxis indicates the total error, or how ‘wrong’ we are overall. As we can see, when we have a simple model, we have a high total error, containing mostly bias and little variance. Estimates are biased but stable across new datasets. If we opt for more complex models our total error will consist more of variance than of bias. Our predictions are less biased, but will vary substantially across new datasets.
This leads us back to the question of why being wrong is better then being right. If we want to make predictions that generalize across new datasets, we can better choose a simple model. Complex models are less wrong overall, but predictions will vary substantially. So if you want your predictions to be right: “let thy model be wrong.”
References:
When we make decisions in the real world, we are faced with limitations to the information, cognitive resources, and time at our disposal. This is called 'bounded rationality', and as psychologists, we have to find a way to deal with those limitations when we make decisions about clients or patients. At the same time, professional standards require psychologists to take an evidencebased approach: the accuracy of assessments and efficacy of treatments provided should be supported by empirical research. Perhaps even more importantly, reviews and metaanalyses on the accuracy of clinical judgment ('clinical prediction') and empirically derived formulas ('statistical prediction') have shown the latter to be more accurate than the former. So ideally, for psychologists working in practice, empirical studies should supply them with statistical prediction rules that can be easily evaluated.
However, in most empirical studies in psychology, data are analyzed using generalized linear models, or GLMs. With these models, we can predict the value of a variable by adding up the contributions of other predictor variables (often called 'risk' or 'protective' factors). GLMs are powerful models, as they are conceptually simple and have desirable properties in terms of stability and accuracy; this may explain their popularity among social scientists.
As an example of a GLM, let’s take an analysis performed in a paper by Penninx and colleagues, who studied factors that predict whether patients who currently have an anxiety or depressive disorder will still have such a disorder after two years. The researchers found seven variables or factors that predicted the presence of a disorder; these are shown in the table below.
This model offers us some important insights into risk and protective factors for developing a chronic depression or anxiety disorder. However, if we want to use the model to assess a new patient’s risk of developing a chronic disorder, we have to assess the value of all seven variables and calculate a weighted sum of their values. This may require too much time and resources, especially for a psychologist working in clinical practice, where both these commodities are scarce.
Researchers like Gigerenzer and Katsikopoulis have therefore suggested using socalled 'fast and frugal trees' for statistical prediction in clinical practice. A fast and frugal tree is a very simple decision tree, consisting of only one branch. At every level of the tree, the value of only one variable is assessed; on the basis of that value, either the tree is exited and a final decision made, or the value of the next variable in the tree is assessed. An example of such a tree is shown below. It can be used by doctors for deciding whether to prescribe the antibiotic macrolides to children with an infection.
These fast and frugal trees seem very helpful for performing statistical predictions when time, information, and resources are limited. However, the trees have to be derived in such a way that they provide accurate decisions, and that the variables are assessed in the most efficient way.
This is where rulebased methods may come in handy. Rulebased methods are a relatively new dataanalytic tool developed in the field of statistics and data mining. Friedman and Popescu have developed one of the most promising rulebased methods: the RuleFit algorithm. This algorithm derives a socalled prediction rule ensemble, using exactly the same data as are used to fit GLMs. However, the prediction rules in the ensemble can be represented as fast and frugal trees, which may be easier to use in practice than GLMs.
In a paper we published this year, we showed an example of a prediction rule ensemble for predicting the presence of a depressive or anxiety disorder, by applying the RuleFit algorithm to the same data as the original study by Penninx and colleagues. We found a prediction rule ensemble of two simple rules, providing decisions whose accuracy was comparable to that of the original GLM. Below, the two rules in the ensemble are depicted as fast and frugal trees. On average, evaluation of the rules in the ensemble required assessing the value of only three variables, whereas using the GLM for prediction would require assessing the value of seven variables in the table above.
Therefore, we concluded that rulebased methods, and the RuleFit algorithm in particular, are promising methods for creating decisionmaking tools that are simple and easy to use in psychological practice. In future research, we will work on further improving the applicability, accuracy, and ease of use of rulebased methods.
]]>Many negative consequences followed the discovery of the massive fraud by social psychologist Diederik Stapel in 2011, including damage to the reputation of science in general and psychology in particular. One of the few positive outcomes of his fraud is that it has brought to light the importance of replication in science.
A recent failed replication of a highimpact result in social psychology highlights this problem.
A key finding in social psychology is the phenomenon of unconscious priming. Accumulated evidence has shown that behavior can be unconsciously influenced (primed) by a previous stimulus that activates certain stereotypes, personality traits, or other concepts. A famous example, often included in textbooks, is the study by Dijksterhuis et al. (1998) . In this experiment the researchers instructed participants to take a few minutes to write down characteristics of either a typical professor or a typical soccer hooligan. Subsequently, they asked the participants to fill out a general knowledge questionnaire. Participants who wrote about professors scored significantly higher on this test than participants who wrote about soccer hooligans.
These findings are quite spectacular: What if students could improve exam scores simply by thinking about professors for a little while? However, the results also come across as a bit surprising and counterintuitive. This led Shanks et al. (2013) to do a direct replication of this study to further investigate if the findings were, in fact, real. In nine different experiments, using largely the same methods as described by Dijksterhuis et al., they found no evidence that priming with the concept of a professor increases test scores.
There are several explanations for this replication failure, which are given in the Discussion section of the article. These are:
According to Shanks et al., the latter option is the most likely explanation. There are several possible causes of false positives in research, of which arguably the most important are posthoc selection of data or analyses, and publication bias (and of course, fraud as in the case of Diederik Stapel). Publication bias is the phenomenon that it is much easier for a researcher to get ‘sexy’ results published than a boring replication paper. As a result, negative replication attempts often disappear into a scientist’s file drawer.
Thankfully, and partly because of Stapel’s fraud, replication has received a lot more attention lately. For psychologists there is now a virtual file drawer where they can share unpublished replication attempts. Additionally, there is now a journal publishing ‘Registered Replication Reports’. The idea is that several research groups independently replicate an important finding, using preregistered methods. The results are published regardless of the outcome.
In my opinion, this is an important step forwards that helps ensure that scientific results, particularly those from the field of psychology, retain their credibility.
]]>A proper manner to uncover fraudulent, flawed, and chance findings is replication. Basically, a replication is nothing more than to copycat an earlier study, so exact as may be possible, only to see whether the results come close to those of the original paper. Positive replications, in general, add to the credibility of a finding. Negative replications, instead, can indicate fraudulent or flawed studies/behavior but also can reflect mundane issues such as the use of an underpowered sample. The American psychologist Henry Roediger captured the importance of replication in a short and nowadays famous sentence: one replication is worth at least a thousand ttests.
Recently, some colleagues and I also entered the arena of replication. To be honest, this arena is not as sexy or exciting as the one in which the initial research findings are generated. So, what then was our drive? Well, we noticed inconsistencies with regard to a research finding that can be regarded as a pillar for the theory in our research niche. We decided that this inconsistency needed to be solved before going a step further. The inconsistency we talk about is on the relation between a variant, val^{66}met, on the gene that codes for BDNF, a neuronal growth factor, and the volume of the hippocampus. Biologically, there are pretty good reasons for why such an association could exist. However, no theory here, we are talking methodology today. Besides, it is complicated and may bore you at least a little.
What did we do? We performed a single (replication) study and beyond that ran a metaanalysis on the subject matter. This latter, performing a metaanalysis (i.e., an analysis that estimates the strength and the direction of a presumed effect over studies) is a very useful thing to do when faced with nonuniform findings and it serves. We included a total of 25 studies (on no less then 3,620 individuals) in our analysis and found the effect we were looking for: carriers of a met allele (each individual has either 2 val alleles on this locus or one or two met alleles) had smaller hippocampal volume. This effect, however, was so small that we not really were impressed by it. What did impress us though was a striking correlation between the year in which the paper was published and the magnitude of the effect that the paper reported (see the accompanying figure for a cumulative metaanalysis that shows that the effectsizes steadily decline over time). Our first tentative conclusion from this was that the effect of interest seems to be hard to replicate. Strengthening this tentative conclusion was that we found that smallscale studies, that are likely to be the least reliable, estimated the effect to be very large whereas the big, and probably more precise, studies estimated the effect to be nonexistent. Altogether we concluded, with a broken heart and tears in our eyes, that the relation between val^{66}met and hippocampal volume probably does not exist. Rather, the presumed association resonates a winners curse in which early studies report large effects (and get published in topnotch journals) to be followed by increasingly smaller effects in subsequent (and better powered) replication attempts (that get published in lower ranking Journals or worse, not at al: i.e., the file drawer problem).
Indeed  one replication is worth at least a thousand ttests  but I would like to add that a couple of replications (let say about twenty) and wellpowered studies may be worth at least a million ttests.

Read more: 'A systematic review and metaanalysis on the association between BDNF val66met and hippocampal volume—A genuine effect or a winners curse?' (2012).
]]>
One of the goals (and some would say the most important goal) of psychology is to study the contents of the human mind. Normally, getting to know the contents of the mind requires interrogating people: what did you see? how do you feel? what did you think? While this may seem a straightforward way to investigate the mind, it is riddled with problems.
The problem we face when asking people what they think or believe is that people cannot always accurately describe such content. You may know the feeling that you have a hunch, or suspicion, that something is the case, but are not able to express it in words. And maybe you are even completely unaware of some of the knowledge that you have.
Researchers wanting to find out how conscious and unconscious knowledge are used prepared a card game in which participants could pick cards from one of four decks. Most of the time picking a card would result in winning money, but some cards would result in a penalty where the participant would lose money. However, the researchers rigged the game so that two of the decks would provide net winnings, whereas the other two would result in net losses.
As people were playing the game, they slowly started to avoid the two ‘bad’ decks. However, when they were asked if they understood the nature of the game most people could not say why they behaved the way they did, and instead reported having a ‘hunch’ or a ‘gut feeling’. It was not until the 80th card that they could explicitly identify the good and bad decks. Clearly, people have knowledge without being able to report it.
Apparently, by looking at behavior we can see evidence of unconscious knowledge. Some researchers have proposed yet other ways to study the unconscious mind. One of these methods assumes that offering people money makes them more willing to use their knowledge. This is how it works: you force people to make a choice and ask them to wager money on the correctness of their choice  a method known as postdecision wagering. If they are correct, they win money, if not, they lose money.
We decided to put this this method to the test, and flashed numbers on a computer screen for a split second. Then, we asked one group of people to wager money on whether or not they saw the numbers. Also, we asked another group of people to just indicate how clearly they saw these numbers, without betting any money. It turns out that people were more accurate in indicating that they saw these numbers if you asked them to wager money on it! Apparently, they have some knowledge that they only use when they can earn money. Maybe it’s time for universities to start paying students for passing exams...
Read more about this research in the publication 'Consciousness of targets during the attentional blink: A gradual or allornone dimension?' by Sander Nieuwenhuis and Roy de Kleijn, published in 2011 in Attention, Perception, & Psychophysics, 73, 364–373.
]]>Last weekend NRC, a Dutch newspaper, published an article in bold face on page 3 with the heading Dutch wounded through skiing: a remarkable increase of 14 percent. Looking for explanations, the newspaper mentions that this change cannot be explained by different snow conditions or by an increase in the number of people having a ski holiday, and concludes that it must be due to personal factors such as taking higher risks and less thorough preparation.
When counting the number of accidents each year we cannot expect this year’s count to equal next year’s. There will always be some random fluctuations. The question arises which fluctuations are merely random and which are systematic? To answer such questions we use statistics, that is, sampling distributions are used for the data we collect. In psychological research the normal distribution for continuous variables or the binomial distribution for dichotomous variables are often used. For counts of events in a given time frame, such as the number of accidents in a year, the most natural distribution is the Poisson distribution. The Poisson distribution has a single parameter, , which represents both the mean and the variance. That is a higher mean implies a higher variance. According to the NRC article, last year there were about 700 injuries, this year about 800, so the most likely estimate of would be 750. Is the change from 700 to 800 really a remarkable jump or just random sampling from a probability density function? With the R program it is simple to draw random numbers from a given density. Drawing five numbers at random from a Poisson distribution with = 750, I obtain 783, 738, 756, 722, and 813.
From this simple sequence of random numbers we can conclude that it is not really strange to observe 722 accidents one year, and 813 a year later; a change similar to the one presented in the newspaper. Such a change can already be expected purely based on chance. In order to obtain a better overview I sampled 10,000 observations from the Poisson distribution, which resulted in the following histogram.
The minimum in this histogram is about 650, while the maximum is about 850. So, it all seems a big fuss instead of a real change. Moreover, the conclusion that we are taking higher risks does not seem to have any foundation at all.
]]>Recent incidents in psychology have made me think about the state of the art of psychological research. The quality of psychological research as currently published in APA and related journals is questionable. One of the main reasons in my opinion is that from current papers it is impossible to tell whether the conclusions are derived from predefined hypotheses and honest statistical analysis, or from data dredging, i.e., trying everything to get a significant result out of your data. I think it is time for psychologists to decide on a new publication model that would certainly alleviate this problem.
The practice in current psychological research is that researchers first do their investigation, and only afterwards write a report about the results and try to sell that report to a journal. With this practice we can never be sure whether the results as published derive from predefined hypotheses and preset data analysis schemes, or from data dredging. Therefore, the quality of psychological research as currently published in our main journals is questionable. To make psychological research more reliable I think the current publication model should be trashed and a new one adopted. Here are my thoughts in a few steps: