Kevin Schofield's writings, observations, and other pointless distractions
Over the weekend something that Facebook did exploded all over the Internet, newspapers, and TV news. Over the past two days I’ve spent a bunch of time reading up on the issue and gathering the available information, so I could take a run at explaining the issues involved, why people are upset, and the heart of the issue. There are a lot of strong feelings, and a handful of red herrings, so this is going to take some serious unpacking.
Every web service, including Facebook, Twitter, Google, Bing, Yahoo and Yelp all collect and analyze data on how users use their service, nominally so they can learn what works and what doesn’t and make ongoing improvements.
Generally speaking, there are two classes of activities that fall under the rubric of “collect and analyze data”:
1. Collect data on how people use your current production code, and after the fact look for patterns and correlations.
2. Run “A/B tests”, essentially two different versions of your system where “A” is the current version and “B” has some specific modification, users are randomly given one or the other, and the results are compared to see which one works better for a specific goal.
There are actually some other classes of activities (and new ones emerging quickly) but these two are by far the most common, and we can safely ignore the others for our purposes here.
The main difference between these two is that in case 1, everyone sees the same thing and all you can learn is what’s working, and not working, with your current service. Case 2, is where you get to find out if some alternative would be better than the current system. The approach is a very direct outgrowth of classic scientific experimental design, with a “control” group and an “experimental” group. This approach is still widely used not only in the hard sciences (physics, chemistry, biology), but also in computer science and many social sciences including most notably psychology. Those disciplines have honed their methodologies over time, and continue to adapt them as technologies evolve.
Even within the computing industry, A/B testing is not a particularly new phenomenon. The “human-computer interaction” or HCI community has used similar methods of testing new technologies and designs since the early 1980’s. And even before then, the military branches and NASA employed specialists to test systems like cockpit and capsule controls in order to understand the limits to which personnel could faithfully control aircraft, spacecraft, tanks, and weapon systems. What is new today is that, with the emergence of large-scale Internet services, it’s possible to run tests on much larger populations. It’s also possible to run a lot of tests, automatically, and often simultaneously. With earlier generations of technologies, it was frequently a challenge to get enough test subjects to achieve statistical significance in your results. That problem has gone away — at least for the organizations that run popular Web services.
While these methods originated in academic disciplines, today the practice of them for A/B testing of large-scale Internet services has moved significantly beyond the academic practice. Services such as Facebook, Google and Bing are constantly running A/B tests on some small fraction of their users, usually on relatively small feature additions or changes (or page layout modifications) and the underlying services are updated daily — and sometimes hourly — as results are analyzed and decisions are made about specific features.
I can’t emphasize enough that in Web service companies, A/B testing is a core part of the culture. Lots of people design and run A/B tests, and participate in analyzing the results and deciding what action to take. This is another very important part of how industry has moved beyond the standard academic methods: most of the people doing A/B testing have little to no understanding of the origins of this methodology or how it was honed in academia over the years.
One of the most important ways in which the practice was refined by the academic community was in recognizing the responsibilities that come with conducting experiments on human subjects. There is an extensive litany of “academic” psychological and medical experiments gone horribly wrong over the years, including notable, oft-discussed ones such as Milgram’s electric shock experiment and Zimbardo’s prison experiment. And of course we can go farther back to Mengele’s experiments on prisoners under the Nazi regime. Out of this sordid history came an imperative to develop standards for professional ethics when conducting experiments on human subjects. In the United States, the gold standard for this is the Belmont report, written in 1979, and the APA points to this as their standard.
The Belmont report begins by identifying three core ethical principles:
Respect for persons incorporates at least two ethical convictions: first, that individuals should be treated as autonomous agents, and second, that persons with diminished autonomy are entitled to protection. The principle of respect for persons thus divides into two separate moral requirements: the requirement to acknowledge autonomy and the requirement to protect those with diminished autonomy.
Persons are treated in an ethical manner not only by respecting their decisions and protecting them from harm, but also by making efforts to secure their well-being. Such treatment falls under the principle of beneficence. The term “beneficence” is often understood to cover acts of kindness or charity that go beyond strict obligation. In this document, beneficence is understood in a stronger sense, as an obligation. Two general rules have been formulated as complementary expressions of beneficent actions in this sense: (1) do not harm and (2) maximize possible benefits and minimize possible harms.
Who ought to receive the benefits of research and bear its burdens? This is a question of justice, in the sense of “fairness in distribution” or “what is deserved.” An injustice occurs when some benefit to which a person is entitled is denied without good reason or when some burden is imposed unduly… the selection of research subjects needs to be scrutinized in order to determine whether some classes (e.g., welfare patients, particular racial and ethnic minorities, or persons confined to institutions) are being systematically selected simply because of their easy availability, their compromised position, or their manipulability, rather than for reasons directly related to the problem being studied. Finally, whenever research supported by public funds leads to the development of therapeutic devices and procedures, justice demands both that these not provide advantages only to those who can afford them and that such research should not unduly involve persons from groups unlikely to be among the beneficiaries of subsequent applications of the research.
It goes on to discuss how to apply these principles, addressing:
1. Informed consent.
Respect for persons requires that subjects, to the degree that they are capable, be given the opportunity to choose what shall or shall not happen to them. This opportunity is provided when adequate standards for informed consent are satisfied.
While the importance of informed consent is unquestioned, controversy prevails over the nature and possibility of an informed consent. Nonetheless, there is widespread agreement that the consent process can be analyzed as containing three elements: information, comprehension and voluntariness.
The assessment of risks and benefits requires a careful arrayal of relevant data, including, in some cases, alternative ways of obtaining the benefits sought in the research. Thus, the assessment presents both an opportunity and a responsibility to gather systematic and comprehensive information about proposed research. For the investigator, it is a means to examine whether the proposed research is properly designed. For a review committee, it is a method for determining whether the risks that will be presented to subjects are justified. For prospective subjects, the assessment will assist the determination whether or not to participate.
Just as the principle of respect for persons finds expression in the requirements for consent, and the principle of beneficence in risk/benefit assessment, the principle of justice gives rise to moral requirements that there be fair procedures and outcomes in the selection of research subjects.
Within academia there exist “Institutional Review Boards” (IRBs) that are responsible for reviewing proposed research projects that involve human subjects. The Belmont Report framework provides the starting point for the review criteria that is used within that institution. Any researcher who is conducting a research experiment that involves human subjects is required to follow the IRB’s process for review and approval of that project.
It’s significant to note that IRBs are almost universally hated by academic researchers for being slow, overly bureaucratic, poor at understanding the proposed research, and often behind the times in understanding the latest technology. Nevertheless, they play a critical role in ensuring that research on human subjects are conducted ethically. Further, researchers (academic or industry) who receive federal research funding are required to have an appropriate IRB review and approve any research experiments involving human subjects, AND institutions receiving federal research funding are required to have an IRB and to institute a process through which all research experiments involving human subjects are reviewed and approved by the IRB.
Most academic journals and proceedings will not publish submitted research papers that involve human subject trials that have not been reviewed and approved by the relevant IRB.
So as you can see, there is a substantial amount of infrastructure and process invested into oversight of human-subject experiments within academia. You could argue that it’s too much, but its purpose is clear.
Industry organizations, including the vast majority of web service companies, have almost none of this, unless they are accepting federal research funding — and almost none do. Arguably their A/B testing are human subject trials; the vast majority are probably harmless and require little in the way of approval, or could be “auto-approved” with an internally-published set of guidelines for what is acceptable without review.
But the ethical standards raise other important issues for the way Web companies run their A/B testing. The first one is the issue of informed consent. Nearly all web services (at least the ones with in-house lawyers) have a “terms and conditions” page which states that the organization is collecting data on its users and using it both for ongoing research and to improve their service. Many explicitly ask you to click an “agree” button at one point or another. But is that “informed consent” for A/B testing?
Actually, this question has been debated for 15+ years. Here’s a pointer to a report from a 1999 workshop in the US that addressed issues around doing research on the Internet. Here’s another from a 2007 workshop in the UK, in which they specifically discuss the problem of whether clicking “agree” on a terms & conditions page is acceptable as “informed consent” and their conclusion is “no” — because it’s well understood that most people don’t read it. Legally, it’s acceptable and you probably couldn’t get sued. Ethically and professionally, it’s not acceptable.
Further, I have yet to see a terms and conditions page that explains that users will be subjects in A/B testing. Their descriptions would adequately account for “case 1” testing as I described it above: we collect data on how you use our service, and we analyze that data. But none of them explain that they may change the system without telling you in order to see how you respond. In no sense is this informed consent. In fact, in order to pass the “informed consent” test, it would need to go even further and explain the kinds of changes that might be made. And even further, special provisions would need to be made for children using the service, as they might not fully appreciate what they are agreeing to — this falls under the “respect for persons” principle above and its provision for people with diminished autonomy.
OK, thanks for bearing with me as I laid all that out. Time to (finally!) dive into the Facebook case.
In 2012, a data scientist at Facebook, Adam Kramer, worked with a Cornell professor, Jeff Hancock, and a Cornell doctoral student, Jamie Guillory (now at UCSF), to design an A/B test to evaluate the conventional wisdom that when Facebook users see their friends posting happy things, it makes them sadder. Their approach to this was to modify the algorithm that selects content for the News Feed feature in the “B” version so that it removes a small number of either positive or negative stories, and then to evaluate the subsequent content posted by that user to see if it skews more positive or negative. They ran the test on over 600,000 Facebook users for a week, analyzed the results, wrote up a paper on the results, and submitted it to the Proceedings of the National Academy of Sciences (PNAS) for peer review and publication. The paper successfully ran the peer review gauntlet and was ushered through editing and publication by one of the PNAS editors, Susan Fiske. It came out last week, and caused significant outrage.
Among the people who got upset about this were the Facebook users who felt violated; there was a “creep factor” to them of having Facebook manipulating their moods without their awareness, and it was amplified by Facebook’s status as a for-profit corporation — in Facebook’s hands, who knows how a tool like this might be used to generate profits and increase their power. The unchecked power of large, faceless corporations who have little accountability to their customers or to anyone else is a point of significant concern in the wider population.
There were also the professionals who work in this field who were angry that these researchers would openly flout the ethical standards of their chosen profession with an experiment seemingly constructed and executed in haste and without the required oversight. There have been many concerns raised suggesting that an IRB with full knowledge of this experiment would not have approved it, and that if PNAS had full knowledge of the design, review and approval status, and execution of this experiment it would not have accepted it for publication.
On the other hand, there is another community of people who believe that this is a tempest in a teapot, and there Facebook and the researchers did nothing wrong — or at least nothing that isn’t done hundreds of times every day by Web service companies.
In the back-and-forth between these groups, along with limited discussions with the three researchers, key issues emerged as well as some more information about what actually transpired.
The details on exactly what reviews and approvals happened before this test was run are very sketchy. Facebook didn’t have an IRB in 2012, and they have not been forthcoming about what internal review and approval, if any, the test was subject to. Both Hancock and Guillory were required to seek IRB approval from their own institution (Guillory was a Cornell student at the time). According to Fiske, both Hancock and Guillory told her that their IRB had approved it because it was using a “pre-existing data set” — which would be the case if it was a “case 1” type of study, or if Kramer had run the A/B test and then involved Hancock. But that doesn’t square, because the paper clearly states that Hancock and Guillory were involved in designing the research experiment; Hancock either misrepresented his involvement in the study to his IRB, or he waited to seek approval until after the experiment was run. Yesterday Cornell made the following statement, in which it’s made clear that that the Cornell IRB did not, in fact, review the experiment, based upon representations from Hancock and Guillory that they were not involved in the actual implementation of the experiment — just the up-front design and the data analysis afterwards. That was a major mistake on the part of the Cornell IRB, and a significant contributor to the end result. It’s unclear whether they made that decision based upon full information.
The paper also claims that because the users have all accepted the Facebook terms and conditions, they have provided informed consent. That is at best a stretch, and at worst a blatant misrepresentation of both “informed” and “consent” for an experiment that was designed to manipulate users’ moods. Many professionals have echoed this concern in the past few days.
There were several failures in this situation. First, there was insufficient oversight before the experiment was run. Second, informed consent was not obtained from the human subjects before the experiment was run. Third, there was a communication breakdown in the process of getting the right IRBs to approve it, possibly the parameters of the experiment to the relevant IRBs, and the results of those IRB reviews. Fourth, the human subjects were never debriefed after the experiment, as they were entitled to under the prevailing ethical standard for this kind of work. This is all very fixable; the underlying problem is that in the interface between academia and industry, we have transferred foundational knowledge about the process of doing this kind of research that has been extended and expanded, but we never transferred the ethical framework that guides the proper implementation of these practices.
“All web companies do A/B testing. This is nothing new.” Yes, pretty much all web companies do A/B testing. Almost none of them ever describe to their users what A/B testing actually is, or what their users’ experience will be. Further, designing an A/B test to modify users’ moods is breaking new ground and beyond the kind of feature changes that even experienced users might expect. This completely fails the “informed” test.
“They accepted the Facebook terms and conditions.” The Facebook terms and conditions page only says that they collect data on how you use the service. Accepting that is hardly consenting to have Facebook try to modify your mood without your awareness. It’s probably not illegal, as some have suggested, but it definitely isn’t up to the accepted professional standard for this kind of work.
“The Cornell IRB approved the study.” As I explained above, that is not true. The Cornell IRB did not approve the experiment.
“If you had told the users you were going to do this, you would have biased the study.” Only if the experiment was designed poorly. If you told them it was going to happen over one week in a three month period and you waited a month before you started, the vast majority would not have noticed. And I would humbly suggest that you could probably have derived the same result from a careful “Case 1” analysis — with a log of millions of users’ daily activities, I suspect that you could find the same pattern the researchers found just within typical variation. But further: is this a research result that is so badly needed that it justifies conducting experiments on human subjects without their awareness that they are being experimented on and what the risks to them might be? I have a very difficult time saying “yes” to that.
“The effect they found was minimal — so the harm they inflicted on users was minimal to none.” Perhaps, but they didn’t know that would be the result before they ran the test on real people: adults and children, people with mood disorders, people in highly stressful situations. etc. Kramer, Hancock and Guillory ran an experiment that had the potential to modify Facebook users’ moods — in fact, their experimental hypothesis was that it would modify users’ moods. While they tried to design it to have a minimal effect, they had no a priori knowledge of how much effect it would actually have. The point of getting experiments reviewed and approved ahead of time is to raise questions about what the effect will be so that there can at least be appropriate monitoring of the test subjects and intervention if necessary, or to stop a test entirely if the risks to human subjects are too high. There is no indication that any kind of monitoring regime was in place as part of the test protocol. You don’t make judgments about this after the experiment is run; you make them before you run it.
“All media manipulate emotions. What’s the big deal?” Good writers (of books, of scripts, of editorials, of speeches) try to elicit emotional reactions. Likewise, advertisements are also designed to make an emotional impact in order to get us to buy something. But the give-and-take there is clear: we know what the game is when we read something or see an advertisement. We may or may not be skilled at resisting appeals to our emotions, but no one is trying to hide the fact that this is being done. However, this case is different. Facebook’s entire design intentionally underemphasizes their editorial role in favor of user-supplied content, and in the past users have become upset whenever Facebook interfered with that — including blurring the line between content and advertisements with “sponsored posts.” And no one would have expected that Facebook would actively and intentionally try to manipulate users’ moods by filtering the context. This is so far out of the norm for what people think Facebook does with its site, and how Facebook represents itself to its user community, that to do this without any kind of advanced disclosure and/or informed consent is arguably a significant violation of the trust between Facebook and its users.
“You’re just slowing down innovation.” Look, the typical A/B test needs minimal review, if any review at all — so long as some group of people have thoughtfully laid down the boundaries for what kinds of tests are expected to present little or no risk to human subjects. But most IRBs would have serious concerns with a test that intended to modify subjects’ mood, and for very good reasons. Some innovations need to be slowed down. There, I said it.
“By making a fuss about this, companies will just go underground; they will publish less about their internal experiments, and they will collaborate less with academics which will deprive us all of valuable information and learnings about how these systems work and how people interact with them.” Hardly. The vast majority of A/B test results are uninteresting and not worthy of publishing. The truly interesting ones, from a research perspective, that are worthy of broad dissemination are designed by researchers, either those hired by the companies themselves or in collaboration with academics. I can tell you from personal experience that most of the researchers hired by companies want to publish and want to continue to have ongoing academic collaborations; not only does their work benefit from collaboration and publishing, but their career benefits as it gives them more options down the line. Many industry researchers worry that their company will lose interest in doing research and their job will disappear in a budget-tightening exercise some day in the not-too-distant future. A researcher that goes “off the radar” for even 2-3 years has a much more difficult time transitioning from industry back to academia because of the gap in their publishing record. The only result that will come of a company “going dark” will be that they lose their best researchers, who all have plenty of options and no desire to toil in obscurity or put their long-term career trajectory at risk.
It’s pretty clear from Facebook’s more recent “non-statement” statement on the issue that the lawyers have taken over, and I doubt we will get any more substantial information directly from any of the stakeholders. My hope is that there are internal reviews happening at Facebook, Cornell and PNAS, and that processes will be improved so this situation isn’t repeated. There are also important lessons for the web services industry as a whole, and I hope the influential people are paying attention.