We’re Not Done Yet: After Developing the AI

Developing AI is a dynamic, multifaceted process. Even if an AI performs optimally from a technical standpoint, other constraining factors could limit its overall performance and acceptance. Developing an AI to be safe and dependable means stakeholders must learn more about how the AI functions as the risks from its use increase. This section details factors that make that understanding challenging to achieve, and describes how proper documentation, explanations of intent, and user education can improve outcomes.

Explore the Three Fails in This Category:

Testing in the Wild

Test and evaluation (T&E) teams work with algorithm developers to outline criteria for quality control, and of course they can’t anticipate all algorithmic outcomes. But the consequences (and even blame) for the unexpected results are sometimes transferred onto groups who are unaware of these limitations or have not consented to being test subjects.

Examples

Boeing initially blamed foreign pilots for the 737 MAX crashes, even though a sensor malfunction, faulty software, lack of pilot training, making a safety feature an optional purchase, and not mentioning the software in the pilot manual were all contributory causes.1

In 2014, UK immigration rules required some foreigners to pass an English proficiency test. A voice recognition system was used as part of the exam to detect fraud (e.g., if an applicant took the test multiple times under different names, or if a native speaker took the oral test posing as the applicant). But because the government did not understand how high the algorithm’s error rate was, and each flagged recording was checked by undertrained employees, the UK cancelled thousands of visas and deported people in error.2,3 Thus, applicants who had followed the rules suffered the consequences of the shortcomings in the algorithm.

Why is this a fail?

T&E of AI algorithms is hard. Even for AI models that aren’t entirely black boxes we have only limited T&E tools4,5 (though resources are emerging6,7,8). Difficulties for T&E result from:

Uncertain outcomes: Many AI models are complex, not fully explainable, and potentially non-linear (meaning they behave in unexpected ways in response to unexpected inputs), and we don’t have great tools to help us understand their decisions and limitations.9,10,11

Model drift: Due to changes in data, the environment, or people’s behavior an AI’s performance will drift, or become outdated, over time.12,13

Unanticipated use: Because AI interacts with people who probably do not share our skills or understanding of the system, and who may not share our goals, the AI will be used in unanticipated ways.

Pressures to move quickly: There is a tension between resolving to develop and deploy automated products quickly and taking time to test, understand, and address the limitations of those products.14

Because all these difficulties, deployers and consumers of AI models often don’t know the range or severity of consequences of the AI’s application.15

Jonathan Zittrain, Harvard Law School professor, describes how the issues that emerge from an unpredictable system will become problematic as the number of systems increases. He introduces the concept of “intellectual debt,” which applies to many fields, not only AI. For example, in medicine some drugs are approved for wide use even when “no one knows exactly how they work,”16 but they may still have value. If the unknowns were limited to only a single AI (or drug), then causes and effects might be isolated and mitigated. But as the number of AIs and their interactions with humans grows, performing the number of tests required to uncover potential consequences becomes logistically impossible.

What happens when things fail?

Users are held responsible for bad AI outcomes even if those outcomes aren’t entirely (or at all) their fault. A lack of laws defining accountability and responsibility for AI means that it is too easy to blame the AI victim when something goes wrong. The default assumption in semi-autonomous vehicle crashes, as in the Boeing 737 MAX tragedies, has been that drivers are solely at fault.17,18,19,20,21 Similarly, reports on the 737 crashes showed that “all the risk [was put] on the pilot, who would be expected to know what to do within seconds if a system he didn’t know existed… forced the plane downward.”22 The early days of automated flying demonstrated that educating pilots about the automation capabilities and how to act as a member of a human-machine team reduced the number of crashes significantly.23,24,25

As a separate concern, the individuals or communities subject to an AI can become unwilling or unknowing test subjects. Pedestrians can unknowingly be injured by still-learning, semi-autonomous vehicles;26 oncology patients can be diagnosed by an experimental IBM Watson (Watson is in a trial phase and not yet approved for clinical use);27 Pearson can offer different messaging to different students as an experiment in gauging student engagement.28 As the AI Now Institute at New York University (a research institute dedicated to understanding the social implications of AI technologies) puts it, “this is a repeated pattern when market dominance and profits are valued over safety, transparency, and assurance.”29

The early days of automated flying demonstrated that educating pilots about the automation capabilities and how to act as a member of a human-machine team reduced the number of crashes significantly




Hold AI to a Higher Standard	Involve the Communities Affected by the AI	Make Our Assumptions Explicit	Monitor the AI’s Impact and Establish Layers of Accountability
It’s OK to Say No to Automation	Plan to Fail	Try Human-AI Couples Counseling	Envision Safeguards for AI Advocates
AI Challenges are Multidisciplinary, so They Require a Multidisciplinary Team	Ask for Help: Hire a Villain	Offer the User Choices	Require Objective, Third-party Verification and Validation
Incorporate Privacy, Civil Liberties, and Security from the Beginning	Use Math to Reduce Bad Outcomes Caused by Math	Promote Better Adoption through Gameplay	Entrust Sector-specific Agencies to Establish AI Standards for Their Domains

Government Dependence on Black Box Vendors

Trade secrecy and proprietary products make it challenging to verify and validate the relevance and accuracy of vendors’ algorithms. These examples demonstrate the importance of at least knowing the attributes of the data and processes for creating the AI model.

Examples

COMPAS, a tool that assesses recidivism risk of prison inmates (repeating or returning to criminal behavior), produced controversial results. In one case, because of an error in the data fed into the AI, an inmate was denied parole despite having a nearly perfect record of rehabilitation. Since COMPAS is proprietary, neither judges nor inmates know how the tool makes its decisions.1,2

The Houston Independent School District implemented an AI to measure teachers’ performances by comparing their student’s test scores to the statewide average. The teacher’s union won a lawsuit, arguing that the proprietary nature of the product prevents teachers from verifying the results, thereby violating their Fourteenth Amendment rights to due process.3

Why is this a fail?

For government organizations, it’s cheaper or easier to acquire algorithms from or outsource algorithm development to third-party vendors. To verify and validate the delivered technology, the government agency needs to understand the methodology that produced it: from analyzing what datasets were applied to knowing the objectives of the AI model to ensuring the operational environment was captured correctly.

What happens when things fail?

Often the problems with the vendors’ models come about because the models’ proprietary nature inhibits verification and validation capabilities. For example, if the vendor modified or added to the training data that the government supplied for the algorithm, or if the government’s datasets and operating environment have evolved from those provided to the vendor, then the AI won’t perform as expected. Unless the contract says otherwise, the vendor keeps its training and validation processes private.

In certain cases the government agency doesn’t have a mature enough understanding of AI requirements and acquisition to prevent mistakes. Sometimes a government agency doesn’t buy a product, but it buys a service. For example, since government agencies usually don’t have fully AI-capable workforces, an agency might provide its data to the vendor with the expectation that the vendor’s experts might discover patterns in the data. In some of these instances, agencies have forgotten to keep some data to serve as a test set, since the same data cannot be used for training and testing the product.

These verification and validation challenges will become more important, yet harder to overcome, as vendors begin to pitch end-to-end AI platforms rather than specialized AI models.




Hold AI to a Higher Standard	Involve the Communities Affected by the AI	Make Our Assumptions Explicit	Monitor the AI’s Impact and Establish Layers of Accountability
It’s OK to Say No to Automation	Plan to Fail	Try Human-AI Couples Counseling	Envision Safeguards for AI Advocates
AI Challenges are Multidisciplinary, so They Require a Multidisciplinary Team	Ask for Help: Hire a Villain	Offer the User Choices	Require Objective, Third-party Verification and Validation
Incorporate Privacy, Civil Liberties, and Security from the Beginning	Use Math to Reduce Bad Outcomes Caused by Math	Promote Better Adoption through Gameplay	Entrust Sector-specific Agencies to Establish AI Standards for Their Domains

Clear as Mud

The technical and operational challenges in creating a perfectly understandable model can dissuade developers from including incomplete, but still helpful, context and explanations. This omission can prevent people from using an otherwise beneficial AI.

Examples

When UPS rolled out a route-optimization AI that told drivers the best route to take, drivers initially rejected it because they felt they knew better. Once UPS updated the system to provide explanations for some of its suggestions, the program had better success.1

A psychiatrist realized that Facebook’s ‘people you may know’ algorithm was recommending her patients to each other as potential ‘friends,’ since they were all visiting the same location.2 Explanations to both users and developers as to why this algorithm made its recommendations could have mitigated similar breaches of privacy and removed those results from the output.

Why is this a fail?

When we introduce an AI into a new system or process, each set of stakeholders – AI developers, operators, decision makers, affected communities, and objective third-party evaluators – has different requirements for understanding, using, and trusting the AI system.3 These requirements are also domain and situation specific.4

Especially as we begin to develop and adopt AI products that enhance or substitute for human judgment, it is essential that users and policymakers know more about how an AI functions and the intended and non-intended uses for the AI. Adding explanations, documentation, and context are so important because they help calibrate trust in an AI – that is, figuring out how to trust the AI to the extent it should be trusted. Empowering users and stakeholders with understanding can address concepts such as:

Transparency – how does the AI work and what are its decision criteria?
Traceability – can the AI help developers and users follow and justify its decision-making process?
Interpretability – can developers and users understand and make sense of any provided explanations?
Informativeness – does the AI provide information that different stakeholders find useful?
Policy – under what conditions is the AI used and how is it incorporated into existing processes or human decision making?
Limitations – do the stakeholders understand the limits of the AI and its intended uses?5,6,7

Traditionally, the conversation in the AI community has focused on transparency (AI experts refer to it as “explainability” or “explainable AI”). Approaches for generating AI explanations are very active areas of research, but coming up with useful explanations of how the model actually makes decisions remains challenging for several reasons. Technically, it can be hard because certain models are very complex. Current explainer tools can emphasize which inputs had the most influence on an answer, but not why they had that influence, which makes them valuable but incomplete. Finally, early research showed a tradeoff between accuracy and explainability, but this tradeoff may not always exist. Some of us have responded to the myth that there must be a tradeoff by overlooking more interpretable models in favor of more common but opaque ones.8

What happens when things fail?

Cognitively, existing explanations can be misleading. Users can be tempted to impart their own associations or anthropomorphize an AI (i.e., attributing human intentions to it). Also, assuming causality when there is only correlation in an AI system will lead to incorrect conclusions.9 If these misunderstandings can cause financial, psychological, physical, or other types of harm, then the importance of good explanations becomes even greater.10

Adding explanations, documentation, and context are so important because they help calibrate trust in an AI – that is, figuring out how to trust the AI to the extent it should be trusted

The challenge lies in expanding the conversation beyond transparency and explainability to include the multitude of ways in which AI stakeholders can improve their understanding and choice. If we adopt the mindset that the users, policymakers, auditors, and others in the AI workflow are all our customers, this can help us devote more resources to providing the context that these stakeholders need.




Hold AI to a Higher Standard	Involve the Communities Affected by the AI	Make Our Assumptions Explicit	Monitor the AI’s Impact and Establish Layers of Accountability
It’s OK to Say No to Automation	Plan to Fail	Try Human-AI Couples Counseling	Envision Safeguards for AI Advocates
AI Challenges are Multidisciplinary, so They Require a Multidisciplinary Team	Ask for Help: Hire a Villain	Offer the User Choices	Require Objective, Third-party Verification and Validation
Incorporate Privacy, Civil Liberties, and Security from the Beginning	Use Math to Reduce Bad Outcomes Caused by Math	Promote Better Adoption through Gameplay	Entrust Sector-specific Agencies to Establish AI Standards for Their Domains

Add Your Experience! This site should be a community resource and would benefit from your examples and voices. You can write to us by clicking here.