• Posted on 16 May 2023
  • 60-minute read

Join Sally Cripps as she sheds light on uncertainty in decision making

In this talk, Professor Sally Cripps explores the difficulties of making decisions with limited information and how AI systems aid in decision making.

Uncertainty lies at the heart of AI. Why do I say that? AI systems are designed to aid or autonomously make decisions

Professor Sally Cripps

N35B5M9FGKk

Descriptive transcript

Oh, hello, everybody. My name is Sally Cripps, and it's my very great pleasure to be speaking to you today. But before we begin, I'd like to acknowledge the Gandangara people, the traditional custodians of the land on which I'm recording this lecture, and pay my respects to their elders past and present.

The title of today's talk is "Uncertainty: Nothing is More Certain." I'm going to talk to you about uncertainty in decision-making, and that seemed a very apt title. It was a phrase used by a Roman statesman 2,000 years ago, actually. He was born 2,000 years ago this year in 23 AD, and it was Pliny the Elder.

In a slightly more recent use, it was the title of a paper that I wrote that appeared last year, "Uncertainty: Nothing is More Certain." In it, we look at the difficulty of making decisions in huge amounts of ambiguity, particularly in environmental contexts. When I think about it, I try to think, well, what is certain? Really, the only thing I could come up with was the sun rising and setting, and hence you see this very lovely picture in front of you of the sun setting. That's about the only thing that I can think of that's going to happen reliably over the next, well, a few billion years at least, anyway.

Uncertainty is at the heart of AI, which is why that question mark just appeared. Why do I say that? It's because AI is essentially a system or a series of systems that either aid us in decision-making or autonomously make the decision. Examples of autonomous decisions are things like robots moving around, or an automated mine, or even, as we'll see later on, an algorithm which automatically determines whether somebody gets a visa for entry to the UK. There are examples of automated decisions, but of course, a lot of AI is used to aid humans making decisions. There's a human in the loop, which looks at the output of an AI system and decides whether or not to make a particular decision.

In those AI systems, I like to think of them as part of a circle or a cycle, because they're not linear. The part that begins the cycle when you build an AI system is what you already know—your knowledge and beliefs. Then you may decide to get some data. That's usually the way with AI systems. You get some data, and that data is trying to understand the question mark in the middle—what it is you don't know. Then you have a model. The purpose of a model is to connect the data to the question mark, to the issue at hand.

Finally, you are using the data and the prior knowledge and beliefs. You get updated knowledge in order to make a decision, and so the cycle continues. Having made that decision, you've then got new knowledge and beliefs, and so on and so forth. But at the centre of it is always something I don't know. It's very rare to think of any decision that you make with 100% certainty.

Let's talk about those unknowns and their relationship to decisions. For some reason, lost in the mists of time, mathematicians or statisticians tend to call things they don't know a Greek letter, and I've chosen the Greek letter theta here to represent what it is we don't know.

Now I'm going to give you some examples of a decision that we need to make and an example of theta, the thing that we don't know. For example, the decision could be to set the price of a product at a supermarket, and what you don't know is the consumer demand for that product at that price. The decision could be to approve a visa. What you don't know is the riskiness of the applicant. You have a system in place, but you don't actually know for certain what the riskiness of the applicant is.

Another example that we're going to look at is the direction that a drone should go to locate pollution hotspots. In these situations, the decision is typically automated—that is, the drone would autonomously guide itself, but it still has to make a decision as to which way it goes. What we don't know is the pollution levels in the city. That's actually what we're trying to uncover.

The decision could be how to manage the horses in the national park, and indeed, that was one of the examples that we had in the paper, "Uncertainty: Nothing is More Certain." By no means the only unknown, but one of the unknowns is the impact of horses on natural habitat.

Another example could be where to drill for geothermal energy. In this day and age, as we try to cut back on fossil fuels, there's a big push into alternative forms of energy, and one of them is geothermal energy, which happens when we find hot granite. The temperature gradient provides us energy, and it's a very clean form of energy. The decision facing anybody who wants to produce this type of energy is where the hot granite is—so where to drill. What we don't know is the subsurface geology.

In this slide, I'm going to talk about using that cycle—the elements of a decision problem—because for me, the cycle actually is the elements of the decision problem. The first element is an unknown quantity, something that we are uncertain about, but which will affect our decision.

Then there are data or other sources of information. For some reason, we give data or things we observe Roman letters. So in this case, that's Y. We've got little n observations, Y1 up to Yn. A bold font Y is just my notation for a vector.

Then we've got a probability model that's going to connect theta, the thing I don't know, to the data. That's done in two ways. Firstly, by the model. The model says, if I actually knew what the value of theta—the thing I don't know—is, how likely am I to observe the actual data? Hence, it's called a likelihood function.

Then, to update our prior knowledge with the knowledge in the data, we use Bayes' theorem. That's just Bayes' theorem there, that says the posterior distribution—that's the updated distribution—this thing, P of theta given Y, is the posterior. The posterior distribution is equal to the likelihood times the prior divided by something, which is a normalising constant, which makes sure things integrate to one.

The last two elements of a decision problem are the set of decisions that you could take. I've called this an action space, given it the letter capital A. So the set of decisions, D, is in A. They're the set of decisions. We're going to choose that decision, which I've labelled—I've got a bit of a cold—D star, as that decision which will maximise my utility. So that's what that equation said.

Now, the important thing about this is, when I think about a utility—a utility, for those who don't know, could be anything. It's very subjective. It's about what it is I want to get out of that decision. The important thing is, in relation to uncertainty, is that I must integrate over the unknown values of theta to get my overall utility, because I don't know what theta is. The integration is with respect to the posterior distribution of theta. So this equation here doesn't have theta in it, because I'm integrating over all possible values, where they come from the posterior. Hence, why this uncertainty around theta is critical in me making a decision, because it affects my utility.

Okay. So now I'm going to go through two really toy examples to show you those elements of a decision problem, and then go through a real example about where we use this type of thinking to actually monitor pollution levels across a city.

So the example I'm going to start with is geothermal energy, which is clean energy. The hot granite is known to exist in the Cooper Basin in South Australia, but it exists a long way below the surface—four to five kilometres. So we really don't have very much data, but yet we need to make a decision. The decision, of course, is to drill or not to drill. That's the question. The unknown is whether the site has hot granite.

Now, you're told from prior geological data that the probability that any particular site will have granite is about one in 50. So that's our prior belief and knowledge. We're going to call that prior belief theta, and we're going to give it a Bernoulli distribution.

So we've moved away from talking about events. We're now talking about random variables. A Bernoulli random variable is a random variable that takes on the value of one if the event occurs and zero otherwise. So theta here is "the site has hot granite." It's going to equal one with probability one in 50, which is 0.02. It'll equal zero with probability 0.98. So that's what we don't know.

Then we're told that we have a core sample and it is analysed, and it is known from previous data that the core sample analysis is very accurate. It will correctly identify a site that has hot granite with probability 0.95 and identify a site without hot granite with probability 0.94. Now, I haven't explicitly said it there, but the core sample comes back positive.

So now we're told we've got an unknown, but now we have a piece of data. We're going to use this information to construct a likelihood function—a very simple one, but a likelihood function nonetheless.

So Y is also a Bernoulli random variable. It comes back one if it's positive and zero if it's negative. We're told that, conditional on knowing there is hot granite, it will come back positive with probability 0.95. If there's no hot granite, there's still a chance it will come back positive, and that chance is one minus the probability that if there were no hot granite, it would come back negative. We're told it correctly identifies a site without hot granite—so theta equals zero—it'll come back negative, Y equals zero, with probability 0.94. So the probability that we'll still get a positive reading when the site has no granite is 0.06.

So now we're in a position to calculate the posterior. What we actually want to know is the probability the site has granite, conditional on the fact that I've got a positive core sample. That's equal to my likelihood times my prior divided by the probability Y equals one—a normalising constant, making sure the probabilities add up to one. That is, the probability that theta equals zero given Y equals one plus theta equals one given Y equals one—the sum of those two things is one. Another way to think about it is just the probability of the evidence—the probability that if I went around taking core samples, I would get a one.

So that's how we're going to calculate it. We're going to imagine the probability that I get a positive result is the probability that I get a positive result when the site actually has granite times the probability that the site has granite (0.95 × 0.02), plus the probability that when the site doesn't have granite, I still get a positive result (0.06 × 0.98). This gives a total probability of 0.078, or approximately 8%.

So if we look at this, we should already be getting an idea that perhaps the results or the attributes of the core sample are not quite as good as they would seem to imply with 0.95 and 0.94, in the sense that the probability that a site has granite is 2%, whereas just taking the probability that if we arbitrarily drilled a site and got a positive result, it would happen 8% of the time—so four times as often as the site actually has granite.

Therefore, not surprisingly, the probability that we're actually going to find hot granite conditional on our positive sample is about one in four, or about 24%.

Now we've got almost all the way around that circle, and now we have to figure out what sort of a decision that we want to make based on that.

Here, the decision—again, drill or not to drill. Our action space is A: D1 being drill, D0 being not drill. We want to choose either D1 or D0 based on the decision that will maximise my utility.

Now, the formula for utility that I gave you in the previous slide had an integration sign instead of a summation sign. That's just because in the previous slide, theta was the more general case, which is usually continuous rather than discrete. But in this particular case, theta takes on the values only of zero or one.

So what that statement says is: my overall utility is the utility I get if I take a particular decision and theta comes out as zero, times the probability that theta is zero, plus the utility I get for that decision if theta is one, times the probability that theta is one. Basically, a weighted average of the utilities, where the weights depend on the probability that the site has granite or not.

So if we decide not to drill (D0), our utility—we're going to define utility here as profit—will be zero if I don't drill. Now, if I do drill, the cost is going to be $100 million. The million-dollar question is whether or not there is granite, conditional on the fact that I've already got a positive sample. Now, the probability that there's no granite is 0.76, and I'll get zero revenue from that, so I'll end up losing the $100 million that I spent in the cost. Or else I'll get a revenue of $700 million, meaning a total profit of $600 million if there is granite, and that happens with probability 0.24.

So I'm just going to take the weighted average of those profits to get an overall utility of $68 million.

So, if my utility is defined as the profit going forward, then obviously I would decide to drill rather than not to drill, because I'm getting $68 million from drilling and zero from not drilling. So that's just a very simple problem about using uncertainty to make a decision.

In this slide, I'd like to point out a very important concept, which is the value of information. So the core sample, I'm saying, cost $10 million—was it worth it?

The way to think of the value of information is that a piece of information has no value in a rational world if I'm not going to do something different contingent upon that information in some circumstances. For example, if the core sample had come back positive and I had decided not to drill at all, even if it were positive, then obviously I could work that out in advance and I should not have gone to the hassle or the expense of getting the core sample.

So let's work through this example. If we didn't take the core sample, then we would clearly decide not to drill. Just based on prior probabilities, we would only get that revenue of $700 million with probability 0.02 and we would lose $100 million with probability 0.98, giving an expected profit of minus $86 million. So we would definitely decide not to drill.

Now, with the core sample, it's going to cost $10 million, and if the sample comes back as positive, we've already shown that we would actually drill. I'm sorry, I've got a cat running around. I'm terribly sorry about that. That will happen with probability 0.078. If the core sample comes back negative, then we won't drill. Why won't we drill? Well, we could go through the maths again. If we have conditional on Y equals zero, the probability that we're actually going to find granite is very tiny—it's 0.0011—and the overall expected revenue is minus $98 million.

So let's just recap and think about how much that core sample was worth. We know it is worth something. Why? Because without the core sample, we would have decided not to drill. With the core sample, we would have decided to drill under some circumstances, and those circumstances are when the core sample comes back positive. How often does the core sample come back positive? 7.8% of the time.

So the expected utility from getting that sample would have been the 7.8% times $68 million, which is, in fact, $5.3 million. So it's not worth the $10 million that we paid for it, but it is worth something.

This next slide is just to take a bit of a breather after working through all those calculations. I want to make the point here that making decisions under uncertainty is very difficult, and it is very difficult for human beings to hold in their mind all those myriad of options in front of them, thinking about the future, what those consequences of the future are.

There are two views of human capabilities. One view is this lovely piece by Shakespeare: "What a piece of work is man, how noble in reason, how infinite in faculties." In fact, this is from Hamlet, and Hamlet is feeling a bit miserable and he's dwelling on the reasons why he might feel miserable, because he's saying how wonderful a human being is, and yet he still feels miserable. As a matter of fact, Hamlet has quite good reason to feel miserable—you know, he thinks his uncle has murdered his dad and married his mum, which is enough really to make anyone feel miserable. But anyway, this is a lovely piece of prose talking about just how infinite in faculties, how wonderful our capabilities are.

Now, this is a slightly different view of our capabilities of reasoning. This is by Herbert Simon, Nobel laureate. I'll just read it for you: "The capacity of the human mind for formulating and solving complex problems is very small compared to the size of the problems whose solution is required for objectively rational behaviour in the real world, or even for a reasonable approximation to such objective rationality."

I've got to say, I tend to side with Mr Simon in this. I think that we do have a very limited capability for making rational decisions in such a complex world, which is why the point of this slide is why we need a framework in which to study it. What I did back then may have seemed a complicated way of attacking a very simple problem, but the point is that it generalises to a much larger class of problems and enables us to make rational decisions. You may argue rationality is not all that it's cut out to be, and I think I'd probably agree with you a lot, but there is a place for rationality.

This is another toy example where I'm going to try and show you about how we think about sequentially acquiring information. I've chosen this as a legal example. In my new role, I found myself surprisingly surrounded by lawyers, and it does occur to me that one of the things that happens in law is that they make decisions under uncertainty or in lots of ambiguity an awful lot of the time.

This is an example of a court case, and we're going to see the value of various pieces of information. It's a real court case, and it's a famous legal case where the perpetrator attempted the assassination of a world leader and was found not guilty on the basis of insanity, specifically that the person had schizophrenia.

There is no doubt that the person did, in fact, attempt the assassination. The defence is that the person was insane, and the defence based their case on the results of a CAT scan, which showed brain atrophy. Put to the jury was that, therefore, this person has got schizophrenia and is not legally responsible. In fact, the jury agreed. So there was a big outcry about this particular decision.

Let's go through some of the data to see whether or not we would agree with this decision. The following are relevant pieces of information: schizophrenia is prevalent in 1.5% of the population. So you're hopefully thinking to yourself now, "Aha, that's a prior," and you're right. It's a prior. Again, our unknown here is whether or not the person has schizophrenia. Again, it's a Bernoulli random variable.

So theta is equal to one with probability 1.5%, or equal to zero—in other words, the person doesn't have schizophrenia—with a 98.5% probability. So that's our prior belief.

Then we're told that the person has had a CAT scan and it showed brain atrophy. Our data, which we denote by Y, is equal to one or zero. We're told that 30% of people with schizophrenia show brain atrophy compared with 2% of the non-schizophrenic population. This is telling us about our likelihood.

So we're told that if the person has schizophrenia, the probability that Y will come back as showing that they have brain atrophy is 30%. Again, Y is this Bernoulli random variable equalling one if there's brain atrophy and zero otherwise. If the person doesn't have schizophrenia, then we're told that they will have brain atrophy 2% of the time.

Now, of course, we need to turn the handle. We've got a prior, we've got a likelihood, we can calculate the probability of the evidence—that is, the probability that somebody comes back with brain atrophy—in exactly the same way as we calculated the probability that the site contained granite. We'll get a posterior probability that the person has indeed got schizophrenia of 18.6%.

I'm not here to say whether that is beyond reasonable doubt, but 18.6% is certainly not 1% or 2% or even 5%. It is very much non-zero. My own personal view would be that if I thought there was an 18.6% chance, then that would probably constitute reasonable doubt. But there was more information, and we'll go over to that more information on the next slide.

Individuals with a first-degree relative who has schizophrenia have a 10% chance of developing the disorder, as opposed to the usual 1.5% of the population. You're told that the perpetrator has a sibling with schizophrenia. How would you update your belief regarding the insanity plea?

So we've now got two pieces of information: the CAT scan and the fact that the person has a sibling with schizophrenia. I'm now going to subscript that first piece of information regarding the CAT scan by the value one to indicate that that was the first piece of information and it relates to brain atrophy via a CAT scan.

Our new prior belief is, in fact, our old posterior. So, if you think about it, we're going around that cycle again. We found out that the person actually did have brain atrophy. We went from a prior probability of 1.5% to 18.6%. That now becomes our new prior probability as we consider this new piece of data.

The perpetrator has a sibling with schizophrenia, and I'm going to subscript that piece of information by two just to indicate that this is now a different piece of information, and it is the fact that the person has a sibling with schizophrenia. You're told that 10% of people with schizophrenic siblings develop schizophrenia.

So, if we've got our model in our head again, this is a likelihood function. If our perpetrator had schizophrenia, if that were true, then the probability that they'd have a sibling with schizophrenia is 10%. If our perpetrator didn't have it, then the probability that they'd have a sibling with schizophrenia would just be 1.5%. Again, we update, we turn that handle, and we update our prior conditional with the likelihood, and we get now 60%.

So, on the balance of probabilities, it is more likely than not that the person has schizophrenia. So this would definitely be, I would think, room for reasonable doubt.

This is an example of a Bayesian learning algorithm. This is how Google learns about you on the internet. You make purchases, you get these sequential pieces of information, and it's updating all the time what it doesn't know about you—theta. That theta might be your propensity to purchase a certain product, or your propensity to take a trip somewhere overseas, or anything of that nature. It is constantly updating and learning each time you do a search or make a purchase. All of these things are fed into something very similar to the algorithm that I've just described.

You've also heard about Navinder talking about the DARPA challenge, and the ways that Bluey—and I can't remember the other names of the robots, but I do remember Bluey, because I've met Bluey—and that's how the robot decides which way to go as well. It's part of this Bayesian learning algorithm.

So you may ask yourself, if pieces of information are costly, which piece of information would I collect first? Now I want to talk about the value of information as a form of utility.

Let's think about this particular case. We had our perpetrator. We started out with a prior belief that they had schizophrenia of 1.5%. When we got the piece of information one, we updated that prior belief from 1.5% to 18.6%. We can compute this ratio. What is this ratio? It's just my posterior divided by my prior, and that ratio for the first piece of information is 12.4.

Now, for the second piece of information, we can do the same thing. If we had not observed Y1, if we had observed Y2 first, we can calculate the posterior probability that the perpetrator had, in fact, schizophrenia, and that would have been 9.2%. The ratio of the 9.2 to the 1.5 is 6.1. So you can see that the first piece of information shifted our belief by a factor of two times more than the second piece of information.

Now, I'm going to introduce a concept called the mutual information, but before I do, I want to just point out that theta and Y1 or theta and Y2 both can take on—theta can take on the value of one or zero, and so can Y1 or Y2. This is just one particular realisation of theta and Y1. There are many other pairs—in fact, three other pairs: 0,0; 1,0; or 0,1, as well as 1,1. So this is just one particular realisation of these random variables, which are the schizophrenia and brain atrophy in this case, or schizophrenia and a sibling with schizophrenia in the second case.

So, mutual information as a utility function. All I've done here is take this ratio of the posterior to the prior, take the log—there are a lot of ways to explain taking the log, but for now, probably the easiest explanation is to say that taking the log of a ratio is analogous to the percentage change. So you can think of the log of the ratio of the posterior to the prior as the percentage change that I get from moving from the prior to the posterior. Then what I'm doing, I'm weighting that percentage change by the probability that those outcomes occurred.

So, as I said before, here we have theta and Y1 being one, but in fact, both of them could have been zero or one. So I'm just basically taking a weighted average of those rates of change from the prior to the posterior, where that weighted average is determined by the probability that that particular pair of outcomes occurred.

If I do this, I calculate the mutual information for theta and Y1 and I get 0.0038; mutual information for theta and Y2 is 0.0016. So Y1 adds more information than Y2 on average, not just across when they both equal one.

Unfortunately, the real world things are not just zero or ones—they're continuous—but that's okay. All we need do is replace those summation signs by integral signs. This last one says that, in general, if we have a pair of random variables that are continuous with values over the space theta times Y, then the mutual information is just the expected value of that rate of change, effectively.

Now I'd like to extend what we've done in those two toy examples to how we actually use this in a completely autonomous AI system. In this particular example, we want to dynamically locate the maximum of a function—in this case, pollution hotspots. We have a drone flying around. This is taken from a piece of work that Roman did with drones flying about the city, and the drone has to decide where to go next. We want to discover the hotspots, but we want to do it in an optimal fashion, so we want to go the shortest possible path to trying to find out where those hotspots are.

In this example, there are two forms of acquisition functions. One is entropy, which is analogous to—not exactly the same as—mutual information, but similar. The goal in the entropy function is really to reduce our uncertainty the most. Then there's another acquisition function, which unfortunately can't really be formulated formally as a utility function, but it has very good properties. What this other acquisition function does is try to minimise uncertainty but maximise the probability that I'm going to hit the maximum. So it explores, but it explores around the maximum—it's trying to have a trade-off between exploring and also hitting the maxima.

So, my unknown here—my theta—is now where that pollution hotspot is, and the decision of the drone is where do I go next.

What you're going to see is the path of the drone as it goes around to explore pollution levels, and the path is dictated by one of three acquisition functions: upper confidence bound, entropy, or just randomly going anywhere. This is the mean of the function (mu), this is the variance, and this is the path taken by the drone.

If we look at the entropy one and you follow where it's going, you can see that it's closely following the variance—it's trying to reduce the variance. If we look at the path of the upper confidence bound, you can see that the drone is automatically coming back to these two regions, because these are two regions of high pollution levels—these are the hotspots. It's exploring, but it's exploring around those regions so that it's identifying those regions that are of concern, as opposed to the one at the bottom, which is entirely random. It's really not doing a very good job of uncovering the pollution levels in an optimal fashion.

So we'll just let this run, and it's about to finish, and you can see that's the result of the drone. So that is an AI system where all those notions of the value of information that we've discussed, together with these acquisition functions, show how the drone is guided in a sensible way to get intelligent data acquisition.

I'd like to shift gears a little bit further at the moment and talk a little bit more about the notion of uncertainty and how that plays into decision-making. I've called this slide "Uncertainty: A Taxonomy" because I'd like to break down uncertainty into bits that we can do something about and bits that we can't really do anything about.

The first is the bit that we can't do much about, which is the inherent or aleatoric uncertainty. In this case, getting more data will not help. You can think of this as the outcome of a toss of a coin.

Then there's parameter uncertainty. An example of a parameter would be the probability that the coin will come up heads. Although I may not be able to perfectly predict the outcome of a coin as I'm flipping it, whether or not it comes up heads or tails, the more I flip it, I will get better estimates of what the probability that it will come up heads is. So in that case, more data will help.

Then there's model uncertainty. Model uncertainty is harder to get at than parameter uncertainty, but very important nonetheless. Here, coming up with a model for coin tossing, you know, this is just that the data are IID—independent and identically distributed—Bernoulli random variables. As you can see, this is a likelihood function. Every time we propose a likelihood function, we're proposing a model. They're not the only sorts of models that we can think about, but they are a sort of model. What this model is effectively saying is that the outcome of one coin toss does not depend on a previous coin toss. That's a pretty good assumption, but there are lots of cases where we make model assumptions that, you know, we could have chosen many, and there is uncertainty around that model that's often not appreciated.

Then there's knowledge uncertainty. Knowledge uncertainty I would paraphrase as "I simply don't know." Unfortunately, that also happens quite a bit of the time, but yet we still must make a decision.

So if we think about a future event—so the Y star here is a future event—and we want to have some idea of the likelihood of that future event occurring, for now I'm going to leave out knowledge uncertainty. I'm going to come back to it. But we do have to take into account all those first three types of uncertainty, which is what this equation is doing.

The first bit of that equation says that there's uncertainty. Even if I had the correct model, M, and I knew the parameters, there is still uncertainty about the outcome—so that's my inherent uncertainty, that's the outcome of a toss of a coin.

Then, conditional on a model, there's uncertainty around the probability of whether it's going to come up heads or tails. We all know that's about a half, but there are many cases where we don't know the parameters.

Then, lastly, there's uncertainty about the model. The uncertainty of a future event will actually encompass all of those uncertainties. They all must be considered if we're to accuratel

Autonomous decisions can be made by AI. That includes robots moving around, automated mines, and algorithms determining visa approvals. AI is also used to assist humans in decision making. Without proper guardrails this can cause harm. 

The scale and speed of this technological impact is almost certainly unprecedented in human history. Regulation for this new era must be technologically well informed, responding to and where possible anticipating issues related to new technology. The goal should always be to ensure that humans experience the benefits of new technology, while addressing or at least mitigating harmful risks.

I like to think of AI systems as part of a cycle or a circle rather than a linear process. The cycle begins with existing knowledge and beliefs, which are then complemented with data

Professor Sally Cripps

 

Share