Lawfare Daily: Christina Knight on AI Safety Institutes and Testing Frontier AI Models

Published by The Lawfare Institute
in Cooperation With
Christina Knight, Machine Learning Safety and Evals Lead at Scale AI and former senior policy adviser at the U.S. AI Safety Institute (AISI), joins Kevin Frazier, the AI Innovation and Law Fellow at Texas and a Senior Editor at Lawfare, to break down what it means to test and evaluate frontier AI models as well as the status of international efforts to coordinate on those efforts.
This recording took place before the administration changed the name of the U.S. AI Safety Institute to the U.S. Center for AI Standards and Innovation.
To receive ad-free podcasts, become a Lawfare Material Supporter at www.patreon.com/lawfare. You can also support Lawfare by making a one-time donation at https://givebutter.com/lawfare-institute.
Click the button below to view a transcript of this podcast. Please note that the transcript was auto-generated and may contain errors.
Transcript
[Intro]
Christina Knight: You already identified really clearly the risks or model policies that you're trying to adhere to, and then you go in and you try to figure out to what extent models might be susceptible to that type of probe, and then you can go in and try to fix it.
Kevin Frazier: It's The Lawfare Podcast. I'm Kevin Frazier, the AI Innovation and Law Fellow at Texas Law, and a senior editor at Lawfare, joined by Christina Knight, Machine Learning Safety and Evals lead at Scale AI and former senior policy advisor at the U.S. AI Safety Institute.
Christina Knight: We really need to shift away from the foundation model eval as something that will help guarantee safety at the downstream level because safety is very specific to who's using the model and what context we're using it and how we're using it.
Kevin Frazier: Today we're talking about ongoing efforts to test and understand the capabilities of frontier AI models. It's a critical conversation for at least two reasons. First, there's been a seemingly global shift in perspective from AI safety to AI opportunity, and second, labs continue to develop ever more capable models that nevertheless fall short on some key indicators such as their hallucination rates.
[Main podcast]
So Christina, we have a lot to cover, so I want to start by getting a sense of the current evals landscape. We'll dive into what exactly an eval is in a second, but I think more folks have probably heard about these AI safety institutes that exist around the world. So maybe just give us a quick snapshot of what is an AI safety institute, where do they exist, and what's their current status? And we'll start with those easy questions.
Christina Knight: For sure. So I'll start with the AI safety institutes and then dive into the evals landscape because I like to think about them a little bit separately. But to start off, an AI safety institute—and we had very specific government language—is a government backed scientific office. And so it is a institute that is associated with a government body, but isn't necessarily a regulatory body, and is working to help advance the science of AI safety on behalf of that government.
And so there are 10 AI safety institutes or similar government backed scientific offices around the world, one of which, the EU AI Office, is a regulatory body, but the other ones aren't.
And so they all have slightly different mandates depending on the country. For instance, the Safety Institute in South Korea, there was a new piece of legislation passed a few months ago, the South Korea AI Act, and that's going into effect in 2026. And so the AI Safety Institute there will actually be responsible for evaluating models and that plays a slightly regulatory piece, but they're on the eval side and not on the enforcement side. And so, we're gonna see similar things like that come up around the world, whereas the UK for instance, they're doing a lot of research and they are not involved in any type of legislation.
Kevin Frazier: Okay. So getting a sense of the evolution of AI safety institutes, it sounds like we're slowly moving, perhaps from more scientific research oriented bodies to perhaps having a greater enforcement or regulatory mandate. But let's just start on that, that first category of the initial wave of AISIs as they're referred to. What was the first AISI and what was its charge and, and how have we seen that model be followed so far?
Christina Knight: So the first one was in the UK and I think everyone remembers—or a lot of people who have been following AI safety for a little while—remember when the UK set up their safety institute and then they hosted the first safety summit in Bletchley Park last October? Wow, time flies. October 2023.
That is when the U.S. decided to set up our Safety Institute, and so we announced it. And Secretary, former Secretary of Commerce, Gina Raimondo, announced our Safety Institute in early November, I think, and then we established it in February and it was slow to come on, but first our former director, Elizabeth Kelly, was announced, and then Paul Christiano, who used to work at OpenAI and helped invent reinforcement learning with human feedback, which is a very widely used mechanism for helping make models safer, but also helping make models more capable. And then I joined pretty early on, a few others joined, and we started to build up our mandate, which was to advance the science of AI safety through guidelines, research and testing models pre-deployment.
And so that was the initial wave. And then other safety institutes around the world started to set up as they saw what the U.S. and the UK were doing, but also as they helped advance AI safety in their own countries.
Kevin Frazier: So getting a sense of why we even started in AISI, I just wanna compare and contrast two technologies here. So the Model T gets invented in the early 20th century. We don't have any sort of formal governance really, of cars and, and best practices until arguably the fifties, if not the sixties. Some states were ahead of it; some insurers actually were smashing cars against walls before the federal government was.
What was the insistence or what's the rationale for having government do a lot of this research? Why not just lean on labs to be doing their own AI research? I mean, it seems like we have labs, we have universities, we have all these folks who are already doing safety research. Why have a formal U.S. government body working on this?
Christina Knight: That's a really good question. The U.S. AI Safety Institute was started because it wasn't directly mentioned in, but it was born because of the Biden administration executive order on safe, secure and trustworthy AI. And in that executive order, I think it really helpfully laid out that we want to have industry specific regulation, but at the high foundation model level, we really do need more research, and yes, labs are doing good research, and yes, there is academic research going on, but a lot of independent researchers don't have the compute necessary to conduct really robust AI safety explorations.
And the government is lucky in that we, we do have a lot of money and there is a lot of resources that the government can put to advancing AI safety research. And because there's so much unknown right now, it's really helpful to have the government pushing some of that and enforcing the focus on AI safety.
I also think we've seen a lot with past technologies more of the power in terms of inventing the technology itself laying within the government. We had Bell Labs, we had in the nuclear area—there was a lot of research going on in the government, and now we're seeing so much of that happening out here in San Francisco, which is awesome because there's a lot of innovation, but that also means there needs to be more of a balance and coordination with the government to make sure that it's happening in a responsible way.
Kevin Frazier: Right. So it's relatively easy for folks to buy a Model T and go find a wall to just drive it into; far harder to say, hey, I want to purchase however much compute and get all this training data and run robust tests on different models.
Christina Knight: Exactly.
Kevin Frazier: So, we've got this sense of kind of a, a specific role for AISIs. Now you mentioned that the early wave, the first wave for certain, so in the U.S., in the U.K., don't have this sort of enforcement responsibility or authority. So what's the dynamic like with the labs? What sort of relationship does the U.S. AISI have with OpenAI, with Anthropic, and has that changed over time? What's its current status?
Christina Knight: So the U.S. AI Safety Institute is still not a regulatory body, same with the U.K., and most of them are not. We—or I guess they now, now that I don't work there anymore, it's kind of funny—have pre-deployment access agreements with open AI and Anthropic. So that means that the U.S. AI Safety Institute will help test their models for certain safety considerations before they're released. And so if there is anything that might introduce risk, that's something that the U.S. AI Safety Institute can help them identify early on.
Kevin Frazier: And when you say something that may introduce risk, what do you mean there? Right, because there's a risk with, with any new technology. You know, I never thought my grandma would use it to do x or didn't think crazy Uncle Bill would do it, use it to–
Christina Knight: Exactly right.
Kevin Frazier: So what risk merits this sort of extra layer of having a company send it to the government even before potentially deploying it? So what, what risks are, is the AISI trying to drill down on?
Christina Knight: No, that's a really hard question because it really depends on the specific use case context and policy considerations of not only the model, but also the system level, who will be actually interacting with that technology.
And when we thought about risk, it's really the composite of the likelihood of harm occurring with the potential impact of that harm. So you can think about low impact, high likelihood risks like your grandma interacting with AI and it telling her to do something bad then ends up–
Kevin Frazier: Worst case, she just like manipulates her bowling scores. I love you, grandma.
Christina Knight: But then you can also think about maybe lower likelihood, but higher impact risk. And that's where we've seen a lot of focus on CBRN, so chemical, biological, radiological, and nuclear threats and helping develop biological agents or chemical weapons. And so that's something that right now is a lot lower likelihood and hopefully stays lower likelihood, but that would be really high impact.
And so when we're looking to test for certain safety risks, we're really checking across that spectrum and for the U.S. AI Safety Institute, in the early mandate, there was real focus on national security and public safety risks. So that was focused on CBRN, focused on cyber, and focused on AI R&D, so how models can help develop other AI models in a way that might be harmful.
Kevin Frazier: So we know that we're looking for these specifically important and perhaps irreversible or significant risks. How has that been going? Have, have we had a moment of, oh my gosh, we've discovered that this model is going to do unexpected thing Z, we need to quickly respond to this, kind of alert everyone. Have we seen that across any of the AISIs? Have we had that sort of crazy moment, or has everything so far fallen below the threshold of minimal risk?
Christina Knight: I wouldn't speak directly to the AISIs, but more just evaluations and testing and red teaming in general. There are definitely iterations of red teaming that go on as a model is in the development phase.
And we haven't seen, to the best of my knowledge, extreme CBRN risks. There have been certain tests done on models that don't have proper safeguards in place that show quite significant harm can be elicited, but when we're thinking about the marginal risks—-so the risk that AI introduces beyond what already exists in the information ecosystem—we haven't seen anything that might warrant extreme precautions.
That's not to say it doesn't exist, and that's not to say that we shouldn't stay aware because AI is developing really rapidly and unexpectedly, and so it really is important that we keep conducting these extensive red teaming tests and that we're staying on top of how we're thinking about different risks that might arise.
Kevin Frazier: Yeah, and I, I think that we can dive more into the capabilities and limitations of these different testing approaches.
One thing I, I just want to drill down on is we had the initial wave of AISIs, the U.K., the U.S. then spreading to numerous other countries. At the same time, in the sort of development of the AI policy discourse and AI policy narrative, we've seen a transition arguably globally from a more AI safety orientation to perhaps a more AI opportunity orientation. Those were, that was the framing used by Vice President Vance at the Paris AI Action Summit. We also heard from a lot of folks that the Paris Summit perhaps wasn't as safety oriented as they expected, and following that summit we even saw the UK AISI keep the same acronym, but change from the AI Safety Institute to the AI Security Institute.
So amid all this policy narrative shifting, is the mission of AISIs changing across the the world, or are we still seeing this, the same sorts of people, the same sorts of tests and the same sort of end goal be applied across the AISIs?
Christina Knight: It has been a bit of a misnomer because even though the U.S. AI Safety Institute, for instance, is called the safety institute, the whole mandate was—and former Secretary Gina Armando used to say this all the time—it’s that safety breeds trust, trust spurs adoption, and adoption leads to innovation. So the whole thing was trying to protect against risks so that we could innovate as fast as possible.
And so the U.S. AI Safety Institute, and a lot of the safety institutes around the world have been focused on national security and public safety risks, because those are the risks that ostensibly would hinder innovation if they became really extreme because no developer wants to release a model that then they get a huge backlash and they've created this huge issue. And so it's trying to preemptively protect against certain risks so that we can keep on benefiting from all the really amazing things that AI is helping us doing. And so I haven't seen a shift in terms of what the AISIs are focusing on.
As we were speaking about before, I was just over there traveling, visiting some people in South Korea and Japan and Singapore and their safety institutes, and they're still very focused on what they were focused on before, which is a lot of national security risks, a lot of figuring out how system level AI is gonna get deployed into industries across their supply chains, and then also looking more at overall safety risks and trying to protect against certain biases and harms that are specific to their cultural norms.
Kevin Frazier: Yeah, and I, I like that line by former Secretary Raimondo because, to, to go back to my grandma and cars, you know, to get more average Americans—the folks who don't live and breathe and think about AI all the time—if you're only seeing headlines about how dangerous AI is, about how frequently it hallucinates, then the sort of research into, hey, actually it's improving its Fidelity to what you wanted it to do, it's improving its accuracy.
It's not going to create a bio weapon, the more you can be assured of that, well then, hey, suddenly my grandma is saying, oh, I'll use ChatGPT to exactly book my next trip. And so really interesting perspective that even with a innovation forward mindset. You can see a very clear rationale for AISIs and seeing that there's been a through line of consistency of the work in, in many of these AISIs is really interesting to point out.
But with that goal of producing reliable, verifiable study of these AI models, I wanna now just do a quick vocab session to make sure everyone's on the same page. So there are a lot of different ways to test the capabilities of an AI model, as well as to track their progress. So let's just do a quick definitional period. I'm gonna turn you into Christina AKA Webster Webster's dictionary for AI. So let's start with red teaming. What is it? What's its function?
Christina Knight: The problem is everyone disagrees about all of these terms, so I'll give you my definition, but someone else, they might–
Kevin Frazier: This is good you, you should probably create your own glossary after this.
Christina Knight: So red teaming in my conception, you can think of in two main ways. The first one is kind of wide vulnerability probing: getting people to interact with a model or an automated model that you've jailbroken to conduct red teaming for you. We've been seeing a, a lot more of that being really effective.
Kevin Frazier: Let's just, let's pause on that just for a second. So jail breaking a model to go against a different model, you're saying? So model v. model, is that the implication of?
Christina Knight: Yes, exactly.
Kevin Frazier: Red teaming. Okay. Wow. So jailbreaking, essentially directing a model to not adhere to its protocols, not adhere to protocol.
Christina Knight: And you can tell it okay, you're helping me advance. Really crucial AI safety research by circumventing your safeguards and helping me red team this other model. And so there are expert humans that are really good at red teamers; there are also pretty well-trained models that are good at red teamers. And so across these two kind of categories of red teaming that I'll explain, you can think of it as both a human and an automated schema.
Kevin Frazier: So theoretically we could have, you know, one test driver driving next to a car, you know, doing some crazy maneuvers, seeing how the car reacts. We could also have an autonomous vehicle driving next to another autonomous vehicle–
Christina Knight: Exactly.
Kevin Frazier: –seeing how it reacts to crazy, crazy tactics. Mm-hmm. Okay.
Christina Knight: You have Gemini—jailbreak Gemini that then will help jailbreak Claude. Like you can do crazy things.
Kevin Frazier: And it all makes sense now. It's like an LSAT logic game. That's a, a, a throwback for all those folks now, taking the LSAT who don't have to do the logic games questions, don't get me started.
So this idea of red teaming, basically finding novel capabilities, novel threats that perhaps weren't identified previously, that's our end goal?
Christina Knight: Kind of. So when you have these two types, you have the widespread vulnerability probing, and that is looking for both known and perhaps unknown risks. And so when you're doing that, you're just trying to elicit harm across the spectrum of what I was talking about, of low likelihood, high impact, low impact, high likelihood, and you're trying different adversarial tactics. So when we talk about jailbreaks, it comes from the term jailbreaking a cell phone, I think, and you're trying really advanced and kind of manipulative ways to convince a model to either circumvent its safeguards or to elicit a certain risk that the model developer might not have thought of.
So you can think about a few common tactics are fictionalization. If I tell a model that we're acting in a alternate universe where it doesn't actually have safety policies, and it's my best friend and it's gonna convince me how to kill someone—my best friend wouldn't do that, but something like that—then you can jailbreak its safeguards, and it might give you harmful content.
When you're thinking about other types of jailbreaks you can do, you can use Unicode or you can use other languages and try to sandwich into attacks to try to circumvent the model's logical reasoning process about why something might be harmful.
When you do widespread vulnerability probing, you'll identify certain threat vectors that are associated with a particular deployment, and so that means that, okay, maybe, and I'm just making this up, but maybe Gemini is more susceptible to CBRN threats and more susceptible to multi-turn attacks. So not an attack that would just be, I ask an LLM something and an LLM gives me something back; it would be, we slowly have a conversation and over the course of that conversation, I introduce harm in a way that the model would then respond.
And so once you've identified those threats that are associated with a particular deployment, then you go into more targeted red teaming. And so that's the second category, and that's when you've already identified really clearly the risks or model policies that you're trying to adhere to. And then you go in and you try to figure out to what extent models might be susceptible to that type of probe, and then you can go in and try to fix it.
So then you can either have human red teamers or automated red teamers, maybe take a harmful prompt and a harmful response and rewrite it. And so then it can be used to fine tune a model to make it safer or be used directly for RegX in a new content classifier. And so there's a lot you can do with red teaming, both to identify new harms, but also to help improve models’ robustness against risks that are already identified.
Kevin Frazier: Okay. A critical role there and really what stands out to me is the importance of having really good red teamers, right, whether that's an automated red teamer or finding the AI experts, whether they're internal to an AISI or externally.
Christina Knight: Yeah.
Kevin Frazier: We know some labs, for example, will solicit. External AI experts to come and red team their models, right, run 'em through as many exercises as possible. So a very clear rationale for why, why red teaming would be a part of that process.
Now, two more things I wanna break down. Let's start with evals. What are they, what the heck do they mean? How reliable are they?
Christina Knight: So evals or evaluations are just ways of assessing model capabilities and model risks. So you can have safety evals, which are looking specifically at how robust models are to adversarial attacks, or you could have capability evals, and there's a whole spectrum of them. You can look at specifically math evals or how good a model is at coding, or how good it is at logical reasoning.
And so there's a whole suite of evaluations that exist, and some of them are a lot more reliable than others, and the reason that evals need to constantly be updated, and some of them aren't that reliable, is because they can become what we call oversaturated, which means if the answers to that eval somehow get leaked—either they're public or they've been found or released in some way—models can then use it in their new training data. And that means that, okay, it might be able to answer every single question on this test correctly, but if you show it a new test with very similar questions, but slightly different answers, it won't be able to perform very well.
And so a huge focus right now is trying to make these tests robust enough that new, as new models come out, they perform poorly enough on these tests that we can actually compare them. Because if every model's getting 98.8% on every eval, then we don't really know what it means, but if we can release new evals. Like for instance Scale, actually released Humanity's Last Exam, which is kind of a hilarious name, but is an eval that is really difficult and most state-of-the-art models don't perform as well as they do on other evals on this specific test.
And so evals are also private and public. And so companies have evals within their own company that they use to evaluate the model capabilities that aren't necessarily used as benchmarks, which is another term—that's a type of eval that we use to rank models.
Kevin Frazier: We'll get to benchmarks in a second. So just wanna hang our hats on evals for a second. So, with respect to safety evals, I'm gonna introduce yet another term for folks: sandbagging. What is sandbagging? What's our concern about models that are aware that they're undergoing testing and start to respond differently because they understand that they're now being evaluated for whether or not they're going to be risky. Is that something that happens frequently? How do, how do you try to address that phenomena?
Christina Knight: Mm-hmm. So this is something that is really complicated because we don't quite understand models’ faithfulness, and this is where a lot of chain of thought research comes into play because when models—so chain of thought is associated with a particular type of reasoning model that won't only just to give you an answer, but it will actually walk through the logical reasoning steps that it took to reach that answer.
And so in one way that's really good because you can see, okay, this is what the model was thinking to get here, but in another sense, we don't know if that's actually how, what the model was thinking to get there, because there has been a lot of research done that shows, okay, if a model outputs an answer that it's not so sure about, it will just work backwards and try to justify its logical reasoning based on that specific answer, even though it knows that it's not right.
Kevin Frazier: It's a good thing humans never do that, right?
Christina Knight: Yeah, never do that. And so we're sometimes worried about sandbagging because we, on a safety eval, for instance, if the model wants to prove that it's safe, but isn't necessarily safe and recognizes that it's being tested, then it might underperform or overperform on a specific eval, even though that's not what it would actually do in real time. And so we just need more faithfulness evaluations and that's why AI safety research is so important because there is a lot unknown right now about what models are doing under the hood.
Kevin Frazier: Yeah. So this struggle, on the one hand, you can develop a fantastic eval—perhaps it's the, the most creative, most difficult, or most novel one out there, but. If we aren't sure it's actually testing the model, then it, it won't matter. And so that, that dual race of thinking about how capable are these models that deceiving the testers as well as asking, well, is the test even a good one, is is difficult.
So with that in mind, I would love your take on some of these early efforts to, let's say, mandate that a model adhere to a certain eval and score at a certain level. Given that uncertainty, is, is that really a meaningful way to say, Hey, I'm concerned about AI safety. I'm gonna call congressperson Y and say, Hey, I demand that there be a, a safety eval, and if the model doesn't pass this threshold, then it's a no go. It sounds like there's too much uncertainty right now for that to be a reliable approach.
Christina Knight: Yeah, in, in my conception, I would say there's a distinction between not requiring, but encouraging a lot of safety testing—and we have seen the community move in that direction—and requiring a very specific eval or score on an eval. Because what we we just spoke about is that these evals, they're getting more reliable, but they're not very reliable. And then figuring out exactly what eval to use and making that a universal test is something that would just block innovation in a way that wouldn't even help necessarily advance AI safety.
And I also think that we really need to shift away from the foundation model eval as something that will help guarantee safety at the downstream level, because safety is very specific to who's using the model and what context we're using it and how we're using it. And so especially with agentic capabilities, multi-modalities, we need to be thinking about making specific safeguards at the model user context level, and then having robust evals and testing processes to ensure that the model is used for the correct purpose.
Kevin Frazier: Right, so you may have a test for a Hummer and think that this is the perfect test for a Hummer, but you can imagine a bicycle being more dangerous in certain scenarios.
Christina Knight: Exactly.
Kevin Frazier: It'd be a very finite set of scenarios, but you could imagine that no, we would actually need a different test for, hey, if someone's gonna ride their bike through the middle of a mall during shopping season, right then, you need a different test.
So these narrow models, as, as some folks refer to them, could present different threats. For example, if you're relying on a model for radiology to detect certain tumors, who cares if it did well on some tests that wasn't even testing radiology and things like that?
Christina Knight: Yeah, the speed limit's different on the highway versus in your neighborhood, and that's because a model or a car being used on the highway should have to adhere to very different safety policies than something outside of an elementary school.
Kevin Frazier: There we go. There we go. And I don't know why I had to make this so car centric. I'm not even a car guy, just to be honest. I want a motorcycle, my wife won't let me have one; that's another conversation. There's so many other conversations I've started, but, but now I'm just sad about my motorcycle.
Before we go down that rabbit hole, we've crossed off red teaming, we've crossed off evals. What are benchmarks? How do they fit into this picture?
Christina Knight: I like to think about benchmarks as very similar to evals, but they're more of the public let's rank models against each other and figure out how OpenAI performs on logical reasoning compared to Gemini 2.5 Pro. And so that's more looking at how are models related to each other and what should we focus on when we're advancing new capabilities?
Kevin Frazier: Okay. And so now that we've got a, a pretty complete picture here of what we can maybe place under the umbrella of AI testing, I wanna run through some of the concerns that that folks may raise. So, for example, with any of these testing efforts, especially with to the extent they're done by the government, what concern is there, for example, that you may be disclosing trade secrets, that you may be disclosing information to government employees who then turn around and say, hey, great, I know what Open AI's ChatGPT-5, I know the secret sauce, I'm gonna go leak it to whomever and make a trillion dollars. Is that a concern? What's, what's some of the implications around how to keep the initial testing and the actual models themselves confidential?
Christina Knight: Well, that really depends on at what stage of model development you're conducting testing. So some red teaming is just via API, so you are using the model as if you are a user that is interacting with the model through either the web interface or through the API. And that type of testing doesn't reveal anything about the model because you're just using it as if it's already been released. And so in that sense, there's nothing to really worry about.
When you're doing more of the pre-deployment testing, that is something where you usually have NDAs and MOUs in place to ensure that any proprietary secrets that are shared remain confidential.
Kevin Frazier: And we know also that this is a time of particular geopolitical tensions, and so even though a lot of these AISIs are being hosted by countries with which we've had long relationships—South Korea, the UK, the EU—is there information sharing going across these different AISIs and is there a concern of saying, hey, well maybe we found the perfect eval. Maybe we wanna hold onto that, so we, we are the ones who test best or test most accurately. What's that dynamic like across the AISIs?
Christina Knight: I think that it is very coordinated because there's so much incentive to not have, companies have to sign up to 10 different evals in 10 different countries. But it really is to every single country's benefit to have some sort of universal safety benchmark.
And that doesn't exist yet, but there has been a lot of work. In my time in the U.S. AISI, I worked a lot with the other nine countries to conduct an international joint testing exercise. And so this is starting to align on what safety considerations are important to Singapore, for instance, but might not be as relevant in France, and then trying to combine them all into a universal benchmark where we can test for universal risks and have some sort of measure of what AI safety means across the globe.
And that's not to say that we're close to doing that, but there is information sharing and a lot of incentive to align safety evals. And I don't think anyone is thinking, I've got the best eval and I wanna keep it to myself because it's such a nascent scientific field that we all need to work together.
Kevin Frazier: And so you, you've been speaking about your time at the U.S. AISI in the past tense. How does Scale AI fit into all of this? And more generally, how do private companies interact with these AISIs? What's that engagement like?
Christina Knight: And so Scale AI, I am working a lot with the evaluation and alignment lab within Scale, and so we do a lot of red teaming and we also have the SEAL leaderboard. So we put out these evals and rank models against them and try to conduct our own internal research. And so at Scale, Scale’s working with the U.S. AISI, Scale’s working with the UK AISI conducting some preliminary research around AI safety.
Kevin Frazier: And just for sake of full transparency, I'm guessing that the seal ranking isn't referring to your favorite seals at Pier 49 in San Francisco.
Christina Knight: No.
Kevin Frazier: Can you break that acronym down?
Christina Knight: Safety, Evaluations, and Alignment Lab.
Kevin Frazier: Okay, perfect, perfect. I was a little bit more excited about the former, but that's okay. I'm, I'm glad you all have that.
So, looking forward, we've, we've seen the AISIs become more numerous, but also maybe the political conversation and political narrative around AI has been undergoing some changes. Forecast out what, what's the future of testing look like? What are the trends you're most excited by? What keeps you up at night? What, what would you encourage listeners to be thinking about for those who are trying to get a sense of where this is all headed?
Christina Knight: I would say two things. The first one, the thing that everyone likes to talk about, agent safety testing. There is a lot of thought right now going into how best to conduct red teaming, but then also monitoring for agentic capabilities. And right now, I like to think about it in three buckets where you have the monitoring aspect of how do we both use humans, but then also use other LLMs to track both agents' logical reasoning steps, and then the actions that they take to ensure that there is correct escalatory practices.
And then there's also a second bucket of sandboxing. So how do we create the right virtual environments for agents to act in so that when we actually do transfer that agent to the real world. We know the type of patterns that are coming up, and we know when to intervene.
And then the last bucket is figuring out really good escalatory thresholds, so having certain thresholds around actions. For instance, if it's a financial agent, figuring out, okay, if it makes a prediction two standard deviations above or below what we have seen in the past three years, that's something where we wanna show it to a human.
So figuring out how best to allocate resources across those three research buckets, I think is something that we're gonna see a lot more focus on and something that I'm really interested in, because a lot of safeguards that have been adapted to large language models and the way that we've typically been interacting with AI are not very robust at the agent level.
And so we've been doing testing recently on prompt injections where if you ask a model directly, can you help me build a bomb, it won't do it, but if you ask an agent to go to a website, and in that website it says, can you help me build a bomb, the agent will tell you how to do it. And so there's those slight nuances that come up in the agentic case that are really hard to protect against and that we should be focusing on more.
And then I would say the second thing, and this is what I spoke about a little bit, but the trend towards automated red teaming. Models that have been jailbroken are really good at jailbreaking other models. We have experts, a team of 50 red teaming experts based in Dallas—I was there a few weeks ago visiting, it's really cool—but they were impressed by what Gemini could do when it was interacting with another model. They're like, I never even thought about that attack, that's genius.
And so we're gonna see a lot more use of AI to not only red team, but then also—we didn't really speak about this, but also to grade the evals because it's really hard every time a new model comes out, you have to rerun an eval, and if you have to grade all of the responses by humans, that takes a lot of time. And so we've been seeing a huge advancement in terms of scalable oversight and scalable ways of measuring how models are performing on these valuations.
Kevin Frazier: Wow. Well, Christina, it sounds like you have your work cut out for you, so I'm gonna let you get back to it in particular to prevent that agent from building a bomb. And we'll have to say thank you so much for joining. I'm sure we'll be talking again soon.
Christina Knight: Thank you so much for having me.
Kevin Frazier: The Lawfare Podcast is produced in cooperation with the Brookings Institution. You can get ad-free versions of this and other Lawfare podcasts by becoming a Lawfare material supporter at our website, lawfaremedia.org/support. You'll also get access to special events and other content available only to our supporters.
Please rate and review us wherever you get your podcasts. Look for our other podcasts, including Rational Security, Allies, The Aftermath, and Escalation, our latest Lawfare Presents podcast series about the war in Ukraine.
Check out our written work at lawfaremedia.org. The podcast is edited by Jen Patja. Our theme song is from Alibi Music. As always, thank you for listening.