Scaling Laws: AI and Energy: What do we know? What are we learning?
Published by The Lawfare Institute
in Cooperation With
Mosharaf Chowdhury, associate professor at the University of Michigan and director of the ML Energy lab, and Dan Zhao, AI researcher at MIT, GoogleX, and Microsoft focused on AI for science and sustainable and energy-efficient AI, join Kevin Frazier, AI Innovation and Law Fellow at the University of Texas School of Law and a Senior Editor at Lawfare, to discuss the energy costs of AI.
They break down exactly how much a energy fuels a single ChatGPT query, why this is difficult to figure out, how we might improve energy efficiency, and what kinds of policies might minimize AI’s growing energy and environmental costs.
Leo Wu provided excellent research assistance on this podcast.
Read more from Mosharaf:
- https://ml.energy/
- https://www.technologyreview.com/2025/05/20/1116327/ai-energy-usage-climate-footprint-big-tech/
Read more from Dan:
Find Scaling Laws on the Lawfare website, and subscribe to never miss an episode.
This Scaling Laws episode ran as the October 17 Lawfare Daily episode.
To receive ad-free podcasts, become a Lawfare Material Supporter at www.patreon.com/lawfare. You can also support Lawfare by making a one-time donation at https://givebutter.com/
Click the button below to view a transcript of this podcast. Please note that the transcript was auto-generated and may contain errors.
Transcript
[Intro]
Alan Rozenshtein:
When the AI overlords take over, what are you most excited about?
Kevin Frazier: It's
not crazy, it's just smart.
Alan Rozenshtein: And
just this year, in the first six months, there have been something like a
thousand laws.
Kevin Frazier: Who's
actually building the scaffolding around how it's going to work, how everyday
folks are going to use it?
Alan Rozenshtein: AI
only works if society lets it work.
Kevin Frazier: There
are so many questions have to be figured out, and nobody came to my bonus
class. Let's enforce the rules of the road.
[Main episode]
Welcome back to Scaling Laws, the podcast brought to you
by Lawfare and the University of Texas School of Law that explores the
intersection of AI, policy, and, of course, the law. I'm Kevin Frazier, the AI
Innovation and Law Fellow at Texas Law and a senior editor at Lawfare.
Artificial Intelligence is sometimes framed as a magic bullet
for solving big problems from discovering new drugs to planning smart cities. But
the infrastructure that powers these models uses electricity and water and a
lot of it.
A December 2024 Department of Energy report found that AI data
centers already account for 4.4% of electricity consumption, which is estimated
to double or triple by 2028. That's a trend that seems likely to continue even
after this decade, given that OpenAI announced the construction of five new
data centers as part of Project Stargate and other labs seem poised to follow
suit.
So why exactly does AI use so much energy? and is it cause for
alarm or merely a fact of technological advance? Today on Scaling Laws?
To explore these questions and more, we have Mosharaf Chowdhury, who's an
associate professor at the University of Michigan, and one of the directors of
the ML Energy Lab, and Dan Zhao, an AI researcher at MIT, Google, X, and
Microsoft who focuses on AI for science and sustainability and energy-efficient
AI.
Giddyup for quite the ride. To get in touch with us, email
scalinglaws@lawfaremedia.org. And with that, we hope you enjoy the show.
Thank you to both of you for coming on.
Mosharaf Chowdhury:
Thank you for having us.
Dan Zhao: Thank you for having me.
Awesome. Mosharaf, what would you say is the common
understanding of AI's energy consumption, to the extent there is one?
Mosharaf Chowdhury:
So, AI's energy consumption have been in the news a lot for last three, four
years. And at the beginning there were no good tools to precisely measure how
much energy these models consume for training and inference.
And as a placeholder, people were using––and that's because for
the lack of better tools––they were using estimations to get a sense of rough
order of magnitude of energy consumption, which essentially means that how much
a GPU consumes at its peak and then how many GPUs you might have, and then
assuming that many GPUs are needed to do a training or for inference. Then you
just multiply all these big numbers and then assume that they are always
running all the time. Then you end up with a very large number, which have been
reported bigger than Netherlands and Ireland and so on.
So [inaudible] is in that sense, there is a lot of
overestimation. It used to be, and that led to many news articles which are
honestly, I don’t know, make sense. If you see those numbers, people would be
concerned. So I think that's what the court is referring to.
Kevin Frazier: Right.
So we've seen all these stats, as you mentioned, from the early days in
particular of AI where seemingly every day we were learning, oh, now it's the
energy rate of a hundred thousand homes for three months at the height of
summer, and now it's the amount of energy of Ireland, and now it's New York
during the Super Bowl, and so on and so forth.
Which are great for tweets. I'm sure they get a lot of viral
traction, but maybe not the most empirically driven. Mosharaf pointed out the
critical difference between training and inference that comes into this picture
as helping us get a sense of where this energy use is actually coming from.
Dan, can you kind of explain for listeners who perhaps aren't
as steeped in AI vocab as the three of us presumably are, what is the
difference between training and inference, and how is that relevant to this
energy conversation?
Dan Zhao: Sure. I
think that's a very good question. So, a while ago––and you know, when I say a while,
I contextualize that in AI space, so just a year or two ago, right––people were
starting to––
Kevin Frazier: So in
AI time, that's at least like 15 years ago. Exactly
Dan Zhao: Right. And
so basically, if you think about it this way, everyone was very occupied with
the energy costs of training, right?
These large behemoth models needed to undergo something called
pre-training, which basically means you take a very large model, you feed the
tokens, and essentially it trains very slowly, depending on your hardware or
the generation of GPUs you have, the networking, the clusters you have. So
nothing really accessible to your common folk with a single GPU, for example,
consumer grade.
But these sort of large companies, large labs, they're
basically training these models on tons and tons of tokens, right? And
basically getting this sort of large language model out of it. So you can think
of your ChatGPT for example. You can think of your Llama that are released, for
example. And so during these times, you are actually trying to train your model
to get up to a certain level of performance, and then that model should be good
enough for all sorts of downstream tasks, from chatting, tool use, so on and so
forth.
Inference is a little different because inference has kind of
allowed people to think a bit more flexibly about taking these models, so they
don't necessarily need as many computational resources for as long or as
intensively. You basically think about taking a model that's already
pre-trained and then really setting it up for deployment downstream, right?
So when you think about inference, it's basically the things
that people nowadays are much more familiar with. When they engage with chat,
for example, you're basically throwing in a query. It's coming out to the other
end response. It seems easy, but under the hood, there's a whole bunch of
complicating factors there.
There are ways to route different requests, if you think about
it, millions of requests maybe every five minutes. Go to ChatGPT. They need to
find a way to route it through their servers, make sure their servers don't
crash. They need to make sure that things are maintained properly. Quality
doesn't degrade.
They need to think about toxic software. Think about harmful
content they need to filter out. And so inference is really, you take a trained
model, you throw things in, then you hopefully get desirable things out, and
that's much more faster and much more, easier in some senses of the word than
training, for example, is.
Kevin Frazier: And so
one of the things that I got called out on in preparation for this podcast, and
thanks to excellent research from Leo Wu was the assumption I assume many
people hold, one of those common understandings that the energy use has to be
this training function, right? Whenever we hear about training these large
language models, and in particular the frontier models here, we think about
these massive data centers being built in the middle of nowhere with tens of
thousands of GPUs and even more CPUs. They're running 24/7, 365 days a year and
just, they need water to stay cool. They need tons of energy sources. Surely
it's got to be the training that's driving these energy costs.
Mosharaf, why is that not the case? Where are we seeing the
preponderance of energy use coming from? So.
Mosharaf Chowdhury:
That's a very good question. Thank you. Training essentially happens, say, only
once. It's not really once, but just to get a sense of, you train a model once
and then when it gets deployed, as Dan was saying then millions of people at
this point, hundreds of millions, certain people are using the same model.
And of course you retrain the model, you fine-tune it many
different way, but none of them comes close to how many people are actually
interacting with the model. So a single training instance consumes a lot of
energy, tens of thousands of watt-hours, based on some of the smaller model and
open source model that have been published. We don't really know how much
actual energy is consumed by training ChatGPT-scale models.
It looks big because we can point to, this model took this much
energy to train. But it gets much, much bigger when you think about all the
millions of people sending, you know, tens of queries each person. Each
inference request consumes small amount of energy. But when you multiply it by
hundreds of million, they add up to very big number.
And all these big data centers, they are not necessarily just
serving training. They're also serving inference requests. So, the ratio
between energy consumption of training and inference depends on what company
you are talking about and what model, but many of the numbers that I have heard
from different providers ranges from 30/70 and 40/60.
I mean, the smaller one being training and the bigger one being
inference. So, at the beginning, when these models were not very popular, you
could easily think that training consumes the lion’s share. But as more and
more people use, it easily gets dwarfed by all the people and all of their
requests.
Kevin Frazier: And I
think what's critical to point out about the fact that inference is driving so
much of this energy use, the calls you're making to the LLM, the prompts you're
sending. Presumably this is only going to skyrocket as we see AI agents become
more and more ingrained to our daily lives, these AI tools that are able to act
autonomously on our behalf if they are taking all these actions, pursuing all
these tasks for us.
The amount of inference energy costs should presumably only go
up. Which raises a lot of questions that we'll get to in a second, such as
whether we should say thank you to our AI or not, but I'll leave that aside for
just one second.
Dan, we have talked a little bit about the fact that we need a
lot of water. We need a lot of energy and electricity to make sure that we can
train these models and engage in inference as well. What does it actually look
like in terms of where the water's coming from, what sources of energy we're
relying on? And why might that be a concern for folks from a sustainability
standpoint?
Dan Zhao: Sure. So if
you think about it, and I'll probably talk very simply without going too many
of the details, just so folks get the big picture, right?
Kevin Frazier: You’re
just doing it for me, which I appreciate.
Dan Zhao: I mean,
everyone appreciates the finer details, but I think the big picture here is
probably more important.
So if you think about, let's say when I was at MIT, for
example, at Lincoln Labs at ***Sea Sale and the Supercomputing Center, you walk
in and you basically see racks among racks of these sort of GPUs that are
basically just sitting there. And when you walk in, the first thing you notice
it's quite hot, right?
Like these GPUs, people can think about these GPUs as these
small black boxes. Essentially, they consume power because they need the power
to run calculations, do a whole large sum of number of electronic operations.
And basically these GPUs kind of run and you know, there's this common trope
for me that GPUs go brr. They do, they indeed do go brr, and to do so basically
requires switching powers through a lot of its billions of transistors that are
kind of on these GPUs to basically come up with these mathematical operations
for either training or inference or whatever it might be.
Right. So obviously the GPUs stood at the center of that. Now
with the GPUs, like even when the GPUs aren't active, when they're idle, like
these transistors, they still kind of have static power leakage. So, there's
still power that's being consumed there, even if not being used. And one thing
I should note is also like not many people know how to efficiently get every
single drop of efficiency out of GPU usage, right?
When people run a model, load a model and run it on GPU, they
don't know things like, oh, I need to basically overlap my communication costs
with compute. I need to keep my GPUs busy. If you didn't know what that meant,
forget I said it, it's not a big deal. But GPUs are an expensive resource and
we need to keep them busy and running.
Other things involve things like, oh, if you want to basically
saturate the memory of your GPU, otherwise a lot of this sort of power that's going
to be consumed anyway is kind of going to waste. Right? And so there are these
considerations that are restricted to the GPU as well. So that's one aspect.
Then there's also––
Kevin Frazier: Pause
on that though.
Sorry. So you're essentially saying that because GPU time is so
scarce and so valuable, right? We hear all the time about new startups that are
just renting out GPU space, for example, just so that folks can train new tools
or allow for greater inference. So, the economic incentive here is to run these
as much as possible for as long as possible, rather than saying, oh, you know,
let's take a break and let them breathe and stretch before their next training.
Dan Zhao: There is a
human dynamic here as there is to all human systems, right. And as a former
economist, I tend to appreciate the frustrations that go with that. Let me put
it this way, democratizing AI is great, right? So, more people have access to
resources, to compute, to run experiments, and hopefully innovation will be
born out of those as well.
But on the flip hand, you have more and more people using GPUs
inefficiently, and those GPUs are going to be run one way or another. It's in
some form a tragedy of the commons, if you think about it, right? Because
suppose that––this is what we did at MIT, for example, we came out with a
system that basically capped the power draw of GPUs.
So what that basically meant was we limited the amount of power
these GPUs could draw. And this is currently in practice at MITs Supercomputing
Center.
Now what this did was we save a lot of energy in aggregate, and
we found there a sweet spot where users don't notice a single or any sort of
perceptible difference on their job performance.
The problem there is, well, if they realize that they're, we've
capped their power and they were able to get their jobs done, you know, in
shorter amounts of time and they were still saving energy, well, people might
still order more jobs to run anyway just because, right?
And so that energy or that effort you spare might still get
used up anyway, because people are going to say, oh, those GPUs are still going
to be running. I might as well be running something on them. Right?
And so when more people come into the deep learning world and
they're running GPUs, you know, learning how to run GPUs efficiently is very
important.
And sometimes it's the job of big labs and other places that
kind of do the behind the scenes magic to make sure that you squeeze every drop
of efficiency out, if that makes sense.
Kevin Frazier: Right.
So essentially you could think of a novice farmer going off and trying to
gather all the wheat. They might do it in a horribly inefficient manner. They're
expending all these man hours, they're using the wrong tools, so on and so
forth, way more labor-intensive.
Whereas the trained farmer, the big labs, for example, they
know how to harness these resources most efficiently. So that's, is that a fair
characterization?
Dan Zhao: It is fair,
but it's also clear that, you know, there's this knowledge and information
asymmetry, right?
Which is natural. You can't really blame one side or the other.
Like for example, even if you had folks who––if you gave folks full access, for
example, to the GPU, to the scheduler behind the GPU resourcing, if you gave
people access to the networking behind the cluster, like, they wouldn't really
know what to do with that anyway, right?
And so it's not necessarily a fault that' be assigned to either
side, it's just that it's a natural consequence of any new technology that
needs to be used, especially if this new technology comes with these sort of
costs that need to be borne by one side, inevitably, right, that then spill
over aggregately into society.
So it is a bit of a sort of natural circumstance as well.
Kevin Frazier: Right.
So you have to learn how to farm to start planting stuff. That's right. That's
going to be a natural process.
So, Dan, I cut you off. You were going, you were mentioning
that we have the GPU kind of incentives in terms of running frequently or using
them to the greatest extent possible.
What's this second kind of driver of energy consumption?
Dan Zhao: Yeah, so
when you think about a lot of these things that come up nowadays, I think Mosharaf
mentioned this as well, like demand. It's just simply a matter of demand as
well.
People, more and more people are wanting to use these things. More
and more people might use them inefficiently. And more, as more people use
these things, and as this sort of technology and span of capabilities grow, the
more we can do with these models––I'm not sure if you saw, Sora 2.0 come out
the other day, its video generation capabilities are amazing––the more that's going
to take. And then the capabilities––
I'll give you a more concrete example. So, as capabilities
grow, modalities grow. So for example, video generation is one example that we
didn't really think about as much as text generation back in the day. And
again, back in the day is like two years ago.
And so things are going to grow. And then this also plays into what
we were mentioning earlier, Kevin, about agents, right? So for example most
recently when I was a senior research scientist at Microsoft, when you think
about agents, the amount of iterations that they take in a single action is far
more than a single round of chat between you and ChatGPT.
The fact that you also then can use images. Images also spend
quite a bit of tokens, more so than text sometimes, depending on the number of
images you're using to do a single task, which is more, much more complex than
answering a single question, for example, on a computer. Like, that's going to
drive things.
So the increasing capabilities, the increase in demand, along
with what I mentioned at first, I think those three make a powerful combination.
Kevin Frazier:
Thanks, Dan. And you're affirming one of my bad jokes, and I have many of them,
which is if a picture is worth a thousand words, then a generated image is
worth 10,000 tokens. So you're welcome to steal that from me anytime you want.
Mosharaf, what are you doing about this? What is your team
doing to help solve this issue? Because we mentioned earlier at the outset that
there are these holes in this conversation. And yet, you know, Google released
a report saying that there's about 0.24 watts per median query that gets used
when you're using Gemini, and apparently that's the same as operating a
microwave for a second.
I don't really know what that means in terms of how I should
change or alter my approach to using Gemini or ChatGPT, but the labs are
sharing information. Sam Altman sent out a tweet about some of OpenAI's energy
uses. What gaps are you filling, or how is your team working on this issue?
Mosharaf Chowdhury:
So, thank you for asking this question.
So yeah, we have to go back to earlier when I said that there
were gaps few years ago because people simply didn't have tools. So, for the
last almost five years, the University of Michigan and with many of my
colleagues across the U.S. I'm leading the ML Energy Initiative.
And we are trying to build tools to precisely measure and
understand and then optimize energy consumption of AI. So one of those tools is
Zeus, which we built from ground up where it can interact with different kinds
of GPUs and CPUs and memory and collect over time how much energy is being
consumed by the workload that is running and the workload being AI inference or
training.
And using Zeus, we have measured the sort of––as precisely as
possible within the GPU, how many watt-hours or joules of energy are being
consumed. And we produce this this output which I will refer to as the ML
energy leaderboard, where we measure the energy consumption of open-source
models. So it led to two things.
One is, it gave people a tool, but also a methodology in terms
of how to think about energy measurement. At the beginning of the podcast, when
you mentioned this MIT Tech Review article, they actually used our tools and
worked with us for almost six months to measure all the models we did, but
couple other questions they have had.
So these tools allows us to actually help journalists to
collect all these numbers. And that also caught a lot of attention from
different companies. For example, if you read this article or white paper from
Google, you will see that they also referred to ML Energy Benchmark, and they
talk about how we measured and how they have expanded the methodology to make
it even more accurate by considering idle machines in their data centers.
And they came up with this median number. By the way, which
should be watt-hour instead of watt, because it's energy. But that's okay. And
so that's one part. What we are doing is making tools available so nobody can
say that it's impossible, it's very hard. We don't have means to measure it,
and it's difficult.
The other part that we are doing is in the optimization space.
So Zeus measures, but Zeus also optimizes. The way it optimizes, part of it is
figuring out the precise power cap to set in a job specific way than earlier
mentioned. At MIT, they were capping all of the servers. But Zeus has a
optimization method that sets individual GPUs’ power in a unique fashion that
together works well for the training job to actually perform better.
And it can be applied to small models where it fits in a single
GPU, but it can be applied perhaps even more effectively for bigger models
because of the distributed computation structure of the model. So essentially,
I will, I know, simplify it and give an overview of why it works for
distributed model are bigger models because they're too big, they don't fit in
a single device.
Oftentimes you need 1632 GPUs just to hold one copy of the
model to start training. Fewer for inference. But essentially, what happens is
that the computation flows through all of the devices. And that led to, that
leads to multiple computation paths. Some of them we will call the parts that
are dictating the runtime and some non-critical, the ones that dictate are the
critical ones.
So all of the computation that are happening on the critical
path, if they slow down even a little bit, your entire training, your inference
request will slow. But all the ones outside the critical path that are
non-critical they can be slowed down as long as they don't become slower than the
critical part.
And so what we have built is a tool that automatically finds
this critical part and precisely computes how to slow down everybody outside
the critical part by setting precise frequency of the GPU that it should be
operating at and precise points. So it's a coordinated dance across thousands
of GPUs that happens at tens of millisecond granularity.
And then that allows us to save up to 20, 30% of energy
consumption of training. So when you said 300,000 or however many homes, now
you can get the same thing done in 200,000 of how many homes and save all of
this energy. Which as Dan mentioned again, like will still be used for doing
more training instead of saving, ut at least our work is making sure we are
effectively using the energy we are paying for.
Kevin Frazier: Well,
Mosharaf, I'm still trying to get the two-step down here in Austin, so I think
the dance you just described is far beyond my skills. But I'm impressed that
we're seeing just how nuanced we can be in making these training runs more
efficient and making inference more efficient.
And I wonder how we can try to get more transparency around the
fact that these efficiency gains exist and what would it look like to sort of
mandate or encourage the adoption of those mechanisms.
But I'll get there in a second, because I think it's also
important to flag that your benchmarking work is pivotal in terms of getting a
more holistic picture of where and how energy is actually being used by these
models. But you noted something critical, which was you were analyzing open-source
models, and as I'm sure all good Scaling Laws, listeners know.
The big boys, for lack of a better phrase, with the exception
of Llama from Meta, are all closed sourced. And so Dan, how are we trying to
remedy the lack of information and the lack of transparency around some of the
biggest companies and the biggest models and the energy uses here? And what
work is your team doing to amplify some of the efficiency gains that Mosharaf
was talking about?
Dan Zhao: Yeah, so
our work at MIT focused a lot on sort of energy efficiency from a very early
time period. But you know, at the time we didn't think that this would actually
go anywhere. This was back in 2021, 2022. We were looking at, like, training,
measuring the energy of training and inference for, geez, back in the day, Resnet.
So these were old CNNs that still see some use nowadays for
computer vision. We were doing things for BERT-like models, simple BERT-like
models, GNNs for molecular inter atomic potential measurements and things like
that. We eventually moved on to works where we tried to benchmark the inference
costs or energy inference for LLMs.
But back then it was, Llama had just come out and no one really
was paying attention to energy inference. The main difficulty is that it's
almost near impossible, unfortunately, because there are several factors at
play here. The most reasonable way people have gone about it has been trying to
find models of a similar size be in terms of the number of parameters or flops
comparable to, you know, whispers on the wind as to what you would think the
big companies are doing.
Or in some ways tracking alternative data, right about like
measurements of certain emissions. But then again, energy does not equal
emissions, because different energy, whereenergy is drawn will translate into
different emissions. So if you're using like energy from wind farms and
windmills versus energy from coal versus energy from nuclear, those emissions
are going to be vastly different, right?
So all these approaches are imperfect, but they try to get at a
scale of measurement. And so that makes things very difficult when it comes to
trying to understand it from the angle that you're describing. So what most
people have tended to do has been just to try and offer improvements, right?
Because at the end of the day, dollars and cents are more
likely stronger incentives to sort of move things around. So, advances in both
hardware as well as advances in the model algorithmic side, research which we
do, is just trying to put out there that hey, if you adopt this, you can reduce
energy, which means you can basically push in more throughput or save on
dollars and cents.
GPUs go brr, and basically everyone's happy, right? But like I
said at the beginning, there is this information asymmetry that does make
things difficult. And so the proxies are what we're using as open data,
especially during our work at MIT. That's essentially what we can do at this
point.
And basically using these open models as test beds. So for
example, if you look at Llama 1, Llama 2, Llama 3, 3.1, 3.2, et cetera, using
these techniques and adapting them and applying them on these models and
saying, ‘hey, look at how much we've reduced model flops this way. Look at how
much latency we've saved this way look at, or more specifically like, oh,
benchmarking on a V100 or a 100 or whatever else.’
Saying like, okay, in our setting we show that energy
efficiency increases this way, but only really the large labs themselves when
they do it on their own hardware, on their own networking stacks, so on and so
forth, that may know the precise numbers, but all we can do sort of offer
solutions that we see work on open benchmarks, open models, and then basically
taken from there.
So that's how I would think, or how I see the current state of
things today, at least.
Kevin Frazier: Yeah,
so this will come as no shock to regular listeners, which is to say I'm
generally pretty bullish on AI and notwithstanding my optimism around its use
cases and its potential to help ameliorate many social woes, I think there's a
sense economically, culturally, so on and so forth, AI's here to stay.
Obviously, it's not going anywhere. The amount of billions, if
not soon trillions of dollars invested in this space suggest that the momentum
will continue. And so I think that your point, Dan, of saying how are we going
to help this process be more efficient because we know it's going to continue
makes a heck of a lot of sense.
But if you don't mind, I want to personalize this for just a
second, because I think there are some folks you know, you can read the New
York Times or you can just observe it anytime there's an AI conversation, there
will be someone who will say something along the lines of, ‘I don't use AI or I
rarely use AI because I think of the environmental ramifications as being too
high to justify whatever Studio Ghibli meme I'm going to get out of this or
whatever silly new joke I'm going to have it generate.’
For the two of you, and Mosharaf, I'd love to start with you if
you don't mind answering, I'm guessing you use AI pretty regularly in your
day-to-day life, or are you a AI vegan who tries to limit as much as possible
your AI use? What's your general approach?
Mosharaf Chowdhury: I
use AI when needed.
I was talking to my student yesterday and they were asking like
what AI do I use? I said, I don't pay for AI. So, I just use for very simple
things like when I need turn, something to be quickly returned, proofread. And
that type of work. But not too much in terms of idea generation.
Not because I am for or against. It’s just that I want to, I
don't know. I feel like ideation is one of the fun part of being a professor
and I want to keep it to myself and take some time to think about ideas. But I
mean, it sort of makes sense to use it as a tool, like any other tool, when you
feel like it's going to make things faster.
At least that's the way I see it. To me, it's just a tool that
is very good at doing certain things and the things that I think it is good at,
I'm going to use it for,
Kevin Frazier: Right.
When you see a nail, you grab a hammer. When you need some improved editing,
you grab ChatGPT makes total sense there. I get that.
Dan, how about yourself? Are you a frequent AI user and in what
senses?
Dan Zhao: I was going
to say, technically I have to be as a researcher, you kind of have to tinker
with these models when you work up, for example developing your agentic models
or trying to make models more efficient, you kind of have to use it, although
probably not in the way that most people would want to use it.
In terms of communicating solely via text, I was the largest
holdout for a very long time. Simply because I, one didn't want them to have my
training data. That's number one. Perhaps a bit paranoid or futile in the end,
but you tell me. Number two, I think it was also because so the other reason
was because I was always suspicious of like, oh, if I use this and this does
reduce friction and save time for me, I'm wondering whether or not certain sets
of gears in my head will no longer click as quickly as they would in the past,
right.
So I can try to at least very be conservative in terms of my
usage. Although I will admit, like in certain things, like there's this for
example, when I write papers, I use low-tech, right? Overleaf has come in,
that's helped. And now I don't have to spend half an hour trying to correct the
table formatting because I can't figure out how, why the table won't render a
certain way.
I'm sure Mosharaf off knows exactly what I'm talking about as
well. I give it to ChatGPT, it figures out for me, I don't ask questions. Yeah,
flame me for that. Feel free. But that's how I see it.
Kevin Frazier: I will
not filet you for that. And I will not detail all the images I use for my
students that try to make, you know, boring case law a little bit more
exciting.
Yeah. But it sounds like from the two of you the main driver
isn't necessarily in a concern out of energy consumption by way of using AI
when it comes to your own personal habits. So can you steelman and really
bolster the case for the people who say, ‘I don't want to use AI because of
these energy costs.’
What is the most persuasive argument for saying, yes indeed you
should refrain or perhaps scale back your AI use because of the energy-intensive
nature of it.
Dan, I know this is a toughie, but I'm going to start with you.
Dan Zhao: Sure. I
would say that it's all kind of harks back to what I was saying at the very
beginning about the tragedy of the cons, right.
If you won't submit that single query, someone else will,
right? And so that energy, that opportunity cost is very small in that sort of
aggregate setting, right? And so if, from a purely economic point of view, if
you're thinking about the productivity gain versus the very small sliver of
energy cost––so, for example, one of our works on benchmarking the large
language model energy costs showed that a single query, it's not really worth a
lot really in terms of energy from a single person.
Over time, over demand, over usage and utilization, it will
grow. Absolutely. But as an individual submitting a query once or twice for a
session, it's not, it's a negligible difference, right. It's really the common,
the tragedy of the commons and the coordination issue that comes up in aggregating
an externality that kind of really produces these effects in aggregate.
And that's sort of something that we're worry about. And it's
also something that can only really effectively be addressed on the aggregate
at the higher level. So I would say like given the productivity benefits
potentially to you for using these, for me, for example, formatting the tech
tables will save me a whole bunch of time.
I'm not claiming I could use that time to save the world, but I
can use that time to continue research on energy efficiency, like the work at
MIT, right? To be able to do these things. But that's, that probably would be
my first and foremost argument when I think about why am I myself not cutting
back?
Am I just being a hypocrite? I'm sure I am, but at the same
time, when I think about the actual concrete effects, that's sort of the
calculus that goes on in my head.
Kevin Frazier: And
Mosharaf, I wonder, when you pick up the paper and you see things like using AI
for this number of queries at this time of night is akin to turning on the
microwave for eight seconds, do you think, can we please stop doing this? This
is so annoying, this isn't really helpful to anything. How, if you were the AI
information czar for a day, how would you change how we're talking about AI and
energy?
What do you think are the most important things that policy
makers should be talking about that the public should be aware of? That labs
should be disclosing?
Mosharaf Chowdhury:
It's very hard question because people want something that people can relate to,
so they can understand, okay, this is how much we are talking about. It could
have been instead of a microwave, some people have used how many light bulbs,
how some people have used, how many miles and meters in electric vehicle, how
many homes.
Sometimes I must have also used, I don’t know, how many pounds
of coal to burn. People just want something physical so that people can relate
to. And I think things like microwave and teaspoon of water, those things come
up. They're looking for household things so that everybody can think of, okay,
this is, how much is it?
And it's nothing. I'm opening the microwave so many times a
day. And just to get across the sense that, as Dan was saying, individual
users, one query is so small. It only is a problem because all of us are using
it, and it would only be solved when most of us stop using it. A few people
adding and deleting is not going to make a dent one way or the other because
each of the individual query is not infinitesimally small, but really really
really small.
And so in terms of how to express this, one way of going about
it would be, I think changes in culture and education system, I would say. Because
it is very easy for us to understand that second versus ten second and a
hundred second, and what it means.
And in the U.S. of course, we use like, how many miles you have
to drive. It doesn't have to be time the other way around, depending on who you
talk to. So similarly for energy, we have to create and sort of cultivate this
language that we can use that everybody understands. I think the, all of these
different comparisons show up.
People are still searching for the right way of posing it so
everybody can understand the same thing and people keep looking for different
examples.
Personally, I don't really have any suggestion like people have
suggested to me many different ways because every time I give this type of
talks for years, I just say, okay, this many joules.
And they say, okay, you should use these or that. I don’t know
what these joules even mean, like, how can I relate to it? But none of them
seem perfect, so I end up not using any of them. And unfortunately, as the
information czar, I don't really have a solution that I can provide right away.
Kevin Frazier: Mosharaf,
I'm sorry, you're definitely fired from your job, but it's okay. You know, you
can put it on your LinkedIn still. You were information czar for all of two
minutes.
So, Dan, I wonder for the folks who are concerned about this
space and want to make sure that we are using and training AI models as
efficiently as possible, what are the primary bottle fit––excuse me, what are
the primary bottlenecks that you and your team may be facing? Is it a matter of
a lack of information from the labs? Is it a matter of resources for your own
work? Is it what's holding this research back?
Dan Zhao: Yeah I'm
probably going to take a rather unconventional answer compared some of the
answers I've given on similar places.
So, I think public education is probably going to be very
important. When we think about, so given the places I've been, for example,
right? Like, in big research labs, in public academic labs, industry, so on and
so forth, everyone wants to do the sexy thing. So, they want to be like, oh, I
want to build the next agent that gets, you know, 99% performance on OSWorld or
Windows Agent Arena, which are like these agentic benchmarks that are built
nowadays that determine what are state-of-the-art in terms of like, computer
usage agent capabilities, right?
No one's really thinking, oh, how do I really save energy? And
part of this is because, per Mosharaf’s point, like it's a bit abstract to
think about what that necessarily means.
And also measuring it is so difficult as well, right? And so as
a result it becomes very murky. So I think public education's important, having
people understand what a GPU does and how you can translate making––based on
that understanding of what GPU, how GPUs work, right?
And understanding, oh, this is why I want to fill up my memory,
pick up a batch size along with the model size, just so that the GPU memory's
filled or saturated. I think people need these, kind of focus on the sexy
things. They understand like, oh, this is how, these are how LLMs work, these
are components that go down, this is what self-attention is, so on and so
forth.
But they rarely think about, oh, this is what's actually
happening on a device, on the GPU, right? Or, oh, when this is getting sent to
a data center, this is what's potentially happening. Or when they think about,
oh, this is how a GPU architecture and how that maps, let's say, a loop in deep
learning training or inference.
Like, their deeper understanding and appreciation of this will
not only make sure that those individual effects feed up on aggregate, but also
at the same time, they're also going to get a bigger bang for their buck when
they run these GPUs, right? They're going to get more efficiency, they're going
to get better performance. So it's a benefit from both sides.
It's just that it takes that fixed cost to overcome for people
to learn, because it's not easy. If it were easy, everyone would be doing it.
And there's also that initial cost of inefficiency that's required to actually
learn and get there, like I mentioned at the beginning. So, at least that's my
somewhat optimistic take at this point.
Kevin Frazier: Yeah
it's really interesting because for me, and again I'm just a lowly law
professor, the thing that comes to mind is, you don't need a Ford F-150 to
drive in downtown Austin, right?
The roads are stable. You're not going over any massive cliffs.
You're just driving on a paved road. You can get by with a Fiat or like I do,
you can ride around on your bike.
But in the context of using AI, there's this idea of, oh, well,
why not just use the reasoning model? Why not go to Deep Research to answer the
question of what should I eat for dinner tonight, right?
You're going to use that model in a far more energy-intensive
way. And if we're not talking about this in kind of layperson's terms, then
users may not understand the difference between how they're engaging with these
models.
But to your point, Dan, obviously there's also just an
education point on the startup side of, if you are getting into this space,
thinking how you can pick up on those more efficient training mechanisms sooner
rather than later is super fascinating.
So, Mosharaf, given that, you know, you all have the world to
save, I don't want to take up too much of your time. What is driving your
optimism in this space or your pessimism in this space? What is kind of top of
mind for you when you put your head on the pillow?
Mosharaf Chowdhury: I'm
optimistic. Let's start with that. I think there is a lot of efficiency to be
extracted. So I have this vision, what I call energy-optimal AI.
As I think it came up multiple times, if we don't do it,
someone else will. So AI is going to happen. What we want to do is to make the
AI happen within the same amount of time, getting to the same level of accuracy
or more, but figuring out what is the minimum amount of energy we need to sort
of spend to get to that point.
And to do that, we need an approach that's what I call a full-stack
approach, starting at the very top of all the models and algorithms that people
are creating. There is a lot of innovation that are happening there.
In the middle, at the software layer that is translating all of
the models to be executed on the hardware, there's a lot of work we are doing, Dan
is doing, that a lot of other, our colleagues are doing.
And then at the bottom layer, sort of in the hardware level,
there's massive amount of progress that are happening. There are new kind of
accelerators coming up. And existing ones are being updated and made more
efficient and added with more course and so on and so forth.
So, as AI becomes commoditized and democratized it'll also get
cheaper because everybody at every layer are working on to make it more
efficient and cost-effective, even for these big boys. Because at the end of
the day, everybody wants to, I don't know, make sure that they give better
service for lower cost, either to make it cheaper for everybody, or at least to
make more, I dunno, profit for themselves.
So the, for the fixed amount of AI that we want, that, the
energy cost, I think, will keep going down. The only thing that's happening
here is that we are still at the beginning of AI. I think there is a lot more
AI to be had. I think Dan mentioned earlier Sora 2.0 and then Google has Veo 3.
All of these things we are going to world models where there
are much more bigger and more expensive than whatever text and other things we
are doing. Soon we will say back in the days we used to do text, they were so
much better and doesn't consume anything. Whereas now we are living in this, I
don't know, world model and whatever else comes up.
So that's what sort of is going to drive the energy
consumption. But as we go through all of these stages, I think there will
always be these opportunities to optimize them and make sure that we are
getting as close as possible to energy optimality for that particular type of AI.
So I am quite optimistic that we continue to find different ways to keep the
cost low and so it doesn't sort of go out of bounds. Yeah. Sorry, go ahead.
Kevin Frazier: No,
that's great. Sounds like you sleep well at night, which is good to hear.
And Dan, I'll tee you up with one final question, which is,
we've talked quite a bit about the market being pretty effective here, where
there's a huge driver for the labs themselves to be more efficient. What is
your commentary for the state legislators who are introducing, for example, AI
energy-related bills?
Is this a moment of saying, hey, just let the market do its
thing, let's let this play out? Or do you think this is a time of saying no, we
need to mandate the sort of efficiency gains we're seeing by Lab A and make
sure that Labs B through Z are applying them as well?
Dan Zhao: Yeah,
that's a very complex question there, right?
It's a very multifaceted question too, and it depends because
the differing incentives in the sort of complex whirlpool of things may having
a clear-cut answer also very difficult, but I'll give it a shot anyway.
So first and foremost, obviously, innovation is still going to
be key. If pursuing energy efficiency gains blindsidedly, and imposing those
onto––or taking those strategies and imposing them on places that may not work,
that probably won’t, won't take priority.
Like in terms of reality, people are probably going to put
energy efficiency considerations aside in terms of chasing pure improvement
performance, be it in terms of, you know, measured from either a model
performance perspective in terms of throughput, tokens per second model forms
in terms of improved loss or be it in, for example, model flops, so on and so
forth.
Or in terms of more product metrics when it comes to actually
putting these products into putting these LLMs and agents into products, be it
like, oh, better user engagement, so on and so forth. So that's one
consideration. The other consideration is also that we probably don't want a one-strategy-fits-all.
At this stage, I feel like we're still very early on when it
comes to trying to understand what strategies work. Because again, everyone
might be working with different model components like the different LLMs that
are out there. Architectures are still quite different, though a lot of them
are still bounded in LLM land when it comes to like all these sort of auto-aggressive
decoder auto models, multimodal, so on and so forth, common motifs.
But each one is operating probably most likely on different
constraints, different powering constraints, different networking
infrastructure, different compute clusters, different data centers, so on and
so forth. So these ops, these operations are very different. And OpEx, CapEx
are likely very different too.
So when it comes to understanding, at least for state
legislators, what's in their own backyard first is probably the most important,
right? If you have places that are concentrating in your backyard that might be
a specific profile, then you might want to target towards that profile first.
But that also requires understanding, I think, a lot of what's
going on there. And that understanding, I think, might be lacking in state
legislatures at this moment, especially because of the sort of lack of subject
matter expertise, for better for worse. The other thing I'll also say is
alluding back to an earlier point, I think that in terms of further efficiency
gains, I don't know how long LLMs will be the main focus.
We've seen slight differences in model architectures. A lot of
the current ecosystem is indeed based on LLMs, these sort of auto-aggressive
decoder style models that are multimodal, et cetera, et cetera, s far as we
know. But one thing that I do think is universal or near-universal is probably
human behavior.
Like if we can somehow induce human behavior to change in terms
of how they're able to operate or collectively work together towards something
that would be great.
I'll give you an example. So, machine learning conferences, I'm
sure those are like the bread and butter that make everyone's day, right.
Places I typically submit to or things like NeurIPS, ICML, ICLR,
as many researchers do. Other folks prefer other settings like IEEE settings,
supercomputer, supercomputing, so on and so forth. If you look at like at MIT
as we did, like at usage, you'll see them spike around these deadlines, because
everyone's panicking and everyone's delaying and everyone procrastinates. Fine.
But like, my point of saying this is that even if eventually we
do fall off the LLM wagon and go towards new architecture, and then we have to
redraw from scratch or rebuild from some basics of what efficiency techniques
mean for these new model architecture types rather than LLMs––and we've already
seen this happen a little bit, so back in the days, the dense LLMs, now MOEs or
mixture of experts are a little more popular, and they found their ways into LLMs.
Efficiency techniques that worked solely on LMS that didn't
count for these MOE additions don't really work as much. Some working.
But human behavior is always universal. So if we can get human
behavior, you know, on board, then you know, I'm probably happy with that. A
long-winded way of saying it depends.
Kevin Frazier: And
all we need to do is change human nature, which, easy smeezy. Right? I'm sure.
I'm sure we're all just ready to do that right after we close this episode.
But, you all have some work to do. Thank you for doing it. I
find it fascinating. And surely I'm going to be sending you a note to please
come back soon. But for now, Mosharaf, Dan, thanks so much for joining.
Dan Zhao: Yeah, thank
you for having us.
Kevin Frazier: Scaling
Laws is a joint production of Lawfare and the University of Texas
School of Law. You can get an ad-free version of this and other Lawfare
podcasts by becoming a material subscriber at our website, lawfaremedia.org/support.
You'll also get access to special events and other content available only to
our supporters. Please rate and review us wherever you get your podcasts. Check
out our written work at lawfaremedia.org. You can also follow us on X and Bluesky.
This podcast was edited by Noam Osband of Goat Rodeo. Our music
is from ALIBI. As always, thanks for listening.
