Lean into laziness

Transcript of a talk I gave on using LLMs

May 08, 2025

I gave a talk to some colleagues recently about how I use language models in my day-to-day work. It focused on practical examples for jobs which don’t involve coding.

If you've been wondering whether language models might help with your work (they probably can), or you're already using them but suspect you could be getting more value (you probably could), you might find something useful in the transcript. My main advice is still to just start using them, but that’s easier said than done, so hopefully having a lot of concrete examples will provide some inspiration.

Transcript

Lightly edited with the help of a few models

Introduction

Alex: I won't officially start the session just yet, but we'll be working out of a doc I've shared on slack. It includes a "How I Put This Together" tab. If you want to just have something fun to read, I would go to the "How I put this together" tab and then click on the link to the Claude chat. I would usually use more different models than this, but so you could see the process I stayed in one chat, and I've shared the entire planning interaction I had with Claude. After I'd planned everything else, I asked Gemini to draft the one-page takeaway which you can find on a different tab.

Host: I will watch the Zoom room, but I think you can get started if you're ready. You have so many people who are excited to chat.

Alex: Okay, there are a lot of people here. First, I'll explain why I offered to run this session. Then I'll share how I use language models, mainly so you know when to try them, not just how to use them. I will also include some thoughts on using them well.

A big part of what I think will be useful here is I'm just going to talk through a bunch of things I've used language models for. There's a longer list that you can already see in the doc. I'll talk through the ones that are highlighted. I'll get people to add their own. I think it'd be nice to have a really long list of examples because the bottom line from this is going to be just use them. You will get better at using them by using them. And I think it's often just not obvious where to start. So a long menu of examples will hopefully help you see, "Wait, I do that job, why not delegate it to a model?"

I'm going to just wrap up the talk and Q&A and interaction with some time for people to sit here on their laptops and do one of two tasks I've come up with to get over a barrier: either setting up a Claude project, or if you already have a bunch of Claude projects and know what they are, trying to get a language model to do something that you think it can't currently do well.

How I Got Started

So how did I get here? I started trying to use language models to help me with work when OpenAI Playground was the thing. This was just before ChatGPT. At that point, I think what I was trying to do was some writing. I did something like "this is the start of a Kelsey Piper blog post" and wrote some stuff and got it to try and continue. It worked okay.

I think for several months I was trying for maybe a couple of hours a week. It was probably net negative for my productivity. I wasn't very good at prompting the models. The models weren't very good. I now think I'm getting a significant boost from using them. It's pretty hard to assess, and I haven't really tried to estimate what it is as a percentage of workflow, but I think I'm spending at least several hours a week working closely with language models on whatever I'm doing. And I think it's clearly a good use of time when I'm doing it.

I do think I've got a lot better, and the sense in which I've got better is knowing which things to try to use models for at all, knowing when to give up on a particular model or particular chat or particular prompting strategy, knowing when to keep going and give feedback rather than starting afresh, and then probably some "how to prompt, how to ask for stuff" as well. But I think that's roughly the order. So it's mostly just actually knowing what kind of tasks to try.

High-Level Philosophy for Using LLMs

In terms of how I think about using language models at a pretty high level, I have a few different ways of trying to get at this thing.

Given how much the idea’s been in the news recently, the first thing I want to say is that you should have an abundance mindset. What I mean is you've just got this source of absurdly cheap labor – literally free labor. But it's not just free in terms of marginal cost of use. It's also free in that it doesn't get bored or fed up with you. If you ask it to do a meaningless thing, if you just ask it to do this very high-effort task, and then you just decide you can't be bothered to read the output, it's fine. Don't do that with your reports!

I think you can sort of turn that handle a lot. Look for the things where you're getting complementarity. Look for the things where they are better than people. A classic example is they can read really quickly. If I want to delegate to a person a sort of 10-ish minute task, it's not reasonable for me to ask them to read 20,000 words of context that I haven't thoroughly formatted or even really curated that well in order to make that 10-minute task slightly better. But if it costs me one copy and paste to do it, then I just can.

As an example, the one-pager that I got Gemini to write as takeaways for this session contains, as well as my notes for the session, all of the blog posts I've ever written, some of which have tips on language models. I didn't filter out the ones that don't. I just said some of these have tips on language models. Ignore the rest.

Similarly, you can ask for stuff multiple times from multiple models. You can ask for a big list of options and then ignore most of them. There's just some amount of leaning into how cheap it is to do some stuff.

A different framing on this is I often think of using language models as just being really lazy. I'm just thinking it's cheaper for me to spend 30 seconds asking for this simple thing to be done than two minutes doing it myself. So I will. And sometimes the output isn't very good. And you quickly learn if you're wasting time by spending more time asking than doing the thing. But the marginal cost is just really low.

Different framing: humans are better at some things and worse at other things and they vary a lot between people. I think language models in some sense are very extreme cases of this. The stuff that they are good at, they are really good at, wildly better than the best people for that already. Writing quickly, reading quickly, reformatting stuff, editing. But the stuff they're really bad at... This is just another case where the solution is to work out how to use their strengths and then really hammer on those rather than going, "Well, if I was asking a person, it wouldn't have screwed up this really basic thing, so I guess I'm going to throw away the baby with the bath water and just say if it can't do this basic thing, it can't be useful for anything."

Prompting Tips

All right, that's the high-level how I think about framing stuff. Let's talk a bit about prompting specifically.

So I think my attitude here is there is a skill, there are things you can do better or worse. But as a first approximation, don't think of this as some crazy specialist thing that you have to learn. It's not like coding. Just ask for the thing you want usually and you'll get it. Here's now a bunch more detail, but start from that.

First thing is you need to provide the context. It's cheap to provide it because the models can read very quickly. There's a lot of people who have familiarity with the EA community in this room. I think it is a pretty cold take that there are lots of smart generalists with no specific context or experience as part of that community.

If you had a bright, enthusiastic young person who didn't have a ton of work experience but was willing to work really hard and you were going to ask them to do a task, that's roughly the kind of description you want to be giving. Tell them how to do it, tell them what you want, tell them to the extent you can what it looks like to do that well and then see how they do. And if it's not great, say it was not great in this way, can you do it better? I think that's just a skill that you can already do with people. Just do the same thing.

If you're not sure exactly what you want out of the task, if you don't know how to do that, but you were working with a bright, energetic, low-context, but enthusiastic young person, maybe the thing you would try is say, "I think I've got this idea in my head. Here's a rough outline of it. Can you ask me a bunch of clarifying questions about it and then go and try?" You can do the same thing.

A big difference is models typically won't of their own accord think to ask the clarifying questions. But you can say "Ask me a bunch of clarifying questions," choose which ones to answer, not feel embarrassed about not answering the ones that you didn't think were important and then get the model to have a go.

Some stuff that I think if people have been using models for a while or have read about early model use or read tips on early model use, doesn't work as well anymore. This is thinking about language models as mostly being next word predictors on huge amounts of Internet text, which is a large part of how they were originally trained.

If you're thinking about them in that way — and this is, to be clear, how you get useful work out of what's called base models, models that haven't had a bunch of additional fine-tuning or what's called post-training applied to them — what you'll tend to do is try to come up with the right kind of Internet text that would be continued. So maybe you would have a question and then a really good answer and then you would write your question and write "Answer:" and then let the model fill it in.

You don't need to do that now. Huge amounts of effort by all of the frontier AI companies have been put into training the models to work well as chat models. I think anthropomorphising is just going to be way better for you than trying to work out which bit of the Internet distribution you're conditioning on.

Another example or analogy: you will see prompting tips that are like "if you want the model to give you a really good answer, you should start off by saying 'you are the world-class expert in blah blah. You have a million years experience in, I don't know, law. Now give me legal advice.'" Instead of that, just go "Can I have high quality legal advice?" It's worth asking for the good version.

If you're asking models to do research for you, lots of them have search capabilities now. Ask for attention to high-quality sources. I have seen so many takes on Twitter recently as there's been an explosion in popularity of search-enabled models. People saying "These language models are so dumb and useless, they're never going to get anywhere. I ask them to do research for me and they're basing all of their findings, they're citing all of these terrible SEO-ed crap Listicles." Did you ask them to prefer academic sources? Did you say "Go with published papers, then archive, then reputable newspapers, then stuff that you can verify from multiple sources and then completely ignore the Internet slop?"

And to be clear, if you want to write a list like that, you can just ask a model: "Can you write me a guide to source prioritization? Which sources are the most respectable?" Then just dump that guide into the model when you're asking for the search capability and then it'll listen to it.

Questions and Reactions

Okay, I will pause there for a second to let people react and process. The next thing I'm going to do is talk through a bunch of specific cases with the hope of giving people some sense of what things I'm using models for and how they could work. But I'll shut up for 30 seconds to see if there's questions or reactions or thoughts at this stage.

Audience Member: This might be covered later, but I think one barrier I find is explaining stuff to the model to set it up for a task and how it does better if you give it like 100 paragraphs of explanation about what it should do. But then I get sick of it. What's the strategy there? Should I explain less or find a way to do it faster?

Alex: Couple of ideas. So one is if you're explaining the same stuff over and over. For example, I'm Alex, I work on AI at Open Phil. I'm working most closely with Emily. Emily does blah, blah. Setting up a project with that context dumped in is a great way to not have to use it over and over again.

The other thing which works better for me than for some people is you can use various kinds of transcription app. I use Super Whisper. There's a link to it in this doc and you can just info dump out loud for two minutes about what you want. Don't worry about structuring it. Don't worry about deleting the bit where you go, "Oh, actually no, I don't think that bit's useful. Forget I said that." Just dump the whole text in.

If you read the Claude chat link in "how I put this together," this is how I did the prep for this session and I deliberately kind of lean into this mode. If you look at the second message from me, just don't read it, just scroll to it. I just throw a bunch of stuff at it in no particular order. It can work out the order of the information that's there.

Audience Member: Do you find that Super Whisper works better than just voice mode? Is that an intentional thing? Because you tried both?

Alex: I think voice mode wasn't available on laptops when I set up Super Whisper. And Super Whisper is great for other things because it just goes to wherever your cursor was and it's fast and accurate. But if you're on your phone, Claude has a voice mode input.

I wanted to understand how easy it was for people to break into unsecured WiFi networks a couple of weeks ago. So I asked Claude by holding the record button and the voice input and then it had some stuff and I asked some follow-up questions. I had a chat with it for 15 minutes while I was walking home about this thing I wanted to learn.

Audience Member: Does it have voice output too or...?

Alex: Claude does not yet. Gemini has voice output and a pretty natural ability for you to interrupt, but it speaks really slowly and is kind of annoying. And then ChatGPT has advanced voice mode which is also kind of annoying on the interrupting front, but is a bit better and faster and more natural than Gemini. I've used both to some extent. Claude's voice mode is rumored to be coming quite soon. I expect I'll use it a lot once it's out.

Specific Use Cases

I'm going to now talk through a few specific things that I've done. It's all of the highlighted ones in the big list of use cases and all of the other ones I'm not going to go through but I'm happy to answer questions in the Q&A.

The first thing is related to this abundance take: I think most people like getting feedback on their work, especially if it's instant, and especially if there's no social obligation to listen to all of it if you don't agree. By default, if there's a written piece of work that I'm either done with but feeling anxious about sending or just stuck and I don't know what to do next, I will copy the whole piece of work and say "Can I get some feedback on this?" and dump it into at minimum models from multiple providers.

In some cases I have specific setups where there's a bunch of relevant information already loaded up to the model and I can ask the model with that relevant information for some feedback on the thing.

An example I haven't actually set up, but I was talking to someone about yesterday was: if you really like the style of someone's writing (I've used Kelsey Piper here), but if you want to write more like the New Yorker or more like Time, have you grabbed your 10 favorite pieces and put them into a single project? Which means next time you're asking for feedback as well as getting feedback from a lot of base models, you can ask for feedback from the model that's just read 20 of your favorite articles by some blogger or newspaper or whatever else.

And then with all of these you can ask for the way in which you want it to give the feedback. And the thing I usually prefer is a list of suggestions that are small and specific and ordered by priority. And then I would just skim the list and decide whether I want to act on any of them. And it's free for me to act on none of them. I can just move on.

Audience Member: I have a sort of noob question. How would you recommend differentiating between when to start a new project and when to create a style? I feel pretty confused about this.

Alex: I feel pretty confused about this too. I've created a few styles and not got a ton of value out of most of them. But it's not clear to me that this is anything other than a skill issue.

The main thing that projects let you do that styles don't is dump in a ton of context. I think maybe I'd say style — I think you ideally want to use style when you want very low friction. A style is something you can have appear in every chat and so it will always be there. If you open up Claude, it will always be possible to select the style.

And then I think a project is for if you have a specific use case that has a bunch of context built up that you want to have on your menu of options. I think to some extent if you have the same information in both, I would expect it to work very similarly.

I want to talk about the management thing. I have a specific project set up for management feedback. There's a blog post that I can link to which explains what it contains. But it's an entire book on management, Holden's advice on management, a bunch of other blog posts. I just put it all in there.

I get various things from this. If I have a problem I want to debug, I might just ask it to debug the problem with me. And I will be clear — this is with permission — I also record meetings with a lot of my reports and just dump the transcripts in and ask for feedback on the transcripts. We have running one-on-one docs and I've done those and ask for stuff.

A specific thing I did that worked well and I was really pleased with was I gave my management coach project the full context dump of a one-on-one doc I had with one of my reports which has two-way feedback every week in it. And I asked "What impression did they have of what I think of how they're doing? Give me a whole bunch of lists of which things I think they're doing well, which things I don't."

And then I looked at that and thought about how I thought the person was doing. I was like "Okay, there's some feedback I need to give here because the picture that is being built up over time is not identical to the one in my head." And I used it to figure out what I actually need to be doing and how I can give feedback. I found that really helpful and I didn't copy the feedback things out of the one-on-one doc — I just gave it the whole nine months of running notes from meetings and one-on-one doc with no formatting and said "Some of this is feedback. What's the impression of the feedback?"

Next use case that seems kind of specific, but I'm really excited about is Gemini 2.5, the latest Google model which we all have access to. The thing that really annoyingly pops up in your Google Docs when you don't want it is pretty good for some stuff.

Google has been really frustrating me at the moment by building very nearly good projects/products and they're completely failing at the last hurdle. They have a feature called Canvas. What this lets you do is where you've got a piece of writing, you can open it up in effectively a Google Docs clone within the model chat. What this lets you do is you can ask Gemini for comments and it will comment in-line on the things. You can then ask it to act on those comments and it will reply to drafts. You can comment on stuff, you can select a section of the text and say "Hey please can you rephrase this?" Or "Give me 20 possible rephrases for this."

I think the vibe is a bit off because remember it's cheap to ask for more labor and then you can just verify the one you like best. I've used this a little bit. It only came out a couple of weeks ago. I'm really excited about it. I think my default editing will probably, when I remember, be in Gemini rather than Google Docs going forwards for anything I actually care about the quality of, just because it's so much easier and lower friction than the thing I was doing before which is: you work with Claude, you give it some feedback, but at some point you have to copy the whole thing into Google Docs if you want to live edit it yourself. And that means every time you want the model to rephrase something you've got to copy the section back, give it the whole context again, then ask for multiple versions. It's just way easier to have it all in one place. So I'm excited about that.

Audience Member: So I often have the workflow where I'm dumping a bunch of stuff down and then an LLM does something that's pretty mid and then I edit. Feels like in theory you should be able to get it to suggest edits or you should be able to vibe code the whole way. And it felt like the issue was always the interface for me — it basically needs to rewrite it again. And does Canvas actually work for this?

Alex: Canvas is certainly the closest of anything I've tried. It's pretty good for this. The problem I have with Canvas is Gemini does have a function that's equivalent to projects called "Gems" in ChatGPT. It's called Gems, which allows you to load a bunch of context in, which is really helpful.

The reason I want to load a bunch of context in is if I'm getting something to write like me, I want to give it all of the writing of mine that I actually like and tell it to match the style very closely. Gems you can set up with a Google Doc. Also, Gemini has Canvas, which lets you do this nice inline editing. Gems do not let you use Gemini's really useful tools like Canvas, but also deep research. So that's annoying. It will probably get fixed soon.

Probably also at least one of OpenAI or Anthropic will just steal the inline editing of Canvas and then we can just use that instead. But right now Canvas is the best at this, but it's going to be worse at style.

So my current workflow for doing this kind of editing would be: while I'm still mostly just getting the language model to draft something, I'd probably be using my Claude projects. I think Claude is the best writer and I have my project set up. And then at the point where I'm asking multiple models for feedback, I'll probably copy the whole thing and get feedback from all of the models and then start editing and implementing the feedback in Canvas.

One thing I can do, obviously, is just paste all of the feedback from the other models into the Gemini app and get a suggestion that's based on that.

Audience Member: What is Canvas?

Alex: Canvas is this tool within Gemini that lets you basically have a Google Doc clone inside the chat window and then talk to the language model, including about specific pieces of text. It's really good.

Oh, another thing that Google's screwed up is you know how this doc has four tabs, which is really useful? You'd expect that Google product integration would mean that I could say "Can you look at Tab 2" or "Look at these three tabs?" No. If you upload a Google Doc to Gemini, it will get all of the text in that doc concatenated with title of the doc, title of the first tab, text on the first tab, title of the second tab, text on the second tab, and so on.

Does anyone know how I know that? How would you work out whether that was the case?

Audience Member: You asked it how it works.

Alex: Yeah, I gave Gemini a Google Doc with four tabs and I said, "What can you see? Print out exactly what you can see." And it printed out exactly what it could see. Just ask the models for stuff. You don't have to do any coding.

Audience Member: For what it's worth, ChatGPT does have some sort of feature where you can draft things, it can give you text and you can draft it. You can live edit it in it now.

Alex: Great! I said this was coming soon. Looks like it's already here.

I guess one thing going back to the earlier point about this or riffing off the performance review example: I think it's often better to get very high-level early drafts from models, but then at the stage where you're going back and forth on what do I even want to do with this doc, I think actually just iterating on essentially a list of nested bullet points to try to get the structure right is better than than iterating on the whole thing. It's quicker for you to read, it's easier for the models to understand.

And then I've had some success recently with once I've got this list of bullet points structure, I'll just basically record a talk that is the post or memo I want to write, speaking it out loud. Again, not really worrying about the formatting and being fine going "Oh actually I don't want to say that, I'll just ignore it" or "Oh, maybe this should come earlier."

And if I give a model a transcript of me basically saying the thing I want to write, but as a talk rather than as a post with lots of ums and ers and bits where I go back, and then the bullet point outline and I say "Hey, here's some examples of my polished writing. Here's a monologue, here's the bullet point outline. Can you write me a post?" That's the closest I've got to one-shotting stuff. A couple of my recent blog posts have been over 90% of the way to done after that. And then the last few percent was mostly just me doing quick manual edits.

Audience Member: I have a sort of weird question. I feel like I have a mental block around just monologuing at an LLM.

Audience Member: I feel like a total moron when I'm doing it.

Alex: To be clear, I'll say something for 10 minutes and then be like "Actually none of that please, urgh. Actually what I really want..." And you just gotta swallow that I think.

Audience Member: And don't let anyone hear you. Delete the transcript afterwards or something.

Alex: You can see a transcript of me doing it to prep for this in the Claude chat. And I think it's super embarrassing, but it's fine. I don't read it. It's the model's job to read it. I just said stuff.

Audience Member: You could try to do voice mode where it talks back and forth and it goes like "aha" and that would actually be helpful.

Audience Member: You can totally set up a style where it does that.

Alex: If you're a faster typer than me, you can just type the questions rather than talking. For me, the talking massively lowers the friction. But if it doesn't for you, then don't do it.

You can also record a meeting with an actual human where you talk about the thing and then just give the meeting transcript to the model. And that's more expensive because you need to talk to an actual human. But if you know any actual humans, then it could still work.

Audience Member: I have kids. I think that would actually be great.

Alex: Unironically, if anyone has kids that want to listen to them talk about work stuff, you should totally record a conversation. The younger the kid, the better. It would be hilarious. Starting off with "I explained this to my baby. Help me explain it to adults."

But another thing you can do, which I don't know if this would get over the social thing, but it's helpful, is you can start off by saying "Hey, I want to get your help drafting a post on [topic]. Can you write 20 questions for me to answer?" And then you can just have the 20 questions up there. And my monologue is "Oh, answer to question three is... Answer to question four is... I don't care about question five. Answer to question six is..." I don't care about question seven." And that's much easier than just trying to dump all the information out in one go.

I'm going to say one more work example, then move fully to Q&A. So I haven't really been talking about coding. People who write a lot of code in their jobs are getting huge speed-ups from language models already. Language models are very good at writing code for a whole bunch of reasons that are not that worth going into.

A couple of things you can do if you don't write a lot of code or maybe you haven't really written any before: models are pretty good at hand-holding you through the setup to writing simple scripts and simple code. Google Colab is a thing where it's pretty close to just "you can write some code and it goes into your web browser and you press run." You don't have to do any of that difficult stuff like installing a particular library on your computer. So that's quite nice. You can also ask the models for help.

Big wake-up moment for me: when I was trying to get earlier versions of language models to help me code, I kept running into bugs. And then my friend Joel Becker just said "Just copy and paste the bug into the chat window. Don't even say it's a bug. Just copy and paste the error message and it'll fix it for you." And this is just so useful. And again, there's some leaning into laziness thing of I'm not asking politely. It's like it wrote some code for me, it wrote 2000 lines and 1900 of them works perfectly. And then there's a bug and instead of saying thank you, I just copied and pasted the bug into the model and it fixed it for me.

It's helpful. It works. Maybe more specifically, I think basically everyone at this point should effectively be a Google Sheets wizard because Google Apps Script is really simple. Models are really good at writing it. They're also really good at hand-holding you through actually activating the Apps Script in the thing.

The example that caused someone to teach me this was I was doing some personal project management in a Google Sheet and I thought it'd be great to have a column that updated with the last time I edited any cell in that row. That feels like a thing that a spreadsheet should be able to do, right? So I can just have a "last updated" column for all of my projects and it just updates anytime I change anything in the row.

ChatGPT (I think maybe version 3 or early 4) one-shotted this. I was like "I would like this feature in Google Sheets. Can you write some Apps Script for that and then explain how to set up the Apps Script so it works?" You can just do loads of stuff. If there's a complicated formula, give it the whole sheet. Say "I want this column to work in this way." It'll write the formula for you. If it looks like it's not working properly, say to the model, "It's not working properly. It looks like it's going wrong in this way. Can you fix it?" It's just a thing that they're very good at doing. Very easy to do.

Q&A Session

I am going to pause there. I think we've got about eight minutes for questions, other things people want to share. Looks like some people have got some cool ideas. Feel free to double-click on any of the other examples and then we'll have 15 minutes to actually do one of the two things that I think it would be great for people to get some practice at.

If people have good examples of projects that they've set up...

Audience Member: I'm just struggling to start somewhere, like what level of detail to put into them.

Alex: Yes, I will paste a blog post with six examples into the working doc.

The main thing that projects let you do that styles don't is dump in a ton of context. Maybe the lowest hanging fruit for anyone that writes lots of emails, which is most people, is the email project described there. It's 30 examples of emails I wrote that were moderately high stakes and then a voice note from me for each one, speaking very frankly, going "Oh, this person's kind of fancy and I don't want to piss them off so can we let them down nicely without implying that it would be a good idea for them to apply again?" I did that. Now most emails that feel ugly to me just get drafted by Claude and I make a couple of minor edits.

Audience Member: Alex, what do you think about using LLMs?

Alex: I think people should do it more.

Audience Member: What is the most unexpected use case that has charmed you?

Alex: I get pretty anxious sometimes and used to have a solution to this, which was I had a blank Google Doc where I just write down what I was thinking and write it down until I'd got to "no further actions are necessary or sensible for me now" and then I'd feel a little bit better.

I don't do that anymore. I just open up a new Claude window. No special project here, no prompting. I just tell Claude what's going on in my head and it's nice and sympathetic and lets me duck out whenever I like without any social awkwardness. I found it really helpful and I just don't care at all that it's programming that's being nice and sympathetic and saying "Oh yeah, that's so hard." I'm just like, "Well, it is."

Audience Member: How do you decide that you've spot-checked an LLM output enough? One problem that I have when I'm asking for deep research is that I get some answers and I look at them for a moment, but I generally feel like I'm going to feel worse if I wind up putting down something that turns out to be blatantly wrong because I believed an LLM than if I found a legit-seeming source and that source told me. So I'm trying to figure out how to not spend more time fact-checking the LLM than it would have taken me to look up or check the sources myself.

Alex: I have a couple of takes. My boring but on-brand answer is: let's say you had a new GCR intern who had produced a piece of work for you and cited it and made some claims. How much fact-checking is enough? Well, it depends on how plausible the claims are and how closely you skim the first couple of sources, but that seems like a solvable problem.

A slightly more helpful answer, or at least giving some new information, is I think that probably the best use of deep research models at the moment is for getting something like an annotated bibliography as a starting point for research rather than trying to one-shot the whole research project. My deep research Prompt Engineer project, which has a full setup guide on my blog and is linked to, has an annotated bibliography mode which pushes the deep research model really hard to just have a bunch of specific things with citations for those things and not an introduction and conclusion and all of the rest of it. So you can have that as a research starting point.

And then I guess the other obvious thing is if it's citing a really long paper or a book, just get that paper, dump it into three other models and say, "Is this claim covered by [source]? Here's a claim, here's a potential source, fact check this and then give me an exact quote for the response you have for whether this is true or not." That last bit maybe is really important. I think rather than just giving a model a big thing, give it a big thing and then directly quote the bit that you are most using to inform the answer you're giving me.

Oh, and don't trust o3.

Audience Member: Yeah.

Alex: I'm not joking.

Audience Member: It lies so hard.

Audience Member: It's just so confident.

Audience Member: Do you have any other takes on different models we should or shouldn't trust with different types of work?

Alex: Models with thinking or reasoning mode enabled are a bit all-round smarter, much better at handling lots of not particularly well ordered contexts, and more likely to deliberately lie to you. I'm saying deliberately — this is slightly anthropomorphising, but I think it is important. I think it is more like lying than hallucinating in that sometimes the model's literally saying, "Well, I don't have search access, so I should just simulate the search." And then other times it's being more blatant than that.

I think that other than that, the models have different vibes, but I don't think any is more or less trustworthy. I think Gemini is the most corporate. I think Claude is the nicest and friendliest. But then also it's very likely to, in quite a sophisticated way, read into how you're asking the question and then give you an answer that is the one that you're kind of hoping for. There are various ways around this, including telling it not to and asking it to make the strongest case against as well as the strongest case for.

Audience Member: Does it help to tell it generally to be disagreeable?

Alex: I think so. My ChatGPT custom instructions have a bit about disagreeableness in them. This has worked extremely well for o3, which now is just an asshole. I haven't done this for Claude, but I think it's fine.

I think o3 has a pretty similar vibe to the vibe you would expect from Grok based on knowledge of Elon Musk, but without having tried Grok. And then Google Gemini has basically the vibe you would expect from Google. And then Anthropic has basically the vibe you would expect from someone really trying hard to be like Amanda Askell. And then ChatGPT is maybe the highest variance between different models.

Audience Member: Sometimes people say that if you describe your problem in a way that doesn't make it clear that it's happening to you — just "This is a problem someone had, asking for a friend" vibes. I haven't tried that with Claude. Does that work?

Alex: I don't know if that works for the sycophancy problem. There is a area that I didn't mention but is related to this where it does work.

A close family member had a series of difficult medical emergencies earlier this year which was pretty tough to deal with. We spent a lot of time in hospital waiting for updates and not really knowing what was going on. I found it unbelievably helpful to just be able to ask Claude what words meant, what base rates were for different things that could possibly happen.

Models have been trained a bit to cover the ass of the developer and be like, "Oh, don't rely on me for medical advice. You should check with an actual doctor." Two things help for getting around this.

One is just rather than trying Galaxy-brain jailbreaking stuff, being a reasonable person and talking to the model as a reasonable person and saying "Look, I appreciate your concern. To be clear, what's going on here is I'm waiting for an update from an actual doctor and I'm pretty worried, would just like your take to the best of your ability. I promise I'm not going to go and act on it" and usually it'll just be like, "Yeah, fair enough, okay."

But then the other thing that I found is pretty helpful is the more I framed my questions as something that looks like a medical school exam rather than something that looks like a person trying to replace their GP with Google, the more likely the model is to answer. So just going "A patient is presenting with [symptoms], here are some vitals. What are your suggested actions?" And it'll answer as a medical questionnaire rather than as "Oh, I can't give medical advice."

I would always check with a lawyer before doing high-stakes stuff. But every contract I've signed since GPT-3 came out, I got every model that I had access to read and tell me about anything that seemed funny.

Audience Member: For the medical stuff, it's also a good way to get around it if you're like, "I have talked to two experts and they told me A and B, which expert should I follow up with?" Is like a good way to get it to give you an opinion because you're like, "I have a medical professional in the loop. Don't worry."

Alex: That's fun, with a slight caveat that, in general, I try not to lie to the models.

Audience Member: In this particular case, I have talked to multiple experts, but...

Audience Member: Why do you try not to lie to them?

Alex: I think that's a longer rabbit hole. I'm happy to discuss it another time.

Audience Member: Do you use these models with memory enabled so that it runs across chats and projects?

Alex: Good question. So I think only ChatGPT has memory enabled at the moment. I have found memory to be actively harmful so far, but it did recently get a very big upgrade.

My first debugging step if people are getting very weird responses from an OpenAI model, before I do anything else, is "Can you open up what's stored in memory and see whether there's some random stuff in there that's explaining all of these weird responses?"

My guess is it will at some point be really useful. Right now, I don't think it would help me. I think knowing when I can just start a new chat with a blank slate and controlling which context I want to put in and which I don't seems like a very powerful feature of using language models. And I don't want to have a bunch of stuff that's kind of randomly attended to, but at some point it's going to be helpful.

Audience Member: I sometimes want to walk and just dump all my thoughts into an MP3 recorder. And I currently don't have a great workflow for then transferring that to a language model.

Alex: I do it with fireflies.ai

Audience Member: But it's kind of frictionful.

Alex: Yeah, I don't have a wonderful workflow here. Really love the idea. Do it a lot myself. I strongly prefer this to having a voice model that's responding. My workflow is I use Google Recorder, I download the audio, I upload it to Fireflies and I download the transcript. It's kind of annoying. It's not that annoying.

Audience Member: I just do this with the voice mode on Claude and I tell it not to respond until I ask it to. If you don't want the actual MP3 output, then you just get a good transcript of what you said.

Alex: Nice. That does seem good. I'm cycling when I do it so I can't press any buttons.

Audience Member: Oh yeah, you need to speak in 10-minute increments.

Audience Member: I guess I'm just worried that I'll click something accidentally and it will be gone or something.

Audience Member: That has happened to me.

Alex: I expect by the end of the year there'll be at least a 75% chance of it just being possible to get the MP3 and dump it straight in. My guess is Claude could one-shot a script that takes in the MP3, uploads it to the Fireflies API, pulls the transcript out of the API and puts it into your clipboard or into a central doc.

Audience Member: The default transcription in Claude is just better than other transcription services I find. So I feel like it's less frictionful to do it this way. And then you just basically have what you want as long as you don't want the actual audio recording.

Alex: If you don't want to be doing it while you go for a walk, the lowest friction way I've found to actually just get a long transcript into Claude — I use something that Super Whisper can't handle right now — I just start a Google Meet with myself. This is not going to help you if you're walking, sorry. I start a Google Meet with myself, the Gemini thing pops up, I say yes, and then I just have a meeting with myself. Gemini transcribes it and I can copy and paste the transcript straight in.

Workshop Activity

We have a bit over seven minutes left. I am happy to keep answering questions once everyone's got started, but I want everyone to try to do one of these two things:

If you have never created a Claude project, I think you should by the hour have created a Claude project to do something. There's a list of examples from me. There's a little setup activity that Claude wrote. You can also just Google Claude projects.

I think I linked to it in my blog post. Just think of something you're going to do more than once. If you can't think of anything else, just have the Claude project contain a single instruction: "Write a bunch of stuff about yourself and your job and what you do." And so now if you want to have a chat with Claude about work, at least you don't have to explain what OpenPhil is and who is in your reporting line. That's a very cheap thing to do. Probably you could do more than that.

If you have a bunch of projects, I think it's a good exercise to come up with something that you don't think language models can do, or you don't think they can do well, and just spend the last five minutes trying to squeeze the best performance you can out of the model for that thing. It's just the way you get better at it. It's actually trying.

And importantly, one of the main things you should be doing here is you try with a prompt, the model does something, it sucks. You work out why it sucks. You give the model the prompt, the thing that sucks, and why it sucks and say "Write a better prompt" and then you do that again.

If you have a question, dump it in the Q&A and I'll answer it out loud if it's there. If you have a question that's like "Can you help me with this thing that I'm working on?" stick up a hand and I'll wander round.

It got two minutes left and then we'll spend two minutes getting out of here. Slack me if you want help with this later. I'm around for a day and then online forever. I'll dump your question into Claude and then paste the response back to you and tell you that's what I did.

Audience Member: Are we also supposed to not add Google Docs to projects?

Alex: Oh yeah, don't add Google Docs to projects. Copy as markdown, paste in the text. Anthropic's Google Docs integration is kind of buggy. It will take up more space if you do it in a Google Doc and it will be slower.

There's a note on my Claude projects document which says this. If you want a full stage-by-stage walkthrough then the "Getting the most from Deep Research Models" blog post has a full setup guide for that specific project.

Audience Member: Is markdown better than PDF?

Alex: I would guess so because you can paste it in as a text file.

Audience Member: Does the same apply for Gems?

Alex: I don't think the slowness applies, but Google Docs handling is bad enough in terms of stripping formatting that I would actually prefer to put in markdown files.

Audience Member: One thing that I would like to be able to do is to link my weekly goals or one-on-one docs that update every single week. Is your best workflow for that to copy and paste the new content?

Alex: Depends how much other stuff you want. If I was doing this, I would do it by copying and pasting because I would probably be doing this in my management coach project which has like half of the context window used with useful advice on management. But if you just want to have a thing to refer to then attaching a single Google Doc seems totally fine.

Audience Member: And when the Google Doc updates it'll be able to see that?

Alex: Yes. Hopefully within a couple of months Anthropic will have just fixed the Google Docs integration so it sucks less.

It's time to finish now. Thanks for letting me ramble about language models. Go and play with them!

Speculative Decoding

Discussion about this post