<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Llm on blog.iankulin.com</title><link>https://blog.iankulin.com/tags/llm/</link><description>Recent content in Llm on blog.iankulin.com</description><generator>Hugo</generator><language>en-AU</language><lastBuildDate>Mon, 07 Jul 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.iankulin.com/tags/llm/index.xml" rel="self" type="application/rss+xml"/><item><title>State of AI tooling (for me)</title><link>https://blog.iankulin.com/state-of-ai-tooling-for-me/</link><pubDate>Mon, 07 Jul 2025 00:00:00 +0000</pubDate><guid>https://blog.iankulin.com/state-of-ai-tooling-for-me/</guid><description>&lt;p&gt;I&amp;rsquo;ve been meaning to write this for a couple of weeks, so let&amp;rsquo;s get to it - things are moving to fast to reflect too long; which is it&amp;rsquo;s own risk.&lt;/p&gt;
&lt;p&gt;In March, I wrote about &lt;a href="https://blog.iankulin.com/where-im-up-to-with-ai-for-coding/"&gt;how I was using AI in coding&lt;/a&gt;, which was Codeium (now Windsurf) in VS Code for completions, and ChatGPT and Claude online for architecture questions and code gen that was more than half a function.&lt;/p&gt;
&lt;h3 id="media"&gt;Media&lt;/h3&gt;
&lt;p&gt;In my usual keeping-current media consumption I hit a couple of surprises:&lt;/p&gt;
&lt;p&gt;Steve Yegge on Changelog &amp;ldquo;&lt;a href="https://changelog.com/friends/96"&gt;Adventures in babysitting coding agents&lt;/a&gt;&amp;rdquo; - Steve is the author of a book on Vibe Coding which is not due out till later in the year, by which time it will surely be out of date, but also works for &lt;a href="https://sourcegraph.com/amp"&gt;Sourcegraph on Amp&lt;/a&gt; which is an agentic tool aimed at enterprise. His pitch was that agentic coding (where the AI can do things - read and edit files, run command line tools etc) is ready now for most tasks, and that the returns on the minimal effort required to code something with prompts are so good that it opens up a lot of projects you wouldn&amp;rsquo;t have bothered with. So you should pick a utility and write it with one of the agentic coding tools. He mentioned a heap - Amp (obviously) but also Cursor, Cline, Claude code and so on.&lt;/p&gt;
&lt;p&gt;I think this probably blew up the ChangeLog peeps discord, and it certainly took me aback a bit - like who&amp;rsquo;d be letting a hallucinating bot loose in their terminal??&lt;/p&gt;
&lt;p&gt;I wanted a bit more science input, and got that from another podcast - &lt;a href="https://ocdevel.com/mlg"&gt;Machine Learning Guide&lt;/a&gt; from &lt;a href="https://www.youtube.com/@ocdevel"&gt;Tyler Renelle&lt;/a&gt;, specifically &lt;a href="https://ocdevel.com/mlg/mla-22"&gt;episodes 22-24&lt;/a&gt;. I can&amp;rsquo;t recommend Tyler highly enough - a very clear thoughtful communicator. I&amp;rsquo;ll be going back to listen to his whole course in machine learning.&lt;/p&gt;
&lt;p&gt;So all that was enough for me to think there&amp;rsquo;s definitely something here, so I&amp;rsquo;d better look at it and see if I need to change.&lt;/p&gt;
&lt;h3 id="tokens"&gt;Tokens&lt;/h3&gt;
&lt;p&gt;Unfortunately this decision was a few days after I&amp;rsquo;d canceled my monthly Claude plan on the basis that I could just buy $35 of tokens and use them in my selfhosted &lt;a href="https://www.librechat.ai/"&gt;LibreChat. (LibreChat&lt;/a&gt; is basically the Claude/ChatGPT interface that you can connect to any model and pay with tokens).&lt;/p&gt;
&lt;p&gt;My thought was that that way, I could swap between different AI companies models as I fancied, and it would probably end up being cheaper. Which, maybe it would have if I&amp;rsquo;d kept using LLMs the way I had been&amp;hellip;&lt;/p&gt;
&lt;h3 id="cline"&gt;Cline&lt;/h3&gt;
&lt;p&gt;Instead what I did was install the &lt;a href="https://cline.bot/"&gt;Cline&lt;/a&gt; add-on in VS Code and gave it my API keys so it could gobble up tokens. The interface is like a chat - you can say things like &amp;ldquo;Turn this node/express app this into a TypeScript project&amp;rdquo; and if you&amp;rsquo;re using Claude as the backend, it will go ahead and make a plan to do that. But then, it will ask you to switch from &amp;ldquo;Plan&amp;rdquo; to &amp;ldquo;Act&amp;rdquo; and jump in and edit the files, run the tests, run the linter etc then loop on that until that is finished. It asks permission for things as it needs, and at first I&amp;rsquo;d carefully inspect what it was doing and why before granting any of them, but very quickly just trusted it with everything. (No doubt there will be a big attack based on this in 2025 - why not take screenshots of my open BitWarden and send them to Russia?).&lt;/p&gt;
&lt;p&gt;If you haven&amp;rsquo;t seen an agentic tool powered by Claude Sonnet doing this stuff, prepare to be amazed. Tool use by AI&amp;rsquo;s is definitely the future, and probably not just for code. It still does sometimes get stuck in a rabbit hole - I find if it hasn&amp;rsquo;t solved it&amp;rsquo;s own problems after a couple of helpful interventions from me, it&amp;rsquo;s probably not going to (I guess it&amp;rsquo;s poisoned it&amp;rsquo;s own context too much) and it&amp;rsquo;s easier to kill it and give it a different (often more focused) prompt on a fresh start. I just used git for my rollback on those occasions though I understand others (perhaps Cursor?) have &amp;lsquo;checkpoints&amp;rsquo; built into the tool.&lt;/p&gt;
&lt;img src="https://blog.iankulin.com/images/do-all-the-things-meme-template-full-e9a85cb2.webp" width="800" alt=""&gt;
&lt;p&gt;A feature of Cline is that it shows you the token use and dollar amount as it&amp;rsquo;s working. I burned through USD20 of Anthropic tokens in about 4 days of coding all the things. I would just open up a project I use in VS Code, and have the Forgejo issues for it on the other screen, and copy them across to Cline one at a time.&lt;/p&gt;
&lt;p&gt;Since these were serious projects I was making branches and code-checking them manually at the pull request stage, but for utilities that can fit on a single web page (something like &amp;ldquo;I want to drop a word doc of school report comments on here, and have you switch out the names to fakes ones that you can replace later with the real ones, and there&amp;rsquo;s a button for me to download the fake name version&amp;rdquo;) I wouldn&amp;rsquo;t look at the code, just the results.&lt;/p&gt;
&lt;p&gt;I topped up the Anthropic money, but also gave Cline my OpenAI, Deep Seek and Gemini API keys and quickly came to these (probably not reliable) conclusions.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Claude Sonnet 4 - the GOAT.&lt;/li&gt;
&lt;li&gt;OpenAI ChatGPT 4 - okay, but not as good as Sonnet. The cost is 2/3 of Sonnet but it&amp;rsquo;s probably 85% as good. So mathematically good value, but in practice that last little bit makes Sonnet way more useful.&lt;/li&gt;
&lt;li&gt;Deepseek - 1/10th of the price of Sonnet. Less good than either of the other two, but I still found myself using it for low intelligence tasks. For example I might get Sonnet to make a detailed plan for renaming a concept in a code base eg &lt;em&gt;&amp;ldquo;I&amp;rsquo;ve been referring to these little files as URLs but now I want them called download jobs everywhere&amp;rdquo;&lt;/em&gt;. Then with that written by Sonnet into a markdown file with things like &amp;ldquo;&lt;code&gt;[ ] in utilities.js on line 321 rename function validateURL() to validateJob()&lt;/code&gt;&amp;rdquo; I&amp;rsquo;d let Deepseek do the grunt work of all the file edits before swapping back to Claude to do the linting and test error fixing.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can see all these cost details right in Cline as you are switching between the models.&lt;/p&gt;
&lt;img src="https://blog.iankulin.com/images/screenshot-2025-07-07-at-15.43.54.png" width="841" alt=""&gt;
&lt;h3 id="claude-code"&gt;Claude Code&lt;/h3&gt;
&lt;p&gt;It is also possible to add your Anthropic API key into Claude code and let it eat your tokens in exchange for a in-browser Space Invaders clone or whatever, so I tried this weird idea of just vibing from the CLI instead of my editor. It worked really, really well. I&amp;rsquo;d still have the code open in VS Code, and review it at the commit stage (I still do this now), but it was very impressive. Sadly, the model or system prompt is so tuned for action that I very quickly ran out of tokens again.&lt;/p&gt;
&lt;h3 id="dollar-dollar-bill-yall"&gt;Dollar, dollar bill y&amp;rsquo;all&lt;/h3&gt;
&lt;p&gt;About this time, I became aware that some level of Claude Code use is included in the &lt;a href="https://www.anthropic.com/pricing"&gt;Pro Plan&lt;/a&gt; - ie the USD20/month plan I&amp;rsquo;d been on before. A trick for new players is that if you&amp;rsquo;re changing from tokens to a plan you need to &lt;code&gt;/logout&lt;/code&gt; and &lt;code&gt;/login&lt;/code&gt; in Claude Code to switch it over to your plan (yes I had to top up my Anthropic credits again).&lt;/p&gt;
&lt;p&gt;With that hurdle passed I was now 100% in on Claude Code. On this plan I can generally code for a couple of hours a night without ever seeing the warning appear. On the weekend, I might get to about lunchtime, then have to wait a couple of hours to start again.&lt;/p&gt;
&lt;h3 id="gemini-drops"&gt;Gemini drops&lt;/h3&gt;
&lt;p&gt;At the end of June, Google launched &lt;a href="https://blog.google/technology/developers/introducing-gemini-cli-open-source-ai-agent/"&gt;Gemini CLI&lt;/a&gt;. I&amp;rsquo;d somehow ended up with $50 of free tokens without entering any billing details and had been using them in Cline, so I had some feeling about Gemini 2.5 and what it is capable of. I think the Gemini CLI is free (ie no token use) for personal use at the moment. It is not as good as Claude Code + Sonnet yet. I understand it has a very large context window, so perhaps if you&amp;rsquo;re working on a very big project it will be comparatively stronger - &lt;a href="https://www.youtube.com/watch?v=nfOVgz_omlU"&gt;Armin Ronacher says he uses it from inside Claude Code to summarise large code bases&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Still - &amp;ldquo;free&amp;rdquo; is a compelling value argument. I&amp;rsquo;ll often break out Gemini CLI if I&amp;rsquo;ve run out of Claude Code time, but not on serious projects.&lt;/p&gt;
&lt;h3 id="tips-and-tricks-for-agentic-coding-with-claude-code"&gt;Tips and Tricks for Agentic Coding with Claude Code&lt;/h3&gt;
&lt;p&gt;So, from the sources above, my own experience and the vibes from the zeitgeist, here is some things that work, right now in July 2025.&lt;/p&gt;
&lt;p&gt;Plan - &amp;gt; Act&lt;/p&gt;
&lt;p&gt;I guess I learned this from Cline. Ask for a detailed plan for the change - if it&amp;rsquo;s heading on a wrong track you&amp;rsquo;ll usually pick it up here. I&amp;rsquo;ll often ask for this as a markdown file when it completely understands the job and has explained it back to me.&lt;/p&gt;
&lt;p&gt;Mistakes are Cheaper Early&lt;/p&gt;
&lt;p&gt;Once it gets going on a big task, it will often run for for ten minutes or so. I don&amp;rsquo;t want to do ten minutes worth of AI datacentre environmental damage for code I&amp;rsquo;m going to throw away. So I will read through the proposal before I let it get to work. In Claude Code, [SHIFT][TAB] switches between plan and act. I don&amp;rsquo;t let it out of plan until there is a good plan that I&amp;rsquo;m happy with.&lt;/p&gt;
&lt;p&gt;Guardrails&lt;/p&gt;
&lt;p&gt;I think this was in my previous article. Have it set up linting, tests &amp;amp; formatting, and make a rule that it runs them. It needs a feedback loop, this is one of them. I&amp;rsquo;ve also jumped fully into TypeScript for Javascript with AI coding. It&amp;rsquo;s like a slightly over-enthusiastic junior developer who has read books on every subject and API in the world. The guardrails force it to do things better.&lt;/p&gt;
&lt;p&gt;Good Coding Practices&lt;/p&gt;
&lt;p&gt;Small files, good architecture, clear names, good project organisation. Small amount of up to date documentation that describes the shape and why of things. I like to keep the &amp;lsquo;sprints&amp;rsquo; small. So I&amp;rsquo;m going from one working app state to another. I also frequently refactor - probably more than in my own handcoding.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://blog.iankulin.com/images/screenshot-2025-07-07-at-14.17.04.jpg" alt=""&gt;&lt;/p&gt;
&lt;p&gt;CLAUDE.md&lt;/p&gt;
&lt;p&gt;These agent instructions get sent up with every chat. Keep them succinct. I take my cue about what it needs by watching it work. If it&amp;rsquo;s grepping all the time to find the main functions, I write a section about where they are and what they do - it&amp;rsquo;s a token saver mechanism. If you get Claude Code to update it, I think it puts too much in - since it&amp;rsquo;s going with every request it will chew up tokens reading it so this is a balancing act.&lt;/p&gt;
&lt;p&gt;Tell it what tools, and where things are. For example if I don&amp;rsquo;t tell it that it can use the Playwright MCP to check any UI changes it makes it usually won&amp;rsquo;t bother. I give it a &lt;code&gt;temp/&lt;/code&gt; directory to write disposable scripts in.&lt;/p&gt;
&lt;p&gt;MCP&lt;/p&gt;
&lt;p&gt;MCP was everywhere a couple of weeks ago, but it&amp;rsquo;s possible that it will be a fad - you don&amp;rsquo;t need a git or a github MCP server, just tell Claude to use it on the command line. The only MCP server I install is Playwright.&lt;/p&gt;
&lt;p&gt;Start Over&lt;/p&gt;
&lt;p&gt;Since the whole chat history is going up with every request, you want the least amount of baggage. As soon as we don&amp;rsquo;t need what&amp;rsquo;s in the context for the next job, I clear it. If I do need it, but it&amp;rsquo;s gotten long or contains direction changes, I ask for a markdown file of the plan, then restart with that.&lt;/p&gt;
&lt;p&gt;Chat is still helpful&lt;/p&gt;
&lt;p&gt;I have browser tabs open with ChatGPT and Claude and use them for all the self-contained queries, for instance &lt;em&gt;&amp;ldquo;Should I hand code the interfaces to different LLMs or is there a good library that does this?&amp;rdquo;&lt;/em&gt; isn&amp;rsquo;t something to do in Claude Code - it doesn&amp;rsquo;t need access to your project. Do that somewhere else with VC money.&lt;/p&gt;
&lt;h3 id="keep-learning"&gt;Keep Learning&lt;/h3&gt;
&lt;p&gt;This is fast moving. I am getting great value out of these tools right now with these techniques, but we are in a new, changing, exciting, world with this stuff.&lt;/p&gt;</description></item><item><title>Where I'm up to with AI for coding</title><link>https://blog.iankulin.com/where-im-up-to-with-ai-for-coding/</link><pubDate>Mon, 03 Mar 2025 00:00:00 +0000</pubDate><guid>https://blog.iankulin.com/where-im-up-to-with-ai-for-coding/</guid><description>&lt;p&gt;There&amp;rsquo;s still plenty of controversy about LLMs for coding, and not without reason. But I thought I&amp;rsquo;d run through what I&amp;rsquo;ve tried, and where I&amp;rsquo;ve landed for using AI. Also what the pitfalls are, where it&amp;rsquo;s useful and how it&amp;rsquo;s changed my practice.&lt;/p&gt;
&lt;h3 id="issues"&gt;Issues&lt;/h3&gt;
&lt;h5 id="training-data"&gt;Training data&lt;/h5&gt;
&lt;p&gt;The training data for large language models generally is problematic. There&amp;rsquo;s no doubt that they have been trained on copyright material. With code it&amp;rsquo;s slightly less murky since there is a high availability of good quality open source data with attached licenses to train models on. No doubt this include code written by people who don&amp;rsquo;t approve of it being used by AI, but I think the popular reading of most open source licenses is that using it for training is fine.&lt;/p&gt;
&lt;h4 id="accuracy"&gt;Accuracy&lt;/h4&gt;
&lt;p&gt;Another area where AI code is better than other AI use is in verifiability. It&amp;rsquo;s possible to write good tests to verify a lot of software behaviour. This somewhat negates the problem of hallucinations.&lt;/p&gt;
&lt;h4 id="energy-use"&gt;Energy Use&lt;/h4&gt;
&lt;p&gt;Energy use is an issue I don&amp;rsquo;t really have an answer for. When IT companies are investigating owning their own power stations that&amp;rsquo;s a clear sign that this is a problem that the experts expect to get worse than better. I&amp;rsquo;ve lived through so many IT bubbles now that I&amp;rsquo;m sure that the hype around AI will die down somewhat and there won&amp;rsquo;t be VC money for adding AI to products to make them worse in a few years. Hopefully, AI will be left running in the areas only where it&amp;rsquo;s genuinely helpful like most of the previous IT fashions.&lt;/p&gt;
&lt;p&gt;I also have a growing suspicion that we might have got to the end of the performance gains of making models bigger. Surely by now all of the data that can be gobbled up has been, and the improvements seem to be coming in smaller steps. I imagine future gains won&amp;rsquo;t involve making models bigger, but integrating them into tasks more effectively or building them to be more focused.&lt;/p&gt;
&lt;p&gt;Nevertheless, for the moment, the power usage, especially for training, and especially that the US energy mix now looks like it&amp;rsquo;s moving away from renewables, is my main concern about AI use.&lt;/p&gt;
&lt;h4 id="leaking-data"&gt;Leaking Data&lt;/h4&gt;
&lt;p&gt;Another issue is leaking data. This does not overly affect me since I open source my code anyway, but anyone using it in a real job would have to be following policy on this which in most cases would be - don&amp;rsquo;t use it. There are a couple of problems related to the AI vacuuming up all it&amp;rsquo;s context from everything in your projects that does worry me - Because I&amp;rsquo;m so comfortable in VS Code and git, I keep all my work notes as markdown and manage them in VS Code, and I also use plain text accounting (BeanCount). I don&amp;rsquo;t want any of that data heading out into the AI behemoths, so I&amp;rsquo;m constantly turning the plugins off and on.&lt;/p&gt;
&lt;p&gt;It is possible to use local models, especially if you&amp;rsquo;re on a Mac. I&amp;rsquo;ve used &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; with the &lt;a href="https://marketplace.visualstudio.com/items?itemName=Continue.continue"&gt;Continue&lt;/a&gt; plugin for code completion and kept my data to myself. More about this experience later.&lt;/p&gt;
&lt;h3 id="what-ive-tried"&gt;What I&amp;rsquo;ve tried&lt;/h3&gt;
&lt;p&gt;I used Github Copilot for the trial period and was so impressed with it I paid for the service for a couple of months. This was mainly for code completion although I did use the chat a bit - it just wasn&amp;rsquo;t as comfortable in the editor.&lt;/p&gt;
&lt;p&gt;I switched to &lt;a href="https://codeium.com/"&gt;Codeium&lt;/a&gt; after hearing Kevin Howe on a &lt;a href="https://syntax.fm/show/728/ai-superpowers-with-kevin-hou-and-codeium"&gt;Syntax episode&lt;/a&gt;. For code, this seems right on par with my (now outdated) experience of Github Copilot. Copilot did seem a bit better at figuring things out from the context though - for example my plain text accounting format is probably not in the training data for either service, but when I was letting it they both would produce suggestions in the correct format, but Copilot was making better suggestions. For example it would suggest an expense was for fuel if the payee was a petrol station who appeared elsewhere in my current file.&lt;/p&gt;
&lt;p&gt;I then discovered Ollama, and with an M1 MacBook it&amp;rsquo;s a really simple matter to just pull models down and play with them. Mostly at the command line, but I did use &lt;a href="https://github.com/open-webui/open-webui"&gt;Open Web UI&lt;/a&gt; a bit for a more ChatGPT like experience. I played around with trying to do RAG via Open Web UI but with poor results.&lt;/p&gt;
&lt;p&gt;Using Ollama (which provides a REST type API to your models) I switched to the Continue VS Code plugin so I could do code-completion locally. This worked fine, but, 1) it was a bit slower than Copilot or Codeium. Only by a bit, but the difference was it was thinking slower than me, so I would have to wait for it, whereas with the big online services I was constantly typing over their suggestions, so I gave up on it. If my current M1 MacBook dies I&amp;rsquo;ll buy an M4 and try this again.&lt;/p&gt;
&lt;p&gt;I have used, and continue to use, a combination of Claude, ChatGPT, V0, and DeepSeek Coder in the web browser chat modes. In fact, this is probably my main use. I don&amp;rsquo;t pay for any of them (thank you venture capitalists) and just move across to a different one when I run out of free queries.&lt;/p&gt;
&lt;p&gt;Most of this use is the sort of questions you might ask your mates at work - how would you tackle this? what a good library for? what do you think of this approach? can you have a look over my code and suggest improvements? Working in webchat mode reduces the context available (compared to your entire project) but I&amp;rsquo;ve grown to actually prefer the tight control it gives me when I&amp;rsquo;m asking specific code questions.&lt;/p&gt;
&lt;h3 id="how-i-use-it-now"&gt;How I use it now&lt;/h3&gt;
&lt;p&gt;I use Codeium via its VS Code plugin for code completion. Sometimes this is amazing - it spits out what&amp;rsquo;s in your head, and follows your naming conventions etc. Other times it doesn&amp;rsquo;t and I just keep typing.&lt;/p&gt;
&lt;p&gt;What it&amp;rsquo;s really good at is anything repetitive. I especially love it for tests, once I&amp;rsquo;ve written a couple of tests against edge cases in my code, it gets the flavour of what I want and starts writing good ones, including some I wouldn&amp;rsquo;t have thought of which is gold. This is often a tab, tab, tab, exercise.&lt;/p&gt;
&lt;p&gt;I spend a lot of time in long form conversations in the web interfaces of the major chatbots. Usually this is quite fruitful. I often get it to generate code, or to add behaviours to code I&amp;rsquo;ve given it which I then transfer over manually. If it gets into a muddle, I usually clear it&amp;rsquo;s memory and start a new chat or move over to a different service. Having the wrong ideas or code in the context seems to lead to a chain of stupider and stupider attempts to fix the symptoms of a problem rather than going back and identifying it. It&amp;rsquo;s possible that my fresh explanation of what I&amp;rsquo;m trying to do, the code I&amp;rsquo;ve got and what the issue is is also helpful in this restart.&lt;/p&gt;
&lt;h3 id="how-its-changed-my-style"&gt;How it&amp;rsquo;s changed my style&lt;/h3&gt;
&lt;p&gt;With any tool, using it well involves understanding it&amp;rsquo;s strengths and leaning into them. AI is no different, and here&amp;rsquo;s the things I do to help it help me, or things that it&amp;rsquo;s made possible.&lt;/p&gt;
&lt;p&gt;The first change has just been to improve my craft in ways I should have been otherwise, but as a solo developer you can let slide. This is stuff like clear comments, thoughtful descriptive names, and good separation of ideas. This helps the AI as much as it would help someone reviewing your code, or future you when you come back to maintain it. I like my files to be smaller than I used to. 500 lines is a guideline for me.&lt;/p&gt;
&lt;p&gt;I already liked old and popular tech before, but now I really like it. Think of the difference of the training corpus for Node/Express vs the latest iteration of SveltKit V2. You just get better answers and suggestions for things the AI knows better.&lt;/p&gt;
&lt;p&gt;The last change is that I&amp;rsquo;m much more likely to change to an appropriate library or technology. The annoying friction of not knowing the exact syntax for things disappears since the AI can generate code with correct syntax for me. It makes my programming skills much more portable. Of course you need to invest in some of the high level understandings to know what you should want to do, but once you know that, you don&amp;rsquo;t need to know what to type to achieve that in the way you did a couple of years ago.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;m sure I should know better how to regex, and to remember the common ffmpeg or rsync flags, but I&amp;rsquo;m never going back to spend time on those jobs!&lt;/p&gt;</description></item><item><title>LLM coding question comparison using Ollama</title><link>https://blog.iankulin.com/llm-coding-question-comparison-using-ollama/</link><pubDate>Mon, 29 Jul 2024 00:00:00 +0000</pubDate><guid>https://blog.iankulin.com/llm-coding-question-comparison-using-ollama/</guid><description>&lt;p&gt;Now Ollama has made it simple enough for anyone who can use a terminal to run large language models locally, naturally I&amp;rsquo;ve gone overboard downloading too many to play with. I&amp;rsquo;m increasingly feeling they definitely have a place in the devops/coding arsenal of tools, but which model is best?&lt;/p&gt;
&lt;p&gt;If you go on HuggingFace to look at a new model you&amp;rsquo;re interested, they often have great comparisons like this.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"&gt;&lt;img src="https://blog.iankulin.com/images/performance.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;There has been a lot of work in crafting these and other benchmarks which are often comprehensive and well thought out. I&amp;rsquo;ve also seen people doing fun things, like &lt;a href="https://youtu.be/B0uMFWAGUzI?t=145"&gt;this guy&lt;/a&gt;, who is just pasting coding challenges off a web page into an LLM and seeing if it can solve them (spoiler - mostly it can solve &lt;a href="https://www.w3resource.com/python-exercises/basic/python-basic-1-exercise-141.php"&gt;coding problems that are probably part of it&amp;rsquo;s training set&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;A factor to keep in mind when looking at these charts is that they are probably running unquantised (uncompressed is a close enough analogy) models on fleets of &lt;a href="https://www.nvidia.com/en-au/data-center/h100/"&gt;$60K graphics cards&lt;/a&gt;. I can use that if I pay them $20 a month and have an internet connection, but I want to pay $0 and run it on my M1 MacBook - that&amp;rsquo;s why I downloaded &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;So what follows is my completely unscientific testing of the models I&amp;rsquo;ve downloaded. Basically, I&amp;rsquo;ll ask them the same question (that I think I know the answer to) and time their response, and subjectively judge their output. For the question I&amp;rsquo;ve chosen:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Thinking about Docker, what&amp;rsquo;s the difference between [CMD] and [EntryPoint]?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This seems like a fairly specific bit of knowledge someone might want to know about, I know the answer, and the first page of google results are mostly good so there should be sufficient training data. I&amp;rsquo;ve put both terms in square brackets as a red herring, and same with the camelcase for ENTRYPOINT. I also didn&amp;rsquo;t specify that these are both usually defined in the dockerfile. I&amp;rsquo;ve had a go at the same question, and &lt;a href="https://blog.iankulin.com/dockerfile-cmd-vs-entrypoint/"&gt;published it here last week&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="the-results"&gt;The results&lt;/h3&gt;
&lt;p&gt;According to me, I am the winner.&lt;/p&gt;
&lt;img src="https://blog.iankulin.com/images/2309065-t2sarahconnor2.jpg" width="640" alt=""&gt;
&lt;p&gt;&lt;img src="https://blog.iankulin.com/images/screen-shot-2024-07-03-at-2.37.43-pm.png" alt=""&gt;&lt;/p&gt;
&lt;p&gt;The word count for my answer would be a bit higher if we counted the text in my images, which we probably should. I made up my times by guessing what they&amp;rsquo;d be if you asked me this question.&lt;/p&gt;
&lt;h3 id="the-contestants"&gt;The contestants&lt;/h3&gt;
&lt;p&gt;&lt;em&gt;codeqwen&lt;/em&gt; and &lt;a href="https://chat.deepseek.com/coder"&gt;&lt;em&gt;deepseek-coder&lt;/em&gt;&lt;/a&gt; are both optimised for chatting about code, which I&amp;rsquo;m claiming Docker skills are a legitimate part of. They both also do autocomplete, and I&amp;rsquo;m using &lt;em&gt;codeqwen&lt;/em&gt; for that in VSCode. &lt;em&gt;deepseek-coder&lt;/em&gt; is about twice as big, and you&amp;rsquo;d think better, which it was, but in my opinion, only a little. codeqwen had a clear error and &lt;em&gt;deepseek-coder&lt;/em&gt; was a bit muddled in some parts but did a great job of wrapping it up with an explanation of where you&amp;rsquo;d use both.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;phi3&lt;/em&gt;&amp;rsquo;s is small (half the size of most of the others here) and great for chatting. For general questions it&amp;rsquo;s very impressive for it&amp;rsquo;s size, but was useless for this task. It&amp;rsquo;s interesting to me that the smartest and the stupidest AI&amp;rsquo;s had the most to say, and that my explanation was almost the exact size of all the others.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://blog.iankulin.com/images/d11c1d71-92aa-43dd-9b44-39e7ac1b2727_1600x900.jpg" alt=""&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;dolphin-mistral&lt;/em&gt;&amp;rsquo;s claim to fame is that it&amp;rsquo;s uncensored. So if you ask it how to build an improvised explosive, overturn an election, or trick a co-worker into falling in love with you, it will happily tell you - something the other models here cannot. Basically, it laughs at the first law of robotics. Even though launching a Docker container is not illegal or unethical, it had a reasonable, usable answer for our question.&lt;/p&gt;
&lt;p&gt;I tried two versions of &lt;em&gt;llama3&lt;/em&gt;. To explain the difference, we need to go into an explanation of how large language models work, which I don&amp;rsquo;t know, so I&amp;rsquo;m just going to hallucinate it for you:&lt;/p&gt;
&lt;img src="https://blog.iankulin.com/images/input.jpg" width="800" alt=""&gt;
&lt;p&gt;If you vacuum up heaps of input (the training data), then filter out all the cruft (&amp;rsquo;the&amp;rsquo;, &amp;lsquo;a&amp;rsquo;, &amp;rsquo;not)&amp;rsquo;, then put it into a special multidimensional database so that similar things are near each other (eg &amp;lsquo;rose&amp;rsquo; is near &amp;lsquo;flower&amp;rsquo;, &amp;lsquo;red&amp;rsquo; and &amp;rsquo;titanic&amp;rsquo; and a long way from &amp;lsquo;bulldozer&amp;rsquo; and &amp;lsquo;antidisestablishmentarianism&amp;rsquo;) and the database also includes how far away from each other those things are, then it is very, very, big. Too big to put on my MacBook.&lt;/p&gt;
&lt;p&gt;We can reduce the size of it by &amp;lsquo;quantising&amp;rsquo; it which is a word I&amp;rsquo;ve heard on a podcast and might mean reducing the resolution of the numbers representing the distances between concepts in the database. This is the &amp;lsquo;q8&amp;rsquo; and &amp;lsquo;q4&amp;rsquo; you can see in the tags in the table. &amp;lsquo;q4&amp;rsquo; is going to be smaller, but less accurate than a &amp;lsquo;q8&amp;rsquo; of the same data.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s the situation with these two versions of &lt;em&gt;llama3&lt;/em&gt; - one is more &amp;lsquo;compressed&amp;rsquo;. The relationship between the quantitation and the usefulness of the model is not linear for many applications, and that seems to be the case here. The bigger model did produce some more detail, but I actually preferred the output of the smaller one.&lt;/p&gt;
&lt;h3 id="conclusion"&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;It would be foolish to put much weight on a conclusion from a single run of a dubious test analyzed by a subjective carbon based lifeform, but anyway&amp;hellip;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;All of these models produced a useful starting point except &lt;em&gt;phi3&lt;/em&gt;. You probably could have just used what the others produced and gone on working with your dockerfile and things would have worked out fine.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;llama3&lt;/em&gt;&amp;rsquo;s performance matches my experience of other times I&amp;rsquo;ve been using it. It&amp;rsquo;s just pretty great for what it is.&lt;/li&gt;
&lt;li&gt;Most of the explanations over-complicated things - this probably could have been fixed with a better prompt.&lt;/li&gt;
&lt;li&gt;It&amp;rsquo;s sort of magic when you think this is most of the world&amp;rsquo;s knowledge squashed into 4GB on my laptop in a form I can just ask questions of&lt;/li&gt;
&lt;li&gt;There is room for improvement&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It&amp;rsquo;s easy to imagine that if these models were able to reach out to the internet and check what they&amp;rsquo;d come up with then generate a response by combining their first guess and their new knowledge, they&amp;rsquo;d be a lot better. That&amp;rsquo;s basically what &lt;a href="https://www.perplexity.ai/search/thinking-about-docker-what-s-t-9wJXPl_iTv2BLE60.QXgGA"&gt;Perplexity&lt;/a&gt; does, and it&amp;rsquo;s output is better than any of my local models, and it includes some of the links it used which would probably clear up any further questions. That sort of functionality is not far away for local models, and something like it is running in &lt;a href="https://useanything.com/"&gt;AnythingLLM&lt;/a&gt;, so I expect these will be indispensable tools in a year.&lt;/p&gt;</description></item><item><title>Using LLMs for coding</title><link>https://blog.iankulin.com/using-llms-for-coding/</link><pubDate>Mon, 01 Jul 2024 00:00:00 +0000</pubDate><guid>https://blog.iankulin.com/using-llms-for-coding/</guid><description>&lt;p&gt;&lt;a href="https://madmuseum.org/events/ghost-shell"&gt;&lt;img src="https://blog.iankulin.com/images/ghost-in-the-shell_07.jpg" alt="Ghost in the Shell
© Manga Entertainment 1996
"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This post looks at the context for some of my thinking about AI for supporting software development, and where I&amp;rsquo;ve landed on it for the time being.&lt;/p&gt;
&lt;h3 id="the-landscape"&gt;The landscape&lt;/h3&gt;
&lt;p&gt;I &lt;a href="https://blog.iankulin.com/chatgpts-code-writing/"&gt;briefly wrote about ChatGPT&amp;rsquo;s&lt;/a&gt; coding ability at the end of 2022. The wide availability of this tool marked the beginning of what I think can fairly be described as a revolution. The controversies that have crystalised since have not dampened my amazement of this step forward in what compute can do, especially around natural language processing.&lt;/p&gt;
&lt;p&gt;The next big news in this story was Microsoft&amp;rsquo;s launch of Github Copilot. In business terms this was a brilliant move - owning the most popular code editor, and leveraging the world&amp;rsquo;s biggest collection of public code to create a product that &lt;a href="https://visualstudiomagazine.com/Articles/2024/02/05/copilot-numbers.aspx"&gt;millions of people&lt;/a&gt; are prepared to pay $10 a month for can only be regarded as a success.&lt;/p&gt;
&lt;p&gt;At the same time as Microsoft established a new revenue stream, LLMs have been an exciting area of open source growth, especially the excellent Python libraries and the tools in the LangChain ecosystem.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s not all rainbows and unicorns though - there&amp;rsquo;s a few valid points that AI skeptics have coalesced around.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Training data - although this is a bigger issue for general models (where masses of web content has been vacuumed up) than it is for code, it is still an issue. If a model is trained on some non-permissively licensed code, and the generative AI I&amp;rsquo;m using includes that code in a commit, then a license, or at least some ethics have been breached.&lt;/li&gt;
&lt;li&gt;Quality (1) - You can see from the feature images in many of the posts in this blog during my MidJourney enthusiasm that generative AI is not perfect. Before I abandoned them I started to prefer the mangled writing and fingers of the engines, but no one wants the software equivalent of mangled fingers in their codebases. I suspect this particular aspect of the quality of the code will probably have a technological solution - we&amp;rsquo;re in the very early days after all.&lt;/li&gt;
&lt;li&gt;Quality (2) - A trickier quality problem is people writing code using AI where they do not fully understand the code they are committing. I imagine this is going to be a growing issue for projects, especially anything with a profit motive such as bug bonuses. Projects have mechanisms like code reviews and pull requests, but if submissions can be low-effort and checking them is high-effort, that asymmetry is going to be painful.&lt;/li&gt;
&lt;li&gt;Poisoned well - As the amount of AI code in codebases increases, then AI is trained on those codebases this will quickly become a snake eating it&amp;rsquo;s tail as AI is training itself on it&amp;rsquo;s own code. If allowed, this would tend to slowly evolve future codebases to use techniques favoured by early coding LLMs. The current amount of machine influenced code on &lt;a href="https://decrypt.co/147191/no-human-programmers-five-years-ai-stability-ceo"&gt;GitHub is definitely not 41%&lt;/a&gt; but it must be some, and is likely to increase, so this is a factor that will need some thought.&lt;/li&gt;
&lt;li&gt;Exfiltrating code - if you use an external LLM, such as GitHub Copilot to write commercial code, who can see your code? Since it&amp;rsquo;s being transmitted to the AI in order to make autocomplete suggestions, the answer is Microsoft, or some other company. How does that intersect with your company&amp;rsquo;s policies? I assume, based on the questions I&amp;rsquo;ve asked Copilot over the last year, that I&amp;rsquo;d never be considered for a coding job at Microsoft :-)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="i-for-one-welcome-our-new-robot-overlords"&gt;I, for one, welcome our new robot overlords&lt;/h3&gt;
&lt;img src="https://blog.iankulin.com/images/hailants-1.jpg" width="512" alt=""&gt;
&lt;p&gt;In an industry particularly known for excessive hype-cycles, it&amp;rsquo;s important to critically examine what we&amp;rsquo;re doing, but for the moment, I&amp;rsquo;ve landed on the position that these are good tools for me to use. Here&amp;rsquo;s my thinking.&lt;/p&gt;
&lt;p&gt;My situation is that I&amp;rsquo;m a very experienced developer, with solid expertise in several languages and programing paradigms, and with a degree that was strong in looking at the meta level of languages and software development processes, but, I&amp;rsquo;ve got no professional experience in modern languages. Because of this, a lot of my process has been knowing what I wanted to do, using google or stack overflow to figure out the mechanics of that in whatever language I&amp;rsquo;m using, then translating that into the context of the code I&amp;rsquo;m working on. Generative AI fits extremely well into that need - instead of jumping into a browser window to look something up, I&amp;rsquo;m just writing a descriptive comment of my intentions, then tabbing through the suggestions to chose an approach.&lt;/p&gt;
&lt;p&gt;My particular style is also well suited to these tools - I like clear, simple to reason about code. If I can write a pure function for something, I do. I like to break my code up into separated concerns with clear interfaces, I don&amp;rsquo;t prematurely optimise. I use descriptive variable, function and object names. I like to work with established, well documented languages and popular libraries, and I prefer to reduce external dependencies. All of these habits make it easier for an AI assistant to access the context of what I&amp;rsquo;m doing, and therefore to make better quality suggestions.&lt;/p&gt;
&lt;h3 id="my-journey"&gt;My journey&lt;/h3&gt;
&lt;p&gt;I started out using ChatGPT 3 then 3.5 as a sort of super-google/stack-overflow eliminator.&lt;/p&gt;
&lt;p&gt;Then with the public launch of &lt;a href="https://github.com/features/copilot"&gt;GitHub Copilot&lt;/a&gt;, I trialed that in VSCode and it was a great experience. I guess they didn&amp;rsquo;t invent the idea for the greyed out auto-complete suggestion you can tab to accept, but it feels like a natural way to work with this stuff.&lt;/p&gt;
&lt;p&gt;I paid for Copilot for a couple of months. But then heard about &lt;a href="https://codeium.com/"&gt;Codium&lt;/a&gt;, probably on &lt;a href="https://syntax.fm/show/728/ai-superpowers-with-kevin-hou-and-codeium"&gt;Syntax&lt;/a&gt;, which is free for individual developers (for now - thank you VC funding). I haven&amp;rsquo;t done any careful comparisons, but its definitely of the same order. I suspect Copilot is doing something better with the local context. For example I use a plain text accounting system called &lt;a href="https://beancount.github.io/docs/beancount_language_syntax.html#transactions"&gt;Bean Count&lt;/a&gt; in VSCode. Copilot is able to understand these transactions and make much useful suggestions than Codium. I assume this is just inferred from my local files since there would not be much training data for them, and it suggests the correct accounts based on the payees which must be from local context.&lt;/p&gt;
&lt;p&gt;I&amp;rsquo;ve probably done more work with Codium, 80% of it on Javascript, than with Copilot. It&amp;rsquo;s definitely a workable solution and a great choice if you want a Copilot type experience without paying for it, or have questions about Microsoft&amp;rsquo;s training data.&lt;/p&gt;
&lt;p&gt;More recently I&amp;rsquo;ve started playing with local models to avoid the problem of exfiltrating my code - I strongly feel I can&amp;rsquo;t use AI assisted coding with client code if I don&amp;rsquo;t know what&amp;rsquo;s happening it. If I can run a local model, that problem is avoided.&lt;/p&gt;
&lt;p&gt;I code on an early M1 MacBook, so &lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; is an easy to use choice. I&amp;rsquo;ve tried &lt;a href="https://ai.meta.com/blog/meta-llama-3/"&gt;llama3&lt;/a&gt; and &lt;a href="https://qwenlm.github.io/blog/codeqwen1.5/"&gt;codeqwen1.5&lt;/a&gt; in the terminal for a bit, but missed the ChatGPT web experience. To get that back, I&amp;rsquo;ve been running &lt;a href="https://openwebui.com/"&gt;Open WebUI&lt;/a&gt; in a docker container.&lt;/p&gt;
&lt;p&gt;More recently, I&amp;rsquo;ve installed the &lt;a href="https://docs.continue.dev/intro"&gt;Continue&lt;/a&gt; VSCode extension that allows those Ollama managed models to work in VSCode, including the auto-suggestions (following &lt;a href="https://www.davegray.codes/posts/bye-copilot-how-to-create-a-local-ai-coding-assistant-for-free"&gt;Dave Gray&amp;rsquo;s blog post&lt;/a&gt;). I&amp;rsquo;ve got a few long flights coming up over the next week, so it will be good to be able to work offline with that help.&lt;/p&gt;
&lt;p&gt;I haven&amp;rsquo;t really done more than play with CodeQwen in VSCode via Continue so far, but my initial impression is that it&amp;rsquo;s comparable to Copilot, although the extra second of waiting for auto-suggestions did make me look up M3max MacBook pricing. Logic tells you that a 4GB model on a MacBook is going to be less capable than the giant GPT4 powered Copilot, but &lt;a href="https://qwenlm.github.io/blog/codeqwen1.5/"&gt;this comparison&lt;/a&gt; suggests the difference is not an order of magnitude (although the model size is). From limited playing around in small JavaScript codebases, they seem similar, with the local model just being a bit slower.&lt;/p&gt;
&lt;p&gt;If this is a revolution, it&amp;rsquo;s one we&amp;rsquo;re at the start of, and I certainly reserve the right to change my mind about AI assistance in coding, but I suspect it&amp;rsquo;s our future and I&amp;rsquo;m excited at the productivity boost it currently gives me working in languages I&amp;rsquo;m new to.&lt;/p&gt;</description></item></channel></rss>