2026.02.18 - Preparing to Develop
It was a busy week for Brendan and I–we were both traveling and had…plans for the weekend. Regardless, we found the time within our schedules to continue this project and met to continue pushing the needle forward. More on this later, but we’re at the point where we are going to start building (actually!). The goal is a working prototype within the next 2-3 weeks.
Before getting to that, though, let’s recap the meeting in the order of our agenda.
First, we quickly reviewed the two new prompts that I had developed. One was created using the same approach as documented in Meeting Recap #2 for February 8, 2026–except this time, I used Claude instead of ChatGPT. We decided to do this because, during our last meeting, we had decided on Claude as our preferred model. Thus, we thought it made sense to also develop the underlying prompt using Claude.
To briefly summarize the approach: I input five sample interactions (“artificial data”) that I wanted the LLM to replicate, and added in a few explicit instructions for things that we absolutely did not want the LLM to do (e.g., disclose specific answers or search the Internet).
Although I used the same approach, it’s notable that Claude’s output was quite meaningfully different from ChatGPT’s. Perhaps the most noticeable difference was the output length–whereas ChatGPT’s output was 507 words, Claude’s was 1,385 words. In other words, Claude’s was more than twice the length of ChatGPT’s!
This higher word count is significant because it also translates into higher token counts, which in turn translate into higher costs for us.
With last week’s conversation on tokens still on my mind, I followed up to Claude’s 1,385-word output by requesting it give me a shorter output. Its revised version stood at less than 25% of the original length, at just 381 words.
The second prompt I created was to reflect a state-machine concept, which we had also discussed in our meeting last week. I documented my experience building this prompt (with ChatGPT) in a post from yesterday morning.
After we walked through both of these prompts, Brendan shared with me the results of his testing–for which he used my updated Claude prompt. Notably, he used this prompt for testing in Claude (Haiku and Sonnet), ChatGPT (4.1 mini and 5.2), and Gemini (3-Flash). This time, he kept his inputs largely consistent. That way, we could more clearly distinguish between the quality of the model outputs.
Overall, across all models, we were happy with our results. The prompt this week was working the way we had intended, and just as it had last week, too.
One interesting thread from our conversation: when I was reading over Brendan’s interaction with Claude Haiku, I noticed that Haiku had employed the same sentence structure across all of its outputs. It started with some sort of affirmation, and then in the next line, posed a question. I raised this as a potential concern to Brendan–did he find this overly redundant?
He replied that he didn’t, in fact. After going back and forth a bit, he suggested that perhaps my reaction to the repetition was due to my perspective as a reader, rather than as a user engaged in a critical thinking session. Reading the interaction, exchange by exchange, the repetition may be obvious. But for Brendan, who took a minute or so after each model output to interpret it and craft a response, the repetition wasn’t so obvious.
Another interesting takeaway: when Brendan was outlining the context for the issue that he wanted to think through (inheritance, a computer science project), he commented, “i don’t entirely know why this might be useful though. it seems like it would be messy.”
Brendan did not use this specific language in his testing of GPT 5.2, so obviously, the interaction using that model did not follow up on it. In his testing of GPT 4.1 mini, he misspelled messy as “messay.” GPT did not follow up on it. We were considering giving it the benefit of the doubt. But shouldn’t large language models be good enough to see past a one-character typo?
Both Claude models caught it and immediately followed up by asking Brendan to clarify what he meant. Gemini’s response was most interesting, however. Unlike Claude, it did not immediately follow up on Brendan’s intent behind the term “messy.” However, nine exchanges back-and-forth later, Gemini hit Brendan with, “Earlier you mentioned that inheritance seemed like it might be ‘messy’–how does the idea of having one central place to manage shared code change your perspective on that messiness?”
Both of us found this outcome a bit bizarre. But I was really…impressed. While I concede that it would have been timely for the model to follow-up right away, it felt strategic to let that comment go to the wayside for a bit, and then return to it some time later. Circling back, as some may say, to remind the user of their initial objectives after they have already made some headway.
After reviewing his interactions, Brendan and I proceeded to review some artificial data he had created. (i.e., Sample interactions that we would use as a model for an “ideal” interaction.) He shared some Claude resources on prompt engineering (Prompting Best Practices and Prompt Engineering Overview).
We then talked a bit about my updated user interface, which remains very preliminary, but marks a step up from the Figma interface that I had designed a few weeks ago. This one, I made in Google Slides. It doesn’t feel quite there–a big part of it, I think, has to do with the proportions. And we’re still trying to navigate this tricky balance between creating something that is unique (different from all the other chatbots), but still…recognizable.
Brendan raised that we should probably think of ways to incorporate more thinking motifs into the design. Currently, he observed, what we have doesn’t look too different from something like Google Docs. I agreed with him. However, I also suggested that there can be a risk with the design veering into cringe. How can we best incorporate the concept of thinking into our design in a way that is obvious, yet minimalist?
Finally, to conclude, we talked through the timeline. I again brought up the prospect of vibecoding–would it help speed things up? I knew that Brendan wasn’t the biggest fan of the idea, but I realized then that I didn’t really know why–was it an issue of rigor? He replied that his concern was actually with security. It’s harder for developers to follow vibecoded code. That incomprehensibility is what creates a security risk–which may be particularly pronounced for this tool, as it captures users at a vulnerable point in their critical thinking process. By definition, these thoughts are half-baked and not designed for public consumption. I can imagine a big issue if someone’s interactions with a chatbot got out, and they reflected perspectives that are less than societally accepted. On a conceptual level, I wouldn’t think that one should be punished/held accountable purely for thinking, especially if that thought is a means to an end of a more fully-formed thought. But on the other hand…these technological tools really change things, because when you click submit to a chatbot, you are, by definition, submitting an output. Even if, within your mind, the thought is fleeting, it becomes memorialized in the digital universe once you type it into the chat window.
So, anyway. Brendan estimated that it would take two weeks to develop something that we could host on our laptops, and an additional week to get it ready to ship out to other people’s devices. Then, another month or so to get it online, which would encompass figuring out how to create the database (chat storage), the servers, log-in, etc.
Next steps:
Research prompt caching
Update prompt to address lingering issues (occasionally too-complex questions; overenthusiastic affirmations)
Review Claude’s prompting guides
Food for thought:
Is it best practice to generate a prompt using the same platform that you will ultimately be using to deploy the prompt?
How varied should the chatbot outputs be? Does it need to masquerade as anything more than a chatbot?
What are non-cringe “thinking” motifs that we can deploy?
