Like the rest of the internet, I’ve been playing around with Chat-GPT. It’s remarkable in so many ways — writing fluidly and coherently, putting together sentences and paragraphs, pulling information from the vast swathes of data upon which it was trained.

We marvel: It would take humans hours to find this information, if we even could! How long I would have to scour keyword-manipulated Google searches to learn so much! If I even could!

Of course, though, it did take Chat-GPT hours to find this information, if you consider how many thousands of hours it was trained. And sometimes we humans simply couldn’t find something out—and Chat-GPT can’t either.

I’ve seen otherwise intelligent people discussing Chat-GPT like it’s an oracle — “I want to know how to determine whether this business metric is reasonable, but I’ve been asking GPT all morning and it doesn’t know the answer! So it must not be possible.” or “I asked GPT to calculate my carbon emissions, and it gave me this number, so I’m going to offset by that much.” Maybe these are reasonable things to say and do. But within the first hour of playing with Chat-GPT, asking it about both fact-based and opinion-based things in a variety of subject areas, I was able to generate dozens of confident falsehoods.

Here are some:

On Sunday, I did the Tres Picachos hike in Puerto Rico. According to my recording and those of several other people on the internet, it’s about 5.6 miles long (2.8 each way). According to my topological map and the US Geological Survey marker at the summit (not to mention Wikipedia, which Chat-GPT surely consumed), it’s 3,176ft at the summit. It took us 3 hours, including 30 minutes at the top taking in the view. And any knowledge of the Puerto Rican rainforest, plus the photos on Google Images and my own human-eye observations yesterday, will confirm that it’s jungly all the way to the top.

What did Chat-GPT say?

It’s right that it’s hard, and humid, and hot, and that you should bring water and wear hiking attire. But it got everything else wrong. And all of that is factual information, easy to verify on a number of sites across the internet!

When I ask a simpler question, it even gets this wrong! The mountain is not 2,257 feet tall.  According to my topological map and the US Geological Survey marker at the summit (not to mention Wikipedia, which Chat-GPT surely consumed), it’s 3,176ft at the summit. At least it gets the meters-to-feet conversion correct 🙄

What about something that’s still factual, but more obscure?

My partner is working on a book about crystal twinning, which is a phenomenon in which minerals grow following particular symmetries (eg. reflection across a mirror plane). Different mineral species exhibit different “twin laws”, which describe the orientation of the twinning using axes and planes in coordinate space defined by the crystal lattice. These laws are deterministic for the mineral species and based on its physical structure. They also often (but not always) have human-approachable names in addition to their precise numerical definitions.

There are no confirmed instances of twinning in aquamarine. The Brazil Law was defined for quartz, and is not observed in other species—if it ever did occur in another mineral, that species would have to be of the trigonal crystal system; aquamarine is hexagonal. 70 degree joining is certainly not seen in aquamarine and also does not describe Brazil Law twins—70 degrees would create an inclined axis twin (which Brazil Law twins are not), forming a visible angle between the two crystals. While a “V” shape is a common morphology in twinned crystals, it is definitely not true of Brazil Law twins (or aquamarine). 

So it’s totally wrong. But at least these are all terms and concepts in mineralogy, even if they’re put together wrong. When we asked about calcite, it made up a twin law out of whole cloth:

Dauphiné Law and Brazil Law twins are real twin laws, but they are only observed in quartz, never calcite. And Romanesque Law twinning is not real, for any mineral species. It also provides two sets of Miller indices for each twin law, while every twin law is in fact defined by a single list of Miller indices. And the provided Miller indices are themselves totally incorrect for all of the listed twin laws. Also worth noting — while these sets of -1, 0, and 1 are at times valid Miller indices, they would be extremely unusual for twinning, as -1 is almost never observed in twins.

Admittedly, this sort of task is probably tricky for a language model — in some cases, there’s contradictory information on the internet about twin laws and mineral species. But surely the algorithm could at least be consistent within a single chat instance about which fact it’s following? Or it could equivocate, the way it loves to do.

Here it is being mealy-mouthed about climate change when I tried to prompt to be denialist:

Here it is stating the boundaries of its abilities:

And if you give it nonsense, sometimes it can tell:

when should you boil a carbon offset project?
I'm not familiar with the phrase "boil a carbon offset project." Can you please provide more context or clarify your question?



kelsey.josund@gmail.com
when should you cancel a carbon offset project?
Carbon offset projects can be cancelled for various reasons, including failure to meet project requirements, changes in market conditions, and other unforeseen circumstances. Here are some reasons when a carbon offset project should be cancelled:

Finally, I turned to a niche subject about which I’ve gone down a social media rabbit hole — the theory that Taylor Swift has had same-sex relationships in the past. The likes of Rolling Stone has written about this, and there are robust discussions on Reddit, TikTok, and Twitter as fans dissect her life for evidence. Chat-GPT wisely does not take a stance on whether this is true — and I won’t either — but it does hugely misrepresent the arguments:

Karma is a track from Taylor Swift’s most recent album, Midnights. Reputation was released on November 10, 2017, but it did not contain a song called Karma (though there were conspiracy theories that it would have a sister album called Karma, which is probably where GPT got the idea). The analysis of the song demonstrates a plausible understanding of the concept of karma, but not awareness of the song itself, which has different themes. And as for the “gaylor” interpretation? It’s totally wrong. GPT picked Hayley Kiyoko because she’s a famously lesbian singer, so it did pick up on what gaylor implies. But she and Taylor Swift are not collaborators and have not been rumored to be together, now or in the past (if you’re curious, the primary women with whom Taylor Swift has been associated are Karlie Kloss and Diana Agron). One note: Chat GPT’s training data gives it no knowledge of Midnights, which came out after the dataset was closed. I’d be more shocked if it got this question right. But shouldn’t it at least be able to tell that it doesn’t know? It did that successfully above, when I asked about boiling carbon offsets.

I’m not the first to reckon with Chat-GPT’s deep unreliability, but I’m throwing my two cents in. This algorithm is impressive and amazing, but it’s nowhere close to being safe. I’ve seen articles about people turning to Chat-GPT for recipes or homework help or advice, and maybe that’s fine, but when it comes to truth, it has little to offer. Humans have no trouble generating misinformation, so I don’t worry that it’ll deluge us with false-facts—it’s not the falseness itself that’s the problem. The problem is that it isn’t trustworthy: if it doesn’t know something, it just makes it up. People do that, too, of course—but we are used to that. We look for it. We fact-check.

If we’re turning to this algorithm to do our fact-checking, though, we’re leading ourselves astray. Unlike a stack of Google search results through which we can comb, checking citations and making judgements, GPT just tells us one thing. And if it’s wrong, we have to look outside it to know.

Chat-GPT also seems incapable of assessing its own knowledge.

can I trust you?
As an AI language model, I am programmed to provide accurate and unbiased responses to the best of my abilities. However, I am not capable of having personal intentions or emotions, and my responses are based solely on my programming and the information provided to me. So, while I cannot be "trusted" in the traditional sense of the word, you can rely on me to provide information and answers to your questions to the best of my knowledge and capabilities.

So you can’t just ask it to tell you something and assess its correctness. It’s convinced that it generates accurate information, even with all the evidence to the contrary.

what should I use you for?
As an AI language model, I can help you with various tasks such as answering questions, providing information, generating ideas, offering suggestions, assisting in writing tasks, and more. You can use me for any task that requires natural language processing, and I will do my best to provide you with accurate and helpful responses.



kelsey.josund@gmail.com
Are you more useful for ideating or fact-checking?
As an AI language model, I can be useful for both ideating and fact-checking, as well as other tasks such as answering questions, generating text, and more.

Going back to the mineral twinning discussion, I decided to try to get some sense of accuracy out of the chatbot. But when I asked it to cite its sources, it made some up:

The information about Brazil Law twins is closer to correct than before, but the sources are fake. The Quartz Page is a website, not a book, and it was not created by Andreas Börner. The information it says came from the book is not accurate. As far as I can tell, the paper it cites does not exist.

But none of this should be terribly surprising. Even though ChatGPT was trained on all of the internet, its objectives are to simulate language, not to generate or recall knowledge. We can’t really fault it for failing where it isn’t meant to succeed, but we must keep in mind what it can and can’t do, and what its creators are trying to do or not. When we turn to it for facts or knowledge, it’s quite likely to lead us astray in a very convincing way. This is like trusting your least trustworthy friend—you know the guy, the pathological liar who tells compelling stories. Super fun on a night out. Not so great as a coworker, though.

Which is why it’s so alarming to me that people are using it as a coworker. It’s writing code and articles, summarizing papers, and suggesting business decisions. I’d argue it’s not good for this purpose, and we’re mistaking confidence for competence (of course, that’s something we’ve been doing for millenia, if my understanding of history is correct). Sure, ChatGPT can brainstorm, but it also eagerly makes shit up, and it—and you—probably can’t tell the difference.

These large language models have a fundamentally different purpose than something like Google or Wikipedia–-or, for that matter, humans answering the sorts of questions asked in this post. When we ask a question, we’re not looking for nice-sounding words that could plausibly answer the question. We’re looking for the answer. And that’s not what ChatGPT is for.

When I need to know something, I’ll stick with Google for now.