top of page

Alt: Morphology & LLMs

Questions:​

  1. What systematic patterns characterise LLM errors in the morphological analysis of English conversational data?

  2. To what extent can scaffolded prompting interventions improve analytical reliability?

Unpacking the questions

  1. What systematic patterns characterise LLM errors in the morphological analysis of English conversational data

What's really being asked here is: what's the gap between what linguists mean by morphological analysis and what LLMs produce when asked to do it?

A. Process: what error patterns do LLMs make?

  • ablauts? ("sing" - "sang" - "sung")

  • suppletion? ("go" - "went")

  • inflectional contractions? ("I'm", "what's", "gonna")

  • these need to be compared to regular inflection patterns as a control ("walk" - "walked - "walks"

  • overfitting

B. Methodology: how do we analyse errors?

The benchmark for error is straightford: "run" - "runned" is just wrong, for our purposes it doesn't need a theory to determine that it's wrong.

​

Notes: 

​

  • We'd expect LLMs to fail with low-frequency irregular forms.​​

  • Most interestingly, given how straightforward English morphology is (comparatively), if LLMs make errors with REGULAR forms, that would be diagnostic

  • Also interesting - if LLMs struggle with high-frequency IRREGULAR forms (i.e. well represented in training data), is there a pattern to that failure?

  • Still interesting  - even with the expected errors (low-frequency, irregular) do they cluster around one of the theoretical approaches to morphology (see below)? This might indicate what the internal representation of morphological patterns is within the LLM - something we don't presently know much about.

​

So, theoretical frameworks for morphology come into play at the stage of interpreting the results:

​

  • item and arrangement (IA) 

  • item and process (IP) 

  • ​word-and-paradigm (WP) 

  • (newer ones?)

C. Methodology: which / how many LLMs?

Against which benchmark am I evaluating the LLM's output? Results should be compared to the outputs of (hand-done) analysis on a combination of:

​

  • Multiple (e.g. the "big 3") - ChatGPT, Claude, Gemini - the problem would be workload, the payoff would be comparison. TBH, I think that's too much for an MA to do well.

​​

  • Single - one of the big 3... gives methodological consistency (no cross-model variation to confound results), much more manageable, and still interesting in its own right, but not necessarily generalisable. Still has its own issues:​

 

  1. Which model? Claude Haiku (fastest, cheapest to run, least capable of the models - though still very capable), Claude Sonnet (the regular Claude which most people with a free account are likely to use for harder stuff), Claude Opus (the most intelligent - likely to get better results, capabilities likely to be available to all users in next release, most expensive to run, restricted availability to paid tiers).

  2. ​Even when that decision is made, need to decide with or without "extended thinking" - same rationales as above apply.

  3. ​Which model lab from the big 3? Currently, the "top" model by benchmarking changes with every release, the "big 3" just cycle around being top with a change every month. That suggests there's no methodological reason to choose one over the other, so stick with Claude - (a) I already pay for it, (b) I like its personality.​

  4. Version: as model labs update their models every 3 months or so, the version number/date will need to be explicit to allow reproducibility.

​

D. Data - how much to adequately survey the morphological terrain

  • All the terrain needs to be covered:

    • Latinate / Germanic affixes​

    • various levels of morphological sophistication (i.e. 1-transformation, 2-transformations, 3-transformations)

    • derivations & inflections (including contractions)

    • regular, irregular (common and uncommon)

all these suggest a matrix will be needed and transcripts visually inspected to ensure that all required terrain is present in the analysed corpus.​

  • Need conversational transcripts: BNC2014 corpus (the largest corpus of spoken English) includes British Library / BBC R4 "Listening Project" transcripts:

    • publicly available​

    • relatively contemporary 

    • familiar format (cf TMA02)

  • How many transcripts / how long each?

    • Crossley's "How many words" conference paper suggested 150 words were enough for NLP tools to analyse L2 English. It's a different thing, but the analogy is reasonable: 150-200 words per conversational transcript to be analysed by an AI for morphology. TMA02 showed that morphological processes were evident within that scope - having more data from that from any one transcript increases the output without adding value.

    • How many? As many as it takes to fill the matrix of the terrain - a pure guess would be 15-20 sample transcripts, each of 150-200 words. (Around 3,000-4,000 words feels about right, and manageable.)​​

  • There's no need to worry about full regional sampling of accents, since that's not the object of study, as long as all morphological terrain is covered.​

E. Ethics

BNC2014 is an established, publicly accessible, already-anonymised corpus. But informed consent was only given for research purposes - a purpose that will not have included uploading to a commercial AI company, this is purpose creep

​​

The only way that this can be ethical is if it can be ensured that the uploaded transcripts will not be used for model training or other purposes.

​

Currently, using any model off-the-shelf is not ethically permissible, as prompts (including all uploads) are kept by the model labs for future training of those AI models.

​

Using an API to upload data and prompts gives contractual guarantees of data confidentiality, non-use and deletion after 7-days; this approach is unworkable, however, given that it requires programming of every intervention, whereas the project demands an interactive workflow with continuous refinement of prompts.

​

A mitigation strategy would be (here using Claude as an example) either :

​

  1. ​​​Toggle machine learning off for each conversation or​,

  2. Use Incognito chats for each conversation​​

 

both have the same outcome. To enable internal project organisation, toggling chats within Claude pProjects makes sense (Incognito is unavailable within Projects)

 

Procedural safeguards:

  • log mitigation strategy as each prompt is logged;

  • log BNC2014 text IDs of data sent, plus line ranges.

 

(Note, if the prompt (or upload data) is deemed to violate Usage Policy (for example Child Abuse or Terrorism), that data will be retained for 2 years. Clearly, this is not applicable for this research project.)

F. Literature

There will be sparse literature around the whole field, which presents a difficulty for situating the work, but an opportunity for original work. Adjacent areas that will need to be considered:

  • NLP evaluation methodology — how do we assess language model capabilities systematically?

  • Computational linguistics tooling — the history of computer-assisted linguistic analysis

  • Error analysis in L2 acquisition — a well-developed framework for systematically characterising linguistic errors that could be adapted

  • Prompt engineering — the emerging (if somewhat atheoretical) literature on LLM steering

  • Philosophy of AI — questions about what "understanding" means for statistical systems

  • Morphological theoretical frameworks - IA, IP, WP at least. I think the newer frameworks might be a step too far, and possibly a "future research" direction.

​

    2. Can scaffolded prompting interventions improve analytical reliability?

A. Define scaffolded interventions

Define "scaffolded" interventions: the term is well understood in AI, less so in linguistics.

Possible scaffolding strategies:

  • Explicit analytical frameworks provided in the prompt

  • Worked examples (few-shot prompting)

  • Step-by-step instructions (chain-of-thought)

  • Error-checking routines ("now verify your analysis by...")

  • Iterative refinement (multi-turn correction)​

B. Reliance on Q1 findings

  • If errors cluster around ablaut → scaffold with explicit process-based instructions

  • If errors cluster around low-frequency forms → scaffold with exemplar paradigms

  • If errors are random → scaffolding may not help (itself a finding)

C. Measuring improved reliability

  • Same test set, before/after scaffolding

  • Error rate comparison (quantitative)

  • Error type comparison (qualitative — do the kinds of errors change?)

D. Methodology 

I would need to ensure that there was no cross-contamination between different prompting strategies/sessions. Options for clean A/B testing:

  • Separate conversations - simplest approach. Test scaffold A in conversation 1, scaffold B in conversation 2. Each starts clean.

  • Incognito mode - Claude has a temporary chat mode that doesn't contribute to memory.

  • API access - rigorous and systematic - run tests via the API, where thre is complete control over what goes into each context window. Prompts would need to be scripted, but no contamination. is guaranteed The gold standard for reproducibility, but overkill, I think.

  • Projects — different Claude Projects have separate memory spaces.

E. Limitations

  • Scaffolding that works for one model may not transfer

  • Overfitting to specific error types - in other words, the danger of prompting to pass the test rather than address the underlying principle/cause. This would fix the problem for this LLM on this corpus, but not be generalisable.

  • The intervention might improve performance without telling us why, or with multiple possible explanations - that suggests that interventions need to be carefully designed to limit that possibility.

ADDRESS

Fairwinds Community Hub,

Falmouth Parish Church,

Church Street,

Falmouth,

TR11 3DX

  • Facebook

© 2023 by Chilioi. Proudly created with Wix.com

bottom of page