| Title: | Large Language Model (LLM) Tools for Psychological Text Analysis |
|---|---|
| Description: | A collection of large language model (LLM) text analysis methods designed with psychological data in mind. Currently, LLMing (aka "lemming") includes a text anomaly detection method based on the angle-based subspace approach described by Zhang, Lin, and Karim (2015) and a text generation method. <doi:10.1016/j.ress.2015.05.025>. |
| Authors: | Lindley Slipetz [aut, cre], Teague Henry [aut], Siqi Sun [ctb] |
| Maintainer: | Lindley Slipetz <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.2.1 |
| Built: | 2026-05-25 07:08:21 UTC |
| Source: | https://github.com/sliplr19/llming |
Cleans a text column and converts it to a numeric embedding matrix or dataframe, with one row per input text and one column per embedding dimension. Supports pure R and Python-backed embedding methods.
embed( dat, method = c("E5", "Qwen3", "NV-Embed", "BERT", "GloVe"), text_col = "text", py_e5_qwen = NULL, py_nv = NULL, hf_cache_root = NULL, max_length = 256, prefer_gpu = FALSE, auto_install_sentence_transformers = FALSE )embed( dat, method = c("E5", "Qwen3", "NV-Embed", "BERT", "GloVe"), text_col = "text", py_e5_qwen = NULL, py_nv = NULL, hf_cache_root = NULL, max_length = 256, prefer_gpu = FALSE, auto_install_sentence_transformers = FALSE )
dat |
A dataframe containing one text per row. |
method |
Character scalar specifying the embedding method. One of
|
text_col |
Character scalar giving the name of the text column in
|
py_e5_qwen |
Optional path to the Python executable to use for the
|
py_nv |
Optional path to the Python executable to use for the
|
hf_cache_root |
Optional path to a cache directory for Hugging Face models and related files. Required for Python-backed methods and used as a temporary cache for BERT when not otherwise supplied. |
max_length |
Integer maximum sequence length passed to the tokenizer for
the |
prefer_gpu |
Logical; if |
auto_install_sentence_transformers |
Logical; if |
The input text is lightly preprocessed before embedding. This preprocessing:
removes punctuation, symbols, numbers, and URLs,
converts text to lowercase,
removes English stopwords,
transliterates text to ASCII.
Method-specific behavior:
"GloVe" uses a pure R workflow via text2vec and returns
document-level mean embeddings.
"BERT" uses text::textEmbed() with the "bert-base-uncased"
model on CPU.
"E5" and "Qwen3" use Python and sentence-transformers.
"NV-Embed" uses Python and transformers.
A numeric matrix or dataframe with one row per input text and one column per embedding dimension.
df <- data.frame( text = c( "I slept well and feel great today!", "I saw friends and it went well.", "I think I failed that exam. I'm such a disappointment." ) ) emb_dat <- embed( dat = df, method = "GloVe", text_col = "text" )df <- data.frame( text = c( "I slept well and feel great today!", "I saw friends and it went well.", "I think I failed that exam. I'm such a disappointment." ) ) emb_dat <- embed( dat = df, method = "GloVe", text_col = "text" )
Converts each column of a pCOS score matrix into binary indicators
G_thres(pCOS_mat, theta)G_thres(pCOS_mat, theta)
pCOS_mat |
Dataframe of pCOS values |
theta |
Numeric threshold |
A matrix of 0s and 1s of which cells meet the threshold
z_dat <- data.frame("A" = rnorm(500,0,1), "B" = rnorm(500,0,1), "C" = rnorm(500,0,1)) snn <- sim_SNN(z_dat, 10, 5) vec_snn <- vector_SNN(z_dat, snn) pCOSdat <- pCOS(z_dat, vec_snn) G <- G_thres(pCOSdat, theta = 0.1)z_dat <- data.frame("A" = rnorm(500,0,1), "B" = rnorm(500,0,1), "C" = rnorm(500,0,1)) snn <- sim_SNN(z_dat, 10, 5) vec_snn <- vector_SNN(z_dat, snn) pCOSdat <- pCOS(z_dat, vec_snn) G <- G_thres(pCOSdat, theta = 0.1)
Computes a normalized Mahalanobis distance score. Only features with nonzero scores in S receive nonzero Mahalanobis scores.
normahalo(z, rs, S)normahalo(z, rs, S)
z |
Dataframe of z scores |
rs |
List of reference sets |
S |
Dataframe of numeric values |
A dataframe of local outlier scores
Applies pCOS_row() to corresponding rows of two data frames, returning one pCOS value per row.
pCOS(z_dat, vec_SNN)pCOS(z_dat, vec_SNN)
z_dat |
Numeric dataframe, typically z-scores |
vec_SNN |
Numeric dataframe, typically the output of vector_SNN |
A dataframe with same dimensions as z_dat
Given two numeric vectors, computes an average cosine-based similarity.
pCOS_row(z, v_SNN)pCOS_row(z, v_SNN)
z |
Numeric vector |
v_SNN |
Numeric vector, same size as z |
A numeric vector
Creates a list of the vectors of the top shared nearest neighbors for each row of the z dataframe
rep_set(z, snn)rep_set(z, snn)
z |
Dataframe of values of reference set |
snn |
Dataframe of shared nearest neighbors indices |
A list of dataframes where each row of the dataframe is the vector representation of a given shared nearest neighbor
Builds a shared nearest neighbors matrix and, for each row (observation), returns the indices of the top neighbors with the largest SNN overlap counts
sim_SNN(z_dat, k, tops)sim_SNN(z_dat, k, tops)
z_dat |
A dataframe with numeric columns |
k |
An integer representing number of nearest neighbors |
tops |
An integer representing how many of shared nearest neighbors to return |
A dataframe of top rows with shared nearest neighbors
All prompt components and example texts are provided by the user as function arguments. This function generates text data based on severity score from a given questionnaire.
text_datagen( prompts, examples, scenario = NULL, overall_rules = NULL, percentile_scaffold = NULL, item_rules = NULL, items = NULL, structure_rules = NULL, percentile_specification = NULL, band_specification = NULL, example_instruction = NULL, what_to_write = NULL, task_desc = NULL, target_min = 90L, target_max = 100L, temperature = 0.4, top_p = 0.9, repetition_penalty = 1.1, model_name = "meta-llama/Meta-Llama-3-8B-Instruct", batch_size = 2L, python = Sys.getenv("RETICULATE_PYTHON", "python"), env = NULL, output_file = NULL )text_datagen( prompts, examples, scenario = NULL, overall_rules = NULL, percentile_scaffold = NULL, item_rules = NULL, items = NULL, structure_rules = NULL, percentile_specification = NULL, band_specification = NULL, example_instruction = NULL, what_to_write = NULL, task_desc = NULL, target_min = 90L, target_max = 100L, temperature = 0.4, top_p = 0.9, repetition_penalty = 1.1, model_name = "meta-llama/Meta-Llama-3-8B-Instruct", batch_size = 2L, python = Sys.getenv("RETICULATE_PYTHON", "python"), env = NULL, output_file = NULL )
prompts |
A data.frame with one row per diary to generate. Must contain at least a column indicating severity level. |
examples |
A data.frame of example diary texts with columns
|
scenario |
Character string used in the SCENARIO section. |
overall_rules |
Character string describing global writing rules. |
percentile_scaffold |
Character string describing how percentiles map onto severity. |
item_rules |
Character string describing how to internally choose symptom patterns. |
items |
Character string of the battery under study. |
structure_rules |
Character string describing structural rules. |
percentile_specification |
Character string describing what the severity percentile means. |
band_specification |
Character string describing severity bands. |
example_instruction |
Character string introducing the example texts. |
what_to_write |
Character string describing what the model should write about. |
task_desc |
Character string for the system-level role description. |
target_min |
Integer minimum number of tokens to generate. |
target_max |
Integer maximum number of tokens to generate. |
temperature |
Numeric temperature for sampling. |
top_p |
Numeric top-p nucleus sampling value. |
repetition_penalty |
Numeric repetition penalty. |
model_name |
Model identifier string to pass to transformers. |
batch_size |
Integer passed through to the Python script. |
python |
Path to the Python executable. |
env |
Optional named character vector or list of environment variables. |
output_file |
Optional path to save the output CSV. |
A data.frame with columns id, severity, and response.
prompts <- data.frame( id = 1:2, severity = c(10, 80), num_examples = c(1, 1) ) examples <- data.frame( text = c("Example A", "Example B"), label = c("group1", "group2"), stringsAsFactors = FALSE )prompts <- data.frame( id = 1:2, severity = c(10, 80), num_examples = c(1, 1) ) examples <- data.frame( text = c("Example A", "Example B"), label = c("group1", "group2"), stringsAsFactors = FALSE )
Text anomaly detection method adapted from (Zhang et al. 2015).
textanomaly(dat, k, tops, theta, text_method)textanomaly(dat, k, tops, theta, text_method)
dat |
A dataframe with text data, one text per row |
k |
An integer representing number of nearest neighbors |
tops |
An integer representing how many of shared nearest neighbors to return |
theta |
Numeric threshold |
text_method |
Character scalar specifying the embedding method. One of
|
Dataframe of local outlier score
Zhang L, Lin J, Karim R (2015). βAn angle-based subspace anomaly detection approach to high-dimensional data: With an application to industrial fault detection.β Reliability Engineering & System Safety, 142, 482β497. ISSN 0951-8320. doi:10.1016/j.ress.2015.05.025.
For each row of the SNN index matrix, this function takes the rows of reference dataframe, z, and computes their column means, yielding one mean vector per observation.
vector_SNN(z, snn)vector_SNN(z, snn)
z |
Numeric dataframe |
snn |
Dataframe of shared nearest neighbors indices |
Dataframe of same dimensions as z
Z-score on columns
z_score(dat)z_score(dat)
dat |
A dataframe with numeric cells |
A dataframe with numeric cells with the same dimensions as dat