Mechanistic Interpretability
Probing the internal computations of large language models. Causal interventions, activation analysis, and circuit-level methods to understand which parts of a model are actually doing the work.
An independent research lab studying the internals of large language models: how they reason, how they adapt, and how reinforcement learning rewires them. Papers in the open. Loop, in development. Looking for collaborators.
drafted in the open, with code and data released alongside
Praxor Lab studies the internals of large language models. How they reason, how they adapt, and how reinforcement learning rewires them. We write it up in the open.
Two papers in draft. Loop in development. The work runs on open code, careful interventions, and claims sized to evidence.
Two papers in active development. One on the internals of reasoning models, one on parameter-efficient adaptation. The tracks below are the directions those papers grow into. Each is meant to produce a small, verifiable result, not a manifesto.
Probing the internal computations of large language models. Causal interventions, activation analysis, and circuit-level methods to understand which parts of a model are actually doing the work.
How language models reason step by step, and whether the written trace is causally responsible for the answer. We compare prompted and trained reasoners and measure step-level effects.
Adapting large pretrained models to downstream tasks without retraining from scratch. Low-rank decompositions, subspace methods, and rank allocation guided by the model's own spectrum.
Reproducible experiments, released code, and claims sized to evidence. Negative results count. Ablations are load-bearing.
An opinionated stack for post-training and model specialization. Training APIs, managed datasets, and research-native environments, with circuit-level traces of what your reward signal is doing to the model on every checkpoint.
Built by interpretability researchers, for teams who care which circuits their reward is actually moving.
Thirty minutes SSH'd into a fresh GPU box before you find out the driver is wrong, the CUDA version is wrong, or the trainer hangs at step zero.
Terabytes of trajectories on object storage with no dedup, no eval index, no version. Someone writes a new ETL every time you want to rerun an ablation.
Loss goes down, eval goes up, and you have no idea which circuits the reward moved. Reward hacking only shows up later, in deployment.
PPO, GRPO, DPO, and RLHF behind one SDK. Reward shaping, KL control, and checkpoint-level interpretability turned on by default. Not a library you bolt on.
loop.train(recipe='grpo', base='llama-3.1-70b')
loop.reward(fn=my_scorer, kl_target=0.05)
loop.checkpoint(every=200, with_circuits=True)Versioned trajectory storage at terabyte scale. Dedup, filter, mix, and slice without writing a new pipeline. Eval sets and rollouts share one index.
loop.dataset.attach('trajectories.parquet')
loop.dataset.dedup(by='prompt_hash')
loop.dataset.mix(weights={...})Pre-warmed GPU pools with research-native images. Notebooks, training, and eval share one environment. Every node is preflighted before it reaches you.
loop env launch --gpus 8xH100
loop env attach --notebook
loop env preflight # ✓ all nodes healthyYour reward shaped a circuit. Loop tells you which one, every run, as a default artifact. Interpretability is in the CI loop, not an opt-in plugin.
Generated automatically with every checkpoint. No instrumentation in your training loop. Available as JSON, in the dashboard, or via loop diff <ckpt>.
There are good open-source RL trainers. None of them ship a managed dataset layer, a preflighted GPU environment, and circuit-level interpretability in the same box. Loop is the opinionated stack — not a piece of one.
Reflects publicly available capabilities as of Q2 2026. We're happy to be wrong — tell us at loop@praxorlab.com.
$ loop init my-rl-run --base llama-3.1-70b→ env: 8×H100 ready (preflight passed in 11s)→ base: llama-3.1-70b · spectrum cached$ loop dataset attach trajectories.parquet→ ingested 14.2M rollouts · deduped to 9.8M$ loop train --recipe grpo --steps 4000→ step 0400 loss 1.84 → 1.62 kl 0.04→ step 2000 eval pass@1: 0.412 → 0.491→ step 4000 eval pass@1: 0.491 → 0.508✓ circuits moved: 14 attention heads in L23–L28✓ reward-hacking probe: clean (held-out, n=2k)! drift on L31.MLP: flagged for review→ checkpoint: loop://runs/my-rl-run/ckpt-finalIn development. Looking for design partners.
For teams running serious post-training. Tell us what you're training and where the tooling is breaking, and we'll keep you in the loop as the alpha opens up.
Most of what we do is research. For teams that need the same methods applied to a specific model or product, we take on a small number of engagements per quarter, built around the interpretability and adaptation work the lab is already doing.
Parameter-efficient fine-tuning for your domain, sized to your compute budget.
Mechanistic analysis of how a reasoning model is actually arriving at its answers.
Domain-specific evals and adversarial review, not vanity benchmarks.
Joint work on a focused interpretability or adaptation question, ending in a co-authored draft.
1-hour call plus async intake. We learn your model, your goal, and what good looks like.
Agree the question, the eval, and the fail conditions. Scope is locked before we touch compute.
Work in the open. Weekly working session and intermediate results in a shared notebook.
Written report, a reproducible repo, and a working session with your team.
Open to a small number of engagements.
Send a short note. We'll reply with whether the shape fits, and if it does, a 30-minute discovery call to scope it.
Two papers in active draft, Loop in development, and room for a small group of collaborators. Pick the level of involvement that fits where you are.
Work alongside Praxor researchers on one of the papers in active draft, or scope a related question of your own. Small group, close collaboration, public output.
For researchers with their own ongoing work who want to contribute to a specific draft on a lighter time commitment.
An open reading group on interpretability and adaptation. For anyone who wants to follow along and discuss the work.
The lab works in small, scoped projects rather than open-ended programs. Four phases over roughly six months.
Tell us your background and what draws you to interpretability or adaptation. No prior publications required. We look for care about the question and willingness to do real empirical work.
We sit down with accepted fellows and pick something to work on. Usually a contribution to a draft, sometimes a related question of your own.
Six months of focused work. Weekly working sessions and intermediate results in the open from day one.
We draft together, share early for outside review, and submit to a workshop or conference. The artifact is the paper plus the code, data, and notebooks needed to reproduce it.
We're looking for fellows and collaborators. Students, engineers, and self-directed researchers are all welcome. What we need most is people who care about getting the empirical work right.