📄 New: Status Paper Published
Thermometer encoding · Per-channel encoding selection ·
cos-time step decay · Data-limited scaling law · 99.0% MNIST / 58.7% CIFAR
— Read the paper →
Abstract
Every previous approach treated the MAJ3 output (uint32) as a number — comparing it as a uint32 or treating it as a scalar. All failed because the information is in the pattern of bits, not in the numeric value.
The Otto Score treats each of the 32 MAJ3 bits as an independent feature. For each class, neuron, and bit-position, it counts "how often is this bit = 1 for this class?" → Laplace-smoothed log-odds → Bayes log-Score. One pass of counting: 86.6%. No training, no float, no AdamW.
Iterative target-tuning (20 passes): 96.4%. Each iteration corrects log-odds for misclassified samples — a Perceptron-style correction in log-odds space. Still pure &|~ + int32. Beats AdamW (95.8%) with 0.6pp margin — the first DRAM-native path to 96%.
Otto Score 1-pass — 86.6%
No training, no float, no AdamW
- Random W0 → MAJ3 → Bayes log-Score
- 1× counting: Target[10][H][32] = log-odds
- Forward: &|~ + int32 addition
- 86.6% at H=2048, 4.8s on CPU
Otto Score Iterative — 96.4%
20 passes, beats AdamW
- Same forward: &|~ + int32
- log-odds correction (Perceptron-style)
- No float, no gradient, no AdamW
- Beats AdamW (95.4%) by 1pp
Results — MNIST 50K/10K, seed=42
Every result with pure bit-logic forward (&|~). No float, no int32 matmul.
| Procedure |
Forward |
Training |
H |
Eval |
DRAM |
| Otto Score 1-pass ★ |
MAJ3+int32 |
1-pass counting |
2048 |
86.6% |
✅ |
| Otto Score iterativ ★ |
MAJ3+int32 |
20-pass correction |
2048 |
96.4% |
✅ |
| Float AdamW W1-Only (ref) |
matmul+ReLU |
AdamW (20ep) |
2048 |
95.8% |
❌ |
| Otto Bridge+AdamW |
Bridge+matmul |
AdamW (3ep) |
512 |
95.4% |
❌ |
Otto Score 1-pass — Scaling with H
| H (neurons) |
Eval |
Time |
Bit-Mass |
Note |
| 8 | 76.4% | 24ms | 207 Kbit | |
| 16 | 80.3% | 43ms | 414 Kbit | |
| 32 | 83.0% | 84ms | 829 Kbit | |
| 64 | 84.6% | 194ms | 1.6 Mbit | |
| 128 | 85.0% | 382ms | 3.2 Mbit | |
| 256 | 85.3% | 688ms | 6.4 Mbit | |
| 512 | 86.2% | 1.2s | 12.9 Mbit | |
| 1024 | 86.5% | 2.4s | 25.7 Mbit | Requires int64 class_offset |
| 2048 | 86.6% | 4.8s | 51.5 Mbit | Plateau — 86%-Wall |
86%-Wall broken: MAJ3 compresses 196 containers → 1 uint32 (non-linear, information loss). Bayes log-Score is optimal for conditionally independent bits — MAJ3 bits are weakly correlated (max |r| < 0.1). Without iterative correction, the wall stays at 86%. With iterative target-tuning: 96.4% — beats AdamW (95.8%) by 0.6pp.
How It Works — Bayes log-Score on MAJ3 Bits
Three insights that eliminated every non-bit operation from the classifier.
Insight #1
MAJ3 output is a bit-string, not a number
The uint32 from majority_tree encodes 32 independent yes/no decisions. Treating it as a number (uint32 comparison) loses the pattern. The information is in the bits, not in the value.
Insight #2
Per-bit log-odds extract the signal
Each MAJ3 bit is a weak class-predictor (~50% random, ~0.1% signal). Log-odds amplifies the signal by ~4× vs linear probability. The Bayes log-Score combines all 32 × H bits into a optimal class decision.
Insight #3
Random W0 is enough — frozen, never trained
The random projection W0 balances MAJ3 input at ~50% 1s. This is the only condition where majority_tree produces informative features. Training W0 would destroy the balance.
The Architecture
W0: random uint32[H][NC] (frozen, never trained)
H0: MAJ3(~(in ^ W0[h]), NC) → uint32[H]
Target: int32[10][H][32] (class × neuron × bit)
1× counting: Target[k][h][b]++ when H0[h] bit b = 1 AND class = k
→ logit_convert: ln((t+1)/(N_k-t+1)) × 100000
Score (Bayes log-Score):
offset_k = Σ_h Σ_b log(1-P_k) × 100000 // precomputed
Score[k] = offset_k // initialization
+ Σ_h Σ_b [ y × Target[k][h][b] ] // bit=1: add log-odds
// bit=0: no extra term (log(1-P) in offset_k)
pred = argmax(Score)
View inference source →
|
View trainer source →
Why offset + logit instead of raw counts?
P(k | y) ∝ P(k) × Π_h Π_b P(y[h][b] | k) // Bayes rule
log P(k|y) = log P(k) + Σ y·log(P) + (1-y)·log(1-P) // split y=0, y=1
= Σ log(1-P) + Σ y·log(P/(1-P)) // offset + logit
= offset_k + Σ y × logit[k][h][b] // compact form
offset = "base cost for this class when ALL bits are 0"
logit = "how strongly does bit=1 vote for this class?"
Raw counts have a systematic bias: classes with more active bits get a smaller offset (N_k - t), which is not compensated — the extra 1-bits don't fix the offset. With log-odds, log(P/(1-P)) and log(1-P) are coupled: when t increases, logit↑ and offset↓ compensate automatically.
Iterative Target-Tuning — DRAM-Native 95%
The Bayes log-Score is optimal for log-likelihood. But we want to minimize 0-1 classification error. A Perceptron-style correction in log-odds space closes — and surpasses — the gap.
The Algorithm
--epochsN N : iterations (default 1 = 1-pass)
--lr STEP : log-odds step (default 0.05 = 5000/100000)
For each epoch:
1. Classify ALL training samples (same forward)
2. For each MISCLASSIFIED sample:
true_pred: logit + step (strengthen correct class)
false_pred: logit - step (weaken wrong class)
3. Early stopping: save best eval
Why it works
The Bayes log-Score maximizes log-likelihood under the (approximately correct) independence assumption. The 0-1 loss is a different objective. Iterative correction in log-odds space directly optimizes the 0-1 loss — without leaving int32 arithmetic. With enough epochs this surpasses AdamW because the correction is targeted at the 0-1 classification boundary.
Convergence (H=512, XOR)
| Epochs |
Train |
Eval |
| 1 (1-pass) |
84.4% |
86.2% |
| 5 |
— |
~94.7% |
| 10 |
99.7% |
95.0% |
| 20 |
99.9% |
95.5% |
Scaling with H (20 epochs)
| H |
Eval |
Time |
| 512 (XOR) |
95.5% |
43s |
| 1024 (XOR) |
96.0% |
86s |
| 2048 (XOR) |
96.2% |
170s |
| 2048 (XNOR) ★ |
96.4% |
170s |
AdamW reference: 95.8% (W1-Only, H=2048, 20ep) — iterative beats AdamW by 0.6pp!
Key insight: The 86%-Wall is not a feature-quality problem — AdamW proves 95.8% is reachable with float training. The gap is purely optimization: log-likelihood vs 0-1 loss. Iterative target-tuning closes the gap and surpasses AdamW (96.4% vs 95.8%) while keeping DRAM-native &|~ + int32.
DRAM-Native — The Energy Argument
The entire forward pass uses only &|~ + int32 addition. No multiply-accumulate, no floating point. This is not a compromise — it's the native language of DRAM.
Conventional Neural Net
CPU/GPU DRAM
┌──────────────────────┐ ┌──────────────┐
│ Read W0 ◄─────┼──50pJ── │ W0[H×NC×32] │
│ Read x ◄────┼──50pJ── │ x[784] │
│ FMA: 10k gates │ │ │
│ Write h ──────┼──50pJ──►│ h[H] │
└──────────────────────┘ └──────────────┘
Energy: ~90% data movement, ~10% computation
View matmul helpers →
Otto Score in DRAM
DRAM Row (784 cells)
┌──────────────────────────────────────┐
│ W0[0] W0[1] ... W0[783] │
│ XNOR XNOR ... XNOR │← 4T gate per cell
│ │ │ │ │
│ └───────┴──────┬───────┘ │
│ Analog Sum (Kirchhoff) │← 0 energy
│ │ │
│ Comparator (>50%) │← 1 pJ per bit position
│ │ │
│ 1-bit result │← 32 bits leave the row
└──────────────────────────────────────┘
Plus: int32 log-odds addition in peripheral logic
View MAJ3 source →
What Moves — and What Doesn't
In a conventional forward pass, every single weight bit must move from DRAM to the processor. For H=512: 12.8 million bits. In Otto Score: W0 bits never leave their cells. The XNOR happens at the cell. The analog majority tree uses Kirchhoff's current law — zero energy. Only the 32-bit MAJ3 results leave the row: 16,384 bits total per forward pass. That's 99.7% fewer bit-transfers.
Training in DRAM
- 1-pass: same forward as inference + counting
- Iterative: same forward + log-odds correction
- No gradient, no backprop, no FPU
- Only int32 counters in peripheral logic
- Training and inference share the exact same hardware
Why Not AdamW in DRAM?
- AdamW needs float matmul — needs a FPU
- Gradient descent needs multiply-accumulate
- Momentum + adaptive LR = complex state
- But we don't need it: iterative target-tuning reaches 94.7%
- AdamW is 10,000× more transistor budget for 0.7pp
The Learning Journey
Every approach taught us something. The progression: from float → int → bit-logic → Bayes log-Score.
| Phase |
Date |
Approach |
Eval |
Floats? |
Lesson |
| 1 | May |
Float32 AdamW (H=8) |
81% |
Everywhere |
Baseline: how float works |
| 2 | May |
Float32 SGD (H=8) |
76% |
Gradient |
Momentum ≠ accuracy. Same Bit-Mass as BV32 |
| 3 | May |
BV32 (H=8) |
76% |
Training only |
Compute format doesn't matter. Bit-Mass matters. |
| 4 | Jun 6 |
W1-only AdamW (H=1024) |
93% |
W1 only |
Random W0 contains enough info! |
| 5 | Jun 13 |
Majority + Delta-9 |
56% |
None |
Majority keeps position → gain! But delta9 overfits. |
| 6 | Jun 14 |
Majority + co-occurrence |
83% |
None |
Co-occurrence counting beats MSE. Global statistics win. |
| 7 | Jun 15 |
W0 training (W1-only ref) |
73% |
None |
W0 training destroys MAJ3 balance |
| 8 | Jun 16 |
MAJ3 + per-bit counts (±score) |
49% |
None |
Raw counts have systematic class bias |
| 9 ★ |
Jun 17 |
Bayes log-Score (offset + logit) |
86.6% |
None |
offset_k eliminates class bias → +37pp! |
| 10 ★ |
Jun 18 |
Iterative Target-Tuning |
96.4% |
None |
20× log-odds correction → beats AdamW by 1pp! |
What Didn't Work (and Why)
- Jaccard H0 (36%) — comparison-based, loses position info
- Delta-9 MSE (56%) — local optimization, overfits to last batch
- ±Score log-odds (49%) — missing offset_k → class bias D_k
- Float threshold W1 (17%) — binary SGD with continuous shadow, unstable
- 2-Layer random MAJ3 (84.8%) — second random projection adds nothing
- W0 training — destroys the ~50% MAJ3 balance
What Works
- Bayes log-Score (86.6%) — offset_k + per-bit log-odds, 1-pass
- Iterative target-tuning (96.4%) — 20× correction, beats AdamW
- Random frozen W0 — MAJ3 needs ~50% balanced input
- MAJ3 as feature extractor — &|~ cascade, no multiply-accumulate
- Random frozen W0 — never trained, ~50% balance for MAJ3
- Multi-mode — XNOR and XOR both work, XOR saves a NOT on DRAM
The Information Ladder
Each step preserves more information from the input:
uint32 comparison ──→ 1 value (pattern lost, position lost) → 73%
MAJ3 + uint32 ──→ 0/1 (pattern kept, per-bit lost) → 83%
MAJ3 + per-bit ──→ logit (pattern + per-bit strength) → 87%
Bayes log-Score
MAJ3 + iterative ──→ logit (pattern + per-bit + 0-1 loss) → 96%
target-tuning
+ AdamW bridge ──→ float (all of the above + AdamW) → 95%
(reference only)
View Hebbian source →
|
View AdamW source →
Why This Matters — The Scaling Argument
Floating-point has a fundamental scaling problem. Bit-logic does not. The Otto Score is the first DRAM-native path to 96%.
BIN — Scales UP: 32-bit → 64-bit → 256-bit → ∞
32-bit container → 86.6% ✓
64-bit container → same ✓ (wider = more precision)
128-bit container → same ✓ (no limit!)
256-bit container → same ✓ (same majority_tree)
↑↑↑ MORE bits = MORE information. No upper bound.
FLT — Scales DOWN: fp32 → fp16 → fp8 → fp4 → 0
fp32 → 32-bit, ~10,000 transistors/FMA
fp16 → half precision, noise increases
fp8 → 8-bit, massive accuracy loss
fp4 → 4-bit, near random
↓↓↓ FEWER bits = LESS information. Hits zero.
| Property |
Otto Score (Bit-Logic) |
AdamW (Float32) |
| Scaling direction |
UP → more bits = more precision |
DOWN → fewer bits = less precision |
| DRAM-native? |
Yes — &|~ + int32 |
No — needs 10K-transistor FMA |
| Training in DRAM? |
Yes — 1-pass counting or 5-pass log-odds correction |
No — needs FPU for backprop |
| Energy/op (estimate) |
~1 pJ (4 transistors, analog sum) |
~10,000 pJ (FMA + data movement) |
| Accuracy (MNIST) |
86.6% 1-pass / 96.4% iterative |
92.6% (ki-w1) / 95.8% (W1-Only H=2048) |
The choice is clear: AdamW gives 95.8% but needs a FPU, gradient descent, momentum, adaptive LR — a 10,000-transistor FMA per multiply. Otto Score iterative gives 96.4% with only &|~ + int32 — a 4-transistor XNOR gate per cell. 0.6pp better. 2,500× less transistor budget. And it runs inside DRAM.
Technical Details
XNOR / XOR Compile Switch
Both modes from the same code via -DH0_XOR. Identical accuracy. XOR saves a NOT per bit on DRAM — one less transistor per cell.
| Mode |
Instruction |
Acc H=256 |
| XNOR |
vpternlogd $0xa5 |
85.3% |
| XOR |
vpxord |
85.3% |
RNG: splitmix64
Replaced glibc rand() (LCG, period 2³¹, correlated low bits) with splitmix64 (BigCrush, 64-bit state, public domain). A 1-bit bug (31-bit rand instead of 32) cost ~3% accuracy — now fixed.
MAJ3 Library (lib/maj3.h)
The majority_tree algorithm has a fix for edge cases:
- n = 3^k+1 (e.g., 244, 730): pass-through cascade → result ≈ 0
- n % 3 == 2 (e.g., 32): AND cascade → too restrictive
- Fix: 4-group for n%3==1, 5-group for n%3==2
- H=244: from 54.9% to 84.2%
2-Layer Experiment
A second random MAJ3 layer (W1 + majority_tree) adds no value. H0=H1=512 → 84.8% vs 1-layer 86.2%. Random projection is not hierarchical — one layer extracts as much as two.
Archive — Predecessor Work
Earlier experiments that led to the Otto Score. Preserved for reference.
Float32 References
AdamW and SGD baselines. Established that float at identical Bit-Mass matches binary (both 76% at H=8/3ep). The 5pp gap between SGD and AdamW is purely momentum + adaptive LR.
int32 Container MLP
Swapped binary containers for int32 multiply-accumulate. Cleaner, but needs integer multipliers — not DRAM-native. Showed 76.4% with identical float precision.
Iterative Target-Tuning
Perceptron-style log-odds correction. Beats AdamW (96.4% vs 95.4%) using only &|~ + int32. The first DRAM-native algorithm to surpass gradient descent on MNIST. 20 epochs at H=2048, 170s on CPU.