WERscore — from screening prompt to product-ready demo

Overview

This project started as a job interview screening task: compute Word Error Rate (WER) between a reference sentence and a hypothesis sentence, then surface substitutions, deletions, insertions, and a word-level alignment.Instead of shipping a “correct answer” and stopping there, I treated the prompt like a tiny product spec: make the metric understandable at a glance. The result is an interactive WER visualizer that animates alignment in real time, color-codes every edit operation, and keeps the WER counters synced with what the user is seeing on screen.

The app is portfolio-ready today, and it’s also a foundation for something bigger: a benchmarking tool that can compare multiple speech-to-text engines side-by-side.

Role

Data Visualization
Full Stack Development

Client

WERscore Labs

Date

Aug 2022 - Feb 2023

Link

wer-score.vercel.app

What the user experiences

Paste two sentences (reference + hypothesis).
Click Compute to tokenize and initialize the engine.
Hit Start to watch the alignment build word-by-word:
- Matched words drop in cleanly.
- Insertions appear with a strike-through.
- Deletions show what the model missed.
- Substitutions highlight the reference word and reveal the hypothesis token on hover.

As the animation runs, the counters (S / D / I / N) and WER update live — so the numeric score never feels detached from the text.

Why I went beyond the prompt

WER is usually presented as a single number. That’s useful for evaluation, but it hides the story: where did the model go wrong, and how did the alignment decide what counts as an insertion vs deletion vs substitution?

So I focused on two things:

Make the alignment explicit: show the word-level operations clearly and consistently.
Make it feel “alive”: the user can see the engine thinking through step-by-step playback.

This turns an abstract metric into something you can demo to a hiring manager, explain to a teammate, or eventually use to compare ASR engines.

UX and animation decisions that make it legible

1) Color-coded chips that map directly to the metrics

Each token renders as a small “chip” whose color corresponds to its operation type. That same mapping is used by the counters above, so users build intuition quickly: if I see a lot of red chips, I know why WER is climbing.

2) Substitution tooltips (reference vs hypothesis) without visual clutter

Substitutions are the most informative error type, but showing both words inline can make the line unreadable.

Instead, the UI shows the reference token as the chip, and reveals the hypothesis token on hover using a tooltip. It’s a clean interaction pattern: obvious when you want details, invisible when you don’t.

3) Motion that reinforces “step-by-step alignment”

I used Framer Motion to animate each chip as it’s emitted by the engine. This isn’t decoration — it’s feedback. The user sees each decision appear at the moment it is made.

Key implementation ideas (with minimal, supportive code)

Tokenization and engine bootstrap

WER evaluation should be robust to inconsistent spacing and casing, so tokenization is intentionally simple and predictable: lowercase + whitespace split.

const tokenize = (s: string) => s.toLocaleLowerCase().trim().split(/\s+/).filter(Boolean)

const compute = useCallback((hypothesis: string, reference: string) => {
  const hTokens = tokenize(hypothesis)
  const rTokens = tokenize(reference)

  setAlignment([])
  setIsRunning(false)

  setReferenceWordCount(rTokens.length)

  const maxSteps = Math.max(hTokens.length, rTokens.length) + 10

  engineRef.current = {
    hTokens,
    rTokens,
    i: 0,
    j: 0,
    step: 0,
    maxSteps,
    done: false,
  }
}, [])

The goal here: reset UI state, store token arrays, and set up a small “engine cursor” (i, j) that we can animate forward.

Real-time totals + WER that update as the alignment streams in

Instead of calculating everything at the end, the UI derives totals from the emitted alignment tokens — meaning the counters and WER stay correct during playback and manual stepping.

const totals = useMemo(() => {
  return alignment.reduce(
    (acc, { type }) => {
      acc[type] += 1
      return acc
    },
    { ...emptyTotals },
  )
}, [alignment])

const wer = useMemo(() => {
  if (referenceWordCount === 0) return undefined
  const errors = totals.DELETED + totals.INSERTED + totals.SUBSTITUTED
  return errors / referenceWordCount
}, [totals, referenceWordCount])

This is what makes the project feel like a live instrument panel rather than a static result page.

The alignment step that drives the animation

The heart of the demo is the single-step function. Each call emits exactly one alignment token — which then appears as one animated chip.

A key detail: when there’s a mismatch, the engine checks one token ahead to decide whether the best explanation is:

a deletion (reference has an extra token),
an insertion (hypothesis has an extra token),
or a substitution (true replacement).

if (h !== r) {
  if (rTokens[e.j + 1] === h) {
    setAlignment((prev) => [...prev, { word: r!, type: 'DELETED', substitution: undefined }])
    e.j += 1
    e.step += 1
    return
  }

  if (hTokens[e.i + 1] === r) {
    setAlignment((prev) => [...prev, { word: h!, type: 'INSERTED', substitution: undefined }])
    e.i += 1
    e.step += 1
    return
  }

  setAlignment((prev) => [...prev, { word: r!, type: 'SUBSTITUTED', substitution: h }])
  e.i += 1
  e.j += 1
  e.step += 1
  return
}

This structure is intentionally simple because the project is meant to be:

explainable in an interview,
debuggable while animating,
and easy to extend later.

The animated readout: where the metric becomes “visual”

The alignment tokens render as chips, animated into view as they stream in. Substitutions get tooltips to reveal the hypothesis token on hover.

<motion.span
  className={`rounded px-1 py-0.5 ${colorHighlightClass[t.type]} ${t.type === 'INSERTED' ? 'line-through' : ''}`}
  initial={{ opacity: 0, y: -10 }}
  animate={{ opacity: 1, y: 0 }}
  transition={transition}
>
  {t.word}
</motion.span>

This is the “aha” moment of the project: the user doesn’t just read a score — they watch the operations that create it.

What I’d build next

This app already works as a clean portfolio demo, but it’s also a strong base for a more practical tool.

Planned upgrades:

ASR benchmarking mode: plug in speech-to-text APIs (e.g., Whisper, AssemblyAI, PicoVoice) and compare results side-by-side using the same alignment visualization.
Teleprompter mode: generate a read-out script, record audio, send to multiple engines, then visualize and compare their WER outputs instantly.
Richer token handling: optional punctuation-aware tokenization and configurable error weighting.
Better “demo controls”: speed slider, keyboard shortcuts, persistence of last inputs, shareable alignment snapshots.

How to run locally

Install dependencies:
```
npm install
```
Start the dev server:
```
npm run dev
```
Open the localhost URL printed by Vite.

Summary

This started as a screening prompt, but I treated it like a product: make WER understandable, interactive, and explainable. The end result is a visual alignment engine with live metrics, animated word-level feedback, and a clear path to evolving into a full benchmarking suite for speech-to-text engines.

Next Project