A Reusable AI-Tool Evaluation Rubric (Copy This Scorecard)

Most AI-tool decisions are made on a demo and a vibe. A demo shows the tool at its best on someone else's task; it tells you almost nothing about how the tool behaves on *your* task, with *your* data, under *your* constraints. A reusable rubric fixes that by forcing the same questions every time, so two tools are compared on the same axes instead of on enthusiasm.

The scorecard

Score each dimension 0–2 (0 = fails, 1 = adequate, 2 = strong) for the specific job you have in mind.

DimensionWhat you are actually checking0 / 2 anchors
Task fitDoes it do *your* task, not the demo's?0: needs heavy workarounds · 2: handles your real case
Permission boundaryWhat can it read, write, execute?0: broad, unscoped access · 2: least-privilege, auditable
Data handlingWhere does your data go; is it trained on?0: unclear or retained · 2: documented, no-train, in-region
ReliabilityHow does it fail, and how visibly?0: silent wrong answers · 2: fails loudly, recoverable
Lock-inCost to leave? Export, standards, portability0: proprietary trap · 2: open formats, easy exit
Total costPrice *plus* setup, maintenance, token burn0: hidden or variable · 2: predictable, justified
SupportDocs, issue response, community0: stale docs, no replies · 2: active and responsive

A score is not a verdict by itself — a tool can score low on lock-in and still be the right call for a throwaway task. The value is that the *pattern* of scores tells a story: a high-task-fit, low-permission-boundary tool is one to sandbox carefully; a high-everything-but-cost tool is a budget conversation, not a safety one.

Weight the dimensions for the job in front of you

The seven dimensions are not equal for every decision, and pretending they are is how teams pick a tool that scores fine and still hurts them. For anything touching customer or regulated data, *data handling* and *permission boundary* are gatekeepers — a zero there ends the evaluation no matter how high the rest score. For a quick internal experiment, *lock-in* and *total cost* barely matter and *task fit* dominates. Decide the weights before you score, so the scoring can't be rationalized after the fact.

Three red flags override a strong overall score: the tool can't tell you where your data goes or whether it trains on it; it requests broad, unscoped access "to make setup easier"; or it fails silently, returning confident wrong answers with no signal. Any one of those is a stop, even at a perfect task-fit score, because each converts a productivity tool into a liability you won't notice until it costs you.

How to run a one-hour bake-off

Pick the two or three finalists and, for each, do the same four things with a timer:

  1. Define one real task in a sentence and prepare sample inputs that look like your actual data (never production secrets).
  2. Read the docs first — install path, required permissions, data terms — and write down every account, key, or access scope it asks for. The tool that needs the least to do the job usually wins the safety axis.
  3. Run the task once with sample data only and record the first useful output, the failures, and anything confusing.
  4. Try to remove it and confirm your workspace still works. A tool you cannot cleanly uninstall is a tool you have not really evaluated.

Fill the scorecard immediately after, while the friction is fresh. Keep the completed cards; six months later, when someone asks "why did we pick this," you will have an answer better than "the demo looked good." Re-score on any major version change, because permissions, pricing, and data terms move more often than features.

How to compare scores without fooling yourself

Do not average the scorecard too early. A tool with a high average and one serious data-handling problem may be worse than a boring tool with a lower total and safer boundaries. Read the lowest scores first, then ask whether any are disqualifying for this task. For example, a 0 on lock-in may be acceptable for a one-week internal experiment, while a 0 on permission boundary should stop a tool that touches customer files. Keep the raw notes beside the numbers: what failed, what surprised you, what required admin access, and what you would need to verify with the vendor before using real data.