For AI labs · research teams · model builders

Consented humans.
Trainable AI.

The consented dataset your model has been missing. A growing registry of identity-verified humans who have explicitly licensed their likeness for AI training. Demographically representative. Fully consented. Provenance you can prove.

Provenance per sample Revocation-aware delivery Multi-modal data formats
Registry snapshot UPDATED 17 MAY 2026
Verified humans
2,500+
3 continents
Modalities
5
Image · Video · Audio · 3D · Motion
Age range
18–82
Median 34
Territories
18
UK, EU, US lead
Per-sample provenance
100%
Cryptographic chain
Revocation SLA
72h
Custom available
The thesis

Scraped data has a half-life.

For most of the last decade, the bottleneck in human-likeness AI wasn't licensing — it was compute. That's reversed. Models are now trained faster than the legal layer underneath them can be defended.

Three things are happening simultaneously: courts are starting to award damages on unauthorised likeness use; regulators are demanding training-data provenance for foundation models; and the public is realising that "the model was trained on millions of images" means "the model was trained on me."

Twinnin's bet is simple. The next generation of human-likeness models will be built on data with provenance, consent, and ongoing payment. Not because labs become altruistic — but because models without that paper become commercially unusable.

"The era of free scraping is closing. The era of licensed, consented training data is opening. Twinnin is positioned at exactly that inflection."
SFC Capital · seed lead · April 2026
"Foundation model labs are already losing deals to procurement teams asking for chain-of-consent paper. The biggest labs see this coming. The smaller ones don't yet."
Katrien Grobler · founder · Deadline interview
Regulation

The next AI regulation is about your training set.

The EU AI Act enters full enforcement on 2 August 2026. From that date, foundation models distributed in the EU must disclose training data sources, demonstrate consent, and respond to data subject withdrawal requests within defined timelines.

The US is twelve to eighteen months behind, not five years. Federal NO FAKES, California §927, and state-level training data disclosure laws are coming. Models built on scraped data will face active commercial restrictions. Models built on licensed data will not.

What you get

Six properties of the dataset. All non-negotiable.

This is what every sample carries — whether you're licensing 100 humans for a research preview or 25,000 for a foundation model training run. The properties are the product.

01

Per-sample provenance

Every image, frame, audio clip, and motion capture sample carries cryptographic provenance back to the consenting human. Verifiable by anyone, anywhere, without contacting us.
C2PA · SHA-256 · ED25519
02

Explicit training consent

Twins opt into AI training as a separate, granular consent — not bundled with commercial licensing. Consent specifies model type, retention, and downstream redistribution rights.
Granular · auditable · time-stamped
03

Revocation-aware delivery

When a twin revokes consent, you're notified within the agreed SLA. We track which samples are revoked, log your handling, and give you a clear audit trail for regulator response.
Webhook · SLA-defined
04

Multi-modal coverage

Images, video sequences, audio (when separately consented), motion capture, and 3D scans. Same registry, same identity, modalities you can mix and match for multi-modal model work.
5 modalities · expanding
05

Demographic balancing tools

Filter and sample by age, gender, ethnicity, geography, height, body type, and other documented attributes. Hit your target representation; reduce well-known training set biases.
Self-reported · structured
06

Ongoing compensation infrastructure

We pay the humans whose data trains your models. Per-sample, per-training-run, or revenue-share on commercial deployment — whatever structure your deal needs. The humans get paid. That's the deal.
Per-sample or revenue-share
Coverage

Sampling that doesn't bias your model.

Most training datasets are accidentally a sample of who had cameras pointed at them — overrepresenting some demographics by orders of magnitude. We're building Twinnin to be deliberately representative, not accidentally biased.

Right now we're 2,500+ verified humans across 18 territories, growing fastest in the UK, EU, and US. We track gaps actively and run targeted outreach in underrepresented categories — older adults, non-Western markets, disabled humans, and rare phenotypes that scraped datasets systematically miss.

If your model needs a specific demographic mix, ask. If we don't have the coverage today, we'll tell you, and tell you when we will.

Registry coverage · May 2026 SELF-REPORTED
Age 18–34 38% · 950 verified
Age 35–54 42% · 1,050 verified
Age 55+ 20% · 500 verified · growing
UK / EEA 71% · primary market
North America 17% · growing
Other territories 12% · actively expanding
Deal structure

Three ways to work with the data.

We structure deals around how you train and what you ship. Per-sample for evaluation work. Cohort-based for production training. Strategic for foundation-model partnerships. All three carry the same provenance, consent, and compensation guarantees.

Evaluation

Sample

For research previews, benchmarking, and proof-of-concept work. Limited cohort, limited duration, non-production use.

Sample size100–500 humans
Modalities1–2
Duration90 days
UseNon-production
  • Full provenance per sample
  • 72-hour revocation SLA
  • Use-case attestation required
  • Per-sample compensation to humans
Request a sample →
Foundation

Strategic

Multi-year partnerships for labs building foundation models on consented human data. Exclusivity options, co-investment, bespoke terms.

ScopeCustom
ExclusivityAvailable
Duration3+ years
UseNegotiated
  • Everything in Cohort
  • Co-investment in registry growth
  • Targeted demographic recruitment
  • Dedicated technical liaison
  • Joint regulatory engagement
  • Co-marketing options
Book a call →
Provenance

Every sample, cryptographically signed.

Every image, video frame, and audio clip in your dataset arrives with a cryptographic chain that proves it came from a consenting human at a specific moment, with a specific licence, with a specific use-case attestation from you.

The chain is open. Anyone — your auditor, your insurer, a regulator, your distribution partner — can verify a sample independently without contacting Twinnin. We're not the trust gatekeeper. The cryptography is.

This is what training data should look like. Provenance-first, regulator-ready, distributable, and durable across the next decade of AI compliance regimes.

SAMPLE PROVENANCE tw_sample_a4f2_9821_00184
subject_id "tw_a4f2_9821"
subject_consent "granted · 2026-04-12T09:14:22Z"
consent_scope "training+inference"
consent_status "active"
modality "image/jpeg"
capture_date "2026-04-14T11:02:00Z"
licensee "lab_xxxxx"
use_attestation "foundation_training_v3"
hash sha-256: a4f2...c918
signature ed25519: 8j2k...1184
revocation_status "none · checked 2026-05-17"
In production

What labs are building with Twinnin.

Representative examples of how research and engineering teams are using the registry today. We're under NDA on specific partners — names available on a signed mutual NDA.

Foundation models

Improving facial diversity in image-generation models

Mid-sized lab using a 4,000-human cohort to reduce demographic bias in a next-generation text-to-image model. Balanced across age, ethnicity, and territory.

Cohort 4,000 humans
Modality Image · multi-pose
Duration 24 months
Comp model Per-sample + rev share
Synthetic actors

Character continuity for video-generation pipelines

Video AI lab licensing high-fidelity multi-modal data on a smaller cohort to support consistent character generation across long-form output. Each human is a "named character" in the model's latent space.

Cohort 200 humans
Modality Video + audio + 3D
Duration 36 months
Comp model Revenue share on deploy
Speech models

Voice cloning with verified speaker consent

Speech AI lab using audio-consented twins to train a voice generation model that ships with on-deployment consent verification — no synthetic voice can run without a registered speaker.

Cohort 800 humans
Modality Audio + paralinguistic
Duration 18 months
Comp model Per-deployment
Evaluation

Bias-testing benchmark for facial recognition

Academic-industrial lab building a fairness benchmark for face recognition systems. Balanced cohort across all major demographic axes, with consented permission to publish results.

Cohort 1,200 humans
Modality Image · standardised
Duration Open benchmark
Comp model Per-sample fixed
Research FAQ

The questions research and legal teams ask first.

01

What happens if a twin revokes consent during training?

We notify you within the agreed SLA. For pre-training samples, you remove them from the training set and document the removal in your audit log; we provide tooling to support this. For samples already incorporated into model weights, the deal you sign defines downstream obligations — typically a good-faith effort with documented mitigation, not retroactive model deletion. The honesty: model weights are harder than data, and we are transparent about that in every contract.
02

Is the data structured for our training pipeline?

We deliver in standard ML-friendly formats — Parquet, TFRecord, WebDataset, raw JPEG/PNG/WAV depending on modality. Metadata as JSON or Avro alongside each sample. Cryptographic provenance attached as sidecar files. We work with your data engineering team on bespoke pipelines for Strategic partnerships.
03

Can we license exclusively?

Yes, but it's more nuanced than yes/no. You can license a cohort exclusively for a specific use-case (e.g. "only this lab trains text-to-image models on this cohort for 24 months") without locking the humans out of other deal types. The humans retain their separate commercial licensing rights. Full exclusivity across all use-cases is available on Strategic plans and priced accordingly.
04

How do you handle synthetic data we generate from licensed source?

Synthetic outputs derived from licensed source data fall under the licence envelope of the source. The licence specifies whether synthetic derivatives are permitted, how they can be distributed, and whether they require their own provenance trail. For most cohort deals, synthetic outputs are permitted with disclosure obligations. Strategic deals can negotiate cleaner terms.
05

What about minors? We won't touch under-18s — confirm you don't either.

Twins under 18 are not available for AI training data licensing. Full stop. The registry has under-18 accounts (parental-guardian managed) for narrow commercial licensing categories only — never for training data. This is a hard line we won't cross.
06

How do twins get paid?

Per-sample compensation on Evaluation deals. Per-sample plus revenue share on Cohort deals (revenue share triggers when the trained model generates revenue or is deployed at scale). Strategic deals support bespoke compensation — fixed annual payments, milestone-based, or hybrid. Humans always get paid. That's the deal — and that's what makes the deal defensible.
07

Who else have you worked with?

We're under NDA on specific partners. We can reference named partners under mutual NDA in a first call. The shape of partners: research labs in foundation modelling, video generation, voice cloning, and fairness benchmarking. Mostly in the UK, EU, and US.
08

What's the engagement process?

Evaluation: usually two calls, then a signed evaluation agreement, then sample delivery in 1–2 weeks. Cohort: mutual NDA, then a detailed scoping conversation (cohort spec, demographic requirements, modality mix, compensation structure), then a 4–6 week negotiation to signed contract and first cohort delivery. Strategic: timed to your research roadmap. We start every relationship at the founder level.
Start a conversation

Build the next model on data that holds up.

If you're training on human likeness today, you'll need licensed training data tomorrow. Start the conversation now and lock in cohort terms before the regulation tightens further.

Or write to katrien@twinnin.ai · Typical reply within one working day · Founder-led conversations