Published March 31, 2026
1
The Hollow Game Problem
A play-money prediction market is easy to dismiss. It sounds light. Disposable. A clever mechanic without real consequence. Markets rise and fall, people guess, a leaderboard flashes, and then everyone moves on. That is the graveyard where most play-money systems end up: interesting for a week, noisy for a month, forgotten soon after.
Picture a WhatsApp group — colleagues, friends, cricket obsessives — running a prediction market for an upcoming Test match. Someone calls a 90% chance of India winning. India loses. The group laughs for an hour. By Tuesday, no one remembers, and no one’s behaviour has changed. The next market begins exactly as the last one ended: with careless confidence and no consequence.
That, in compressed form, is why play-money prediction markets have been tried many times and have mostly failed. The failure is not mechanical. The odds engines, the market formats, the user interfaces — those are solvable engineering problems. The failure is structural. Without real stakes, prediction becomes noise.
When losing feels like nothing, people do not think carefully before staking. They guess. They stake on feelings rather than evidence. They claim 90% confidence when they mean 60%, because bravado costs nothing. The market floods with low-signal predictions. Other participants cannot distinguish the genuinely calibrated forecaster from the lucky guesser. Over weeks, the platform loses its signal value — the one thing that made it interesting. And once signal is gone, there is no reason to return.
The pattern is consistent enough to be called a law: a prediction market without real stakes is not a market. It is a game, and games rely on novelty. Novelty fades. They simulate the form of a market without creating the consequence of one.
The instinctive solution is money. Real money creates real stakes — losing ₹500 on a wrong call focuses the mind. But money also creates legal complexity, regulatory exposure, and the risk of shifting from a forecasting platform into a gambling product. In India specifically, real-money gaming is a legally fraught category at best and actively hostile at worst. The financial route is not the answer.
The question WePredict is built around is a harder one: can real stakes exist without real money? The answer is yes — but only if the stakes are genuinely felt. Reputation is one such stake. Social standing is another. A persistent track record, visible to everyone who knows you, is a third. These are not theoretical motivators. They are the reason professionals work carefully on problems no one is paying them to solve, the reason academics write papers that will be read by twenty people, the reason a chess player at a local club cares deeply about a rating point that has no cash value whatsoever.
The real stakes are not financial. They are reputational. And reputation is only as powerful as the record that supports it.
WePredict is built on Mu — an attention currency earned through NeoMails, a daily interactive email, and spent in the prediction marketplace. Mu creates an initial stake: spending it carelessly depletes a balance that took real daily engagement to accumulate. But Mu alone does not create the deeper consequence that changes how people predict. It does not create a record. It does not follow anyone. When Mu is gone it is gone, and the next prediction begins without memory.
For WePredict to escape that failure pattern, it needs something that does follow people. Something that compounds. Something that the serious participant protects and the careless participant damages. That something is the Predictor Score.

2
The Score That Follows You
A chess rating is not a trophy. It is not awarded at a ceremony or handed out for participation. It is a number that compresses an entire history of play — wins, losses, the quality of opponents, the consistency of performance under pressure — into a single figure that updates with every game. A player who earns a high rating cannot fake it. The rating is the evidence. It took time to build, and it can be damaged by a single careless period of play.
A credit score works differently. No one chooses to play it, and it is shaped partly by institutional behaviour rather than personal performance alone. But it does something the chess rating does not: it travels beyond the person. Banks consult it before lending. Landlords check it before leasing. It shapes how others treat you, not just how you regard your own performance. The Predictor Score is closer to the chess rating in how it is built — through performance, over time, by choice — but closer to the credit score in what it eventually does: it becomes the record that others consult before deciding how much to trust what you say.
The Predictor Score works on this dual logic. It is a persistent, compounding record of forecasting accuracy — not a badge given at a moment, not a leaderboard that resets quarterly, but a number that follows a person across every market they enter and every prediction they make. A Score of 1,400 represents a different person than a Score of 400 — not because of a single correct prediction, but because of the accumulated pattern of how that person thinks under uncertainty.
Understanding what the Score measures requires separating accuracy from calibration, and most people conflate the two. Accuracy is whether the prediction was right. Calibration is whether the expressed confidence matched the actual probability. A person who says ‘90% confident’ and is right 90% of the time is perfectly calibrated. A person who says ‘90% confident’ and is right 55% of the time is significantly miscalibrated — not just occasionally wrong, but systematically overconfident.
The Predictor Score rewards calibration, not just accuracy. A confident wrong answer hurts the Score more than an uncertain wrong answer. Saying ‘65% likely’ when one genuinely means 65% is rewarded, even when the outcome goes the other way. Claiming ‘95%’ to appear decisive and then being wrong is penalised severely — Part 5 works through the exact numbers, and the penalty for overclaiming is not proportional to the error. It is catastrophic. This distinction creates an incentive for intellectual honesty. The Score rewards the person who says ‘I don’t know, but here is my best estimate’ over the person who performs certainty they do not have.
Mu tells you how much attention you have earned. Predictor Score tells you how well you use it.
The second property is compounding. Two years of predictions across hundreds of markets is a fundamentally different record than two weeks of predictions across ten. The Score becomes harder to fake and harder to replicate as it accumulates. A new entrant to WePredict, however skilled, cannot compress eighteen months of consistent, calibrated forecasting into a week of play. Time is built into the architecture.
The third property is consistency across contexts. A Predictor Score does not reset when someone moves from a private group to a public market, or from cricket predictions to business ones. It is one continuous record. The same person who earns credibility forecasting match results in a team Slack group carries that record into public WePredict. The Score travels. This portability gives it weight across both modes: WePredict Private, which runs prediction markets within a closed group visible only to members, and WePredict Public, which opens markets to the full platform. The Score is the single thread connecting both.
Return to the WhatsApp group from Part 1. The same people, the same cricket match, the same wrong call from the overconfident member. But now there is a Predictor Score attached to every name in the group. The wrong call is not forgotten on Tuesday. It is recorded in a Score that everyone can see. The overconfident member watches their number fall. The quieter member who said ‘60%’ — uncertain, honest — watches theirs hold. Over weeks, the group develops a memory it did not have before. Patterns emerge about who is reliable and who performs confidence they do not possess. The Score did not change the people. It made the truth visible.
How the Score is computed
A reputation system only works if people believe the number is real. That depends on how it is computed — and whether it can be gamed.
The Predictor Score is not a win-loss record. A simple right/wrong count would reward lucky guessers and penalise careful forecasters who honestly expressed uncertainty. Instead, the Score measures the accuracy of the expressed probability, not merely the direction of the call. If a participant says 70% on an outcome that happens, they score better than someone who also got it right but said 95%. The overclaimer was rewarded by luck. The 70% call was rewarded by honesty. Equally, if the outcome does not happen, the person who said 30% — genuinely uncertain — is penalised far less than the person who said 95% and was catastrophically wrong. Part 5 walks through the mathematics in full.
The Score also weights markets by difficulty. A market where the crowd consensus is 90% on one side — a heavily favoured team, an obvious outcome — contributes almost nothing to anyone’s Score. If the answer was obvious, predicting it correctly demonstrates no judgement. The Score points that matter come from contested markets: uncertain outcomes, genuine dispersion of opinion, questions where the crowd is genuinely split. Easy markets add almost nothing. Difficult markets are where reputations are built.
Finally, the Score is designed to become more stable as it grows. An early Score can move quickly because the sample is small. A mature Score — built over two years of predictions — moves more slowly, because it represents a long record that a single week cannot fairly overturn. An impression formed after one conversation is fragile. A reputation earned over years requires sustained evidence to shift.

3
Two Stories
Story One: The Slack Team
A marketing team at a mid-size brand has been running WePredict Private — prediction markets visible only to their group — in their company Slack for six months. The markets are specific to their world: will this campaign beat last week’s open rate? Will the new homepage variant outperform the control? Will the product launch hit the Q3 target?
Before the Predictor Score existed, these questions had a predictable dynamic. The head of growth dominated the pre-launch conversation. His predictions carried the room not because they were consistently right, but because they were delivered with force. He regularly claimed 90% confidence. He was occasionally correct and frequently wrong. The junior analyst on the team — quieter, more careful — offered 60–65% estimates with reasoning attached. She was overridden most of the time. Her uncertainty was read as lack of conviction.
Six months of Predictor Scores changed the conversation completely. The data told a story no one had articulated before. The junior analyst had the highest Score on the team. Her 60–65% calls were landing at the rate she predicted. She was not uncertain — she was honest. The head of growth’s Score was mediocre. His 90% calls were right about 55% of the time — a gap that, in a proper scoring rule, is severely penalised. He was not confident. He was miscalibrated.
The team began checking Scores before pre-launch reviews. Not formally — no one announced a policy change. But the Score was visible in every Slack thread, and visible things change behaviour. The loudest voice in the room was no longer automatically the most trusted one. The analyst’s estimates started shaping decisions. The head of growth began hedging his confidence calls.
The Predictor Score did not punish the HiPPO. It simply made the truth visible. And once the truth is visible, it is very difficult to unsee.
Story Two: The Cricket Fan
A 28-year-old in Mumbai has been participating in WePredict Public — the open platform, visible to all — for eighteen months. He follows cricket obsessively and has made over 340 predictions across IPL matches, Test series, and bilateral ODI tournaments. His Predictor Score has climbed steadily — not because he wins every market, but because his calibration is unusually honest. He says 70% when he means 70%. He says 55% when he is genuinely uncertain, rather than manufacturing confidence to appear decisive.
His Score is now visible in the WePredict leaderboards for his Circles — the named prediction groups he belongs to. Other members check his Score before deciding how to weight his calls in markets they are less certain about. He has become known as a reliable predictor. Not lucky. Not loud. Reliable. That reputation took eighteen months to build. It cannot be bought by someone joining WePredict today and predicting aggressively for two weeks.
The interesting detail is what he protects most. Not his Mu balance — though he earns Mu consistently through daily NeoMails engagement. The Mu comes and goes as he stakes it in markets. What he thinks about carefully before entering a market is the impact on his Score. A careless stake — entering a market he knows nothing about simply because he has Mu to spend — will damage a record he has spent eighteen months building. The consequence is not financial. It is reputational. And that turns out to be a more powerful motivator than money for a person who already has a Score worth protecting.
Mu is what flows through the system, but Predictor Score is what gives the flow meaning.
The Score is the real stake. Mu is the token. Reputation is the game.
The group is the room. The Predictor Score is the passport.

4
Why This Is the Moat
A brand that begins building NeoMails and WePredict today will, in two years, possess something that cannot be bought: a body of Predictor Scores attached to real people, built over hundreds of real markets, across genuine uncertainty. A competitor arriving later with more money and better technology cannot replicate this. The Scores are the accumulated result of time, behaviour, and consistency. None of those can be shortcut by spending more.
This is the difference between a technological moat and a behavioural moat. A technological moat can be matched — a competitor with sufficient resources can build equivalent infrastructure. A behavioural moat cannot, because the behaviour that produced it cannot be manufactured. Two years of daily prediction, calibrated honestly across varied markets, leaves a record that is both unique to the person and impossible to fast-track. The Predictor Score is behavioural in this precise sense. Part 6 sets out the anti-gaming architecture that ensures this record cannot be manufactured by other means.
A Predictor Score built carefully over years may eventually become one of the most honest signals available about the quality of a person’s judgement — more honest than a CV, more consistent than an interview, more durable than a testimonial. What brands do with that signal is still being written.
Three things change for the Atrium system when the Predictor Score exists.
First, Mu becomes meaningful in a different register. Without the Score, Mu is a genuinely interesting engagement mechanic — a streak reward, a gamification layer, a currency that makes daily email engagement feel like progress. With the Score, Mu is the currency used in a system that produces something real: a reputation record. That changes the psychology of earning and spending it entirely. People protect their Mu not because they want the balance to be high, but because spending it carelessly will damage the record they are building. The Score transforms Mu from a points layer into a stake in a reputation game.
Second, WePredict Private becomes sticky in a way that has nothing to do with the product mechanics. Groups develop persistent hierarchies of trusted predictors over weeks and months. The ranking is visible, persistent, and social — it updates in real time and everyone in the group can see it. That social memory is what makes Private groups return. It is not the cricket markets, though those help. It is the fact that leaving the group means losing the record. And losing the record means losing the standing. No competing platform can offer a better market format and import the same social consequence.
Third, WePredict Public becomes credible rather than merely entertaining. Public prediction markets are only valuable if the participants are genuinely trying to be accurate — if the aggregate of predictions reflects real information rather than noise. The Predictor Score creates that incentive not through financial rewards but through reputational ones. A public leaderboard of Predictor Scores is a credibility system: it separates the calibrated from the loud and makes that separation visible over time.
The relationship between Mu and the Score is worth stating clearly one final time. Mu flows — earned in NeoMails, spent in markets, replenished through continued engagement. The Score compounds — built through consistent, calibrated forecasting, damaged by careless staking, impossible to shortcut. A high Mu balance means consistent attention. A high Predictor Score means consistent judgement. The best participants in the WePredict ecosystem will have both, and the two together are what make the system self-reinforcing.
The real stake in WePredict is not money. It is reputation. And once reputation begins to compound, a game becomes a system.

5
The Maths of Calibration
The sections that follow are for readers who want the mechanics.
The Predictor Score is built on a principle that most gamified systems ignore: it is not enough to be right. What matters is how confident you were, and whether that confidence was justified.
The technical foundation is a proper scoring rule derived from the Brier score family — a mathematical function with one defining property: the only way to maximise your expected score over time is to report your genuine belief. Expressing more confidence than you actually have, or less, will on average hurt your score rather than help it. The system creates a structural incentive for honesty about uncertainty.
Prediction Quality
For any resolved market, compute a Prediction Quality score:

Where p is the predicted probability (a decimal between 0 and 1) and o is the outcome (1 if the event happened, 0 if it did not). Three examples show why calibration beats boldness:
| Prediction |
Outcome |
PQ Score |
| Said 70% (0.70), it happened |
o = 1 |
1 − (0.70 − 1)² = 0.91 |
| Said 95% (0.95), it happened |
o = 1 |
1 − (0.95 − 1)² = 0.9975 |
| Said 95% (0.95), it did NOT happen |
o = 0 |
1 − (0.95 − 0)² = 0.0975 ← catastrophic |
The third row is the one to focus on. Saying 95% and being wrong produces a PQ of 0.0975 — catastrophically low. Saying 70% and being wrong produces 1 − (0.70 − 0)² = 0.51 — more than five times better, on the same outcome. The penalty for overclaiming is not proportional to the error. It is severe by design.

Equally important: saying 70% and being right (PQ = 0.91) scores notably less than saying 95% and being right (PQ = 0.9975). The system is not punishing confidence. It is punishing unjustified confidence. A participant who genuinely believes 95% and says 95% is rewarded when right. A participant who does not believe 95% but says it anyway to sound decisive will, over time, be wrong at rates that destroy their score.

Difficulty weighting — the anti-obvious mechanism
Raw PQ scores are multiplied by a difficulty weight:

Where c is the leave-one-out crowd consensus — the average of all other participants’ predictions, excluding the focal participant’s own prediction. Using leave-one-out prevents the circularity of a participant influencing the difficulty of their own market. This formula (the variance of a Bernoulli distribution) peaks at 1.0 when consensus is exactly 50/50 and collapses toward zero as consensus becomes overwhelming:
| Crowd Consensus (c) |
Difficulty Weight (D) |
| 50% (genuinely split) |
1.00 |
| 70% |
0.84 |
| 85% |
0.51 |
| 90% |
0.36 |
| 95% |
0.19 |
| 99% (monsoon market) |
0.04 |

The weighted contribution of any prediction is therefore:

A 99%-consensus market where someone stakes confidently and wins scores: 0.04 × 0.9975 ≈ 0.04. Negligible. The same participant, in a genuinely uncertain market (c = 0.52) where they say 65% and get it right, scores: 0.99 × 0.88 ≈ 0.87. More than twenty times the reward for honest prediction under real uncertainty.
Score aggregation and time decay
The overall Predictor Score is a time-weighted, difficulty-weighted average of Prediction Quality across all eligible predictions:

The denominator includes both time weight and difficulty weight — not time weight alone. This means low-difficulty markets contribute near-zero to both numerator and denominator, so they genuinely add almost nothing to the Score rather than merely diluting it. Easy markets are not just penalised; they are structurally inert.
T is a time weight that applies a gentle quarterly decay (λ ≈ 0.90): predictions from eight quarters ago carry roughly half the weight of recent ones. A strong long-term record cannot be wiped by a bad month, but the Score remains a living reflection of current form rather than a monument to past performance.
The raw Score (ranging 0 to 1) is normalised to a display scale of 0 to 2,000 — similar to an ELO chess rating. A Score of 1,400 is legible and comparable across participants in a way that ‘0.71’ is not.
Domain sub-scores
Underneath the headline Score sit domain sub-scores: sport, business, politics, entertainment, and others as the platform grows. A participant unusually well-calibrated on IPL outcomes is not necessarily equally strong on quarterly sales forecasts. The headline number gives simplicity. Domain scores give fidelity.
Domain sub-scores use the same formula applied to market subsets. The overall Score is a weighted average of domain sub-scores, weighted by effective information — the sum of T × D within each domain, not the raw count of predictions. This means 100 trivial predictions in one domain do not dominate 20 genuinely difficult predictions in another. Depth of calibration matters; volume alone does not.

6
The Anti-Gaming Architecture
A scoring system worth building is a scoring system worth attacking. The Predictor Score is designed with the assumption that some participants will try to game it from day one, and that the response cannot rely on human moderation at scale. The defences have to be structural.
The monsoon market problem
The simplest gaming attempt: a participant creates a Private WePredict market — ‘Will it rain in Mumbai tomorrow?’ during peak monsoon season — invites cooperating accounts, stakes 99%, resolves it, and repeats a hundred times.
Even after a hundred such orchestrated markets, the impact on the display Score would be negligible — the difficulty weighting ensures near-zero contribution to both numerator and denominator of the weighted average.
The effort is economically irrational before any further safeguards apply. But difficulty weighting alone is not sufficient, because a determined participant might seek genuinely uncertain markets and manipulate resolution. Five structural gates close the remaining gaps.
The five gates
Gate 1 — Entropy floor — A market only becomes Score-eligible if the entropy of participant predictions at close exceeds a minimum threshold (H > 0.5 bits, computed as H = −c × log₂c − (1−c) × log₂(1−c)). At 90% consensus, H ≈ 0.47 bits — below threshold, not counted. At 75% consensus, H ≈ 0.81 bits — eligible. This gate is computed automatically from participant behaviour, not set by the market creator.
Gate 2 — Minimum distinct participants — For a market to update the global Predictor Score, at least ten distinct accounts must have predicted. A collusive group of three cannot generate meaningful Score movement for each other. This gate creates an important distinction: small groups still generate a local group Score visible within their circle — members see each other’s relative rankings — but only markets passing this gate affect the global Score that travels with a participant everywhere.
Gate 3 — Creator exclusion — The account that creates a Private market earns zero Score from it, regardless of outcome. Creator exclusion is absolute for privately-created markets. On platform-curated public markets — where resolution is external and no participant controls closure or adjudication — creators may participate on equal terms. This preserves the incentive to create good markets without creating the incentive to manufacture easy ones.
Gate 4 — Maturity multiplier — New accounts begin with a suppressed Score weight that rises as the participant accumulates eligible predictions across distinct domains:

After 10 eligible predictions, the multiplier is 0.18. After 50, it reaches 0.63. After 150, it reaches 0.95. A freshly created account cannot sprint to a high Score through a burst of activity. The record must be built over time across varied domains.
Gate 5 — Single-market cap — No individual market can move the overall Score by more than a set ceiling, regardless of difficulty or expressed confidence. The Score must reflect a pattern, not a moment. One spectacular call cannot inflate an otherwise weak record.

What passes through the gates
The gates do not reduce the volume of Score-eligible predictions for good-faith participants. A genuinely uncertain market — closely contested, widely participated, externally resolved — passes every gate and contributes fully. A difficult call, in a market the crowd found hard, in a domain with prior predictions, on a mature account, is exactly what the Score is designed to reward. The gates make gaming economically irrational. The effort required to manufacture a high Score through artificial means substantially exceeds the effort required to forecast honestly.
Cluster detection
One additional mechanism operates at the network level rather than the market level. If a cluster of accounts — identifiable by graph proximity: same markets, correlated predictions, common creators — shows statistically anomalous mutual agreement, their inter-cluster predictions are down-weighted automatically. Detection uses standard anomaly-detection techniques on correlated prediction patterns and resolution behaviour are sufficient to identify coordinated behaviour without requiring certainty. A mild anomaly triggers mild down-weighting. A severe anomaly triggers near-zero weighting. The system does not need to prove fraud. It needs only to ensure that genuine uncertainty, not coordinated certainty, drives Score movement.
The result
A participant who spends six months gaming the Predictor Score will accumulate a weak Score — the gates, the difficulty weighting, and the maturity multiplier collectively ensure this. A participant who spends six months forecasting honestly across varied uncertain markets — getting some right, some wrong, always reporting genuine confidence — will accumulate a Score that is both higher and more widely trusted. The gap between the two is legible and widens every month, because the gamed Score cannot compound while the honest one can.
The Score does not need to be ungameable. It needs to make gaming less rewarding than honest forecasting. It does — and by a margin large enough to matter.
The architecture serves the promise
The mathematics in Part 5 and the safeguards in Part 6 exist for one reason: to ensure that the reputation the Predictor Score produces is real. A Score that can be manufactured is not a reputation system. It is a leaderboard — and leaderboards are precisely the problem WePredict was built to escape.
The Predictor Score is the foundation on which everything else in WePredict rests. Mu gives people something to stake. The Score gives them something worth protecting. Together, they create the consequence that transforms a game into a system — and a system into a moat.
**
Here is a PDF in case some of the graphics are not clear.