Buterin: Twitter’s Community Notes is very encrypted and looks forward to the future of new social media experiments

Twitter (X) has been a tumultuous past two years. Last year, Elon Musk purchased the platform for $44 billion and then overhauled the company's staffing, content moderation, business model, and website culture. These changes may have more to do with Elon Musk's soft power. rather than specific policy decisions. Amid these controversial actions, however, a new feature on Twitter has quickly become important and appears to be beloved by people across the political spectrum: Community Notes.
Community Notes is a fact-checking tool that sometimes appends contextual notes to tweets, like Elon Musk’s tweet above, as a fact-checking and counter-disinformation tool. It was originally called Birdwatch and first launched as a pilot project in January 2021. It has gradually expanded since then, with its most rapid expansion coinciding with Elon Musk's takeover of Twitter last year. Community Notes appear regularly among the tweets that receive widespread attention on Twitter these days, including those dealing with controversial political topics. In my opinion, and my conclusion from talking to many people across the political spectrum, these Notes are informative and valuable as they appear.
However, what interests me the most is Community Notes, which although not a "crypto project" is probably the closest instance of "crypto values" we've seen in the mainstream world. Community Notes are not written or curated by some centrally selected experts; instead, anyone can write and vote, and which Notes appear or do not appear is determined entirely by open source algorithms. The Twitter website has a detailed and comprehensive guide describing how the algorithm works, and you can download the data containing published notes and polls, run the algorithm locally, and verify that the output matches what is visible on the Twitter website. While not perfect, it comes surprisingly close to the ideal of trustworthy neutrality in quite controversial situations, and is very useful at the same time.
How does the Community Notes algorithm work?
Anyone with a Twitter account that meets certain criteria (basically: active for more than 6 months, no policy violations, verified mobile phone number) can sign up to participate in Community Notes. Currently, participants are being accepted slowly and randomly, but eventually the plan is to allow anyone who qualifies to join. Once accepted, you can first participate in rating existing Notes and, once your ratings are good enough (measured by seeing which ratings match the final results for that Note), you can also write your own Notes.
When you write a Note, the Note receives a score based on review by other Community Notes members. These reviews can be thought of as votes along the three levels of "helpful," "somewhat helpful," and "not helpful," but reviews can also include other labels that play a role in the algorithm. Based on these reviews, Notes is assigned a score. If the note's score exceeds 0.40, the note will be displayed; otherwise, the note will not be displayed.
What's unique about the algorithm is how the scores are calculated. Unlike simple algorithms, which are designed to just calculate some kind of sum or average of user ratings and use that as the final result, the Community Notes rating algorithm explicitly tries to prioritize those who have received positive reviews from people with different perspectives Notes. That is, if people who normally disagree on ratings end up agreeing on a particular Note, that Note will be rated highly.
Let’s take a closer look at how it works. We have a set of users and a set of Notes; we can create a matrix M where cell Mij represents how the i-th user rated the j-th Notes.
For any given Notes, most users have not rated that Notes, so most entries in the matrix will be zero, but that's okay. The goal of the algorithm is to create a four-column model of users and Notes, assigning two statistics to each user, which we can call "friendliness" and "polarity", and two statistics to each Notes, which we Call it "usefulness" and "polarity". The model attempts to predict the matrix as a function of these values, using the following formula:
Note that here I present the terminology used in the Birdwatch paper, as well as my own terminology to provide a more intuitive understanding of the meaning of the variables without involving mathematical concepts:
μ is a "public sentiment" parameter that measures how high the ratings generally are given by users.
iu is the "friendliness" of the user, i.e. how likely it is that the user tends to give a high rating.
in is the "usefulness" of the Notes, that is, how likely it is that the Notes will be rated highly. This is the variable we care about.
fu or fn is the "polarity" of the user or Notes, i.e. their position on the dominant axis of political extremes. In practice, negative polarity roughly means "left leaning" and positive polarity means "right leaning", but please note that the extreme axes are derived by analyzing user and Notes data, and the concepts of left and right are not hard-coded in.
The algorithm uses a fairly basic machine learning model (standard gradient descent) to find the best variable values ​​to predict matrix values. The usefulness assigned to a particular Note is the final score for that Note. A Note will be displayed if its usefulness is at least +0.4.
The core cleverness here is that "polarity" absorbs the characteristics of a Note that cause it to be liked by some users and disliked by others, while "usefulness" only measures the characteristics of a Note. These features lead to it being liked by all users. Selecting usefulness thus identifies Notes that are endorsed across tribes and excludes Notes that are cheered by one tribe but resented by another.
The above only describes the core part of the algorithm. In fact, there are many additional mechanics added on top of it. Fortunately, they are described in public documentation. These mechanisms include the following:
The algorithm is run multiple times, each time adding some randomly generated extreme "fake votes" to the vote. This means that the true output of the algorithm for each Notes is a range of values, and the final result depends on a "lower confidence" taken from that range and compared to a threshold of 0.32.
If many users (especially those with a Notes polarity) rate a Note as "Not Useful" and they also assign the same "tag" (for example, "argumentative or biased language", "unusual source") "Support Notes") as the reason for the rating, then the usefulness threshold required for Notes to be published will increase from 0.4 to 0.5 (this may seem small, but is very important in practice).
If a Note is accepted, its usefulness must be reduced to 0.01 points below the threshold required to accept the Note.
The algorithm performs more runs using multiple models, sometimes boosting Notes with raw usefulness scores between 0.3 and 0.4.
All in all, you get some pretty complex Python code totaling 6282 lines spread across 22 files. But it's all open, and you can download Notes and the scoring data and run it yourself to see if the output matches what's actually happening on Twitter.
So what does this look like in practice?
Probably the biggest difference between this algorithm and simply taking an average score from people's votes is the concept of what I call "polar" values. The algorithm documentation refers to them as fu and fn, using f for factor because the two terms multiply each other; the more general terminology is partly due to the eventual desire to make fu and fn multidimensional.
Polarity is assigned to users and Notes. The link between the user ID and the underlying Twitter account is intentionally kept secret, but Notes is public. In fact, at least for the English dataset, the polarity generated by the algorithm correlates very closely with left and right.
Here are some Notes examples with polarity around -0.8:
Note that I'm not cherry-picking here; these are actually the first three rows in the scored_notes.tsv spreadsheet I generated when running the algorithm locally, and their polarity scores (called coreNoteFactor1 in the spreadsheet) are less than - 0.8.
Now, here are some Notes with a polarity of about +0.8. It turns out that many of them are either people talking about Brazilian politics in Portuguese or Tesla fans angrily rebutting criticism of Tesla, so let me cherry-pick a little and find some Notes that don't fall into either category:
Again, as a reminder, the "left vs. right divide" is not hard-coded into the algorithm in any way; it is discovered computationally. This suggests that if you apply this algorithm to other cultural contexts, it can automatically detect their main political divisions and build bridges between those divisions.
Meanwhile, Notes for maximum usefulness looks like this. This time, since the Notes are actually showing up on Twitter, I can just take a screenshot:
There's another one:
For the second Notes, it deals more directly with highly partisan political topics, but it's a clear, high-quality, and informative Notes, so it gets a high rating. Overall, the algorithm seems to work, and it seems feasible to verify the output of the algorithm by running the code.
What do I think about this algorithm?
What struck me most when analyzing this algorithm was its complexity. There's an "academic paper version" that uses gradient descent to find the best fit of five-term vector and matrix equations, and then there's the real version, a complex series of executions of the algorithm with many different executions and a lot of arbitrariness along the way. coefficient.
Even the academic paper version hides the underlying complexity. The equation it optimizes is a negative fourth order (because there is a quadratic fu*fn term in the prediction formula and the cost function measures the square of the error). While optimizing a quadratic equation in any number of variables will almost always have a unique solution, which you can figure out with fairly basic linear algebra, optimizing a quartic equation in many variables usually has many solutions, hence the multiple rounds of the gradient descent algorithm Different answers may be obtained. Small input changes can cause the dip to flip from one local minimum to another, significantly changing the output results.
The difference between this and the algorithms I helped develop, like secondary financing, to me is like the difference between an economist's algorithm and an engineer's algorithm. Economists' algorithms, in the best case, focus on simplicity, are relatively easy to analyze, and have clear mathematical properties, indicating that it is the best (or least bad) for the task to be solved, and ideally it can also be proved How much damage can someone do in trying to exploit it. An engineer's algorithm, on the other hand, is derived through an iterative process of trial and error to see what works and what doesn't work in the engineer's operating environment. Engineers' algorithms are pragmatic and get the job done; economists' algorithms don't completely lose control when faced with unexpected situations.
Or, as respected internet philosopher roon (aka tszzl) puts it in a related thread:
Of course, I would say that the "theoretical aesthetics" aspect of cryptocurrencies is necessary to be able to accurately differentiate between those protocols that are truly trustless and those that look good and work well on the surface but actually require trust in some centralized actor, Or even worse, it could be a complete scam.
Deep learning is effective under normal circumstances, but it has unavoidable weaknesses in various adversarial machine learning attacks. If done well, technical traps and high-level abstraction ladders can combat these attacks. So, I have a question: Can we turn Community Notes itself into something more like an economics algorithm?
To see what this means in practice, let's explore an algorithm I designed a few years ago for a similar purpose: Pairwise-bounded quadratic funding.
The goal of pairwise-bounded quadratic funding is to fill a loophole in "regular" quadratic funding, whereby even if two participants collude with each other, they can contribute very high amounts to a fake project, return the funds to them, and Receive large subsidies that drain the entire capital pool. In pairwise-bounded quadratic funding, we allocate a limited budget M to each pair of actors. The algorithm iterates through all possible pairs of actors, and if the algorithm decides to add a subsidy to a certain project P because both actor A and actor B support it, then this subsidy is deducted from the budget allocated to the pair (A, B) . Therefore, even if k participants collude, the amount they can steal from the mechanism is at most k (k-1) M.
This form of the algorithm does not work well in the context of Community Notes because each user casts only a small number of votes: on average, the number of votes in common between any two users is zero, so simply by looking at each pair individually User,algorithm cannot understand the user’s polarity. The goal of a machine learning model is precisely to try to "populate" a matrix from very sparse source data that cannot be directly analyzed in this way. But the challenge with this approach is that extra effort is required to avoid highly volatile results in the face of a small number of bad votes.
Can Community Notes really stand up to the left and the right?
We can analyze whether the Community Notes algorithm is actually able to resist extremes, that is, whether it performs better than a naive voting algorithm. This voting algorithm already resists extremes to a certain extent: a post with 200 likes and 100 dislikes will perform worse than a post with only 200 likes. But does Community Notes do any better?
From an abstract algorithm perspective, it's hard to say. Why wouldn’t a polarizing post with a high average rating not have strong polarity and high usefulness? The idea is that if those votes are conflicting, the polarity should "absorb" the features that caused the post to get a lot of votes, but does it actually do that?
To check this, I ran my own simplified implementation for 100 rounds. The average results are as follows:
In this test, "good" Notes received a +2 rating among users of the same political affiliation and a +0 rating among users of the opposite political affiliation, and "good but more extreme" Notes received a +0 rating among users of the same political affiliation. It received a +4 rating among users of the opposite faction and a -2 rating among users of the opposite faction. Although the average scores are the same, the polarity is different. And in fact, the average usefulness of "good" Notes seems to be higher than that of "good but more extreme-leaning" Notes.
Having an algorithm that is closer to the “Economist’s Algorithm” will have a clearer story about how the algorithm penalizes extremes.
How useful is all this in high-stakes situations?
We can learn some of this by looking at a specific situation. About a month ago, Ian Bremmer complained that a highly critical Community Note had been added to a tweet about a Chinese government official, but the Notes had since been removed.
This is a difficult task. It's one thing to do mechanism design in an Ethereum community environment where the biggest complaint might just be $20,000 going to an extreme Twitter influencer. The situation is completely different when it comes to political and geopolitical issues that affect millions of people, where everyone often reasonably assumes the worst motives. However, interacting with these high-stakes environments is essential if mechanic designers want to have a significant impact on the world.
In the case of Twitter, there's an obvious reason to suspect centralized manipulation as the reason for Notes being removed: Elon Musk has a lot of business interests in China, so it's possible that Elon Musk forced the Community Notes team to interfere with the algorithm's output and remove it This particular Notes.
Luckily, the algorithm is open source and verifiable, so we can actually dig into it! Let's do this. The URL of the original tweet is https://twitter.com/MFA_China/status/1676157337109946369. The number 1676157337109946369 at the end is the ID of the tweet. We can search for this ID in the downloadable data and identify the specific row in the spreadsheet that has the above Notes:
Here we get the ID of Notes itself, 1676391378815709184. We then search for that ID in the scored_notes.tsv and note_status_history.tsv files generated by running the algorithm. We got the following results:
The second column in the first output is the current rating for that Notes. The second output shows the Notes' history: its current status is in column seven (NEEDS_MORE_RATINGS), and the first status it previously received that was not NEEDS_MORE_RATINGS is in column five (CURRENTLY_RATED_HELPFUL). So we can see that the algorithm itself first showed the note and then removed it after its rating dropped slightly - there seems to be no central intervention involved.
We can also look at this issue another way by looking at the vote itself. We can scan the ratings-00000.tsv file to isolate all ratings for that Notes and see how many are rated HELPFUL and NOT_HELPFUL:
However, if you sort them by timestamp and look at the first 50 votes, you'll see that there are 40 HELPFUL votes and 9 NOT_HELPFUL votes. So we come to the same conclusion: Notes' initial audience rated Notes more positively, while subsequent audiences of Notes rated it less favorably, so its ratings started higher and declined over time. Got lower.
Unfortunately, exactly how Notes changes status is difficult to explain: it's not a simple matter of "previously it was rated above 0.40, now it's rated below 0.40, so it was removed". Instead, a large number of NOT_HELPFUL replies triggers one of the exception conditions, increasing the usefulness score that Notes needs to stay above the threshold.
This is another great learning opportunity that teaches us a lesson: making a trustworthy neutral algorithm truly trustworthy requires keeping it simple. If a Notes goes from being accepted to not being accepted, there should be a simple and clear story explaining why this is the case.
Of course, there's an entirely different way to manipulate this vote: Brigading. Someone who sees a Notes they disapprove of can call on a highly engaged community (or worse, a legion of fake accounts) to rate it NOT_HELPFUL, and it may not take too many votes to move Notes from " Useful" becomes "extreme". More analysis and work is required to properly reduce the algorithm's vulnerability to such coordinated attacks. A possible improvement would be to not allow any user to vote on any Notes, but instead randomly assign Notes to raters in a manner recommended by the "For You" algorithm, and only allow raters to rate those Notes to which they have been assigned.
Are Community Notes not “brave” enough?
The main criticism I see of Community Notes is basically that it doesn't do enough. I saw two recent articles mentioning this. To quote one of the articles:
The program suffers from a serious limitation, which is that for Community Notes to be public, it must be universally accepted by a consensus of people across the political spectrum.
"It has to have ideological consensus," he said. "That means people on the left and people on the right have to agree that the note has to be attached to the tweet."
Essentially, he said, it requires "cross-ideological agreement on the truth, which is nearly impossible to achieve in an increasingly partisan environment."
It's a tough question, but ultimately I'm inclined to think that it's better to have ten tweets of misinformation spread freely than to have one tweet unfairly annotated. We've seen years of fact-checking, and it's brave, and it's from the perspective of "actually we know the truth, we know one side lies more often than the other." What will be the result?
To be honest, there is a fairly widespread distrust of the concept of fact-checking. Here's a strategy that says: Ignore the critics, remember that fact-checking experts really do know the facts better than any voting system, and stick with it. But going all-in on this approach seems risky. There is value in building intertribal institutions that are respected by all at least to some degree. Like William Blackstone's maxim and the courts, I feel that maintaining that respect requires a system that makes mistakes by omission rather than proactively. So it seems to me that there is value in at least one major organization taking this different path and treating its rare cross-tribal respect as a precious resource.
Another reason I think it's okay for Community Notes to be conservative is that I don't think every misinformation tweet, or even most misinformation tweets, should receive a corrective note. Even though less than one percent of misinformation tweets receive annotations that provide context or correction, Community Notes still provides an extremely valuable service as an educational tool. The goal is not to correct everything; rather, the goal is to remind people that there are multiple viewpoints, that some posts that seem convincing and engaging in isolation are actually quite wrong, and that you, yes, you can usually do basic internet Search to verify it is wrong.
Community Notes cannot be, nor is it intended to be, a panacea for all problems in public epistemology. Whatever problems it doesn’t solve, there’s plenty of room for other mechanisms to fill it, whether it’s a newfangled gadget like a prediction market or an established organization employing full-time employees with domain expertise that can try to fill the gaps.
in conclusion
Community Notes is not only a fascinating social media experiment, but also an example of a fascinating emerging type of mechanism design: mechanisms that consciously seek to identify extremes and tend to promote boundary crossing rather than perpetuate divisions.
Two other examples in this category that I'm aware of are: (i) the pairwise secondary funding mechanism used in Gitcoin Grants, and (ii) Polis, a discussion tool that uses clustering algorithms to help the community identify common Popular statements span people who often have different views. This area of ​​mechanism design is valuable, and I hope we see more academic work in this area.
The algorithmic transparency provided by Community Notes isn't exactly the same as fully decentralized social media - if you don't agree with the way Community Notes works, there's no way to view the same content through a different algorithm. But this is the closest that hyperscale applications will get in the next few years, and we can see that it already provides a lot of value, both in preventing centralized manipulation and ensuring that platforms that do not engage in such manipulation get their due. recognized.
I look forward to seeing Community Notes and many similar-spirited algorithms develop and grow over the next decade.