Associate Professor, ETH Zurich

Law, Economics, and Data Science Group (team) (email norms)

Topics: Law and Economics, Political Economy
Methods: Econometrics, Natural Language Processing, Machine Learning

Curriculum Vitae
Google Scholar Page
SDG Monitor

Welcome to my web page, where you will find information on my research and teaching as a professor at ETH Zurich. I am also CEPR Research Affiliate (Political Economy) and Associate Editor at Economic Journal.  I held previous research appointments at University of Warwick (Assistant Professor) and Princeton University (Postdoc). I earned a Ph.D. in economics and J.D. from Columbia University, a B.A. (Plan II Honors) from University of Texas at Austin, and an LL.M. in international criminal law from University of Amsterdam.

Seminars and Conferences

Zurich Workshop in AI+Economics (Dec 2024, Sept 2023, Oct 2022)
Online Seminar in Economics + Data Science (Thursdays on Zoom)
Monash-Warwick-Zurich Text-as-Data Workshop: (April 2024, Sept 2023, April 2023, Sept 2022, Feb 2022, Sept 2021, Feb 2021)

Teaching Materials

Text Data in Economics (2022) (GitHub)
Building a Robot Judge: Data Science for Decision-Making 
Natural Language Processing for Law and Social Science
Review paper: “Text algorithms in economics (with Stephen Hansen), Annual Review of Economics (2023). Abstract

This paper provides an overview of the methods used for algorithmic text analysis in economics, with a focus on three key contributions. First, the paper introduces methods for representing documents as high-dimensional count vectors over vocabulary terms, for representing words as vectors, and for representing word sequences as embedding vectors. Second, the paper defines four core empirical tasks that encompass most text-as-data research in economics, and enumerates the various approaches that have been taken so far for these tasks. Finally, the paper flags limitations in the current literature, with a focus on the challenge of validating algorithmic output.

Companion GitHub Repo with Example Python Notebooks


Recent Working Papers

Do Words Matter? The Value of Collective Bargaining Agreements” (with Benjamin Arold, W. Bentley MacLeod, and Suresh Naidu) Abstract

This paper proposes novel natural language methods to measure worker rights from collective bargaining agreements (CBAs) for use in empirical economic analysis. Applying unsupervised text-as-data algorithms to a new collection of 30,000 CBAs from Canada in the period 1986-2015, we parse legal obligations (e.g. “the employer shall provide…”) and legal rights (e.g. “workers shall receive…”) from the contract text. We validate that contract clauses provide worker rights, which include both amenities and control over the work environment. Companies that provide more worker rights score highly on a survey indicating pro-worker management practices. Using time-varying province-level variation in labor income tax rates, we find that higher taxes increase the share of worker-rights clauses while reducing pre-tax wages in unionized firms, consistent with a substitution effect away from taxed compensation (wages) toward untaxed amenities (worker rights). Further, an exogenous increase in the value of outside options (from a Bartik instrument for labor demand) increases the share of worker rights clauses in CBAs. Combining the regression estimates, we infer that a one-standard-deviation increase in worker rights is valued at about 5.4% of wages.


Selected Publications — Economics

More Laws, More Growth? Evidence from U.S. States” (with Massimo Morelli and Matia Vannoni), Journal of Political Economy. (forthcoming). Abstract

This paper analyzes the conditions under which more legislation contributes to economic growth. In the context of U.S. states, we apply natural language processing tools to measure legislative flows for the years 1965-2012. We implement a novel shift-share design for text data, where the instrument for legislation is leave-one-out legal-topic flows interacted with pre-treatment legal-topic shares. We find that at the margin, higher legislative output causes more economic growth. Consistent with more complete laws reducing ex-post hold-up, we find that the effect is driven by the use of contingent clauses, is largest in sectors with high relationship-specific investments, and is increasing with local economic uncertainty.

Press: VoxEU.


A Machine Learning Approach to Analyze and Support Anti-Corruption Policy” (with Sergio Galletta and Tommaso Giommoni), American Economic Journal: Economic Policy (conditionally accepted). Abstract

Can machine learning support better governance? In the context of Brazilian municipalities, 2001-2012, we have access to detailed accounts of local budgets and audit data on the associated fiscal corruption. Using the budget variables as predictors, we train a tree-based gradient-boosted classifier to predict the presence of corruption in held-out test data. The trained model, when applied to new data, provides a prediction-based measure of corruption which can be used for new empirical analysis or to support policy responses. We validate the empirical usefulness of this measure by replicating, and extending, some previous empirical evidence on corruption issues in Brazil. We then explore how the predictions can be used to support policies toward corruption. Our policy simulations show that, relative to the status quo policy of random audits, a targeted policy guided by the machine predictions could detect more than twice as many corrupt municipalities for the same audit rate.


Mandatory Retirement Reforms for Judges Improved Performance on U.S. State Supreme Courts” (with W. Bentley MacLeod), American Economic Journal: Economic Policy (2024) Abstract

Anecdotal evidence often points to aging as a cause for reduced work performance. This paper provides empirical evidence on this issue in a context where performance is measurable and there is variation in mandatory retirement policies: U.S. state supreme courts. We find that introducing mandatory retirement reduces the average age of working judges and improves court performance, as measured by output (number of published opinions) and legal impact (number of forward citations to those opinions). Consistent with aging effects as a contributing factor, we find that older judges do about the same amount of work as younger judges, but that work is lower-quality as measured by citations. However, the effect of mandatory retirement on performance is much larger than what would be expected from the change in the age distribution, suggesting that the presence of older judges reduces the performance of younger judges.

Press: VoxEU, NPR On Point


Gender Attitudes in the Judiciary: Evidence from U.S. Circuit Courts” (with Arianna Ornaghi and Daniel L. Chen), American Economic Journal: Applied Economics (2024).Abstract

Do gender attitudes influence interactions with female judges in U.S. Circuit Courts? In this paper, we propose a judge-specific measure of gender attitudes based on use of gender-stereotyped language in the judge’s authored opinions. Exploiting quasi-random assignment of judges to cases and conditioning on judges’ characteristics, we validate the measure showing that higher-slant judges vote more conservatively in gender-related cases. Higher-slant judges interact differently with female colleagues: they are more likely to reverse lower-court decisions if the lower-court judge is a woman than a man, are less likely to assign opinions to female judges, and cite fewer female-authored opinions.


Conservative News Media and Criminal Justice: Evidence from Exposure to Fox News Channel” (with Michael Poyker), The Economic Journal (2024). Abstract

Local exposure to conservative news causes judges to impose harsher criminal sentences. Our evidence comes from an instrumental variables analysis, where randomness in television channel positioning across localities induces exogenous variation in exposure to Fox News Channel. These treatment data on news viewership are taken to outcomes data on almost 7 million criminal sentencing decisions in the United States for the years 2005-2017. Higher Fox News viewership increases incarceration length, and the effect is stronger for black defendants and for drug-related crimes. We can rule out changes in the behavior of police, prosecutors, or potential offenders as significant drivers. Consistent with changes in voter attitudes as the key mechanism, the effect on sentencing harshness is observed for elected (but not appointed) judges. Fox News viewership also increases self-reported beliefs about the importance of drug crime as a social problem.

Media Coverage: New Statesman, The Journalist’s Resource.


How Cable News Reshaped Local Government” (with Sergio Galletta), American Economic Journal: Applied Economics (2023) Abstract

This paper shows that partisan cable news broadcasts have a causal effect on the size and composition of budgets in U.S. localities. Using exogenous channel positioning as an instrument for viewership, we show that exposure to the conservative Fox News Channel reduces both revenues and expenditures. Multiple mechanisms drive these results: Fox News improves election chances for local Republicans, alters politician campaign agendas, and directly shifts voter policy preferences on fiscal issues. Consistent with the priorities of small-government conservatism, we find evidence that the reduction in public services is compensated by increased private provision, in particular through higher student attendance at private schools. The “Fox News Effect” is not just limited to vote shares; it also moves policy to the right. 


Emotion and Reason in Political Language” (with Gloria Gennaro) The Economic Journal (2022) Abstract

We use computational linguistics techniques to study the use of emotion and reason in political discourse. Our new measure of emotionality in language combines lists of emotive and cognitive words, as well as word embeddings, to construct a text-based scale between emotion and reason. After validating the method against human annotations, we apply it to scale 6 million speeches in the U.S. Congressional Record for the years 1858 through 2014. Intuitively, emotionality spikes during times of war and is highest for patriotism-related topics. In the time series, emotionality was relatively low and stable in the earlier years but increased significantly starting in the late 1970s. Comparing Members of Congress to their colleagues, we find that emotionality is higher for Democrats, for women, for ethnic/religious minorities, for members of the opposition party, and for those with relatively extreme policy preferences (either left-wing or right-wing) as measured by roll call votes.

Press: Sage Ocean Blog (2021).


Selected Publications — Political Science

Better to be jeered than ignored? Gender and reactions during parliamentary debates” (with Johann Kruemmel and Jonathan B. Slapin), American Journal of Political Science (2024) Abstract

Are non-verbal reactions during parliamentary debate gendered? Do male and female Members of Parliament (MPs) experience applause or jeering differently? In short, yes, and the gendered nature of a speech matters. Using an original corpus of over 544,000 speeches given in German state parliaments, we first estimate the gendered nature of parliamentary speeches, then examine how reactions to speeches given by male and female MPs differ. Female and male MPs receive similarly positive and negative reactions to their speeches on average, but they receive different reactions depending on the gendered nature of the speeches. Speeches using language associated with women’s topics receive fewer reactions overall, and even fewer when delivered by men. The gendered nature of parliamentary interjections could affect how women MPs view their position and how women voters view parliament.


Relatio: Text semantics capture political and economic narratives” (with Germain Gauthier and Philine Widmer), Political Analysis (2023). Abstract

Social scientists have become increasingly interested in how narratives — the stories in fiction, politics, and life — shape beliefs, behavior, and government policies. This paper provides an unsupervised method to quantify latent narrative structures in text documents. Our pipeline identifies coherent entity groups and maps explicit relations between them in the text. We provide an application to the United States Congressional Record to analyze political and economic narratives in recent decades. Our analysis highlights the dynamics, sentiment, polarization, and interconnectedness of narratives in political discourse.

relatio open-source narrative mining package.


The Effect of Fox News on Health Behavior during COVID-19” (with Sergio Galletta, Dominik Hangartner, Yotam Margalit, and Matteo Pinna) Political Analysis (2023). Abstract

In the early weeks of the 2020 coronavirus (COVID-19) pandemic, Fox News Channel advanced a skeptical narrative that downplayed the risks posed by the virus. We find that this narrative had significant consequences: in localities with higher Fox News viewership—exogenous due to random variation in channel positioning—people were less likely to adopt behaviors geared toward social distancing (e.g., staying at home) and consumed less goods in preparation (e.g., cleaning products, hand sanitizers, masks). Using original survey data, we find that the effect of Fox News came not merely from its long-standing distrustful stance toward science, but also due to program-specific content that minimized the COVID-19 threat.

Press: Hollywood Reporter.


Cross-Domain Topic Classification for Political Texts” (with Massimo Morelli and Moritz Osnabruegge), Political Analysis (2021). Abstract

We introduce and assess cross-domain topic classification. In this approach, an algorithm learns to classify topics in a labeled source corpus and then extrapolates topics in an unlabeled target corpus from another domain. The advantage over within-domain supervised learning is significant efficiency gains because one can use existing training data. The advantage over unsupervised topic models is that our approach can be more specifically targeted to a research question and that the resulting topics are easier to validate and interpret. We demonstrate the method in the case of labeled party platforms (source corpus) and unlabeled parliamentary speeches (target corpus). Besides the standard within-domain error metrics, we further validate the cross-domain performance by labeling a subset of target-corpus documents. We find that the classifier assigns topics accurately in the parliamentary speeches, although accuracy varies substantially by topic. We also propose a tool for interpreting the topics and diagnosing cross-domain classification. To assess empirical validity, we present two case studies on how electoral rules and parliamentarian gender influence the choice of speech topics.


Measuring Discretion and Delegation in Legislative Texts: Methods and Application to U.S. States” (with Massimo Morelli and Matia Vannoni), Political Analysis (2020). Abstract

Bureaucratic discretion and executive delegation are central topics in political economy and political science. The previous empirical literature has measured discretion and delegation by manually coding large bodies of legislation. Drawing from computational linguistics, we provide an automated procedure for measuring discretion and delegation in legal texts to facilitate large-scale empirical analysis. The method uses information in syntactic parse trees to identify legally relevant provisions, as well as agents and delegated actions. We undertake two applications. First, we produce a measure of bureaucratic discretion by looking at the level of legislative detail for U.S. states and find that this measure increases after reforms giving agencies more independence. This effect is consistent with an agency cost model where a more independent bureaucracy requires more specific instructions (less discretion) to avoid bureaucratic drift. Second, we construct measures of delegation to governors in state legislation. Consistent with previous estimates using non-text metrics, we find that executive delegation increases under unified government.

Press: Bocconi Knowledge (2021).


Elections and divisiveness: Theory and evidence” (with Massimo Morelli and Richard Van Weelden), Journal of Politics (2017). Abstract

This paper provides a theoretical and empirical analysis of how politicians allocate their time across issues. When voters are uncertain about an incumbent’s preferences,  there is a pervasive incentive to “posture” by spending too much time on divisive issues (which are more informative about a politician’s preferences) at the expense of time spent on common-values issues (which provide greater benefit to voters).  Higher transparency over the politicians’ choices can exacerbate the distortions. These theoretical results motivate an empirical study of how Members of the U.S. Congress allocate time across issues in their floor speeches.  We find that U.S. Senators spend more time on divisive issues when they are up for election, consistent with electorally induced posturing. In addition, we find that U.S. House Members spend more time on divisive issues in response to higher news transparency.


Selected Publications — ML/NLP

Where do People Tell Stories Online? Story Detection Across Online Communities” (with Maria Antoniak, Joel Mire, Andrew Piper, and Maarten Sap), ACL (2024) Abstract

People share stories online for a myriad of purposes, whether as a means of self-disclosure, processing difficult personal experiences, providing needed information or entertainment, or persuading others to share their beliefs. Better understanding of online storytelling can illuminate the dynamics of social movements, sensemaking practices, persuasion strategies, and more. However, unlike other media such as books and visual content where the narrative nature of the content is often overtly signaled at the document level, studying storytelling in online communities is challenging due to the mixture of storytelling and non-storytelling  behavior, which can be interspersed within documents and across diverse topics and settings. We introduce a codebook and create the Storytelling in Online Communities Corpus, an expert-annotated dataset of 502 English-language posts and comments with labeled story and event spans. Using our corpus, we train and evaluate an online story detection model, which we use to investigate the role of storytelling in different social contexts.
We identify distinctive features of online storytelling, the prevalence of storytelling among different communities, and the conversational patterns of storytelling.



Revisiting Automated Topic Model Evaluation with Large Language Models” (with Dominik Stammbach, Vilém Zouhar, Alexander Hoyle, and Mrinmaya Sachan), EMNLP (2023). Abstract

Topic models are used to make sense of large text collections. However, automatically evaluating topic model output and determining the optimal number of topics both have been longstanding challenges, with no effective automated solutions to date. This paper proposes using large language models to evaluate such output. We find that large language models appropriately assess the resulting topics, correlating more strongly with human judgments than existing automated metrics. We then investigate whether we can use large language models to automatically determine the optimal number of topics. We automatically assign labels to documents and choosing configurations with the most pure labels returns reasonable values for the optimal number of topics.

GitHub Repository



Human-Guided Fair Classification for Natural Language Processing” (with Florian Dorner, Momchil Peychev, Nikola Konstantinov, Naman Goel, Martin Vechev), ICLR (2023). Abstract

Text classifiers have promising applications in high-stake tasks such as resume screening and content moderation. These classifiers must be fair and avoid discriminatory decisions by being invariant to perturbations of sensitive attributes such as gender or ethnicity. However, there is a gap between human intuition about these perturbations and the formal similarity specifications capturing them. While existing research has started to address this gap, current methods are based on hardcoded word replacements, resulting in specifications with limited expressivity or ones that fail to fully align with human intuition (e.g., in cases of asymmetric counterfactuals). This work proposes novel methods for bridging this gap by discovering expressive and intuitive individual fairness specifications. We show how to leverage unsupervised style transfer and GPT-3’s zero-shot capabilities to automatically generate expressive candidate pairs of semantically similar sentences that differ along sensitive attributes. We then validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification. Finally, we show how limited amounts of human feedback can be leveraged to learn a similarity specification that can be used to train downstream fairness-aware models. 



MemSum: Extractive Summarization of Long Documents using Multi-step Episodic Markov Decision Processes” (with Nianlong Gu and Richard Hahnloser), ACL (2022). Abstract

We introduce MemSum (Multi-step Episodic Markov decision process extractive SUMmarizer), a reinforcement-learning-based extractive summarizer enriched at any given time step with information on the current extraction history. Similar to previous models in this vein, MemSum iteratively selects sentences into the summary. Our innovation is in considering a broader information set when summarizing that would intuitively also be used by humans in this task: 1) the text content of the sentence, 2) the global text context of the rest of the document, and 3) the extraction history consisting of the set of sentences that have already been extracted. With a lightweight architecture, MemSum nonetheless obtains state-of-the-art test-set performance (ROUGE score) on long document datasets (PubMed, arXiv, and GovReport). Supporting analysis demonstrates that the added awareness of extraction history gives MemSum robustness against redundancy in the source document.

MemSum Open-Source Package and Documentation.


Selected Publications — Law

Translating Legalese: Enhancing Public Understanding of Court Opinions with Legal Summarizers” (with Aniket Kesari, Suresh Naidu, Lena Song, and Dominik Stammbach), ACM Symposium on Computer Science and Law (2024). Abstract

Judicial opinions are written to be persuasive and could build public trust in court decisions, yet they can be difficult for non-experts to understand. We present a pipeline for using an AI assistant to generate simplified summaries of judicial opinions. Compared to existing expert-written summaries, these AI-generated simple summaries are more accessible to the public and more easily understood by non-experts. We show in a survey experiment that the AI summaries help respondents understand the key features of a ruling, and have higher perceived quality, especially for respondents with less formal education.


If We Build It, Will They Legislate? Empirically Testing the Potential of the Nondelegation Doctrine to Curb Congressional ‘Abdication’” (with Daniel Walters), Cornell Law Review (2023) Abstract

A widely held view for why the Supreme Court would be right to revive the nondelegation doctrine is that Congress has perverse incentives to abdicate its legislative role and evade accountability through the use of delegations, either expressly delineated or implied through statutory imprecision, and that enforcement of the nondelegation doctrine would correct for those incentives. We call this the Field of Dreams Theory—if we build the nondelegation doctrine, Congress will legislate. Unlike originalist arguments for the revival of the nondelegation doctrine, this theory has widespread appeal and is instrumental to the Court’s project of gaining popular acceptance of a greater judicial role in policing congressional decisions regarding delegation. But is it true?

In this article, we comprehensively test the theory at the state level, using two original datasets: one comprising all laws passed by state legislatures and the other comprising all nondelegation decisions in the state Supreme Courts. Using a variety of measures and methods, and in contrast with the one existing study on the subject, we do observe at least some statistically measurable decrease in delegation, if only by certain measures. However, when put in context, these findings are underwhelming compared to the predictions of the Field of Dreams Theory. For instance, we observe that, even where it exists, this effect is substantively small and on par with a number of other factors that influence delegation—our best estimate is that nondelegation cases explain about 1.5 percent of the variation in delegation. Moreover, we also find some evidence that is directly contrary to the Field of Dreams Theory—that is, we find evidence that enforcement of the nondelegation doctrine actually leads to more implied delegation in the form of vague and precatory statutory language.

These findings have direct relevance to contemporary debate and cases entertaining a revitalization of the nondelegation doctrine in the federal courts. First, the findings that enforcement of the doctrine can prospectively decrease legislative delegation suggest that there may be something to the Field of Dreams Theory, although that in turn raises the stakes of debates over whether less delegation would actually be good for public welfare. Second, even though there is an effect, the weakness of that effect, both in an absolute sense and relative to other factors, undermines the overblown claims that the nondelegation doctrine could fundamentally transform how government works. And finally, our finding that judicial decisions enforcing the nondelegation doctrine can sometimes lead to more implied delegation through imprecise statutory language suggests that there may be unintended consequences from giving the nondelegation doctrine a new lease on life.


New Policing, New Segregation: From Ferguson to New York” (with Jeffrey A. Fagan), Georgetown Law Journal Online (2017).Abstract

Modern policing emphasizes advanced statistical metrics, new forms of organizational accountability, and aggressive tactical enforcement of minor crimes as the core of its institutional design. Recent policing research has shown how this policing regime has been woven into the social, political and legal systems in urban areas, but there has been little attention to these policing regimes in smaller areas. In these places, where relationships between citizens, courts and police are more intimate and granular, and local boundaries are closely spaced with considerable flow of persons through spaces, the “new policing” has reached deeply into the everyday lives of predominantly non-white citizens through multiple contacts that lead to an array of legal financial obligations including a wide array of fines and fees. Failure to pay these fees often leads to criminal liability. We examine two faces of modern policing, comparing the Ferguson, Missouri and New York City. We analyze rich and detailed panel data from both places on police stops, citations, warrants, arrests, court dispositions, and penalties, to show the web of social control and legal burdens that these practices create. The data paint a detailed picture of racially discriminatory outcomes at all stages of the process that are common to these two very different social contexts. We link the evidence on the spatial concentration of the racial skew in these policing regimes to patterns of social and spatial segregation, and in turn, to the social, economic and health implications for mobility. We conclude with a discussion of the implications of the “new policing” for constitutional regulation and political reform.


Revisions Requested

Ideas Have Consequences: The Impact of Law and Economics on American Justice(with Daniel L. Chen and Suresh Naidu), R&R at Quarterly Journal of Economics. Abstract

This paper provides a quantitative analysis of the effects of the early law-and-economics movement on the U.S. judiciary. We focus on the Manne Economics Institute for Federal Judges, an intensive economics course that trained almost half of federal judges between 1976 and 1999. Using the universe of published opinions in U.S. Circuit Courts and 1 million District Court criminal sentencing decisions, we estimate the differences-in-differences effect of Manne program attendance using judge fixed effects. Selection into attendance was limited – the program was popular across judges from all backgrounds, was regularly oversubscribed, and admitted judges on a first-come first-served basis – and we further adjust for machine-learning-selected covariates predicting the timing of attendance. We find that after attending economics training, participating judges use more economics language in their opinions, issue more conservative decisions in economics-related cases, rule against regulatory agencies more often, favor more lax enforcement in antitrust cases, and impose more/longer criminal sentences. The law-and-economics movement had policy consequences via its influence on U.S. federal judges.

NPR Planet Money Podcast, Economist article


In-group bias in the Indian judiciary: Evidence from 5 million criminal cases” (with Sam Asher, Aditi Bhowmick, Sandeep Bhupatiraju, Daniel L. Chen, Tanaya Devi, Christoph Goessmann, Paul Novosad, Bilal Siddiqi), R&R at Review of Economics and Statistics Abstract

We study judicial in-group bias in Indian criminal courts using a newly collected dataset on over 5 million criminal case records from 2010-2018. After detecting gender and religious identity using a neural-net classifier applied to judge and defendant names, we exploit quasi-random assignment of cases to judges to examine whether defendant outcomes are affected by assignment to a judge with a similar identity. In the aggregate, we estimate tight zero effects of in-group bias based on shared gender, religion, and last name (a proxy for caste). We do find limited in-group bias in some (but not all) settings where identity is salient — in particular, we find a small religious in-group bias during Ramadan, and we find shared-name in-group bias when judge and defendant match on a rare last name.

Data and source code.

Gender classifier web app.


Media Slant is Contagious” (with Philine Widmer and Sergio Galletta), R&R at Economic Journal Abstract

This paper examines the diffusion of media slant, specifically how partisan content from national cable news affects local newspapers in the U.S., 2005-2008. We use a text-based measure of cable news slant trained on content from Fox News Channel (FNC), CNN, and MSNBC to analyze how local newspapers adopt FNC’s slant over CNN/MSNBC’s. Our findings show that local news becomes more similar to FNC content in response to an exogenous increase in local FNC viewership. This shift is not limited to borrowing from cable news, but rather, local newspapers’ own content changes. Further, cable TV slant polarizes local news content.


Visual representation and stereotypes in news media” (with Ruben Durante, Maria Grebenshchikova, and Carlo Schwarz), Reject & Resubmit at Economic Journal  Abstract

We propose a new method for measuring gender and ethnic stereotypes in news reports. By combining computer vision and natural language processing tools, the method allows us to analyze both images and text and, crucially, the interaction between the two. We apply this approach to over 2 million web articles published in the New York Times and Fox News between 2000 and 2020. We find that in both outlets, men and whites are generally over-represented relative to their population share, while women and Hispanics are under-represented. News content perpetuates common stereotypes such as associating women with narratives about caring roles and family; Blacks and Hispanics with low-skill jobs, crime, and poverty; and Asians with high-skill jobs and science. There are some significant differences across outlets, with Fox’s content displaying a stronger association of Hispanics with immigration than the New York Times. Finally, we find that group representation in the news is influenced by the gender and ethnic identity of authors and editors. This suggests that it results, at least in part, from the choices of news makers, and could change in response to increased diversity in newsroom staff.


The Effect of Fox News Channel on U.S. Elections: 2000-2020” (with Sergio Galletta, Matteo Pinna, and Christopher Warshaw), R&R at Journal of Public Economics Abstract

This paper provides a comprehensive assessment of the effect of Fox News Channel (FNC) on elections in the United States. FNC is the highest-rated channel on cable television and has a documented conservative slant. We show that FNC has helped Republican candidates in elections across levels of U.S. government over the past decade. A one standard deviation decrease in FNC’s channel position boosted Republican vote shares by at least .5 percentage points in recent presidential, Senate, House and gubernatorial elections. The effects of FNC increased steadily between 2004 and 2016 and then plateaued. Survey-based evidence suggests that FNC affects elections by shifting the political preferences of Americans to the right. Overall, the findings suggest that FNC has contributed to the nationalization of United States elections.


More Working Papers

Race-related Research in Economics: Volume, Content, and Publication Incentives” (with Arun Advani, Anton Boltachka, David Cai, and Imran Rasul) Abstract

Issues of racial justice and economic inequalities across racial and ethnic groups have risen to the top of public debate. Economists’ ability to contribute to these debates is based on the body of race-related research. We study the volume and content of race-related research in economics and examine the implicit incentives to produce such work. We do so for a corpus of 225,000 economics publications from 1960 to 2020 to which we apply an algorithmic approach to classify race-related work, and construct paths to publication for 22,000 NBER and 10,000 CEPR working papers posted over the last few decades. We present three new facts. First, since 1960 less than 2% of economics publications have been race-related, with such work being balkanized into a few fields and largely absent from many others. There is an uptick in such work in the mid 1990s. Among the top-5 journals this is driven by the AER, QJE and the JPE. Econometrica and the REStud have each cumulatively published fewer than 15 race-related articles since 1960. Second, on content, while over 50% of race-related publications in the 1970s focused on Black individuals, by the 2010s this had fallen to 20%. There has been a steady decline in the share of race-related research on discrimination since the 1980s, with a rise in the share of studies on identity. Finally, irrespective of field, race-related working papers do not have worse publication outcomes compared to non race-related working papers, in terms of publication likelihood, quality of publication, publication lags and citations. Hence conditional on working papers being produced, the publications process provides little disincentive to work on race-related issues. We discuss policy implications stemming from our findings on economists’ ability to contribute to debates on race and ethnicity in the economy.


Seeing and Hearing is Believing: The Role of Audiovisual Communication in Shaping Inflation Expectations” (with Heiner Mikosch, Alexis Perakis, and Samad Sarferaz) Abstract

This paper presents novel causal evidence on the relationship between various communication channels employed by central banks and households’ expectations about future inflation. In a pre-registered randomized survey experiment administered in 2022, we examine adjustment of inflation expectations when confronted with a press conference statement by the president of the European Central Bank (ECB) articulating the bank’s commitment to a 2% inflation target. First, we replicate previous literature showing that respondents update toward the inflation target. Second, we show that the medium of communication matters, holding the target message constant: Relative to a text transcript, audiovisual mediums (audio, photograph, or video) strengthen updating toward the target. In particular, dynamic mediums that communicate the target through multiple sensory channels (audio and video) are most effective in moving inflation expectations. In an analysis of mechanisms, we can rule out increased attentiveness to the survey and increased trust in the ECB as pivotal drivers of these effects. In a heterogeneity analysis, we find that economically less-informed respondents (those consuming less economic news) are more responsive in updating to audiovisual mediums. Overall, these results suggest that the use of audiovisual communications technology improves the information quality of central bank messaging and makes that messaging easier to process for less-informed households.


Televised Debates and Emotional Appeals in Politics: Evidence from C-SPAN” (with Gloria GennaroAbstract

We study the effect of televised broadcasts of floor debates on the rhetoric and behavior of U.S. Congress Members. First, we show in a differences-in-differences analysis that the introduction of C-SPAN broadcasts in 1979 increased the use of emotional appeals in the House relative to the Senate, where televised floor debates were not introduced until later. Second, we use exogenous variation in C-SPAN channel positioning as an instrument for C-SPAN viewership by Congressional district and show that House Members from districts with exogenously higher C-SPAN viewership are more emotive in floor debates. Contra accountability models of transparency, C-SPAN has no effect on measures of legislative effort on behalf of constituents, and if anything it reduces a politician’s constituency orientation. We find that local news coverage — that is, mediated rather than direct transparency — has the opposite effect of C-SPAN, increasing legislative effort but with no effect on emotional rhetoric. Looking to electoral pressures as a mechanism, we find the emotionality effect of C-SPAN is strongest in competitive districts. Finally, C-SPAN exposure increases the vote share for incumbent Congress Members, and more so among those who speak more emotionally. These results highlight the importance of audience and mediation in the political impacts of higher transparency.



Bootstrapping Science? The Impact of a `Return Human Capital’ Programme on Chinese Research Productivity” (with David Cai, Mirko Draca, and Shaoyu Liu) Abstract

We study the impact of a large-scale scientist recruitment program — China’s Junior Thousand Talents Plan (青年千人计划) — on the productivity of recruited scholars and their local peers in Chinese host universities. Using a comprehensive dataset of published scientific articles, we estimate effects on quantity and quality in a matched difference-in-differences framework. We observe neutral direct productivity effects for participants over a 6-year post-period: an initial drop is followed by a fully offsetting recovery. However, the program participants collaborate at higher rates with more junior China-based co-authors at their host institutions. Looking to peers in the hosting department, we observe positive and rising productivity impacts for peer scholars, equivalent to approximately 0.6 of a publication per peer scholar in the long run. Heterogeneity analysis and the absence of correlated resource effects point to the peer effect being rooted in a knowledge spillover mechanism.


“Economic Interests, Worldviews, and Identities: Theory and Evidence on Ideational Politics” (with Sharun Mukand and Dani Rodrik) Abstract

We distinguish between ideational and interest-based appeals to voters on the supply side of politics, integrating the Keynes-Hayek perspective on the importance of ideas with the Stigler-Becker approach emphasizing vested interests. In our model, political entrepreneurs discover identity and worldview “memes” (narratives, cues, frames) that invoke voters’ identity concerns or shift their views of how the world works. We identify a complementarity between worldview politics and identity politics and illustrate how they may reinforce each other. Furthermore, we show how adverse economic shocks (increasing inequality) lead to a greater incidence of ideational politics. We use these results to analyze data on 60,000 televised political ads in U.S. localities over the years 2000 through 2018. Our empirical work quantifies ideational politics and provides support for key model implications, including the impact of higher inequality on the supply of both identity and worldview politics.



Fight or Flight on Fox? Partisan Fear Responses on U.S. Cable News Shows” (with Cantay Caliskan) Abstract

Empirical work on political communication has so far left out a potentially pivotal dimension – the unspoken emotional responses indicated by facial expressions. This paper shows how to measure these responses using deep-learning-based computer-vision algorithms in the context of U.S. cable news video. Using machine-generated metrics for expressed emotion, combined with mentions of politically divisive entities, we estimate the difference in emotion when cable-channel speakers hear mentions of entities that are from the opposing side of the political divide. We find that the most responsive emotion is fear: When Fox News personalities hear about Democratic entities, or when MSNBC personalities hear about Republican entities, the dominant expressed emotion in their faces is fear, which increases by at least 30% relative to mentions of neutral political entities or those from one’s own partisan team. We find some evidence that this partisan fear response is stronger on Fox News than on MSNBC.


What Drives Partisan Tax Policy? The Effective Tax CodeAbstract

This paper contributes to recent work in political economy and public finance that focuses on how details of the tax code, rather than tax rates, are used to implement redistributive fiscal policies. I use tools from natural language processing to construct a high-dimensional representation of tax code changes from the text of 1.6 million statutes enacted by state legislatures for the years 1963 through 2010. A data-driven approach is taken to recover the effective tax code – the language in tax law that has the largest impact on revenues, holding major tax rates constant. I then show that the effective tax code drives partisan tax policy: relative to Republicans, Democrats use revenue-increasing language for income taxes but use revenue-decreasing language for sales taxes (consistent with a more redistributive fiscal policy) despite making no changes on average to statutory tax rates. These results are consistent with the view that due to their relative salience, changing tax rates is politically more difficult than changing the tax code.


Peer-Reviewed Journal Articles

What is (and was) a Person? Evidence on Historical Mind Perceptions from Natural Language” (with Dominik Stammbach and Kevin Tobia), Cognition (2023). Abstract

An important philosophical tradition identifies persons as those entities that have minds, such that mind perception is a window into person perception. Psychological research has found that human perceptions of mind consist of at least two distinct dimensions: agency (e.g. planning, deciding) and experience (e.g. feeling, hungering). Taking this insight into the semantic space of natural language, we develop a generalizable, scalable computational-linguistics method for measuring variation in perceived agency and experience in large archives of plain-text documents. The resulting text-based rankings of entities along these dimensions correspond to human judgments of perceived agency and experience assessed in blind surveys. We then map both dimensions of mind in historical English-language corpora over the last 200 years and identify two salient trends. First, we find that while women are now described as having similar levels of agency as men, they are still described as more experience-oriented. Second, we find that domesticated animals have gained higher attributions of experience (but not agency) relative to wild animals, especially since the rise of the global animal rights movement in the 1980s.



Detecting the Influence of Chinese Guiding Cases: A Text Reuse Approach” (with Benjamin Minhao Chen, Zhiyu Li, and David Cai), Artificial Intelligence and Law (2023) Abstract

View Page

Socialist courts are supposed to apply the law, not make it, and socialist legality denies judicial decisions any precedential status. In 2011, however, the Chinese Supreme People’s Court designated selected decisions as Guiding Cases to be referred to by all judges when adjudicating similar disputes. One decade on, the paucity of citations to Guiding Cases has been taken as demonstrating the incongruity of case-based adjudication and socialist legality.

Citations are, however, an imperfect measure of influence. Reproduction of language uniquely traceable to Guiding Cases can also be evidence of their impact on judicial decision-making. We employ a local alignment tool to detect unattributed text reuse of Guiding Cases in local court decisions. Our findings suggest that the Guiding Cases are more consequential than commonly assumed, thereby challenging prevailing narratives about the antagonism of socialist legality to case law.


The Choice of Knowledge Base in Automated Claim Checking” (with Dominik Stammbach and Boya Zhang), Journal of Data and Information Quality (2023) Abstract

Automated claim checking is the task of determining the veracity of a claim given evidence found in a knowledge base of trustworthy facts. While previous work has taken the knowledge base as given and optimized the claim-checking pipeline, we take the opposite approach – taking the pipeline as given, we explore the choice of knowledge base. Our first insight is that a claim-checking pipeline can be transferred to a new domain of claims with access to a knowledge base from the new domain. Second, we do not find a “universally best” knowledge base – higher domain overlap of a task dataset and a knowledge base tends to produce better label accuracy. Third, combining multiple knowledge bases does not tend to improve performance beyond using the closest-domain knowledge base. Finally, we show that the claim-checking pipeline’s confidence score for selecting evidence can be used to assess whether a knowledge base will perform well for a new set of claims, even in the absence of ground-truth labels.


Mindfulness reduces information avoidance” (with Daniel Sgroi, Anthony Tuckwell, and Shi Zhuo). Economics Letters (2023). Abstract

Mindfulness meditation has been found to influence various important outcomes such as health, stress, depression, productivity, and altruism. We report evidence from a randomised-controlled trial on a previously untested effect of mindfulness: information avoidance. We find that a relatively short mindfulness treatment (two weeks, 15 minutes a day) is able to induce a statistically significant reduction in information avoidance — that is, avoiding information that may cause worry or regret. Supplementary evidence supports mindfulness’s effects on emotion regulation as a possible mechanism for the effect.


Polarization and Political Selection” (with Tinghua Yu). Journal of Political Institutions and Political Economy (2022). Abstract

Does political polarization among voters affect the quality of elected officials? We examine the question both theoretically and empirically. In our model, high quality candidates prefer to spend time on their current careers over electoral campaigning. In a polarized electorate, however, voters cast their votes mainly based on candidates’ party affiliations, reducing electoral campaign effort in equilibrium. Hence under higher polarization among voters, higher quality candidates are more likely to run for high office and to get elected. Our testable prediction is that electorates with higher polarization select candidates who perform better. We take the predictions to data on judges’ performance constructed from the opinions of all state supreme court judges working between 1965 and 1994. We find that judges who joined the court when polarization was high write higher-quality decisions (receiving more citations from other judges) than judges who joined when polarization was low.


Measuring Judicial Sentiment: Methods and Application to U.S. Circuit Courts” (with Sergio Galletta and Daniel L. Chen) Economica (2021). Abstract

This paper provides a general method for analyzing the sentiments expressed in the language of judicial rulings. We apply natural language processing tools to the text of U.S. appellate court opinions to extrapolate judges’ sentiments (positive/good vs. negative/bad) toward a number of target social groups. We explore descriptively how these sentiments vary over time and across types of judges. In addition, we provide a method for using random assignment of judges in an instrumental variables framework to estimate causal effects of judges’ sentiments. In an empirical application, we show that more positive sentiment influences future judges by increasing the likelihood of reversal but also increasing the number of forward citations.


Reducing Partisanship in Judicial Elections Can Improve Judge Quality: Evidence from U.S. State Supreme Courts” (with W. Bentley MacLeod), Journal of Public Economics (2021). Abstract

Should technocratic public officials be selected through politics or by merit? This paper explores how selection procedures influence the quality of selected officials in the context of U.S. state supreme courts for the years 1947-1994. In a unique set of natural experiments, state governments enacted a variety of reforms making judicial elections less partisan and establishing merit-based procedures that delegate selection to experts. We compare post-reform judges to pre-reform judges in their work quality, measured by forward citations to their opinions. In this setting we can hold constant contemporaneous incentives and the portfolio of cases, allowing us to produce causal estimates under an identification assumption of parallel trends in quality by judge starting year. We find that judges selected by nonpartisan processes (nonpartisan elections or technocratic merit commissions) produce higher-quality work than judges selected by partisan elections. These results are consistent with a representative voter model in which better technocrats are selected when the process has less partisan bias or better information regarding candidate ability.

Replication Notebook.


Fiscal pressures and discriminatory policing: Evidence from traffic stops in Missouri” (with Allison Harris and Jeffrey Fagan), Journal of Race, Ethnicity, and Politics (2020). Abstract

This paper provides evidence of racial variation in traffic enforcement responses to local government budget stress using data from policing agencies in the state of Missouri from 2001 through 2012. Like previous studies, we find that local budget stress is associated with higher citation rates; we also find an increase in traffic-stop arrest rates. However, we find that these effects are concentrated among white (rather than black or Latino) drivers. The results are robust to the inclusion of a range of covariates and a variety of model specifications, including a a regression-discontinuity examining bare budget shortfalls. Considering potential mechanisms, we find that targeting of white drivers is higher where the white-to-black income ratio is higher, consistent with the targeting of drivers who are better able to pay fines. Further, the relative effect on white drivers is higher in areas with statistical over-policing of black drivers: when black drivers are already getting too many fines, police cite white drivers from whom they are presumably more likely to be able to raise the needed extra revenue. These results highlight the relationship between policing-as-taxation and racial inequality in policing outcomes.


Divided Government, Delegation, and Civil Service Reform” (with Massimo Morelli and Matia Vannoni), Political Science Research and Methods (2020). Abstract

This paper sheds new light on the drivers of civil service reform in U.S. states. We first demonstrate theoretically that divided government is a key trigger of civil service reform, providing nuanced predictions for specific configurations of divided government. We then show empirical evidence for these predictions using data from the second half of the 20th century: states tended to introduce these reforms under divided government, and in particular when legislative chambers (rather than legislature and governor) were divided.


A research-based ranking of public policy schools” (with Miguel Urquiola), Scientometrics (2020). Abstract

This paper presents rankings of U.S. public policy schools based on their research publication output. In 2016 we collected the names of about 5,000 faculty members at 44 such schools. We use bibliographic databases to gather measures of the quality and quantity of these individuals’ academic publications. These measures include the number of articles and books written, the quality of the journals the articles have appeared in, and the number of citations all have garnered. We aggregate these data to the school level to produce a set of rankings. The results differ significantly from existing rankings, and in addition display substantial across-field variation.


Text classification of ideological direction in judicial opinions” (with Carina Hausladen and Marcel Schubert), International Review of Law and Economics (2020). Abstract

This paper draws on machine learning methods for text classification to predict the ideological direction of decisions from the associated text. Using a 5% hand-coded sample of cases from U.S. Circuit Courts, we explore and evaluate a variety of machine classifiers to predict “conservative decision” or “liberal decision” in held-out data. Our best classifier is highly predictive (F1=.65) and allows us to extrapolate ideological direction to the full sample. We then use these predictions to replicate and extend Landes and Posner’s (2009) analysis of how the party of the nominating president influences circuit judge’s votes.


Automated Fact-Value Distinction in Court Opinions” (with Yu Cao and Daniel L. Chen), European Journal of Law and Economics (2020). Abstract

This paper studies the problem of automated classification of fact statements and value statements in written judicial decisions. We compare a range of methods and demonstrate that the linguistic features of sentences and paragraphs can be used to successfully classify them along this dimension. The Wordscores method by Laver et al. (2003) performs best in held out data. In an application, we show that the value segments of opinions are more informative than fact segments of the ideological direction of U.S. Circuit Court opinions.


Sequential decision-making with group identity” (with Jessica Van Parys), Journal of Economic Psychology (2018). Abstract

In sequential decision-making experiments, participants often conform to the decisions of others rather than reveal private information — resulting in less information produced and potentially lower payoffs for the group. This paper asks whether experimentally induced group identity affects players’ decisions to conform, even when payoffs are only a function of individual actions. As motivation for the experiment, we show that U.S. Supreme Court Justices in preliminary hearings are more likely to conform to their same-party predecessors when the share of predecessors from their party is high. Lab players, in turn, are more likely to conform to the decisions of in-group members when their share of in-group predecessors is high. We find that exposure to information from in-group members increases the probability of reverse information cascades (herding on the wrong choice), reducing average payoffs. Therefore, alternating decision-making across members of different groups may improve welfare in sequential decision-making contexts.


Intrinsic motivation in public service: Theory and evidence from state supreme courts” (with W. Bentley MacLeod), Journal of Law and Economics (2015).Abstract

This paper provides a theoretical and empirical analysis of the intrinsic preferences of state appellate court judges. We construct a panel data set using published decisions from state supreme court cases merged with institutional and biographical information on all (1,636) state supreme court judges for the 50 states of the United States from 1947 to 1994. We estimate the effects of changes in judge employment conditions on a number of measures of judicial performance. The results are consistent with the hypothesis that judges are intrinsically motivated to provide high-quality decisions, and that at the margin they prefer quality over quantity. When judges face less time pressure, they write more well-researched opinions that are cited more often by later judges. When judges are up for election then performance falls, suggesting that election politics take time away from judging work – rather than providing an incentive for good performance. These effects are strongest when judges have more discretion to select their case portfolio, consistent with psychological theories that posit a negative effect of contingency on motivation.


On the Behavioral Economics of Crime” (with Frans van Winden), Review of Law & Economics (2012). Abstract

This paper examines the implications of the brain sciences’ mechanistic model of human behavior for our understanding of crime. The standard rational-choice crime model is refined by a behavioral approach, which proposes a decision model comprising cognitive and emotional decision systems. According to the behavioral approach, a criminal is not irrational but rather ‘ecologically rational,’ outfitted with evolutionarily conserved decision modules adapted for survival in the human ancestral environment. Several important cognitive as well as emotional factors for criminal behavior are discussed and formalized, using tax evasion as a running example. The behavioral crime model leads to new perspectives on criminal policy-making.


Peer-Reviewed Conference Proceedings

Legal extractive summarization of U.S. court opinions” (with Emmanuel Bauer, Dominik Stammbach, and Nianlong Gu), LIRAI (2023) Abstract

This paper tackles the task of legal extractive summarization using a dataset of 430K U.S. court opinions with key passages annotated. According to automated summary quality metrics, the reinforcement-learning-based MemSum model is best and even out-performs transformer-based models. In turn, expert human evaluation shows that MemSum summaries effectively capture the key points of lengthy court opinions. Motivated by these results, we open-source our models to the general public. This represents progress towards democratizing law and making U.S. court opinions more accessible to the general public.

Legal MemSum: Code and Trained Models



Uncovering and Categorizing Social Biases in Text-to-SQL” (with Yan Liu, Yan Gao, Zhe Su, Xiaokang Chen, Jian-Guang Lou), ACL (2023). Abstract

Large pre-trained language models are acknowledged to carry social biases towards different demographics, which can further amplify existing stereotypes in our society and cause even more harm. Text-to-SQL is an important task, models of which are mainly adopted by administrative industries, where unfair decisions may lead to catastrophic consequences. However, existing Text-to-SQL models are trained on clean, neutral datasets, such as Spider and WikiSQL. This, to some extent, cover up social bias in models under ideal conditions, which nevertheless may emerge in real application scenarios. In this work, we aim to uncover and categorize social biases in Text-to-SQL models. We summarize the categories of social biases that may occur in structured data for Text-to-SQL models. We build test benchmarks and reveal that models with similar task accuracy can contain social biases at very different rates. We show how to take advantage of our methodology to uncover and assess social biases in the downstream Text-to-SQL task. We will release our code and data.


Data-Centric Factors in Algorithmic Fairness” (with Nianyun Li and Naman Goel), AAAI/ACM Conference on AI, Ethics, and Society  (2022). Abstract

Notwithstanding the widely held view that data generation and data curation processes are prominent sources of bias in machine learning algorithms, there is little empirical research seeking to document and understand the specific data dimensions affecting algorithmic unfairness. Contra the previous work, which has focused on modeling using simple, small-scale benchmark datasets, we hold the model constant and methodically intervene on relevant dimensions of a much larger, more diverse dataset. For this purpose, we introduce a new dataset on recidivism in 1.5 million criminal cases from courts in the U.S. state of Wisconsin, 2000-2018. From this main dataset, we generate multiple auxiliary datasets to simulate different kinds of biases in the data. Focusing on algorithmic bias toward different race/ethnicity groups, we assess the relevance of training data size, base rate difference between groups, representation of groups in the training data, temporal aspects of data curation, including race/ethnicity or neighborhood characteristics as features, and training separate classifiers by race/ethnicity or crime type. We find that these factors often do influence fairness metrics holding the classifier specification constant, without having a corresponding effect on accuracy metrics. The methodology and the results in the paper provide a useful reference point for a data-centric approach to studying algorithmic fairness in recidivism prediction and beyond. 



Heroes, Villains, and Victims, and GPT-3: Automated Extraction of Character Roles Without Training Data” (with Dominik Stammbach and Maria Antoniak), Workshop on Narrative Understanding (2022). Abstract

This paper shows how to use large-scale pre-trained language models to extract character roles from narrative texts without training data. Queried with a zero-shot question-answering prompt, GPT-3 can identify the hero, villain, and victim in diverse domains: newspaper articles, movie plot summaries, and political speeches.


DocSCAN: Unsupervised Text Classification via Learning from Neighbors” (with Dominik Stammbach), KONVENS (2022) Abstract

We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN). For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels. Our learnable clustering approach uses pairs of neighboring data points as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels. On five topic classification benchmarks, we improve on various unsupervised baselines by a large margin. In datasets with relatively few and balanced outcome classes, DocSCAN approaches the performance of supervised classification. The method fails for other types of classification, such as sentiment analysis, pointing to important conceptual and practical differences between classifying images and texts.


Machine Extraction of Tax Laws from Legislative Texts” (with Malka Guillot and Luyang Han), NLLP (2021). Abstract

Using a corpus of compiled codes from U.S. states containing labeled tax law sections, we train text classifiers to automatically tag tax-law documents and, further, to identify the associated revenue source (e.g. income, property, or sales). After evaluating classifier performance in held-out test data, we apply them to an historical corpus of U.S. state legislation to extract the flow of relevant laws over the years 1910 through 2010. We document that the classifiers are effective in the historical corpus, for example by automatically detecting establishments of state personal income taxes. The trained models with replication code are published at


Evaluating Document Representations for Content-based Legal Literature Recommendations” (with Malte Ostendorff, Terry Ruas, Bela Gipp, Julian Moreno-Schneider, and Georg Rehm), ICAIL (2021). Abstract

Recommender systems assist legal professionals in finding relevant literature for supporting their case. Despite its importance for the profession, legal applications do not reflect the latest advances in recommender systems and representation learning research. Simultaneously, legal recommender systems are typically evaluated in small-scale user study without any public available benchmark datasets. Thus, these studies have limited reproducibility. To address the gap between research and practice, we explore a set of state-of-the-art document representation methods for the task of retrieving semantically related US case law. We evaluate text-based (e.g., fastText, Transformers), citation-based (e.g., DeepWalk, Poincaré), and hybrid methods. We compare in total 27 methods using two silver standards with annotations for 2,964 documents. The silver standards are newly created from Open Case Book and Wikisource and can be reused under an open license facilitating reproducibility. Our experiments show that document representations from averaged fastText word vectors (trained on legal corpora) yield the best results, closely followed by Poincaré citation embeddings. Combining fastText and Poincaré in a hybrid manner further improves the overall result. Besides the overall performance, we analyze the methods depending on document length, citation count, and the coverage of their recommendations. We make our source code, models, and datasets publicly available at this https URL.


Legal language modeling with transformers” (with Lazar Peric, Stefan Mijic, and Dominik Stammbach), ASAIL (2020). Abstract

We explore the use of deep learning algorithms to generate text in a professional, technical domain: the judiciary. Building on previous work that has focused on non-legal texts, we train auto-regressive transformer models to read and write judicial opinions. We show that survey respondents with legal expertise cannot distinguish genuine opinions from fake opinions generated by our models. However, a transformer-based classifier can distinguish machine- from human-generated legal text with high accuracy. These findings suggest how transformer models can support legal practice.


Unsupervised Extraction of Workplace Rights and Duties from Collective Bargaining Agreements” (with Jeff Jacobs, W. Bentley MacLeod, Suresh Naidu, and Dominik Stammbach), MLLD (2020). Abstract

This paper describes an unsupervised legal document parser which performs a decomposition of labor union contracts into discrete assignments of rights and duties among agents of interest. We use insights from deontic logic applied to modal categories and other linguistic patterns to generate topic-specific measures of relative legal authority. We illustrate the consistency and efficiency of the pipeline by applying it to a large corpus of 35K contracts and validating the resulting outputs.

Source code and replication package.


e-FEVER: Explanations and Summaries for Automated Fact Checking” (with Dominik Stammbach), TTO (2020). Abstract

This paper demonstrates the capability of a large pre-trained language model (GPT-3) to automatically generate explanations for fact checks. Given a claim and the retrieved potential evidence, our system summarizes the evidence and how it supports the fact-check determination. The system does not require any additional parameter training; instead, we use GPT-3’s analogical “few-shot-learning” capability, where we provide a task description and some examples of solved tasks. We then subsequently ask the model to explain new fact checks. Besides providing an intuitive and compressed summary for downstream users, we show that the machine-generated explanations can themselves serve as evidence for automatically making true/false determinations. Along the way, we report new state-of-the-art fact-checking results for the FEVER dataset. Finally, we make the explanations corpus publicly accessible, providing the first large-scale resource for explainable automated fact checking.


Entropy in Legal Language” (with Roland Friedrich and Mauro Luzzatto), NLLP (2020). Abstract

We introduce a novel method to measure word ambiguity, i.e. local entropy, based on a neural language model. We use the measure to investigate entropy in the written text of opinions published by the U.S. Supreme Court (SCOTUS) and the German Bundesgerichtshof (BGH), representative courts of the common-law and civil-law court systems respectively. We compare the local (word) entropy measure with a global (document) entropy measure constructed with a compression algorithm. Our method uses an auxiliary corpus of parallel English and German to adjust for persistent differences in entropy due to the languages. Our results suggest that the BGH’s texts are of lower entropy than the SCOTUS’s. Investigation of low- and high-entropy features suggests that the entropy differential is driven by more frequent use of technical language in the German court.


Other Publications (Not Peer-Reviewed)

Race-Related Research in Economics and other Social Sciences” (with Arun Advani, David Cai, and Imran Rasul), Econometric Society Monograph Series (forthcoming). Abstract

How does economics compare to other social sciences in its study of issues related to race and ethnicity? We assess this using a  corpus of 500,000 academic publications in economics, political science, and sociology. Using an algorithmic approach to classify race-related publications, we document that economics lags far behind the other disciplines in the volume and share of race-related research, despite having higher absolute volumes of research output. Since 1960, there have been 13,000 race-related publications in sociology, 4,000 in political science, and 3,000 in economics. Since around 1970, the share of economics publications that are race-related has hovered just below 2%  (although the share is higher in top-5 journals); in political science, the share has been around 4% since the mid-1990s, while in sociology, it has been above 6% since the 1960s and risen to over 12% in the last decade. Finally, using survey data collected from the Social Science Prediction Platform, we find that economists tend
to overestimate the amount of race-related research in all disciplines, but especially so in economics.

Press: Financial Times, Warwick CAGE, VoxEU 


The Making of International Tax Law: Evidence from Tax Treaties Text” (with Omri Marian), Florida Tax Review (2020). Abstract

We offer the first attempt at empirically testing the level of transnational consensus on the legal language controlling international tax matters. We also investigate the institutional framework of such consensus-building. We build a dataset of 4,052 bilateral income tax treaties, as well as 16 model tax treaties published by the United Nations (UN), Organisation for Economic Co-operation and Development (OECD) and the United States. We use natural language processing to perform pair-wise comparison of all treaties in effect at any given year. We identify clear trends of convergence of legal language in bilateral tax treaties since the 1960s, particularly on the taxation of cross-border business income. To explore the institutional source of such consensus, we compare all treaties in effect at any given year to the model treaties in effect during that year. We also explore whether newly concluded treaties converge towards legal language in newly introduced models. We find the OECD Model Tax Convention (OECD Model) to have a significant influence. In the years following the adoption of a new OECD Model there is a clear trend of convergence in newly adopted bilateral tax treaties towards the language of the new OECD Model. We also find that model treaties published by the UN (UN Model) have little immediate observable effect, though UN treaty policies seem to have a delayed, yet lasting effect. We conclude that such findings support the argument that a trend towards international legal consensus on certain tax matters exists, and that the OECD is the institutional source of the consensus building process.


Automated Classification of Modes of Moral Reasoning in Judicial Decisions (with Nischal Mainali, Liam Meier, and Daniel L. Chen), in: Computational legal studies: The promise and challenge of data-driven research, Edward Elgar (2020). Abstract



Case vectors: Spatial representations of the law using document embeddings” (with Daniel L. Chen), in: Law as Data, Santa Fe Institute Press (2019). Abstract

Recent work in natural language processing represents language objects (words and documents) as dense vectors that encode the relations between those objects. This paper explores the application of these methods to legal language, with the goal of understanding judicial reasoning and the relations between judges. In an application to federal appellate courts, we show that these vectors encode information that distinguishes courts, time, and legal topics. The vectors do not reveal spatial distinctions in terms of political party or law school attended, but they do highlight generational differences across judges. We conclude the paper by outlining a range of promising future applications of these methods.


Judge, Jury, and EXEcute File: The brave new world of legal automation,” Social Market Foundation (2018). Abstract

This paper discusses the prospects for automating decisions in the legal system. I will discuss active research on decision prediction models for judges and prosecutors and how these algorithms might be used to detect and reduce bias in legal decision-making. I will also discuss the substantial risks for these algorithms to replicate existing biases in the system or create new ones. Along the way, I will discuss the role that incentives theory and econometrics can play in understanding and mitigating these risks.


Grad Students, Postdocs, and Mentees

Note: Artworks generated from paper titles by and DALL-E.