Monday, October 24, 2011

Benford's law: a revised analysis

After digging into the Benford's law results from my previous post a bit more, I discovered that a different effect is driving the time-series pattern than I first thought.  The reason is that Benford's law only applies to nonzero digits, while accounting data contain many zero values.  In my original results, I also failed to account for negative numbers.  Zeros and negatives added to the total number of observations but not to the counts for any digit from 1-9, leading to a mechanical "deviation" from Benford's law. Incidentally, the firstdigit package for Stata (the most common statistical software used by economists, and the one I use) suggested by one of the commenters also runs into this problem.

So what the data really show is that the fraction of accounting data composed of zeros has increased over time.  Most of the discrepancy is due to zeros, although negatives also increase over time.  As you can see below, plotting the fraction of zeros accounts for the pattern across time described in my last post.




When ignoring zeros and correctly accounting for the first digits of negative numbers, the data show no clear trend in the deviations from Benford's law from 1970 on.  There was a decrease in the deviation during the 1960s and 70s, but that's almost surely because there were many fewer data points early on.



Why have zeros increased over time?

The question still remains as to why zeros have increased over time.  The fact that they present a problem already points to the artificiality of accounting data.  In a continuous distribution found in nature, the probability that a value was precisely equal to zero would be vanishingly small (and in many cases such as the lengths of rivers and populations of cities, it wouldn't be possible), so the discrepancy wouldn't have manifested at all. 

It's perfectly natural for firms to have zeros in some of its accounting items in a given quarter, such as deferred taxes or intangible assets.  But zeros for items like total revenues and assets should be rare, there's no reason to expect them to increase over time.  In fact, the increase in zeros over time is present in nearly all of the variables.  The graphs below show the fraction of zero observations over time for total assets and total revenues.




Why is this occuring?  I don't know!  There could be a very simple explanation that I'm not thinking of.  The trend could be due to benign shifts in accounting conventions, increases in off-balance-sheet transactions, or changes in firm characteristics over time.  Or, as I posited originally, it's still possible that zeros indicate a growing gap between accounting numbers and real firm activity over time.  Solving this puzzle would likely require more in-depth analysis comparing corporate accounting statements over time and reconciling them with Compustat.

What I do know is that the increasing fraction of zeros in accounting data 1) is present in nearly all of the individual components of corporate balance sheets and cashflow statements 2) is present even within individual firms over time.  I also looked at whether deviations from Benford's law and the percentage of zero observations correlate with known cases of fraud such as Accounting and Auditing Enforcement Releases by the SEC, and I haven't found any clear relationship.

Reflections on research bias

After taking a closer look at the data, I'm more cautious about interpreting Benford's law as an aggregate measure of fraud.  I think it could still be useful in forensic cases when combined with other detailed evidence about an individual firm.  But given the complexity and heterogeneity within a dataset like Compustat, uncovering a broader truth will require more than casual analysis. 

Yet, the initial graphs were incredibly compelling, and I may have been over-enthusiastic about posting them right away and insufficiently evasive in interpreting them.  My brief investigation of Benford's law reminds me of a microcosm of the decline effect, a phenomenon that's been recently brought to light by Jonah Lehrer and Jonathan Schooler.  What they report is that major results in a variety of fields including medicine, biology, and social science tend to decline in magnitude over time, and many cannot be replicated at all.

How might intrinsic biases in our research methods contribute to the decline effect?  I can only speak for myself.  I was absolutely more keen to post my results because of the striking time-series pattern I found.   I chose the variables ex ante based on the most basic components of corporate accounting statements, and I posted the very first set of results that I found.  Thus, I can at least confidently say that the results were free of data snooping, another common source of bias in which a researcher tries many sets of results and presents only the ones that support a desired hypothesis.

The tendency to publish new results sooner may introduce scrutiny bias, where novel results receive less testing for robustness than ones based on established methods.  This can help explain the decline effect.  But the decline effect may be only one example of how the ideal of unbiased truth dies a death by a thousand cuts through the accumulation of faint prejudices in the research and publication process.  Just as the accretion of conscious and unconscious gender bias results in tens of millions of "missing" girls in developing countries.  While biology sets a clear benchmark for gender parity, we can only indirectly measure research bias by examining how the scientific consensus changes over time.

One way to fight against research bias?  Keep digging.  Dig until we reach the edges of failure.  Maybe journals won't publish the dirt (or maybe academia needs its equivalent of gossip mags), but we at least need to be ready to confront it ourselves.  For me at least, this online paper trail will serve as a reminder to stay honest.

Summary of the evidence
  • With high degree of precision, Benford's law does hold for accounting variables overall when considering only digits 1-9 (for both positive and negative values) and dropping zeros.
  • Looking only at digits 1-9, there is no clear change in the deviation from Benford's law for all firms over time
  • The percentages of zeros and negative values in the data have increased over time, for the majority of variables and both across and within firms.

8 comments:

Anonymous said...

Maybe it's b/c companies more readily go public before they've earned anything? Also, cash flow statement data wasn't mandatory until 1994, but I'm guessing that's not an issue b/c it would presumably cause more zeroes pre-1994 if the data weren't imputed from the other sheets. Kudos for following up on your analysis!

Aaron Brown said...

Benford's law can catch someone making up numbers, but only the crudest accounting fraud involves making up a number like revenue or assets. More commonly, someone changes a category, perhaps calling an expense a capital investment or booking a Q2 sale in Q1. Even if someone does invent a number, say claiming a non-existent sale, that will be added to lots of real numbers so the total tested by this analysis should still conform to the Benford pattern.

My first suspicion about the change in zeros and negative numbers is due to changes in the population covered by the data, or possibly the way the data are recorded (if you switch from reporting in thousands of dollars to millions of dollars, you get more zeros). If it is a result of real changes, it’s probably the tendency to move things off the balance sheet, canceling out assets and liabilities, and revenues and expenses, so what are now reported as gross numbers look like net numbers from the 1970s.

The one deviation from random first digits that would make me suspect cooking is too many 1’s and not enough 9’s. If you have $9,837,112 of earnings, it could be tempting to report something over $10 million. This effect is observed in a lot of data (sometimes in reverse, as in lots of prices ending in $0.99 or $0.95).

James said...

Forensic accountants might say the most common accounting tricks are on the revenue line, but Aaron, the phenomenon you cite in your last paragraph has seen a slew of research, as you're likely aware... it's even seeing attempts at being debunked, which is a sure sign of popularity :)

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1430925

I was too lazy to download that, but I saw this snippet where they mention using a "Legacy" Compustat file that excludes small loss observations, so I, too, would wonder about ruling out something logistical in the data here... I have no idea, though, really :) Either way, interesting stuff.

http://dc105.4shared.com/doc/JOYoMtB3/preview.html

danieldwilliam said...

The increase in the number of zeros might be to do with an increasing complexity of financial reporting.

If the data you are using includes all the possible categories in which a number might appear in a financial report and these have been increasing over the years but the number of organisations who actually have a number to report in each new and obscure category remain very low you would see many more zeros proportionately.

Runescape Gold said...
This comment has been removed by a blog administrator.
Anonymous said...

My understanding is that the law only applies to the occurrence of the numbers 1-9 in the "first digit place". A zero would not in fact have any place in the data. Net Profit of $0123 would not make accounting sense. In fact it would most likely be interpreted as $123. Accounting for placement of any other numbers in the sequence of samples falls outside the claims of the law. I would think that since a negative number also has a "first digit" regardless of its negativity, the law would still apply.
I may be mistaken in my assumption that the law only applies to the occurrence of the numbers 1,2,3,4,5,6,7,8,9 in the "first digit" place, but I couldn't find a single reference suggesting the law applies to the placement of digits in second, third, forth.... place.

Jeffrey Froehle said...

Just a thought - could the increase in zeros be a result of greater use of computer accounting software wherein zero balances require no maintenance and are either not eliminated or corrected? Further, there is increasing tendency to truncate accounting data for the sake of space and presentation. Great topic and really find other comments valuable. Don't give up on the project.

Jeffrey Froehle said...

Just a thought - could the increase in zeros be a result of greater use of computer accounting software wherein zero balances require no maintenance and are either not eliminated or corrected? Further, there is increasing tendency to truncate accounting data for the sake of space and presentation. Great topic and really find other comments valuable. Don't give up on the project.