﻿ Benford's Law & Difficulty of Faking Data | SimoleonSense

## Benford's Law & Difficulty of Faking Data

So what is Benford’s law: (Via wikipedia)

Benford’s law, also called the first-digit law, states that in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 about 30% of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than 5% of the time. This distribution of first digits is the same as the widths of gridlines on the logarithmic scale.

This counter-intuitive result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature). It tends to be most accurate when values are distributed across multiple orders of magnitude.

How can we apply Benford’s Law in Fraud Detection (Via Wikipedia)

In 1972, Hal Varian suggested that the law could be used to detect possible fraud in lists of socio-economic data submitted in support of public planning decisions. Based on the plausible assumption that people who make up figures tend to distribute their digits fairly uniformly, a simple comparison of first-digit frequency distribution from the data with the expected distribution according to Benford’s law ought to show up any anomalous results. Following this idea, Mark Nigrini showed that Benford’s law could be used as an indicator of accounting and expenses fraud. In the United States, evidence based on Benford’s law is legally admissible in criminal cases at the federal, state, and local levels.

Benford’s law has been invoked as evidence of fraud in the 2009 Iranian elections.

In June 2010, consultants working for political website Daily Kos used Benford’s law, among other tools, to find serious flaws in the data collected by polling company Research 2000 (R2K). This led to the termination of R2K’s contract with Daily Kos, and possible litigation.

Interesting Abstract (Via Theodore P Hill)

Most people have preconceived notions of randomness that often differ substantially from true randomness. A classroom favorite is the counter intuitive fact that in a randomly selected group of 23 people the probability is bigger than 50% that at least two share the same birthday. A more serious example concerning ” false-positives” in medical testing is this: Suppose that a person is selected at random from a large population of which 1% are drug users and that a drug test is administered which is 98% reliable (ie drug users test positive with a probability of .98 and nonusers test negative with probability .98%). The somewhat surprising fact is that, if the test result is positive, then the person tested is nevertheless more than twice as likely to be a nonuser than a user. Similar surprises concerning unexpected properties of truly random datasets make it difficult to fabricate numerical data successfully.