|
|
|
|
45
|
|
45
|
|
46
|
---
|
46
|
---
|
47
|
|
47
|
|
48
|
-## What is Benford’s Law? (copied from the ISACA journal)
|
|
|
|
|
48
|
+## What is Benford’s Law? (adapted from the ISACA journal [1])
|
49
|
|
49
|
|
50
|
Benford’s Law, named for physicist Frank Benford, who worked on the theory in 1938, is the mathematical theory of leading digits. Specifically, in data sets, the leading digit(s) is (are) distributed in a specific, non uniform way. While one might think that the number 1 would appear as the first digit 11 percent of the time (i.e., one of nine possible numbers), it actually appears about 30 percent of the time (see Figure 1). The number 9, on the other hand, is the first digit less than 5 percent of the time. The theory covers the first digit, second digit, first two digits, last digit and other combinations of digits because the theory is based on a logarithm of probability of occurrence of digits.
|
50
|
Benford’s Law, named for physicist Frank Benford, who worked on the theory in 1938, is the mathematical theory of leading digits. Specifically, in data sets, the leading digit(s) is (are) distributed in a specific, non uniform way. While one might think that the number 1 would appear as the first digit 11 percent of the time (i.e., one of nine possible numbers), it actually appears about 30 percent of the time (see Figure 1). The number 9, on the other hand, is the first digit less than 5 percent of the time. The theory covers the first digit, second digit, first two digits, last digit and other combinations of digits because the theory is based on a logarithm of probability of occurrence of digits.
|
51
|
|
51
|
|
|
|
|
|
100
|
|
100
|
|
101
|
---
|
101
|
---
|
102
|
|
102
|
|
|
|
103
|
+### Frequency of occurrence
|
|
|
104
|
+
|
103
|
The **frequency of occurrence** is defined as the ratio of times that a digit appears divided by the total number of data. For example, the frequency of leading digit `1` in the example would computed as $$9 / 20 = 0.45$$. **Histograms** are the preferred visualization of frequency distributions in a data set. In essence, a histogram is a bar chart where the $$y$$-axis is the frequency and a vertical bar is drawn for each of the counted classifications (in our case, for each digit).
|
105
|
The **frequency of occurrence** is defined as the ratio of times that a digit appears divided by the total number of data. For example, the frequency of leading digit `1` in the example would computed as $$9 / 20 = 0.45$$. **Histograms** are the preferred visualization of frequency distributions in a data set. In essence, a histogram is a bar chart where the $$y$$-axis is the frequency and a vertical bar is drawn for each of the counted classifications (in our case, for each digit).
|
104
|
|
106
|
|
105
|
---
|
107
|
---
|
|
|
|
|
112
|
|
114
|
|
113
|
---
|
115
|
---
|
114
|
|
116
|
|
115
|
-!INCLUDE "../../eip-diagnostic/benfords-law/en/diag-benford-law-01.html"
|
|
|
|
|
117
|
+!INCLUDE "../../eip-diagnostic/benfords-law/en/diag-benford-law-01.html"
|
116
|
<br>
|
118
|
<br>
|
117
|
|
119
|
|
118
|
-!INCLUDE "../../eip-diagnostic/benfords-law/en/diag-benford-law-02.html"
|
|
|
|
|
120
|
+!INCLUDE "../../eip-diagnostic/benfords-law/en/diag-benford-law-02.html"
|
119
|
<br>
|
121
|
<br>
|
120
|
|
122
|
|
121
|
-!INCLUDE "../../eip-diagnostic/benfords-law/en/diag-benford-law-03.html"
|
|
|
|
|
123
|
+!INCLUDE "../../eip-diagnostic/benfords-law/en/diag-benford-law-03.html"
|
122
|
<br>
|
124
|
<br>
|
123
|
|
125
|
|
124
|
-!INCLUDE "../../eip-diagnostic/benfords-law/en/diag-benford-law-04.html"
|
|
|
|
|
126
|
+!INCLUDE "../../eip-diagnostic/benfords-law/en/diag-benford-law-04.html"
|
125
|
<br>
|
127
|
<br>
|
126
|
|
128
|
|
127
|
---
|
129
|
---
|
|
|
|
|
130
|
|
132
|
|
131
|
##Laboratory session
|
133
|
##Laboratory session
|
132
|
|
134
|
|
133
|
-###Exercise 1: Familiarizing yourself with the data files and the provided code
|
|
|
|
|
135
|
+###Exercise 1: Understand the data files and the provided code
|
134
|
|
136
|
|
135
|
####Instructions
|
137
|
####Instructions
|
136
|
|
138
|
|
137
|
-1. Load the project `BenfordsLaw` onto QtCreator by double clicking the file `BenfordsLaw.pro` in the folder `Documents/eip/Arrays-BenfordsLaw` on your computer. You can also go to `http://bitbucket.org/eip-uprrp/arrays-benfordslaw` to download the `Arrays-BenfordsLaw` folder to your computer.
|
|
|
|
|
139
|
+1. Load the project `BenfordsLaw` into `QtCreator`. There are two ways of doing this:
|
|
|
140
|
+
|
|
|
141
|
+ a. Using the virtual machine: Double click the file `BenfordsLaw`.pro` located in the folder `/home/eip/labs/arrays-benfordslaw` of your virtual machine.
|
|
|
142
|
+
|
|
|
143
|
+ b. Downloading the project’s folder from `Bitbucket`: Use a terminal and write the command `git clone http:/bitbucket.org/eip-uprrp/arrays-benfordslaw` to download the folder `arrays-benfordslaw` from `Bitbucket`. Double click the file `BenfordsLaw.pro` located in the folder that you downloaded to your computer.
|
138
|
|
144
|
|
139
|
2. The text files `cta-a.txt`, `cta-b.txt`, `cta-c.txt`, `cta-d.txt`, and `cta-e.txt` in the `data` directory contain either real or bogus data. Each line of the file specifies the bus route code and the number of users for that route on a certain day. Open the file `cta-a.txt` to understand the data format. This will be important when reading the file sequentially using C++. Notice that some of the route codes contain characters.
|
145
|
2. The text files `cta-a.txt`, `cta-b.txt`, `cta-c.txt`, `cta-d.txt`, and `cta-e.txt` in the `data` directory contain either real or bogus data. Each line of the file specifies the bus route code and the number of users for that route on a certain day. Open the file `cta-a.txt` to understand the data format. This will be important when reading the file sequentially using C++. Notice that some of the route codes contain characters.
|
140
|
|
146
|
|
|
|
|
|
164
|
|
170
|
|
165
|
##Deliverables
|
171
|
##Deliverables
|
166
|
|
172
|
|
167
|
-1. Use "Deliverables 1" in Moodle to upload the `main.cpp` file with the modifications you made in **Exercise 2**. Remember to use good programming techniques, include the names of the programmers involved, and to document your program.
|
|
|
|
|
173
|
+1. Use "Deliverable 1" in Moodle to upload the `main.cpp` file with the modifications you made in **Exercise 2**. Remember to use good programming techniques, include the names of the programmers involved, and to document your program.
|
168
|
|
174
|
|
169
|
-2. Use "Deliverables 2" in Moodle to upload a **pdf** file that contains screen shots of the histograms produced after analyzing each text file. Please caption each figure with the name of the text file and provide your decision as to whether the file contained real or bogus data.
|
|
|
|
|
175
|
+2. Use "Deliverable 2" in Moodle to upload a **pdf** file that contains screen shots of the histograms produced after analyzing each text file. Please caption each figure with the name of the text file and provide your decision as to whether the file contained real or bogus data.
|
170
|
|
176
|
|
171
|
---
|
177
|
---
|
172
|
|
178
|
|