Our data was all imported from the United States Environmental Protection Agency. The initial data was far toomuch for any server or our local machines to handle. We ran through 3 basic steps for each set of data. We had all of our CSV files, for yearly, daily and hourly. For each data set we did the following.
First we ran a python script on the data to eliminate unnecessary columns and keep only enough information so that R could run through it without taking forever. We found resources on how to do this on github link: https://github.com/ashkan25/Python-CSV-Column-Remover
We used this on all our files and this way were able to reduce the amount of information that we had to parse through in R.
Next we created a script in R which converted all this information into a dataframe and saved it as an r binary file. This means that after this was saved R no longer has to convert anything, just read in the file. We then were able to read in these files way faster.
We created a list of functions all in our code under the ———Functions——— comment tag which parsed through this data and simply extracted information based on user input. This data was then visualized in various forms according to our requirements. We had to manipulate data in order to be able to best visualize it while not putting at risk the speed of our application. We tried to always be as specific as possible when creating our subsets as to shrink our data as much as we can and to be most precise. We also used data from the mapsdata package to do our mapping.We initially were using google maps api but found that the look we wanted was more easily achievable through the mapsdata package, This package contains county and state coordinate and border information which allows us to color in whatever counties and states we want based on their coordinates. By merging this data with our subsets of whatever needed to be mapped we were able to efficiently find the correct locations to color based on our data.
Note: Merge in R represents a joining of two data frames in this case all county coordinates and all subset information to leave us with the coordinates of the counties which show up in our subset.