top of page

Real Estate Purchases Data Analysis of City of New York | Sample Paper | Realcode4you



Overview & Instructions:

Your data come as five CSV files, which are data on real estate purchases within the borough of Brooklyn from 2016-2020, provided by the City of New York. For the purposes of this analysis, we will only consider purchases of single-family residences and single-unit apartments or condos.


Your goal will be to use linear regression to explain Brooklyn housing prices this period. In Part I, your model will be judged on both the proportion of the data which it explains as well as its predictive accuracy within that proportion. Part II of this final assignment will ask you to specifically estimate the how prices changed from Q3 2020 to Q4 2020, so keep that in mind while designing the model in Part I. When you have finalized your model, you will submit your data and model to me for review, and then discuss your model performance in a memo (Part II of this exam).


Your grade for Part I of the Final Assignment will be determined based on:

  • the number of observations used in the regression (try to include at least 13,000),

  • the number of model degrees of freedom (try for 40 or less),

  • the adjusted R-squared value (try to reach 0.6),

  • and the RMSE of the untransformed predictions (try to stay under $450,000).


You may not find a way to meet all of these requirements at once, but missing these goals narrowly will only create a small grade penalty.



Step 1: Import and prepare the data for analysis

Your data come as five CSV files, which are data on real house purchases within the borough of Brooklyn from 2016-2020, provided by the City of New York. Please use the datasets found in this folder and do not try to recreate the data by downloading from the city website.


  • A glossary with information about each variable can be found here: Glossary of Terms for Property Sales FilesDownload Glossary of Terms for Property Sales Files

  • A “codebook” explaining the meaning behind the city’s building class codes can be found here: Building Classification CodesLinks to an external site.


1.1 Bring the data into R

Using R, bring all five datasets into your workspace. Notice that all five datasets have 21 columns, with similar (but not identical) column names. Please use the following vector of column names to standardize the data.


c('borough','neighborhood','bldclasscat','taxclasscurr','block','lot','easement','bldclasscurr','address','aptnum','zip',' resunits','comunits','totunits','landsqft','grosssqft','yrbuilt','taxclasssale','bldclasssale','price','date')

1.2 Join the data and make it usable for analysis

There are some data cleaning steps and transformations that would be necessary or helpful to almost any analysis. Consider your data carefully, column by column. Reformat, change data types, pay attention to white space and special characters. Datasets kept over multiple years are not always created in exactly the same way, so take care that your data is standardized across years. When you are done, create a new datasets which joins all five yearly datasets. This step will likely take a large amount of time. Do not assume you can complete it quickly. The resulting dataset should have roughly 119,000 rows.


1.3 Filter the data and make transformations specific to this analysis

For the purposes of this analysis, we will only consider purchases of singlefamily residences and single-unit apartments or condos. Restrict the data to purchases where the building class at the time of sale starts with ‘A’ or ‘R’ and where the number of total units and the number of residential units are both 1. Additionally restrict the data to observations where gross square footage is more than 0 and sale price is nonmissing. The resulting dataset should have roughly 19,000 rows.


Step 2: EDA and feature engineering

Your goal will be to use linear regression to explain Brooklyn housing prices within the 2016-2020 window. You will be asked to make predictions for the sale prices within the dataset. You are encouraged to think of ways to get the most explanatory power out of your current variables.


2.1 Exploratory data analysis

Consider price as a potential response variable. Examine how it is distributed, and how it associates with the other variables in your data. Think about how you would use these other variables to explain price. Consider whether each variable should enter as a continuous numeric predictor, or as a factor. Consider transformations of your response variable, transformations of your predictors, or both. Use this exploratory data analysis to revisit your initial data cleaning steps, which might need revision.


2.2 Pre-modeling and feature engineering

Begin to construct linear models explaining price (or a transformation of price). Consider your total model degrees of freedom, your adjusted R^2, and your RMSE (root means square error). Also consider whether your models show severe violations of the OLS model assumptions, or merely slight violations of the OLS model assumptions.


Some home purchases are not competitive. They are exchanged for an amount of money that “doesn’t make sense”, for instance houses sold between family members, or houses inherited through a will. Consider how to best identify and remove these uncompetitive sales from your sample. Your model will be judged on both the proportion of the data which it explains as well as its predictive accuracy within that proportion.


As you model, consider new predictors you can make from the original predictors (this process is called feature engineering), which might help increase your explanatory power. These include interaction terms, polynomial terms, stair step functions, etc.


2.3 Reach a stopping point

Your goal is to come up with a model which minimizes RMSE, maximizes the proportion of the data used in your analysis, and uses no more than 40 model degrees of freedom. When you have finalized your model, you will be ready to submit your answers and discuss your model performance.


Step 3: Submit your model and your work.

Create an RDS file which contains your regression object as well as the final dataset used in your regression. Name the file with your own name and submit through Canvas. Sample code below:


saveRDS(list(model=my.lm, data=my.data), file=’jonathanwilliams.RDS’)

Please also submit the script you used to prepare the data and run the model, either as a .R or .RMD file. Name it after yourself as well, for example: jonathanwilliams.R




We are providing all R programming Related help with reasonable price. If you looking to hire expert that can do your project with quality code then you can send your assignment requirement details at:



bottom of page