Paper details  

This report is a follow up to the second topic in Report #1 (union status). In this report you areto estimate and plot a classification tree with union coverage as the dependent variable (so you aregoing to have to create a binary variable, with one category being employed individuals who are unionmembers or covered by a collective agreement, and the other category being non-unionized.The usual explanatory variables for union coverage are sector (public vs private), industry (goodsproducing vs. service producing), occupation, sex, firm size ( the number of employees at all locationsof the employer), and establishment size (the number of employees at the location of employment).It’s helpful to know that the goods producing sector comprises codes 1-8 in the NAICS21 and theservice producing sector comprises codes 9-21 in the NAICS21.Start with a descriptive analysis of union coverage by relevant variables. Some of the descriptiveanalysis in Report #1 is relevant here and you can reuse tables from Report 1 in this section, if theyare relevant to this Report as well.The second section should be a classification tree analysis usingrpart. Start with a very low com-plexity parameter and prune using the usual xerror criterion. The only accuracy measures I want youto do is a confusion matrix.The third section involves splitting the data into two, public sector only, and private sector only. Pe-form the same analysis you performed in the second section here as well, but with each of the twosectors.The report should be no longer than 1,500 words and be in either a Word document or a PDF document,be paginated, and have a title page. You must also submit, along with your report, a markdown (rmd)file that contains the syntax and output you used in the report.Hints:•The UNION variable only applies to employees, so first you will have to eliminate all observa-tions where LFSSTAT equals 3 or 4.•There are no missing values in the LFS data. NAs are actually “Not Applicable”. No imputingis appropiate. Once you get the data down to the variables you are going to use if any records1
contain NAs they will have to be case deleted. If you do not case delete themrpartwill imputethem.•The variables are all categorical, but they look like integers. You should convert them to factors(a singledf<−lapply(df, factor)will convert all variables in the data framedfinto factors.)•Although this is obvious, just to be clear, to look at accuracy measures you are going to have tosplit your data into training and testing sets.•Check thestrfrequently. Sometimes commands turn data frames into lists without you realizingit.•This is a big data set and you may end up, even after pruning, with quite a big tree. It isacceptable with a wide diagram that is unreadable in portrait orientation to have such diagramsappear as single pages in landscape.•If you do end up with a large tree the information may be difficult to read.You may have toexperiment with different plot commands to improve the readability of your figures.2contain NAs they will have to be case deleted. If you do not case delete themrpartwill imputethem.•The variables are all categorical, but they look like integers. You should convert them to factors(a singledf<−lapply(df, factor)will convert all variables in the data framedfinto factors.)•Although this is obvious, just to be clear, to look at accuracy measures you are going to have tosplit your data into training and testing sets.•Check thestrfrequently. Sometimes commands turn data frames into lists without you realizingit.•This is a big data set and you may end up, even after pruning, with quite a big tree. It isacceptable with a wide diagram that is unreadable in portrait orientation to have such diagramsappear as single pages in landscape.•If you do end up with a large tree the information may be difficult to read.You may have toexperiment with different plot commands to improve the readability of your figures

Economics