High quality of E20-007 braindumps materials and free demo for EMC certification for IT examinee, Real Success Guaranteed with Updated E20-007 pdf dumps vce Materials. 100% PASS Data Science Associate Exam exam Today!

Q16. Assume that you have a data frame in R. Which function would you use to display descriptive statistics about this variable? 

A. summary 

B. str 

C. attributes 

D. levels 


Q17. What is holdout data? 

A. a subset of the provided data set selected at random and used to validate the model 

B. a subset of the provided data set selected at random and used to initially construct the model 

C. a subset of the provided data set that is removed by the data scientist because it contains data errors 

D. a subset of the provided data set that is removed by the data scientist because it contains outliers 


Q18. Refer to the Exhibit. 

In the Exhibit. For effective visualization, what is the chart's primary flaw? 

A. The use of 3 dimensions. 

B. The slanting of axis labels. 

C. The location of the legend. 

D. The order of the columns. 


Q19. In R, functions like plot() and hist() are known as what? 

A. generic functions 

B. virtual methods C. virtual functions D. generic methods 


Q20. efer to exhibit. 

You are asked to write a report on how specific variables impact your client’s sales using a data set provided to you by the client. The data includes 15 variables that the client views as directly related to sales, and you are restricted to these variables only. 

After a preliminary analysis of the data, the following findings were made: 


 Multicollinearity is not an issue among the variables 


 Only three variables—A, B, and C—have significant correlation with sales 

You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are seen in the exhibit. 

You cannot request additional datA. what is a way that you could try to increase the R2 of the model without artificially inflating it? 

A. Create clusters based on the data and use them as model inputs 

B. Force all 15 variables into the model as independent variables 

C. Create interaction variables based only on variables A,B,and C 

D. Break variables A,B,and C into their own univariate models 


Q21. Refer to the exhibit. 

You are using K-means clustering to classify customer behavior for a large retailer. You need to determine the optimum number of customer groups. You plot the within-sum-of-squares (wss) data as shown in the exhibit. How many customer groups should you specify? 

A. 2 

B. 3 

C. 4 

D. 8 


Q22. Which characteristic applies only to Business Intelligence as opposed to Data Science? 

A. Uses only structured data 

B. Supports solving “what if” scenarios 

C. Uses large data sets 

D. Uses predictive modeling techniques 


Q23. The web analytics team uses Hadoop to process access logs. They now want to correlate this data with structured user data residing in their massively parallel database. Which tool should they use to export the structured data from Hadoop? 

A. Sqoop 

B. Pig 

C. Chukwa 

D. Scribe 


Q24. You are building a logistic regression model to predict whether a tax filer will be audited within the next two years. Your training set population is 1000 filers. The audit rate in your training data is 4.2%. What is the sum of the probabilities that the model assigns to all the filers in your training set that have been audited? 

A. 42.0 

B. 4.2 

C. 0.42 

D. 0.042 


Q25. A Data Scientist is assigned to build a model from a reporting data warehouse. The warehouse contains data collected from many sources and transformed through a complex, multi-stage ETL process. What is a concern the data scientist should have about the data? 

A. It is too processed 

B. It is not structured 

C. It is not normalized 

D. It is too centralized 


Q26. Which activity might be performed in the Operationalize phase of the Data Analytics Lifecycle? 

A. Run a pilot 

B. Try different analytical techniques 

C. Try different variables 

D. Transform existing variables 


Q27. You are provided four different datasets. Initial analysis on these datasets show that they have identical mean, variance and correlation values. What should your next step in the analysis be? 

A. Visualize the data to further explore the characteristics of each data set 

B. Select one of the four datasets and begin planning and building a model 

C. Combine the data from all four of the datasets and begin planning and bulding a model 

D. Recalculate the descriptive statistics since they are unlikely to be identical for each dataset 


Q28. Refer to the Exhibit. 

In the Exhibit, the table shows the values for the input Boolean attributes "A", "B", and "C". It also shows the values for the output attribute "class". Which decision tree is valid for the data? 

A. Tree B 

B. Tree A 

C. Tree C 

D. Tree D 


Q29. You are performing a market basket analysis using the Apriori algorithm. Which measure is a ratio describing the how many more times two items are present together than would be expected if those two items are statistically independent? 

A. Lift 

B. Leverage 

C. Support 

D. Confidence 


92. In which lifecycle stage are appropriate analytical techniques determined? 

A. Model planning 

B. Model building 

C. Data preparation 

D. Discovery 


Q30. Which word or phrase completes the statement? Data-ink ratio is to data visualization as . 

A. Confusion matrix is to classifier 

B. Data scientist is to big data 

C. Seasonality is to ARIMA 

D. K-means is to Naive Bayes