1. What are the key differences between Data Analysis and Data Mining?
|Data Analysis||Data Mining|
|Involves the process of cleaning, organizing and using data to produce meaningful results||Using data to search for hidden patterns in the data|
|More comprehensible by a wide variety of audience||Results can seem complex to beginners|
2. What is Data Validation?
Data validation, as the name suggests, is the process that involves determining the accuracy of data and the quality of the source as well. There are many processes in data validation but the main ones are data screening and data verification.
- Data screening: Making use of a variety of models to ensure that the data is accurate and no redundancies are present.
- Data verification: If there is a redundancy, it is evaluated based on multiple steps and then a call is taken to ensure the presence of the data item.
3. What is Data Analysis, in brief?
Data analysis is a structured procedure that involves working with data by performing activities such as ingestion, cleaning, transforming, and assessing it to provide insights, which can be used to drive revenue.
Data is collected, to begin with, from varied sources. Since the data is a raw entity, it has to be cleaned and processed to fill out missing values and to remove any entity that is out of the scope of usage.
After preprocessing the data, it can be analyzed with the help of models, which use the data to perform some analysis on it.
The last step involves reporting and ensuring that the data output is converted to a format that can also cater to a non-technical audience, alongside the analysts.
4. How to know if a data model is performing well or not?
This question is subjective, but there are certain simple assessment points that can be used to assess the accuracy of a data model. They are as follows:
- A well-designed model should offer good predictability. This correlates to the ability of easily being able to predict future insights when needed.
- A rounded model adapts easily to any change made to the data or the pipeline if need be.
- The model should have the ability to cope up in case there is an immediate requirement to largely scale the data.
- The model’s working should be easy and it should be easily understood among clients to help them derive the required results.
5. Explain Data Cleaning in brief?
Data cleaning is also called data wrangling. As the name suggests, it is a structured way of finding erroneous content in data and safely removing them to ensure that the data is of the utmost quality. Here are some of the ways in data cleaning:
- Removing a data block entirely
- Finding ways to fill black data in, without causing redundancies
- Replacing data with its mean or median values
- Making use of placeholders for empty spaces
6. What are some of the problems that a working Data Analyst might encounter?
There can be many issues that a Data Analyst might face when working with data. Here are some of them:
- The accuracy of the model in development will be low if there are multiple entries of the same entity and errors with respect to spellings and incorrect data.
- If the source the data being ingested from is not a verified source, then the data might require a lot of cleaning and preprocess before beginning the analysis.
- The same goes for when extracting data from multiple sources and merging them for use.
- Analysis will take a backstep if the data obtained is incomplete or inaccurate.
7. What is Data Profiling?
Data profiling is a methodology that involves analyzing all entities present in data to a greater depth. The goal here is to provide highly accurate information based on the data and its attributes such as the datatype, frequency of occurrence, and more.
8. What are the scenarios that could cause a model to be retrained?
Data is never a stagnant entity. If there is an expansion in business, this could cause openings to sudden opportunities that might call for a change in the data. Further, assessing the model to check its standing can help the Analyst analyze whether the model is to be retrained or not.
However, the general rule of thumb is to ensure that the models are retrained when there is a change in business protocols and offerings.
9. What are the prerequisites to become a Data Analyst?
There are many skills that a budding Data Analyst needs. Here are some of them:
- Proficient in databases such as SQL, MongoDB, and more
- Ability to effectively collect and analyze data
- Knowledge of database designing and data mining
- Having ability/experience working with large datasets
10. What are the top tools used to perform Data Analysis?
There is a wide spectrum of tools that can be used in the field of data analysis. Here are some of the popular ones:
- Google Search Operators
11. What is an outlier?
An outlier is a value in a dataset that is considered to be away from the mean of the characteristic feature of the dataset. There are two types of outliers: univariate and multivariate.
12. How can we deal with problems that arise when the data flows in from a variety of sources?
There are many ways to go about dealing with multi-source problems. However, these are done primarily to solve the problems of:
- Identifying the presence of similar/same records and merging them into a single record
- Restructuring the schema to ensure there is good schema integration
What are some of the popular tools used in Big Data?
There are multiple tools that are used to handle Big Data. Some of the most popular ones are as follows:
14. What is the use of a Pivot table?
Pivot tables are one of the key features of Excel. They allow a user to view and summarize the entirety of large datasets in a simple manner. Most of the operations with Pivot tables involve drag-and-drop operations that aid in the quick creation of reports.
15. Explain the KNN imputation method, in brief?
KNN is the method that requires the selection of a number of nearest neighbors and a distance metric at the same time. It has the ability to predict both discrete and continuous attributes of a dataset.
A distance function is used here to find the similarity of two or more attributes, which will help in further analysis.
16. What are the top Apache frameworks used in a distributed computing environment?
MapReduce and Hadoop are considered to be the top Apache frameworks when the situation calls for working with a huge dataset in a distributed working environment.
17. What is Hierarchical Clustering?
Hierarchical clustering, or hierarchical cluster analysis, is an algorithm that groups similar objects into common groups called clusters. The goal is to create a set of clusters, where each cluster is different from the other and, individually, they contain similar entities.
18. What are the steps involved when working with a Data Analysis project?
Many steps are involved when working end-to-end on a data analysis project. Some of the important steps are mentioned below:
- Problem statement
- Data cleaning/preprocessing
- Data exploration
- Data validation
19. Can you name some of the statistical methodologies used by Data Analysts?
There are many statistical techniques that are very useful when performing data analysis. Here are some of the important ones:
- Markov process
- Cluster analysis
- Imputation techniques
- Bayesian methodologies
- Rank statistics
20. What is Time Series Analysis?
Time series analysis, or TSA for short, is a widely used statistical technique when working with trend analysis and time-series data in particular. The time-series data involves the presence of the data at particular intervals of time or set periods.
21. Where is Time Series Analysis used?
Since time series analysis (TSA) has a wide scope of usage, it can be used in multiple domains. Here are some of the places where TSA plays an important role:
- Signal processing
- Weather forecasting
- Earthquake prediction
- Applied science
22. What are some of the properties of clustering algorithms?
Any clustering algorithm, when implemented will have the following properties:
- Flat or hierarchical
23. What is Collaborative Filtering?
Collaborative filtering is an algorithm used to create recommendation systems mainly considering the behavioral data of a customer or a user.
For example, when browsing through e-commerce sites, a section called ‘Recommended for you’ is present. This is done using the browsing history, alongside analyzing the previous purchases and collaborative filtering.
24. What are the types of Hypothesis Testing used today?
There are many types of hypothesis testing. Some of them are as follows:
- Analysis of variance (ANOVA): Here, the analysis is conducted between the mean values of multiple groups.
- T-test: This form of testing is used when the standard deviation is not known and the sample size is relatively less.
- Chi-square test: This kind of hypothesis testing is used when there is a requirement to find out the level of association between the categorical variables in a sample.
25. What are some of the data validation methodologies used in Data Analysis?
Many types of data validation techniques are used today. Some of them are:
- Field-level validation: Validation is done across each of the fields to ensure that there are no errors in the data entered by the user.
- Form-level validation: Here, validation is done when the user completes working with the form but before the information being saved.
- Data saving validation: This form of validation takes place when the file or the database record is being saved.
- Search criteria validation: This kind of validation is used to check whether valid results are returned when the user is looking for something.
26. What is K-means algorithm?
K-means algorithm clusters data into different sets based on how close the data points are to each other. The number of clusters is indicated by ‘k’ in the k-means algorithm. It tries to maintain a good amount of separation between each of the clusters.
However, since it works in an unsupervised nature, the clusters will not have any sort of labels to work with.
27. What is the difference between the concepts of recall and the true positive rate?
Recall and the true positive rate, both are totally identical. Here’s the formula for it:
Recall = (True positive)/(True positive + False negative)
28. What are the ideal situations in which t-test or z-test can be used?
It is a standard practice that a t-test is used when there is a sample size less than 30 and the z-test is considered when the sample size exceeds 30 in most cases.
29. Why is Naive Bayes called ‘naive’?
It is called naive because it makes a general assumption that all the data present are unequivocally important and independent of each other. This is not true and won’t hold good in a real-world scenario.
30. What is the simple difference between standardized and unstandardized coefficients?
In the case of standardized coefficients, they are interpreted based on their standard deviation values. While the unstandardized coefficient is measured based on the actual value present in the dataset.
31. How are outliers detected?
Multiple methodologies can be used for detecting outliers, but the two most commonly used methods are as follows:
- Standard deviation method: Here, the value is considered as an outlier if the value is lower or higher than three standard deviations from the mean value.
- Box plot method: Here, a value is considered to be an outlier if it is lesser or higher than 1.5 times the interquartile range (IQR)
32. Why is KNN preferred when determining missing numbers in data?
K-Nearest Neighbour (KNN) is preferred here because of the fact that KNN can easily approximate the value to be determined based on the values closest to it.
33. How can one handle suspicious or missing data in a dataset while performing analysis?
If there are any discrepancies in data, a user can go on to use any of the following methods:
- Creation of a validation report with details about the data in discussion
- Escalating the same to an experienced Data Analyst to look at it and take a call
- Replacing the invalid data with a corresponding valid and up-to-date data
- Using many strategies together to find missing values and using approximation if needed
34. What is the simple difference between Principal Component Analysis (PCA) and Factor Analysis (FA)?
Among many differences, the major difference between PCA and FA lies in the fact that factor analysis is used to specify and work with the variance between variables, but the aim of PCA is to explain the covariance between the existing components or variables.
35. How is it beneficial to make use of version control?
There are numerous benefits of using version control as shown below:
- Establishes an easy way to compare files, identify differences, and merge if any changes are done
- Creates an easy way to track the life cycle of an application build, including every stage in it such as development, production, testing, etc.
- Brings about a good way to establish a collaborative work culture
- Ensures that every version and variant of code is kept safe and secure
36. What are the future trends in Data Analysis?
With this question, the interviewer is trying to assess your grip on the subject and your research in the field. Make sure to state valid facts and respective validation for sources to add positivity to your candidature. Also, try to explain how Artificial Intelligence is making a huge impact on data analysis and its potential in the same.
37. Why are you applying for the Data Analyst role in our company?
Here, the interviewer is trying to see how well you can convince them regarding your proficiency in the subject, alongside the need for data analysis at the firm you’ve applied for. It is always an added advantage to know the job description in detail, along with the compensation and the details of the company.
38. Can you rate yourself on a scale of 1–10 depending on your proficiency in Data Analysis?
With this question, the interviewer is trying to grasp your understanding of the subject, your confidence, and your spontaneity. The most important thing to note here is that you answer honestly based on your capacity.
39. Has your college degree helped you with Data Analysis in any way?
This is a question that relates to the latest program you completed in college. Do talk about the degree you have obtained, how it was useful, and how you plan on putting it to full use in the coming days after being recruited in the company.
40. What is your plan after joining for this Data Analyst role?
While answering this question, make sure to keep your explanation concise on how you would bring about a plan that works with the company set up and how you would implement the plan, ensuring that it works by performing perforation validation testing on the same. Do highlight how it can be made better in the coming days with further iterations.
41. What are the disadvantages of Data Analytics?
Compared to the plethora of advantages, there are a very low number of disadvantages when considering Data Analytics. Some of the disadvantages are listed below:
- Data Analytics can cause a breach in customer privacy and their information such as transactions, purchases, and subscriptions.
- Some of the tools are complex and require prior training.
- It takes a lot of skills and expertise to select the right analytics tool every time.
42. What skills should a successful Data Analyst possess?
This is a descriptive question that is highly dependent on how analytical your thinking skills are. There are a variety of tools that a Data Analyst must have expertise in. Programming languages such as Python, R, and SAS, probability, statistics, regression, correlation, and more are the primary skills that a Data Analyst should possess.
43. Why do you think you are the right fit for this Data Analyst role?
With this question, the interviewer is trying to gauge your understanding of the job description and where you’re coming from, with respect to your knowledge in Data Analysis. Be sure to answer this in a concise yet detailed manner by explaining your interests, goals, and visions and how these match with the company substructure.
44. Can you please talk about your past Data Analysis work?
This is a very commonly asked question in a data analysis interview. The interviewer will be assessing you for your clarity in communication, actionable insights from your work experience, your debating skills if questioned on the topics, and how thoughtful you are in your analytical skills.
45. Can you please explain how you would estimate the number of visitors to the Taj Mahal in November 2019?
This is a classic behavioral question. This is to check your thought process without making use of computers or any sort of datasets. You can begin your answer using the below template:
‘First, I would gather some data. To start with, I’d like to find out the population of Agra, where the Taj Mahal is located. The next thing I would take a look at is the number of tourists that came to visit the site during that time. This is followed by the average length of their stay that can be further analyzed by considering factors such as age, gender, and income, and the number of vacation days and bank holidays there are in India. I would also go about analyzing any sort of data available from the local tourist offices.’
46. Do you have any experience working in the same industry as ours before?
This is a very straightforward question. This aims to assess if you have the industry-specific skills that are needed for the current role. Even if you do not possess all of the skills, make sure to thoroughly explain how you can still make use of the skills you’ve obtained in the past to benefit the company.
47. Have you earned any sort of certifications to boost your opportunities as a Data Analyst aspirant?
As always, interviewers look for candidates who are serious about advancing their career options by making use of additional tools like certifications. Certificates are strong proof that you have put in all efforts to learn new skills, master them, and put them into use at the best of your capacity. List the certifications, if you have any, and do talk about them in brief, explaining what all you learned from the program and how it’s been helpful to you so far.
48. What tools do you prefer to use in the various phases of Data Analysis?
This is again a question to check what tools you think are useful for their respective tasks. Do talk about how comfortable you are with the tools you mention and about their popularity in the market today.
49. Which step of a Data Analysis project do you like the most?
Do know that it is completely normal to have a predilection toward certain tools and tasks over the others. However, while performing data analysis, you will always be expected to deal with the entirety of the analytics life cycle, so make sure not to speak negatively about any of the tools or of the steps in the process of data analysis.
50. How good are you in terms of explaining technical content to a non-technical audience with respect to Data Analysis?
This is another classic question asked in most of the Data Analytics interviews. Here, it is extremely vital that you talk about your communication skills in terms of delivering the technical content, your level of patience, and your ability to break content into smaller chunks to help the audience understand better.
It is always advantageous to show the interviewer that you are very well capable of working effectively with people from a variety of backgrounds who may or may not be technical.