JNTUK B.Tech CSE 3-1 CSE (R23) DWDM Unit Wise 2 Marks Important Questions and Answers

The following are the top unit wise important 2 marks question and answers. By preparing these questions your can get good marks in your exernal semester exams. JNTUK B.Tech CSE 3-1 CSE (R23) DWDM Unit Wise 2 Marks Important Questions.

UNIT – I: Data Warehousing and Online Analytical Processing

1. What is a Data Warehouse?
A Data Warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data that helps support decision-making processes in an organization.

2. Define OLAP.
OLAP (Online Analytical Processing) refers to techniques that allow users to interactively analyze multidimensional data from multiple perspectives for decision support.

3. What is a Data Cube?
A Data Cube allows data to be modeled and viewed in multiple dimensions. Each dimension corresponds to an attribute, and each cell contains aggregated measures such as count or sum.

4. Differentiate between Fact Table and Dimension Table.
A Fact Table contains measurable, quantitative data (facts), whereas Dimension Tables store descriptive attributes (dimensions) related to the facts, used for filtering and grouping.

5. What are the types of data attributes?
Data attributes can be of four types:

Nominal – categories without order
Ordinal – categories with order
Interval – numerical, no true zero
Ratio – numerical with a true zero

6. What are proximity measures?
Proximity measures determine how similar or dissimilar two data objects are. Examples include Euclidean distance, Manhattan distance, and cosine similarity.

7. What is summary statistics?
Summary statistics are numerical measures such as mean, median, variance, and standard deviation that describe and summarize the main characteristics of a dataset.

8. What are the steps in data warehouse implementation?
The major steps are:

Data extraction
Data cleaning
Data transformation
Data loading
Indexing and OLAP cube creation

UNIT – II: Data Preprocessing

1. What is Data Preprocessing?
Data preprocessing is the process of cleaning, transforming, and organizing raw data into a suitable format for mining. It improves data quality and mining accuracy.

2. What is Data Cleaning?
Data cleaning handles noisy, missing, and inconsistent data by techniques such as filling in missing values, smoothing noisy data, identifying or removing outliers, and correcting inconsistencies.

3. What are common methods for handling missing data?

Ignore the tuple (if class label is missing)
Fill manually
Use attribute mean or mode
Use global constant
Predict using machine learning

4. What is Data Integration?
Data integration combines data from multiple sources (databases, files, web) into a coherent data store such as a data warehouse.

5. What is Data Transformation?
Data transformation involves converting data into appropriate forms for mining, such as normalization, aggregation, generalization, and attribute construction.

6. What is Data Reduction?
Data reduction reduces the volume of data but produces the same or similar analytical results. Techniques include dimensionality reduction, numerosity reduction, and data compression.

7. What is Data Normalization?
Normalization scales numerical data into a specific range, typically [0,1], to make attributes comparable. Common methods:

Min–Max normalization
Z-score normalization
Decimal scaling

8. What is Data Discretization?
Data discretization converts continuous attributes into discrete intervals or bins to simplify data and improve model efficiency. Methods include binning, histogram analysis, and decision tree approaches.

UNIT – III: Data Mining – Classification and Prediction

1. What is Classification?
Classification is a supervised learning technique that assigns data items to predefined classes using a model built from labeled training data.

2. What is Prediction?
Prediction is a data mining task that models continuous-valued functions and predicts future or unknown values based on available data.

3. What are the major steps in a classification process?

Data collection and preprocessing
Model construction using training data
Model evaluation using test data
Model usage for classification of new data

4. What is a Decision Tree Classifier?
A decision tree classifier uses a tree-like model of decisions and their outcomes to classify data. Each internal node tests an attribute, each branch represents a decision, and each leaf represents a class.

5. What is Information Gain?
Information Gain measures the expected reduction in entropy after splitting a dataset based on an attribute. Attributes with high information gain are chosen for decision tree splits.

6. What is a Confusion Matrix?
A confusion matrix is a table that summarizes classification performance by showing counts of TP (True Positive), FP, TN, and FN predictions.

7. What is Overfitting in classification?
Overfitting occurs when a model fits the training data too closely, capturing noise instead of general patterns, resulting in poor performance on unseen data.

8. What are common methods of model evaluation?

Hold-out method
Cross-validation
Bootstrap sampling
Evaluation metrics include accuracy, precision, recall, F1-score, and ROC curves.

UNIT – IV: Association Analysis

1. What is Association Rule Mining?
Association rule mining is a data mining technique used to find interesting associations, correlations, or relationships among large sets of data items.

2. Define Support and Confidence.

Support measures how frequently an itemset occurs in the dataset.
Confidence measures how often items in Y appear in transactions that contain X (i.e., conditional probability of Y given X).

3. What is the Apriori Principle?
The Apriori Principle states that if an itemset is frequent, then all of its non-empty subsets must also be frequent. This property is used to reduce the search space.

4. What is Candidate Generation in Apriori?
Candidate generation is the process of producing (k+1)-itemset candidates from frequent k-itemsets and pruning those whose subsets are not frequent.

5. What is Confidence-Based Pruning?
In confidence-based pruning, rules that do not meet the minimum confidence threshold are eliminated, keeping only strong and interesting rules.

6. What is Compact Representation of frequent itemsets?
Compact representation stores fewer itemsets by using closed or maximal frequent itemsets, thereby reducing redundancy in the rule set.

7. Mention two advantages of FP-Growth over Apriori.

FP-Growth avoids candidate generation.
It requires fewer database scans by building an FP-tree structure, making it more efficient.

8. Define Frequent Itemset.
A frequent itemset is a set of items that appears together in transactions with a frequency greater than or equal to the minimum support threshold.

UNIT – V: Cluster Analysis

1. What is Clustering?
Clustering is the process of grouping a set of data objects into clusters so that objects within the same cluster are more similar to each other than to those in other clusters.

2. List the types of clustering methods.
The main types are:

Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods

3. What are the different types of clusters?
Common types of clusters include:

Well-separated clusters
Prototype-based clusters
Density-based clusters
Graph-based clusters
Shared-property clusters

4. What is K-means algorithm?
K-means is a partitioning algorithm that divides a dataset into k clusters by iteratively assigning points to the nearest centroid and updating centroids to minimize intra-cluster variance.

5. List additional issues in K-means.

Choosing the number of clusters (k)
Sensitivity to outliers
Empty clusters during iterations
Proper initialization of centroids

6. What is Bi-secting K-means?
Bi-secting K-means is a variant of K-means that repeatedly splits clusters into two using K-means, forming a hierarchical structure for improved quality.

7. What is Agglomerative Hierarchical Clustering?
It is a bottom-up approach where each object starts as its own cluster, and clusters are merged step by step based on similarity until one cluster remains.

8. What is DBSCAN?
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based algorithm that groups together closely packed points and identifies outliers as noise.