# CMPF104: Data Cleaning and Preprocessing: Data science and Data Anaytics: Programming For Foundation In Engineering, Assignment, UNITEN, Malaysia

 University Universiti Tenaga Nasional (UNITEN) Subject CMPF104: Programming For Foundation In Engineering

## Data science and Data Anaytics

Download the dataset from BRIGHTEN. If your student ID ends with an odd number, select Concrete_Data_A dataset, and if your student ID ends with an even number, select Concrete_Data_B dataset. Using the Python attributes, function and libraries to solve the following problems.

### a) Data Cleaning and Preprocessing:

•  Use Pandas to load the dataset. Name the dataframe as concrete_df_XXX.
• Remove ‘Number’ column using .drop() function and visualize the first ten (10)
rows of the data.
• Handle any missing values by dropping or replacing the empty cells. Check for missing values using functions like .info() or .isnull().sum()
• Convert the data frame to array, using to_numpy() function.
• Divide the data into two sets of data with division of 80% and 20% for train and test data, respectively. Name the dataset as train_data_XXX and test_data_XXX

### b) Data Analysis:

• Calculate the correlation between the variables in the dataframe.
• Utilize NumPy and Pandas to calculate summary statistics of the data such as
maximum, minimum, standard deviation, average, median and mode of each
category.
•  Use Pandas functions like .describe() for an overview of summary statistics and apply NumPy functions for specific calculations.

### c) Visualization:

•  Use Matplotlib to create visualizations such as line plots for train and test data
across all categories.
• Generate histogram plots and box plots for all variables.
• Ensure that the visualizations are clear, informative, and aesthetically pleasing.
• Customize your plots by adding the titles, labels and legends

Answer
