Loading...
Please wait while we prepare your content
Please wait while we prepare your content
Solutions for Informatics Practices, Class 12, CBSE
Assertion (A): Pandas is a Python library.
Reasoning (R): Pandas is a powerful, flexible and easy to use open source data analysis library.
Both A and R are true and R is the correct explanation of A.
Explanation
Pandas is a Python library that makes data analysis easy and effective. It is the most famous Python library for data science, offering powerful and flexible data structures that facilitate data analysis and manipulation. As an open-source library, Pandas provides high-performance, easy-to-use data structures and data analysis tools.
Assertion (A): A series stores data row-wise.
Reasoning (R): A series is a one-dimensional labelled data structure.
A is false but R is true.
Explanation
A series in pandas does not store data row-wise. It is a one-dimensional array like object containing an array of data and an associated array of data labels, called its index.
Assertion (A): Dataframe has both a row and column index.
Reasoning (R): Dataframe is a two-dimensional labelled data structure like a table of MySQL.
Both A and R are true and R is the correct explanation of A.
Explanation
A DataFrame in Pandas has both a row index and a column index. It is a two-dimensional labeled data structure, similar to a table in MySQL, each value is identifiable with the combination of row and column indices.
Assertion (A): While creating a series using scalar values, index must be provided.
Reasoning (R): The scalar value is repeated to match the length of index.
Both A and R are true and R is the correct explanation of A.
Explanation
When creating a Series using scalar values, we must provide an index. This is because a Series requires an index to associate with each value. This scalar value shall be repeated to match the length of the index.
Assertion (A): The shape attribute returns the number of rows and number of columns available in dataframe.
Reasoning (R): The shape attribute returns the values in the form of list.
A is true but R is false.
Explanation
The shape attribute of a Pandas DataFrame returns the number of rows and the number of columns in the DataFrame. The shape attribute returns the values in the form of tuple.
Assertion (A): After running the following code:
df = pd.DataFrame([11,46], index = ['True', 'False'])
print(df[True])
A key error will be produced.
Reasoning (R): Dataframe does not support Boolean Indexing.
A is true but R is false.
Explanation
DataFrames do support Boolean Indexing, which allows to select rows based on a Boolean condition. The code df[True]
is trying to access a column named True
, which does not exist in the DataFrame. The index of the DataFrame is ['True', 'False']
. To access the row where the index is 'True', we would use df.loc['True']
. This is an example of label-based indexing, where we are selecting a row based on its index label.
Assertion (A): We can add a new column in an existing dataframe using .at or .loc methods.
Reasoning (R): When we reassign new values to an existing column in a dataframe, the previous values are overwritten.
Both A and R are true but R is not the correct explanation of A.
Explanation
We can add a new column in an existing dataframe using at
or loc
methods. When we reassign new values to an existing column in a dataframe, the previous values are overwritten.
Assertion (A): When a dataframe is created using dictionary, the keys will be the columns and corresponding values will be the rows of the dataframe.
Reasoning (R): NaN values are automatically filled for values of non-matching keys while creating a dataframe using dictionary.
Both A and R are true and R is the correct explanation of A.
Explanation
When a dataframe is created using a dictionary, the keys of the dictionary become the columns of the dataframe, and the values of the dictionary become the rows of the dataframe. If the dictionaries used to create the dataframe have non-matching keys, NaN values will be automatically filled in for the missing values.
Inayra is writing a program using Pandas library for creating a dataframe from two dictionaries. Given below is the snippet she has developed. Help her to complete it by selecting the correct option given as under:
import ............... as pd # Statement 1
dict = {'x': [10,25], 'y' : [32,93] }
dict 1 = {'x',: [14,56], 'y': [36,72] }
df = pd.DataFrame (dict)
df1 = pd. ............... (dict) # Statement 2
df2 = pd. ............... ([df, df1]) # Statement 3
1. Choose the correct option from the following for Statement 1.
(a) pd
(b) data
(c) df
(d) pandas
2. Which of the following option should be taken for Statement 2?
(a) Series
(b) Dataframe
(c) DataFrame
(d) Dictionary
3. Select the correct method from the following for Statement 3.
(a) concat()
(b) shape()
(c) index()
(d) append()
1. pandas
Reason — The pandas library is imported as pd using the statement import pandas as pd
.
2. DataFrame
Reason — The pd.DataFrame()
function is used to create a DataFrame from a dictionary.
3. concat()
Reason — The pd.concat()
function is used to concatenate two or more DataFrames into a single DataFrame. In this case, it is used to concatenate df
and df1
into a single DataFrame df2
.
Dataframe
Reason — A DataFrame object is 2D labelled array like Pandas data structure that stores an ordered collection of columns that can store heterogeneous data types.
Dataframe is immutable.
Reason — DataFrame is value-mutable, size-mutable and stores heterogeneous data types.
All of these
Reason — We can create a DataFrame object by passing data in many different ways, such as two-dimensional dictionaries (i.e., dictionaries having lists or dictionaries or ndarrays or series objects etc), two-dimensional ndarrays, series type object and another DataFrame object.
print(SHOP[SHOP.City == 'Delhi'])
Reason — The correct code print(SHOP[SHOP.City == 'Delhi'])
filters the SHOP
DataFrame to show only the rows where the City
column is equal to 'Delhi'. It does this by creating a boolean mask SHOP.City == 'Delhi'
that returns True for rows where the city is 'Delhi' and False otherwise, and then using this mask to select the corresponding rows from the original DataFrame using SHOP[]
. The resulting DataFrame, which contains only the rows that match the condition, is then printed to the console.
The following statement will ............... .
df = df.drop(['Name', 'Class', 'Rollno'], axis = 1) #df is a DataFrame object
delete three columns having labels 'Name', 'Class' and `Rollno'
Reason — The drop()
function is used to remove rows from a DataFrame. In this case, the axis=1
parameter specifies that we want to drop columns. The list ['Name', 'Class', 'Rollno'] contains the labels of the columns to be dropped. Therefore, the statement will delete the three columns with labels 'Name', 'Class', and 'Rollno' from the DataFrame df
.
The significance of Python Pandas library is as follows:
CSV (Comma Separated Values) files are delimited files that store tabular data (data stored in rows and columns as we see in spreadsheets or databases) where comma delimits every value. Each line in a CSV file is a data record. Each record consists of one or more fields, separated by commas (or the chosen delimiter).
The steps to create CSV files are as follows:
Iterating over rows in the DataFrame:
The iterrows()
method is used to iterate over each row in the DataFrame. In this method, each horizontal subset is in the form of (row-index, series), where the series contains all column values for that row-index.
For example :
import pandas as pd
total_sales = {2015 : {'Qtr1' : 34500, 'Qtr2' : 45000},
2016 : {'Qtr1' : 44500, 'Qtr2' : 65000}}
df = pd.DataFrame(total_sales)
for (row, rowseries) in df.iterrows():
print("RowIndex :", row)
print('Containing :')
print(rowseries)
RowIndex : Qtr1
Containing :
2015 34500
2016 44500
Name: Qtr1, dtype: int64
RowIndex : Qtr2
Containing :
2015 45000
2016 65000
Name: Qtr2, dtype: int64
Iterating over columns in the DataFrame:
The iteritems()
method is used to iterate over each column in the DataFrame. In this method, each vertical subset is in the form of (column-index, series), where the series contains all row values for that column-index.
For example :
import pandas as pd
total_sales = {2015 : {'Qtr1' : 34500, 'Qtr2' : 45000},
2016 : {'Qtr1' : 44500, 'Qtr2' : 65000}}
df = pd.DataFrame(total_sales)
for (col, colseries) in df.iteritems():
print("Column Index :", col)
print('Containing :')
print(colseries)
Column Index : 2015
Containing :
Qtr1 34500
Qtr2 45000
Name: 2015, dtype: int64
Column Index : 2016
Containing :
Qtr1 44500
Qtr2 65000
Name: 2016, dtype: int64
To fill all missing values in a pandas DataFrame with the previous non-missing values, we can use the fillna
method with the ffill
parameter. The syntax is df.fillna(method = 'ffill', inplace = True)
Consider the following tables Item and Customer and answer the questions that follow:
Table: Item
Item_ID | ItemName | Manufacturer | Price |
---|---|---|---|
PC01 | Personal Computer | HCL India | 42000 |
LCO5 | Laptop | HP USA | 55000 |
PCO3 | Personal Computer | Dell USA | 32000 |
PC06 | Personal Computer | Zenith USA | 37000 |
LCO3 | Laptop | Dell USA | 57000 |
Table: Customer
Item_ID | CustomerName | City |
---|---|---|
LCO3 | N Roy | Delhi |
PCO3 | H Singh | Mumbai |
PC06 | R Pandey | Delhi |
LCO3 | C Sharma | Chennai |
PC01 | K Agarwal | Bengaluru |
Assume that the Pandas has been imported as pd.
(a) Create a dataframe called dfI for table Item.
(b) Create a dataframe called dfC for table Customer.
(c) Perform the default join operation on item_ID using two dataframes: dfI and dfC.
(d) Perform the left join operation on item_ID using two dataframes: dfI and dfC.
(e) Perform the right join operation on Item_ID using two dataframes: dfI and dfC.
(f) Perform the default operation on Item_ID using two dataframes: dfI and dfC with the left index as true.
(g) Perform the outer join operation on item_ID using two dataframes: dfI and dfC.
(h) Create a new dataframe dfN using dataframes: dfI and dfC. The new dataframe data will hold both left index and right index true values.
(i) Arrange the dataframe dfN in descending order of Price.
(j) Arrange the dataframe dfN in descending order of City and Price.
(a)
dfI = pd.DataFrame({
'Item_ID': ['PC01', 'LCO5', 'PCO3', 'PC06', 'LCO3'],
'ItemName': ['Personal Computer', 'Laptop', 'Personal Computer', 'Personal Computer', 'Laptop'],
'Manufacturer': ['HCL India', 'HP USA', 'Dell USA', 'Zenith USA', 'Dell USA'],
'Price': [42000, 55000, 32000, 37000, 57000]
})
Item_ID ItemName Manufacturer Price
0 PC01 Personal Computer HCL India 42000
1 LCO5 Laptop HP USA 55000
2 PCO3 Personal Computer Dell USA 32000
3 PC06 Personal Computer Zenith USA 37000
4 LCO3 Laptop Dell USA 57000
(b)
dfC = pd.DataFrame({
'Item_ID': ['LCO3', 'PCO3', 'PC06', 'LCO3', 'PC01'],
'CustomerName': ['N Roy', 'H Singh', 'R Pandey', 'C Sharma', 'K Agarwal'],
'City': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Bengaluru']
})
Item_ID CustomerName City
0 LCO3 N Roy Delhi
1 PCO3 H Singh Mumbai
2 PC06 R Pandey Delhi
3 LCO3 C Sharma Chennai
4 PC01 K Agarwal Bengaluru
(c)
df_default_join = pd.merge(dfI, dfC, on = 'Item_ID')
Item_ID ItemName Manufacturer Price CustomerName City
0 PC01 Personal Computer HCL India 42000 K Agarwal Bengaluru
1 PCO3 Personal Computer Dell USA 32000 H Singh Mumbai
2 PC06 Personal Computer Zenith USA 37000 R Pandey Delhi
3 LCO3 Laptop Dell USA 57000 N Roy Delhi
4 LCO3 Laptop Dell USA 57000 C Sharma Chennai
(d)
df_left_join = pd.merge(dfI, dfC, left_on='Item_ID', right_on='Item_ID', how='left')
Item_ID ItemName Manufacturer Price CustomerName City
0 PC01 Personal Computer HCL India 42000 K Agarwal Bengaluru
1 LCO5 Laptop HP USA 55000 NaN NaN
2 PCO3 Personal Computer Dell USA 32000 H Singh Mumbai
3 PC06 Personal Computer Zenith USA 37000 R Pandey Delhi
4 LCO3 Laptop Dell USA 57000 N Roy Delhi
5 LCO3 Laptop Dell USA 57000 C Sharma Chennai
(e)
df_right_join = pd.merge(dfI, dfC, left_on='Item_ID', right_on='Item_ID', how='right')
Item_ID ItemName Manufacturer Price CustomerName City
0 LCO3 Laptop Dell USA 57000 N Roy Delhi
1 PCO3 Personal Computer Dell USA 32000 H Singh Mumbai
2 PC06 Personal Computer Zenith USA 37000 R Pandey Delhi
3 LCO3 Laptop Dell USA 57000 C Sharma Chennai
4 PC01 Personal Computer HCL India 42000 K Agarwal Bengaluru
(f) This operation is not possible because the left_index
parameter is used for merging on the index of the dataframes, not on a specific column. The default merge operation is an inner join, and it requires a common column to merge on.
(g)
df_outer_join = pd.merge(dfI, dfC, on='Item_ID', how='outer')
Item_ID ItemName Manufacturer Price CustomerName City
0 LCO3 Laptop Dell USA 57000 N Roy Delhi
1 LCO3 Laptop Dell USA 57000 C Sharma Chennai
2 LCO5 Laptop HP USA 55000 NaN NaN
3 PC01 Personal Computer HCL India 42000 K Agarwal Bengaluru
4 PC06 Personal Computer Zenith USA 37000 R Pandey Delhi
5 PCO3 Personal Computer Dell USA 32000 H Singh Mumbai
(h)
dfN = pd.merge(dfI, dfC, left_index=True, right_index=True)
Item_ID_x ItemName Manufacturer Price Item_ID_y CustomerName City
0 PC01 Personal Computer HCL India 42000 LCO3 N Roy Delhi
1 LCO5 Laptop HP USA 55000 PCO3 H Singh Mumbai
2 PCO3 Personal Computer Dell USA 32000 PC06 R Pandey Delhi
3 PC06 Personal Computer Zenith USA 37000 LCO3 C Sharma Chennai
4 LCO3 Laptop Dell USA 57000 PC01 K Agarwal Bengaluru
(i)
dfN_sorted_price = dfN.sort_values(by='Price', ascending=False)
Item_ID_x ItemName Manufacturer Price Item_ID_y CustomerName City
4 LCO3 Laptop Dell USA 57000 PC01 K Agarwal Bengaluru
1 LCO5 Laptop HP USA 55000 PCO3 H Singh Mumbai
0 PC01 Personal Computer HCL India 42000 LCO3 N Roy Delhi
3 PC06 Personal Computer Zenith USA 37000 LCO3 C Sharma Chennai
2 PCO3 Personal Computer Dell USA 32000 PC06 R Pandey Delhi
(j)
dfN_sorted_city_price = dfN.sort_values(by=['City', 'Price'], ascending=[False, False])
Item_ID_x ItemName Manufacturer Price Item_ID_y CustomerName City
1 LCO5 Laptop HP USA 55000 PCO3 H Singh Mumbai
0 PC01 Personal Computer HCL India 42000 LCO3 N Roy Delhi
2 PCO3 Personal Computer Dell USA 32000 PC06 R Pandey Delhi
3 PC06 Personal Computer Zenith USA 37000 LCO3 C Sharma Chennai
4 LCO3 Laptop Dell USA 57000 PC01 K Agarwal Bengaluru
(a) S * 100
0 43.0271
1 61.7328
2 26.5421
3 83.6113
dtype: float64
(b) S > 0
0 True
1 True
2 True
3 True
dtype: bool
(c) S1 = pd.Series(S)
0 0.430271
1 0.617328
2 0.265421
3 0.836113
dtype: float64
(d) S3 = pd.Series(S1) + 3
0 3.430271
1 3.617328
2 3.265421
3 3.836113
dtype: float64
pencils False
notebooks False
scales False
erasers False
dtype: bool
pencils 37
notebooks 46
scales 83
erasers 42
dtype: int64
The code creates two pandas Series, S
and S2
, with the index Stationery which is a list. The code then compares S
and S2
element-wise using the "==" operator, which returns a boolean Series indicating whether each pair of values is equal. Finally, the code adds S
and S2
element-wise using the "+" operator, which returns a new Series with the summed values, and assigns the result back to S
. The resulting S
Series will have the same index as before i.e., Stationery.
What will be the output produced by the following codes, considering the Series object S given in Q.13?
(a) print(S[1:4])
(b) print(S[:1])
(c) print(S[0:2])
(d) S[0:2] = 12
print(S)
(e) print(S.index)
(f) print(S.values)
The Series object 'S' is as follows:
pencils 20
notebooks 33
scales 52
erasers 10
dtype: int64
(a) print(S[1:4])
notebooks 33
scales 52
erasers 10
dtype: int64
The slice S[1:4]
starts at index 1 and ends at index 3, hence, it includes three elements i.e., elements from index 1 and 3.
(b) print(S[:1])
pencils 20
dtype: int64
The slice S[:1]
starts at index 0 and ends at index 1, but because the end index is exclusive, it includes only one element i.e., the element at index 0.
(c) print(S[0:2])
pencils 20
notebooks 33
dtype: int64
The slice S[0:2]
starts at index 0 and ends at index 1, hence, it includes two elements i.e., elements from index 0 and 1.
(d) S[0:2] = 12
print(S)
pencils 12
notebooks 12
scales 52
erasers 10
dtype: int64
The slice S[0:2] = 12
assigns the value 12 to indices 0 and 1 in Series S
, directly modifying those elements. The updated Series is then printed.
(e) print(S.index)
Index(['pencils', 'notebooks', 'scales', 'erasers'], dtype='object')
The code print(S.index)
displays the indices of Series S
.
(f) print(S.values)
[12 12 52 10]
The code print(S.values)
displays the values of Series S
.
Write a program to iterate and print a dataframe column-wise and print only first three columns.
import pandas as pd
data = {
'Name': ['Aliya', 'Hemanth', 'Charlie'],
'Age': [25, 30, 35],
'City': ['Bangalore', 'Chennai', 'Mumbai'],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
first_three_columns = df.iloc[:, :3]
print("Each column:")
for column_name in first_three_columns:
column_data = first_three_columns[column_name]
print(column_name)
print(column_data)
Each column:
Name
0 Aliya
1 Hemanth
2 Charlie
Name: Name, dtype: object
Age
0 25
1 30
2 35
Name: Age, dtype: int64
City
0 Bangalore
1 Chennai
2 Mumbai
Name: City, dtype: object
Write a program to iterate and print a dataframe row-wise at a time and print only first five rows.
import pandas as pd
data = {
'Name': ['Amruta', 'Harsh', 'Yogesh', 'Shreya', 'Zoya', 'Nyra'],
'Age': [25, 30, 35, 40, 45, 28],
'City': ['Chandigarh', 'Jaipur', 'Dehradun', 'Delhi', 'Vadodara', 'Guwahati']
}
df = pd.DataFrame(data)
first_five_rows = df.head(5)
print("Each row:")
for index, row in first_five_rows.iterrows():
print("Index:", index)
print(row)
Each row:
Index: 0
Name Amruta
Age 25
City Chandigarh
Name: 0, dtype: object
Index: 1
Name Harsh
Age 30
City Jaipur
Name: 1, dtype: object
Index: 2
Name Yogesh
Age 35
City Dehradun
Name: 2, dtype: object
Index: 3
Name Shreya
Age 40
City Delhi
Name: 3, dtype: object
Index: 4
Name Zoya
Age 45
City Vadodara
Name: 4, dtype: object
S2 = pd.Series([101, 102, 1-2, 104])
print(S2.index)
S2.index = [0.1.2.3, 4, 5] #Error 1
S2[5] = 220
print(S2)
Error 1 — In the line S2.index = [0.1.2.3, 4, 5]
, the index values are not separated by commas. It should be S2.index = [0, 1, 2, 3, 4, 5]
. Additionally, the Series S2
initially has four elements, so assigning a new index list of six elements ([0, 1, 2, 3, 4, 5]) to S2.index
will raise a ValueError because the new index list length does not match the length of the Series.
The corrected code is:
S2 = pd.Series([101, 102, 1-2, 104])
print(S2.index)
S2.index = [0, 1, 2, 3]
S2[5] = 220
print(S2)
In the above code fragment, the data values should be enclosed in square brackets [] to form a list.
The corrected code is:
S = pd.Series([2, 3, 4, 55], index = range(4))
In the above code fragment, the data values should be enclosed in square brackets to form a list and the specified index range range(7) is out of range for the provided data [1, 2, 3, 4]. Since there are only four data values, the index should have a length that matches the number of data values.
The corrected code is:
S1 = pd.Series([1, 2, 3, 4], index = range(4))
The error in the code fragment is that the length of the data list [1, 2, 3, 4, 5] does not match the length of the index range(4). Since there are only five data values, the index should have a length that matches the number of data values.
The corrected code is:
S1 = pd.Series([1, 2, 3, 4, 5], index = range(5))
The error in the above code is in the line print(s[102, 103, 104])
. When accessing elements in a pandas Series using square brackets, we should use a list of index values, not multiple separate index values separated by commas.
The corrected code is:
data = np.array(['a', 'b', 'c', 'd', 'e', 'f'])
s = pd.Series(data, index = [100, 101, 102, 103, 104, 105])
print(s[[102, 103, 104]])
The statement s1['ab']
causes an error because 'ab' is not a single key in the index. The index has individual keys 'a' and 'b', but not 'ab'.
Consider the following Class12.csv file containing the data as given below:
RollNo | Name | Accounts | Maths | BSt | IP | Eco |
---|---|---|---|---|---|---|
10 | Ritu Jain | 88 | 67 | 87 | 97 | 56 |
11 | Mridul Mehta | 67 | 78 | 77 | 87 | 90 |
12 | Divij | 87 | 89 | 78 | 82 | 92 |
13 | Yashvi Verma | 67 | 82.3 | 76.5 | 98.2 | 78.6 |
14 | Deepak Virmani | 56.7 | 76.5 | 88 | 78 | 67 |
15 | Jatin Malik | 76 | 66 | 77 | 87.5 | 67.5 |
(a) Read the csv file into a dataframe df which is stored with tab ('\t') separator.
(b) Write the code to find the total marks (Total_ marks) for each student and add it to the newly-created dataframe.
(c) Also calculate the percentage obtained by each student under a new column “Average” in the dataframe.
(a)
df = pd.read_csv('Class12.csv', sep='\t')
RollNo Name Accounts Maths BSt IP Eco
0 10 Ritu Jain 88.0 67.0 87.0 97.0 56.0
1 11 Mridul Mehta 67.0 78.0 77.0 87.0 90.0
2 12 Divij 87.0 89.0 78.0 82.0 92.0
3 13 Yashvi Verma 67.0 82.3 76.5 98.2 78.6
4 14 Deepak Virmani 56.7 76.5 88.0 78.0 67.0
5 15 Jatin Malik 76.0 66.0 77.0 87.5 67.5
(b)
df['Total_marks'] = df[['Accounts', 'Maths', 'BSt', 'IP', 'Eco']].sum(axis=1)
RollNo Name Accounts Maths BSt IP Eco Total_marks
0 10 Ritu Jain 88.0 67.0 87.0 97.0 56.0 395.0
1 11 Mridul Mehta 67.0 78.0 77.0 87.0 90.0 399.0
2 12 Divij 87.0 89.0 78.0 82.0 92.0 428.0
3 13 Yashvi Verma 67.0 82.3 76.5 98.2 78.6 402.6
4 14 Deepak Virmani 56.7 76.5 88.0 78.0 67.0 366.2
5 15 Jatin Malik 76.0 66.0 77.0 87.5 67.5 374.0
(c)
df['Average'] = (df['Total_marks'] / 500) * 100
RollNo Name Accounts Maths BSt IP Eco Total_marks Average
0 10 Ritu Jain 88.0 67.0 87.0 97.0 56.0 395.0 79.00
1 11 Mridul Mehta 67.0 78.0 77.0 87.0 90.0 399.0 79.80
2 12 Divij 87.0 89.0 78.0 82.0 92.0 428.0 85.60
3 13 Yashvi Verma 67.0 82.3 76.5 98.2 78.6 402.6 80.52
4 14 Deepak Virmani 56.7 76.5 88.0 78.0 67.0 366.2 73.24
5 15 Jatin Malik 76.0 66.0 77.0 87.5 67.5 374.0 74.80
Write a program that reads students marks from a ‘Result.csv’ file and displays percentage of each student.
import pandas as pd
df = pd.read_csv('Result.csv')
df['Total_marks'] = df[['Maths', 'Science', 'English',]].sum(axis=1)
df['Percentage'] = (df['Total_marks'] / 300) * 100
print("Percentage of each student: ")
print(df[['Name', 'Percentage']])
Percentage of each student:
Name Percentage
0 Rahul 85.000000
1 Rohan 81.666667
2 Riya 88.333333
3 Raj 86.666667
To import specific columns from a CSV file, we can use the usecols
parameter of the read_csv
function from the pandas library. The usecols
parameter is used to specify the list of columns to be read from the CSV file. The syntax is pd.read_csv('filename.csv', usecols = ['column1', 'column2',...])
.
For example, the following command will access Name and Age columns of Employee file.
df = pd.read_csv("Employee.csv", usecols = ['Name', 'Age'])
print(df)
The advantages of CSV file formats are as follows:
It is a simple, compact and ubiquitous format for data storage.
It is a common format for data interchange.
It can be opened in popular spreadsheet packages like MS-Excel, Calc etc.
Nearly all spreadsheets and databases support import/export to csv format.
To read data from a CSV file into a DataFrame while providing our own column names, we can use the names
argument in the read_csv()
function. The syntax is : <DF> = pandas.read_csv(<filepath>, names = <sequence containing column names>)
.
To ensure that the top/first row's data is used as data and not as column headers in a DataFrame when using the read_csv()
function, we need to use the header
argument and set it to None. The syntax is : <DF> = pandas.read_csv(<filepath>, header = None)
.
The nrows
argument can be used to read only the top 10 rows of data from a CSV file using the read_csv()
function in pandas. The nrows
argument specifies the number of rows of the file to read. The syntax is : df = pandas.read_csv(<filepath>, nrows = 10)
.
Create the following dataframe by the name Project regarding a competition and answer the questions given below:
Enrolment No. | Name | Class | Section | Project Name |
---|---|---|---|---|
101 | Rekha | XII | B | Data Analysis |
102 | Divya | XII | C | Graphical Analysis |
103 | Geet | XII | H | Machine Learning |
104 | Jeet | XII | B | App Development |
(a) Insert two records with different methods.
(b) Insert a column to store grades given to their projects.
(c) Write a command to display the name and section for all.
(d) Write a command to display the records with index value 101 and 102.
(e) Insert a column after name to store the school name.
(f) Display the second and third record.
(g) Replace the name and section of Jeet to 'XI','A'.
(h) Remove the column Project Name and Section.
The DataFrame project is created as follows:
import pandas as pd
data = {'Name': ['Rekha', 'Divya', 'Geet', 'Jeet'],
'Class': ['XII', 'XII', 'XII', 'XII'],
'Section': ['B', 'C', 'H', 'B'],
'Project Name': ['Data Analysis', 'Graphical Analysis', 'Machine Learning', 'App Development']
}
Project = pd.DataFrame(data, index = [101, 102, 103, 104])
print(Project)
Name Class Section Project Name
101 Rekha XII B Data Analysis
102 Divya XII C Graphical Analysis
103 Geet XII H Machine Learning
104 Jeet XII B App Development
(a)
Project.loc[105] = [105, 'Arya', 'XI', 'D', 'Web Development']
Project.loc[105] = ['Arya', 'XI', 'D', 'Web Development']
Project.at[106, 'Name'] = 'Vikram'
Project.at[106, 'Class'] = 'XI'
Project.at[106, 'Section'] = 'A'
Project.at[106, 'Project Name'] = 'AI Research'
Name Class Section Project Name
101 Rekha XII B Data Analysis
102 Divya XII C Graphical Analysis
103 Geet XII H Machine Learning
104 Jeet XII B App Development
105 Arya XI D Web Development
106 Vikram XI A AI Research
(b)
Project['Grade'] = ['A', 'B+', 'C+', 'B', 'A+', 'C']
Name Class Section Project Name Grade
101 Rekha XII B Data Analysis A
102 Divya XII C Graphical Analysis B+
103 Geet XII H Machine Learning C+
104 Jeet XII B App Development B
105 Arya XI D Web Development A+
106 Vikram XI A AI Research C
(c)
print(Project[['Name', 'Section']])
Name Section
101 Rekha B
102 Divya C
103 Geet H
104 Jeet B
105 Arya D
106 Vikram A
(d)
print(Project.loc[[101, 102]])
Name Class Section Project Name Grade
101 Rekha XII B Data Analysis A
102 Divya XII C Graphical Analysis B+
(e)
Project.insert(1, 'School', ['ABC', 'PQR', 'ABC', 'PQR', 'XYZ', 'XYZ'])
Name School Class Section Project Name Grade
101 Rekha ABC XII B Data Analysis A
102 Divya PQR XII C Graphical Analysis B+
103 Geet ABC XII H Machine Learning c+
104 Jeet PQR XII B App Development B
105 Arya XYZ XI D Web Development A+
106 Vikram XYZ XI A AI Research C
(f)
print(Project.iloc[1:3])
Name School Class Section Project Name Grade
102 Divya PQR XII C Graphical Analysis B+
103 Geet ABC XII H Machine Learning c+
(g)
Project.Class[104] = 'XI'
Project.Section[104] = 'A'
Name School Class Section Project Name Grade
101 Rekha ABC XII B Data Analysis A
102 Divya PQR XII C Graphical Analysis B+
103 Geet ABC XII H Machine Learning c+
104 Jeet PQR XI A App Development B
105 Arya XYZ XI D Web Development A+
106 Vikram XYZ XI A AI Research C
(h)
Project = Project.drop(['Project Name', 'Section'], axis = 1)
Name School Class Grade
101 Rekha ABC XII A
102 Divya PQR XII B+
103 Geet ABC XII c+
104 Jeet PQR XI B
105 Arya XYZ XI A+
106 Vikram XYZ XI C
Consider the following dataframe: CORONA and answer the questions given below:
ID | State | Cases |
---|---|---|
100 | Delhi | 3000 |
110 | Mumbai | 4000 |
120 | Chennai | 5000 |
130 | Surat | 4500 |
Create the above-given dictionary with the given indexes.
(a) Write code to add a new column “Recovery” using the series method to store the number of patients recovered in every state.
(b) To add a new column “Deaths” using the assign() method to store the number of deaths in every state.
(c) To add a new row to store details of another state using loc (assume values).
(d) To add a new column "Percentage" using the insert() method to store the percentage of recovery in every state (assume values). The column should be added as the fourth column in the dataframe.
(e) To delete the column “Percentage” using del command.
(f) To delete the column “Deaths” using pop() method.
(g) To insert a new row of values using iloc[] at the 1st position.
(h) To delete Cases and State temporarily from the dataframe.
The DataFrame CORONA is created as :
import pandas as pd
data = {'State': ['Delhi', 'Mumbai', 'Chennai', 'Surat'],
'Cases': [3000, 4000, 5000, 4500]}
CORONA = pd.DataFrame(data, index=[100, 110, 120, 130])
print(CORONA)
State Cases
100 Delhi 3000
110 Mumbai 4000
120 Chennai 5000
130 Surat 4500
(a)
CORONA['Recovery'] = pd.Series([2500, 3000, 3500, 3200], index=[100, 110, 120, 130])
State Cases Recovery
100 Delhi 3000 2500
110 Mumbai 4000 3000
120 Chennai 5000 3500
130 Surat 4500 3200
(b)
CORONA = CORONA.assign(Deaths=[200, 250, 300, 220])
State Cases Recovery Deaths
100 Delhi 3000 2500 200
110 Mumbai 4000 3000 250
120 Chennai 5000 3500 300
130 Surat 4500 3200 220
(c)
CORONA.loc[140] = ['Karnataka', 4200, 2800, 180]
State Cases Recovery Deaths
100 Delhi 3000 2500 200
110 Mumbai 4000 3000 250
120 Chennai 5000 3500 300
130 Surat 4500 3200 220
140 Karnataka 4200 2800 180
(d)
CORONA.insert(3, 'Percentage', [80, 75, 70, 71, 67])
State Cases Recovery Percentage Deaths
100 Delhi 3000 2500 80 200
110 Mumbai 4000 3000 75 250
120 Chennai 5000 3500 70 300
130 Surat 4500 3200 71 220
140 Karnataka 4200 2800 67 180
(e)
del CORONA['Percentage']
State Cases Recovery Deaths
100 Delhi 3000 2500 200
110 Mumbai 4000 3000 250
120 Chennai 5000 3500 300
130 Surat 4500 3200 220
140 Karnataka 4200 2800 180
(f)
CORONA.pop('Deaths')
State Cases Recovery
100 Delhi 3000 2500
110 Mumbai 4000 3000
120 Chennai 5000 3500
130 Surat 4500 3200
140 Karnataka 4200 2800
(g) The iloc
method is not used to add rows to a DataFrame. It is used for index-based or integer-location-based accessing of rows and columns, not for adding rows.
One way of adding a row at a specific position in a DataFrame is by creating a new DataFrame including the row and then concatenating the two DataFrames as shown below :
new_row = {'State': 'Hyderabad', 'Cases': 5200, 'Recovery': 3800}
new_df = pd.DataFrame([new_row], index= [150])
CORONA = pd.concat([new_df, CORONA])
State Cases Recovery
150 Hyderabad 5200 3800
100 Delhi 3000 2500
110 Mumbai 4000 3000
120 Chennai 5000 3500
130 Surat 4500 3200
140 Karnataka 4200 2800
(h)
CORONA.drop(['Cases', 'State'], axis=1, inplace=True)
Recovery
100 3800
110 2500
120 3000
130 3500
140 3200
5 2800
The Student DataFrame is created as :
import pandas as pd
Name = ['John', 'Anna', 'Peter', 'Linda', 'Bob']
Grade = ['A', 'B', 'A+', 'C', 'B+']
Marks = [90, 80, 95, 70, 85]
S1 = pd.Series(Grade, index = Name)
S2 = pd.Series(Marks, index = Name)
Student = pd.DataFrame({'Grade' : S1, 'Marks' : S2 })
print(Student)
Grade Marks
John A 90
Anna B 80
Peter A+ 95
Linda C 70
Bob B+ 85
(a)
print(Student.head(3))
Grade Marks
John A 90
Anna B 80
Peter A+ 95
(b)
print(Student.tail(2))
Grade Marks
Linda C 70
Bob B+ 85
(a)
import pandas as pd
data = {
'Name': ['Joseph', 'Ananya', 'Praneet', 'Lakshmi', 'Bhagya'],
'Sub1': [90, 80, 95, 70, 85],
'Sub2': [85, 90, 88, 92, 89],
'Sub3': [88, 85, 90, 95, 92],
'Sub4': [92, 88, 85, 90, 95],
'Sub5': [95, 92, 92, 88, 90]
}
Student = pd.DataFrame(data)
print(Student)
Name Sub1 Sub2 Sub3 Sub4 Sub5
0 Joseph 90 85 88 92 95
1 Ananya 80 90 85 88 92
2 Praneet 95 88 90 85 92
3 Lakshmi 70 92 95 90 88
4 Bhagya 85 89 92 95 90
(b)
print("First 5 rows:")
print(Student.head(5))
print("\nBottom 3 rows:")
print(Student.tail(3))
First 5 rows:
Name Sub1 Sub2 Sub3 Sub4 Sub5
0 Joseph 90 85 88 92 95
1 Ananya 80 90 85 88 92
2 Praneet 95 88 90 85 92
3 Lakshmi 70 92 95 90 88
4 Bhagya 85 89 92 95 90
Bottom 3 rows:
Name Sub1 Sub2 Sub3 Sub4 Sub5
2 Praneet 95 88 90 85 92
3 Lakshmi 70 92 95 90 88
4 Bhagya 85 89 92 95 90
(a)
import pandas as pd
data1 = {'Employee': ['Jatin', 'Avinash', 'Kavya', 'Apoorva', 'Nitin'],
'Salary': [50000, 60000, 70000, 80000, 90000]}
df1 = pd.DataFrame(data1)
data2 = {'Employee': ['Saanvi', 'Aditi', 'Shashank', 'Swapnil', 'Shravani'],
'Salary': [55000, 45000, 30000, 85000, 66000]}
df2 = pd.DataFrame(data2)
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
DataFrame 1:
Employee Salary
0 Jatin 50000
1 Avinash 60000
2 Kavya 70000
3 Apoorva 80000
4 Nitin 90000
DataFrame 2:
Employee Salary
0 Saanvi 55000
1 Aditi 45000
2 Shashank 30000
3 Swapnil 85000
4 Shravani 66000
(b)
df1['Bonus'] = 5000
df2['Bonus'] = 5000
print("DataFrame 1 with bonus:")
print(df1)
print("\nDataFrame 2 with bonus:")
print(df2)
DataFrame 1 with bonus:
Employee Salary Bonus
0 Jatin 50000 5000
1 Avinash 60000 5000
2 Kavya 70000 5000
3 Apoorva 80000 5000
4 Nitin 90000 5000
DataFrame 2 with bonus:
Employee Salary Bonus
0 Saanvi 55000 5000
1 Aditi 45000 5000
2 Shashank 30000 5000
3 Swapnil 85000 5000
4 Shravani 66000 5000
(a)
import pandas as pd
data = [[10, 11, 12, 13, 14],
[23, 34, 45, 32, 65],
[55, 60, 65, 70, 75]]
df = pd.DataFrame(data)
print(df)
0 1 2 3 4
0 10 11 12 13 14
1 23 34 45 32 65
2 55 60 65 70 75
(b)
df.loc[3] = [12, 62, 53, 34, 75]
print(df)
0 1 2 3 4
0 10 11 12 13 14
1 23 34 45 32 65
2 55 60 65 70 75
3 12 62 53 34 75
Create a dataframe of [23, 25], [34], [43, 44, 45, 46] and do the following: :
(a) Display the dataframe. Notice that the missing value is represented by NaN.
(b) Replace the missing value with 0.
(c) Replace the missing value with -1, -2, -3, -4 for columns 0, 1, 2, 3.
(d) Replace the missing value by copying the value from the above cell.
(a)
import pandas as pd
data = [[23, 25],
[34],
[43, 44, 45, 46]]
df = pd.DataFrame(data)
print(df)
0 1 2 3
0 23 25.0 NaN NaN
1 34 NaN NaN NaN
2 43 44.0 45.0 46.0
(b)
df = df.fillna(0)
0 1 2 3
0 23 25.0 0.0 0.0
1 34 0.0 0.0 0.0
2 43 44.0 45.0 46.0
(c)
df = df.fillna({0:-1, 1:-2, 2:-3, 3:-4})
0 1 2 3
0 23 25.0 -3.0 -4.0
1 34 -2.0 -3.0 -4.0
2 43 44.0 45.0 46.0
(d)
df= df.fillna(method='ffill')
0 1 2 3
0 23 25.0 NaN NaN
1 34 25.0 NaN NaN
2 43 44.0 45.0 46.0
Create a dataframe of D1 and D2;
D1 = {‘Rollno’ : [1001, 1004, 1005, 1008, 1009], ‘Name’: [‘Sarika’, ‘Abhay’, ‘Mohit’, ‘Ruby’, ‘Govind’ ]}
D2 = {‘Rollno’ : [1002, 1003, 1004, 1005, 1006], ‘Name’:[‘Seema’,‘Jia’,‘Shweta’, ‘Sonia’, ‘Nishant’]}
(a) Concatenate row-wise.
(b) Concatenate column-wise.
The DataFrame D1 and D2 are created as :
import pandas as pd
Data1 = {'Rollno': [1001, 1004, 1005, 1008, 1009], 'Name': ['Sarika', 'Abhay', 'Mahit', 'Ruby', 'Govind']}
Data2 = {'Rollno': [1002, 1003, 1004, 1005, 1006], 'Name': ['Seema', 'Jia', 'Shweta', 'Sonia', 'Nishant']}
D1 = pd.DataFrame(Data1)
D2 = pd.DataFrame(Data2)
print(D1)
print(D2)
Rollno Name
0 1001 Sarika
1 1004 Abhay
2 1005 Mahit
3 1008 Ruby
4 1009 Govind
Rollno Name
0 1002 Seema
1 1003 Jia
2 1004 Shweta
3 1005 Sonia
4 1006 Nishant
(a)
df = pd.concat([D1, D2])
Rollno Name
0 1001 Sarika
1 1004 Abhay
2 1005 Mahit
3 1008 Ruby
4 1009 Govind
0 1002 Seema
1 1003 Jia
2 1004 Shweta
3 1005 Sonia
4 1006 Nishant
(b)
df = pd.concat([D1, D2], axis=1)
Rollno Name Rollno Name
0 1001 Sarika 1002 Seema
1 1004 Abhay 1003 Jia
2 1005 Mahit 1004 Shweta
3 1008 Ruby 1005 Sonia
4 1009 Govind 1006 Nishant
Create a dataframe of {‘A’ : [ ]} and display whether it is empty or not.
import pandas as pd
df = pd.DataFrame({'A': []})
print(df)
if df.empty:
print("The dataframe is empty.")
else:
print("The dataframe is not empty.")
Empty DataFrame
Columns: [A]
Index: []
The dataframe is empty.
Create a dataframe of {‘A’ : [5, 6], ‘B’: [3, 0], 'C': [0, 0]} and display the result of all() and any() functions.
import pandas as pd
df = pd.DataFrame({'A': [5, 6], 'B': [3, 0], 'C': [0, 0]})
print(df)
print("\nResult of all() function:")
print(df.all())
print("\nResult of any() function:")
print(df.any())
A B C
0 5 3 0
1 6 0 0
Result of all() function:
A True
B False
C False
dtype: bool
Result of any() function:
A True
B True
C False
dtype: bool
Create a dataframe of {‘A’ : [True, True], ‘B’: [True, False], ‘C’: [False, False]} and display the result of all() and any().
import pandas as pd
df = pd.DataFrame({'A': [True, True], 'B': [True, False], 'C': [False, False]})
print(df)
print("\nResult of all() function:")
print(df.all())
print("\nResult of any() function:")
print(df.any())
A B C
0 True True False
1 True False False
Result of all() function:
A True
B False
C False
dtype: bool
Result of any() function:
A True
B True
C False
dtype: bool
del function | pop() function | drop() function |
---|---|---|
The del statement deletes a column from a DataFrame. | The pop() function is used to delete a column or an item from a DataFrame or Series. | The drop() function is used to drop rows or columns from a DataFrame. |
It does not return a value. | It returns the removed item. | It returns a new DataFrame with the dropped labels. |
It modifies the original DataFrame or Series. | It modifies the original DataFrame or Series. | It does not modifies the original DataFrame by default (unless inplace = True is specified). |
The syntax is del df['column_name'] . | The syntax is df['column_name'].pop() . | The syntax is df.drop(index or sequence of indexes) . |
False
Reason — A DataFrame object is 2D labelled array like Pandas data structure that stores an ordered collection columns that can store data of different types.
False
Reason — A Series is a Pandas data structure that represents a one-dimensional array like object containing an array of data and an associated array of data labels, called its index.
True
Reason — The iteritems() method in pandas iterates over a DataFrame column-wise, where each vertical subset is in the form of (column-index, Series) where Series contains all row values for that column-index.
True
Reason — In Pandas, 'NA' (Not Available) and 'NAT' (Not A Time) are used to represent missing values in numeric and datetime columns, respectively. 'None' is used to represent missing values in object columns. Although they are used for different data types, they are all considered as NaN (Not a Number) values.
False
Reason — The iteritems() method in pandas iterates over a DataFrame column-wise, where each vertical subset is in the form of (column-index, Series) where Series contains all row values for that column-index.
True
Reason — CSV stands for Comma Separated Values. It is a file format used to store tabular data, such as tables or spreadsheets, as plain text. In a CSV file, each line represents a single row of data, and each value in that row is separated from the next value by a comma (,).