python dataframe

In python, you have many ways to represent a 2-dimensional array. You can use list like:[[1,2],[3,4],[5,6]], you can use ndarray in numpy like:numpy.array([[1,2],[3,4],[5,6]]), you can also use dataframe in the Pandas package:DataFrame([[1,2],[3,4],[5,6]]). All these represent a 2-dimensional array of 3 rows and 2 columns.

But a DataFrame object has other attributes that list/ndarray do not have. A DataFrame object has an index attribute that is the names of rows. A DataFrame object has also a columns attribute that is the names(or labels) of columns. The actual data is stored in DataFrame object’s values attribute. So the standard constructor of DataFrame is:

from pandas import DataFrame

frame = DataFrame(data=[[1,2],[3,4],[5,6]],index=['row1','row2','row3'],columns=['col1','col2'])

print(frame)

print(frame.index)

print(frame.columns)

print(frame.values)

      col1  col2
row1     1     2
row2     3     4
row3     5     6
Index(['row1', 'row2', 'row3'], dtype='object')
Index(['col1', 'col2'], dtype='object')
[[1 2]
 [3 4]
 [5 6]]

If you do not give the index and columns when constructing a DataFrame object, its index attribute is defaulted to [0,1,2,…], and its columns attribute is also defaulted to [0,1,2,…].

from pandas import DataFrame

frame = DataFrame([[1,2],[3,4],[5,6]])

print(frame)

print(frame.index)

print(frame.columns)

print(frame.values)

   0  1
0  1  2
1  3  4
2  5  6
RangeIndex(start=0, stop=3, step=1)
RangeIndex(start=0, stop=2, step=1)
[[1 2]
 [3 4]
 [5 6]]

How to access elements in two dimensional arrays? We know we can use obj[i][j] to access an element in a list or ndarray(we can also use obj[i,j] to access an element in ndarray). We can also use obj[columnname][rowname] to access the elements in a DataFrame object. Note the difference between list/ndarray and DataFrame: the order of column name and row name is reversed. To keep the convention that column name comes after row name, you can use obj.loc[rowname][columnname]. You can also use obj.iloc[rownumber][columnnumber] to denote an element in a DataFrame object even the DataFrame obj has customized columns/index attributes(the rownumber/columnnumber start from 0).

We can not only get the individual elements in a DataFrame object, but also a whole row/column in it. To get a column, use frame[columnname] or frame.columnname. The result is like a Pandas Series object: an array plus an index. You can get multiple columns using frame[[columname1,columnname2,…]]. Notice the double square brackets. Multiple column names must be put into a list and the list is put into the top level square bracket. You can not just put multiple column labels into a single square bracket to obtain multiple columns because the [] operator of DataFrame only takes one parameter. To get a row, you can not use frame[rowname], because that is used to get a column, and because rowname is not a column name, it will generate an error. You must use frame[:rownumber+1] to get row rownumber. Even the DataFrame object has a customized index, you still need to use the row number not the row name/label to get the whole row. On the other hand, you cannot use frame[columnnumber] to get a column if the DataFrame object has already a customized columns attribute.

From the above discussion, it seems accessing a row is more difficult than accessing a column for DataFrame. You cannot use row name, and you need to write a colon when using row number. That is because [] operator of DataFrame is mainly designed for accessing columns rather than rows. On the contrary, DataFrame.loc mostly serves for rows. frame.loc[rowname] is used to get a whole row. If you want to get multiple rows, you need to put row names into a list like frame.loc[[rowname1,rowname2]].   frame.loc and DataFrame.iloc[] operator can have multiple parameters. But you need to pay attention to the meaning of the parameters in []. frame.iloc[1,2] gets the element at row 1 and column 2, while frame.iloc[[1,2]] gets whole row 1 and whole row 2.

DataFrame[] can also take a Boolean Series as its parameter. The rows whose corresponding value in the series are false are eliminated from the result.

 

 

Leave a Reply