- Published on
Playing Around with Numpy and Pandas
- Authors
- Name
- Yair Mark
- @yairmark
I recently started working on a basic data science project. Below are some of the things I learned getting a feel for Numpy and Pandas.
Linspace - Specify Interval and Range for Graph Axis
The official docs can be found here
This returns an array from start to end up to the number of required samples if provided otherwise 50 samples are returned. For example:
>>> X = np.linspace(1.5, 5, 10)
array([1.5 , 1.88888889, 2.27777778, 2.66666667, 3.05555556,
3.44444444, 3.83333333, 4.22222222, 4.61111111, 5. ])
>>> len(X)
10
This seems to be used with plotting libraries to control each axises range and interval.
For example, the above output if used for an X-axis in a plotting library will have the axis start from 1.5 and end at 5 with 10 intervals.
Unique - Identify Unique Values in Dataframe
dupes = some_df[some_df.duplicated()]
In the above:
some_df.duplicated
returns a list with True or False values. True if the key is a dupe and false otherwise.some_df[
: Pandas seems to be capable of taking in a list of True/False values corresponding to the position of each index.- If
True
the index row is rendered - If
False
it is not rendered.
- If
The duplicated method's documentation can be found here
To easily drop the duplicates run:
# drop duplicates where the entire row is the same
de_duplicated_df = some_df.drop_duplicates()
# drop duplicates where rows are compared based on one column only
de_duplicated_df = some_df.drop_duplicates(subset='SOME_COL')
To confirm that the correct number of dupes was dropped we can see the:
- DataFrame length before
- DataFrame length after de-duplicating
- The number of duplicates:
num_dupes = len(some_df[some_dfduplicated()])
len_before = len(some_df)
de_duped = some_df.drop_duplicates(subset='SOME_COL')
len_after = len(de_duped)
len_before, len_after, num_dupes, (len_before - len_after) == num_dupes
A good idea is to sort the columns to at least at a glance verify if the dupes are gone based on the column you dropped duplicates against:
some_df.sort_values('SOME_COL', ascending=True)
- This nice brief article goes over this in more detail.
- The official docs for drop_duplicates can be found here
- The official docs for sort_values
Append a Column with Defaults
This answer describes this in more details. But to add some new column lets say COUNT
with a default of 0
we do the following:
some_df['COUNT']= 0
You can also default all values in the column to not a number (NaN) as follows:
some_df['COUNT']= np.nan
at - set value for a particular cell
This answer describes how to do this.
You have to use something of the form:
row_num = 10
previous = some_df.at[index_row_num, 'COUNT']
some_df.at[index_row_num, 'COUNT'] = previous + 1
row_num
: this is used to illustrate that the first part of theat
is the row number you want- The second of the
at
is the column you want to set the cell value for.
concat - insert a new row
An example of adding a row with columns is below:
row_df = pd.DataFrame({'ID':0, 'Name': 'John', 'Surnam': 'Smith'}, index=['0'])
row_df.set_index('ID', inplace=True)
data = pd.concat([row_df, data])
If the entry you adding is a simple row that has the same type then you can use the approach described here. An extract from this article is below:
a_row = pd.Series([1, 2])
df = pd.DataFrame([[3, 4], [5, 6]])
row_df = pd.DataFrame([a_row])
df = pd.concat([row_df, df], ignore_index=True)
set_index - change the index to a specific column
This is as simple as:
data.set_index('SOME_OTHER_COL', inplace=True)
inplace
will change the dataframe directly- If this is set to
False
a new dataframe with the new index is returned instead.
- If this is set to
- Here is the official set_index docs
Filtering a DataFrame
This is of the format:
some_df = some_df[some_df.COUNT > 0]
In this case some_df
:
- Has a column called
COUNT
which is an int column- If it has a different type use the appropriate operator
- We are replacing
some_df
but if you just want to filter and see the results change this to:
some_df[some_df.COUNT > 0]