Sorting and ranking#
Sorting a record by a criterion is another important built-in function. Sorting lexicographically by row or column index is already described in the section Reordering and sorting from levels. In the following we look at sorting the values with DataFrame.sort_values and Series.sort_values:
[1]:
import numpy as np
import pandas as pd
rng = np.random.default_rng()
s = pd.Series(rng.normal(size=7))
s.sort_index(ascending=False)
[1]:
6 -0.521271
5 -0.228255
4 -1.131139
3 -0.531495
2 0.783785
1 -0.311396
0 0.088381
dtype: float64
All missing values are sorted to the end of the row by default:
[2]:
s = pd.Series(rng.normal(size=7))
s[s < 0] = np.nan
s.sort_values()
[2]:
6 0.303859
4 0.435222
5 0.936456
3 1.312848
2 1.840338
0 NaN
1 NaN
dtype: float64
With a DataFrame you can sort on both axes. With by
you specify which column or row is to be sorted:
[3]:
df = pd.DataFrame(rng.normal(size=(7, 3)))
df.sort_values(by=2, ascending=False)
[3]:
0 | 1 | 2 | |
---|---|---|---|
3 | 1.489694 | 0.104105 | 0.870251 |
6 | -0.649611 | -1.035134 | 0.515880 |
5 | -0.176371 | 1.261471 | 0.242477 |
0 | 0.252096 | -0.315417 | -1.000917 |
2 | -1.659567 | -0.139293 | -1.138415 |
4 | 1.533278 | 0.241760 | -1.252604 |
1 | 1.929005 | 1.032325 | -2.153640 |
You can also sort rows with axis=1
and by
:
[4]:
df.sort_values(axis=1, by=[0, 1], ascending=False)
[4]:
0 | 1 | 2 | |
---|---|---|---|
0 | 0.252096 | -0.315417 | -1.000917 |
1 | 1.929005 | 1.032325 | -2.153640 |
2 | -1.659567 | -0.139293 | -1.138415 |
3 | 1.489694 | 0.104105 | 0.870251 |
4 | 1.533278 | 0.241760 | -1.252604 |
5 | -0.176371 | 1.261471 | 0.242477 |
6 | -0.649611 | -1.035134 | 0.515880 |
Ranking#
DataFrame.rank and Series.rank assign ranks from one to the number of valid data points in an array:
[5]:
df.rank()
[5]:
0 | 1 | 2 | |
---|---|---|---|
0 | 4.0 | 2.0 | 4.0 |
1 | 7.0 | 6.0 | 1.0 |
2 | 1.0 | 3.0 | 3.0 |
3 | 5.0 | 4.0 | 7.0 |
4 | 6.0 | 5.0 | 2.0 |
5 | 3.0 | 7.0 | 5.0 |
6 | 2.0 | 1.0 | 6.0 |
If ties occur in the ranking, the middle rank is usually assigned in each group.
[6]:
df2 = pd.concat([df, df[5:]])
df2.rank()
[6]:
0 | 1 | 2 | |
---|---|---|---|
0 | 6.0 | 3.0 | 4.0 |
1 | 9.0 | 7.0 | 1.0 |
2 | 1.0 | 4.0 | 3.0 |
3 | 7.0 | 5.0 | 9.0 |
4 | 8.0 | 6.0 | 2.0 |
5 | 4.5 | 8.5 | 5.5 |
6 | 2.5 | 1.5 | 7.5 |
5 | 4.5 | 8.5 | 5.5 |
6 | 2.5 | 1.5 | 7.5 |
The parameter min
, on the other hand, assigns the smallest rank in the group:
[7]:
df2.rank(method="min")
[7]:
0 | 1 | 2 | |
---|---|---|---|
0 | 6.0 | 3.0 | 4.0 |
1 | 9.0 | 7.0 | 1.0 |
2 | 1.0 | 4.0 | 3.0 |
3 | 7.0 | 5.0 | 9.0 |
4 | 8.0 | 6.0 | 2.0 |
5 | 4.0 | 8.0 | 5.0 |
6 | 2.0 | 1.0 | 7.0 |
5 | 4.0 | 8.0 | 5.0 |
6 | 2.0 | 1.0 | 7.0 |
Other methods with rank
#
Method |
Description |
---|---|
|
default: assign the average rank to each entry in the same group |
|
uses the minimum rank for the whole group |
|
uses the maximum rank for the whole group |
|
assigns the ranks in the order in which the values appear in the data |
|
like |