Pandas 2.0: How can data scientists benefit from it?

Pandas is at the core of multiple pipelines we have built. All these improvements bring a lot of benefits such as improved data processing times leading to faster updation of data, and improved resource handling means lowered infrastructure cost and fewer chances of services going down due to overload on CPU and memory.

Pareekshith Katti, Senior Data Scientist.

What is Pandas, and why is it important for data scientists and analysts?

If you’re a data scientist or analyst, it’s very likely that you cannot work without Pandas–the popular open-source library which is used for data manipulation and analysis. It is widely used in data science, finance, and other industries for its easy-to-use data structures and powerful data analysis tools.

Pandas is a valuable resource for data scientists for a multitude of reasons. With Pandas, data scientists are able to clean, preprocess, and transform data easily, allowing them to focus on the analysis rather than the data wrangling process.

Recently, Pandas launched their new and improved version called ‘Pandas 2.0’, which boasts some exciting new features. At Ambee, we deal with millions of terabytes of data every day, and tools like Pandas are instrumental in our research and operations. So, if you’re a developer, here’s what the new update means for you.

What is new in Pandas 2.0?

Nullable boolean data type

One of the notable performance-enhancing introductions in Pandas 2.0 is the nullable Boolean data type called BooleanDtype. Unlike the previous boolean data type, this enhancement allows for missing values in boolean data.

This feature significantly simplifies working with datasets that contain missing boolean values, making data analysis more robust and intuitive. At an enterprise level (especially for us)-stability is paramount. This feature prevents breakages in the pipeline, thereby saving a lot of time and infrastructure.

Improved categorical data type

Categorical data is a common type of data in many datasets. Pandas 2.0 introduces several improvements to the categorical data type, including faster performance and reduced memory usage. Pandas doesn’t typically play well with large datasets. However, these improvements have greatly aided in performance. The new CategoricalDtype also supports string and boolean data, making it easier to work with mixed data types.

Improved string handling

String manipulation is a common task in data analysis. And it’s also one prone to a lot of errors and work. Pandas 2.0 introduces several improvements to string handling, including new string methods and improved performance for string operations. These improvements make it easier to work with text and sentence data in Pandas.

Enhanced extension array interface

The extension array interface in Pandas 2.0 has been enhanced to make it easier to create custom data types. The new extension array interface provides more flexibility and allows for better integration with other libraries.

Improved MultiIndexing

MultiIndexing is a powerful feature in Pandas that allows for the indexing and slicing of datasets with multiple levels of hierarchical indexing. Pandas 2.0 introduces several improvements to MultiIndexing, including faster performance and improved memory usage.

For us at Ambee, this helps in indexing latitude-longitudes and time. It especially aids in doing aggregations and resampling of particular groups without affecting others.

Improved IO performance

Pandas 2.0 introduces several improvements to IO performance, including faster CSV and Excel file reading and writing. These improvements make it easier to work with large datasets and improve the overall performance of data analysis tasks.

The direct result of this is seen in update times. The time taken in reading and writing of scripts is directly proportional to how large the files are. With this new update, we’re able to cut a lot of time and serve clients faster.

Improved plotting capabilities

Pandas 2.0 introduces several improvements to plotting capabilities, including support for new plot types and improved customization options. These improvements make it easier to create compelling visualizations of data in Pandas.

We use it to generate all kinds of graphs, reports, and inferences–both for our research and for clients. The added customization helps us in creating high-quality reports.

8. PyArrow functionality

Pandas can utilize PyArrow to extend functionality and improve the performance of various APIs. For us, the improved parquet and feather formats have helped us greatly. These data formats are highly compressed and reduce a lot of time and cost. Some other pros of the PyArrows functionality are:

More extensive data types compared to NumPy
Missing data support (NA) for all data types
Performant IO reader integration
Facilitate interoperability with other dataframe libraries based on the Apache Arrow specification (e.g. polars, cuDF)

To use this functionality, please ensure you have installed the minimum supported PyArrow version.

How does data structure integration work in Pandas 2.0?

A Series, Index, or the columns of a DataFrame can be directly backed by a pyarrow.ChunkedArray, which is similar to a NumPy array. To construct these from the main pandas data structures, you can pass in a string of the type followed by [pyarrow], e.g. "int64[pyarrow]"" into the dtype parameter

In [1]: ser = pd.Series([-1.5, 0.2, None], dtype="float32[pyarrow]")

In [2]: ser

Out[2]:

0 -1.5

1 0.2

2 <NA >

dtype: float[pyarrow]

In [3]: idx = pd.Index([True, None], dtype="bool[pyarrow]")

In [4]: idx

Out[4]: Index([True, <NA>], dtype='bool[pyarrow]')

In [5]: df = pd.DataFrame([[1, 2], [3, 4]], dtype="uint64[pyarrow]")

In [6]: df

Out[6]:

0 1

0 1 2

1 3 4

Group by: split-apply-combine

This function is especially useful for us at Ambee as it helps us carry out a lot of essential analytics. It can be used similarly for a myriad of other purposes. Keep in mind that

by “group by” we are referring to a process involving one or more of the following steps:

Splitting the data into groups based on some criteria.
Applying a function to each group independently.
Combining the results into a data structure.

Out of these, the split step is the most straightforward. In fact, in many situations, we may wish to split the data set into groups and do something with those groups. In the ‘apply’ step, we might wish to do one of the following:

Aggregation: compute a summary statistic (or statistics) for each group. Some examples:

Compute group sums or means.
Compute group sizes/counts

Transformation: perform some group-specific computations and return a like-indexed object. Some examples:

Standardize data (zscore) within a group.
Filling NAs within groups with a value derived from each group.

Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:

Discard data that belongs to groups with only a few members.
Filter out data based on the group sum or mean.

Many of these operations are defined on GroupBy objects. These operations are similar to the aggregating API, window API, and resample API.

It is possible that a given operation does not fall into one of these categories or is some combination of them. In such a case, it may be possible to compute the operation using GroupBy’s apply method. This method will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories.

Note: An operation that is split into multiple steps using built-in GroupBy operations will be more efficient than using the apply method with a user-defined Python function.

Since the set of object instance methods on pandas data structures are generally rich and expressive, we often simply want to invoke, say, a DataFrame function on each group. The name GroupBy should be quite familiar to those who have used a SQL-based tool (or itertools) in which you can write code like:

SELECT Column1, Column2, mean(Column3), sum(Column4)

FROM SomeTable

GROUPBY Column1, Column2

We aim to make operations like this natural and easy to express using pandas. We’ll address each area of GroupBy functionality and then provide some non-trivial examples/use cases.

See the cookbook for some advanced strategies.

Splitting an object into groups

In Pandas, objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object (more on what the GroupBy object is later), you may do the following:

In [1]: speeds = pd.DataFrame(

...: [

...: ("bird", "Falconiformes", 389.0),

...: ("bird", "Psittaciformes", 24.0),

...: ("mammal", "Carnivora", 80.2),

...: ("mammal", "Primates", np.nan),

...: ("mammal", "Carnivora", 58),

...: ],

...: index=["falcon", "parrot", "lion", "monkey", "leopard"],

...: columns=("class", "order", "max_speed"),

...: )

...:

In [2]: speeds

Out[2]:

class order max_speed

falcon bird Falconiformes 389.0

parrot bird Psittaciformes 24.0

lion mammal Carnivora 80.2

monkey mammal Primates NaN

leopard mammal Carnivora 58.0

‍

# default is axis=0

In [3]: grouped = speeds.groupby("class")

‍

In [4]: grouped = speeds.groupby("order", axis="columns")

In [5]: grouped = speeds.groupby(["class", "order"])

‍

The mapping can be specified in many different ways:

A Python function is to be called on each of the axis labels.
A list or NumPy array of the same length as the selected axis.
A dict or Series, providing a label -> group name mapping.
For DataFrame objects, a string indicating either a column name or an index level name to be used to group.
df.groupby('A') is just syntactic sugar for df.groupby(df['A']).
A list of any of the above things.

Collectively we refer to the grouping objects as the keys. For example, consider the following DataFrame:

Note: A string passed to groupby may refer to either a column or an index level. If a string matches both a column name and an index-level name, a ValueError will be raised.

In [6]: df = pd.DataFrame(

...: {

...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],

...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],

...: "C": np.random.randn(8),

...: "D": np.random.randn(8),

...: }

...: )

...:

‍

In [7]: df

Out[7]:

A B C D

0 foo one 0.469112 -0.861849

1 bar one -0.282863 -2.104569

2 foo two -1.509059 -0.494929

3 bar three -1.135632 1.071804

4 foo two 1.212112 0.721555

5 bar two -0.173215 -0.706771

6 foo one 0.119209 -1.039575

7 foo three -1.044236 0.271860

‍

On a DataFrame, we obtain a GroupBy object by calling groupby(). We could naturally group by either the A or B columns or both:

In [8]: grouped = df.groupby("A")

In [9]: grouped = df.groupby(["A", "B"])

If we also have a MultiIndex on columns A and B, we can group by all but the specified columns

In [10]: df2 = df.set_index(["A", "B"])

In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))

In [12]: grouped.sum()

Out[12]:

C D

bar -1.591710 -1.739537

foo -0.752861 -1.402938

These will split the DataFrame on its index (rows). We could also split by the columns:

In [13]: def get_letter_type(letter):

....: if letter.lower() in 'aeiou':

....: return 'vowel'

....: else:

....: return 'consonant'

....:

‍

In [14]: grouped = df.groupby(get_letter_type, axis=1)

pandas Index objects support duplicate values. If a non-unique index is used as the group key in a groupby operation, all values for the same index value will be considered to be in one group, and thus the output of aggregation functions will only contain unique index values:

In [15]: lst = [1, 2, 3, 1, 2, 3]

In [16]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)

In [17]: grouped = s.groupby(level=0)

In [18]: grouped.first()

Out[18]:

1 1

2 2

3 3

dtype: int64

In [19]: grouped.last()

Out[19]:

1 10

2 20

3 30

dtype: int64

In [20]: grouped.sum()

Out[20]:

1 11

2 22

3 33

dtype: int64

Note that no splitting occurs until it’s needed. Creating the GroupBy object only verifies that you’ve passed a valid mapping.

How do we utilize Pandas 2.0 at Ambee?

At Ambee, data is paramount, and Pandas sits at the core of all our data processes–from generating simple reports to creating complex pipelines. We use it to calibrate, refine and remove anomalies from millions of TBs worth of data for all our datasets from dozens of sources. Clearly, a lot depends on it, and that’s why these updates mean a lot to us. Here’s what our senior data scientist Pareekshith Katti had to say–

Ambee’s data is the centerpiece of our drive to aid climate action across the world. If you’re interested in learning more about how our processes work and would like to try it out for yourself, please contact us here or leave a comment below.

Additional read

For a task as simple as plotting something on a map, it sure can be a tedious job. To ease the burden of consistently doing such tasks, we have developed a new library that requires only a single line of code to plot and customize maps without overlooking the look of it. Take a look at Gspatial-plot by Ambee and learn how this open-source library can be of your benefit in this blog.