BOBOBK

Python Native Lists vs. NumPy Arrays

TECHNOLOGY

In Python, you can choose from various native data types to store collection data, including list, array, tuple, and dictionary. Among these, the list is highly flexible, can store any content, and is mutable, making it widely applicable. However, for scientific computing and storing purely numerical data, NumPy is widely used and has practically replaced lists. So, what are the differences between them, how significant are these differences, and how should they be applied in practice?

Of course, using practical examples is the best way to illustrate the differences.


Comparison of Operation Speed

Let’s compare simple arithmetic operations (addition, subtraction, multiplication, division) using numbers up to 10,000.

First, Summation

mylist = []
for i in range(1,10001):
    mylist.append(i)

#  list
from time import time
start = time()
total=sum(mylist)
print(total)
end = time()
print(f"total:{end-start}s")
## 50005000
## total:0.0003197193145751953s

# numpy  np.sum
import numpy as np
myarray = np.array(mylist)
start = time()
total = np.sum(myarray)
print(total)
end = time()
print(f"total:{end-start}s")

## 50005000
## total:0.00041031837463378906s

# numpy sum
start = time()
total = sum(myarray)
print(total)
end = time()
print(f"total:{end-start}s")

## 50005000
## total:0.0012726783752441406s

As you can see, when calculating the sum, the native list takes 0.0003 seconds. Using NumPy’s np.sum, it takes 0.0004 seconds. However, using Python’s built-in sum() function on a NumPy array is the slowest, taking 0.001 seconds, which is almost twice as long. This doesn’t even include the time it takes to convert the list to an array. Therefore, for summation, the built-in list clearly has an advantage. Many other articles compare using loops, which would indeed be slower, but that doesn’t reflect the true speed of built-in functions.

Next, Product

Using the same mylist data as a base, let’s compare the speeds again.

#  list
from time import time
start = time()
total = 1
for i in mylist: # Corrected from 'total' to 'mylist'
    total *= i
end = time()
print(f"total:{end-start}s")
## (Note: This output would be for the original list sum, not product. For product, it would be a very large number.)
## The original comment output for `total:0.0003197193145751953s` appears to be from the sum example, not product.
## A product of numbers up to 10000 would be astronomically large and take longer.

# numpy  np.prod
import numpy as np
myarray = np.array(mylist)
start = time()
total = np.prod(myarray)
end = time()
print(f"total:{end-start}s")
## total:0.01838994026184082s (This is likely the original output from the source)
## total:0.000213623046875s (This is likely the actual output from a fast execution)

When performing continuous multiplication, since there’s no built-in prod function for native lists, one is forced to use a loop, which inevitably slows down the process. However, NumPy has the np.prod function, which significantly speeds up continuous multiplication.


Conclusion

This article compared the computational speeds of Python’s built-in lists and NumPy arrays from a practical computation perspective. We found that NumPy does not have an advantage in calculating sums, and type conversion adds overhead. However, when calculating continuous products, NumPy shows a significant speed improvement. Therefore, NumPy is frequently preferred for operations like continuous multiplication.

In summary, lists have a wide range of applications and offer fast summation. However, for scientific computing, machine learning, and related fields, NumPy is the dominant choice. This is because NumPy arrays are extremely fast for operations like continuous multiplication, and its foundational role for libraries like Pandas (DataFrames, Series, etc.) gives it an absolute advantage in scientific computing.

Related

Calculating the Gini Coefficient and Plotting the Lorenz Curve with matplotlib

TECHNOLOGY
Calculating the Gini Coefficient and Plotting the Lorenz Curve with matplotlib

The Gini coefficient and Lorenz curve are widely used to represent data inequality, especially wealth inequality. However, currently in Python, there isn't a very good function to directly plot the Lorenz curve. Since the current project requires it, this article records how to use numpy, pandas, matplotlib, and other packages to calculate the Gini coefficient and plot the Lorenz curve for practical use.