'자동으로 페이지를 찾아주세요!' - 페이지네이션 쿼리와 응용, 실행 쿼리 측정까지

Introduction🙋‍♀️

When developing services that handle large amounts of data, implementing pagination is an inevitable part of modern development. In this post, I’ll discuss examples ranging from basic pagination to more advanced pagination features worth exploring.

What Is Pagination?📃

Definition

Generally, Pagination(페이지네이션) refers to sorting a large amount of data according to certain criteria and then displaying it in batches of a specific size. Since each group of data—divided into specific quantities—is called a “page,” this process is referred to as pagination. Common examples of pagination include bulletin board-style layouts and auto-scroll features.

It is very common for a backend engineer to receive a request like, “Please create an API that retrieves a list of posts with pagination.” To solve this, it is standard practice to use a query via an ORM.

Queries for Paginated Results

Until now, I have used the following queries to implement pagination.

sql
select * from boards
limit 5  -- page의 사이즈
offset 10  -- (page 번호 - 1) * page 사이즈

Traditionally, pagination can be achieved using limit and offset. If we implement this using Flask and the SQLAlchemy API, we can create an API with pagination functionality like the following.

python
@app.route("/api/boards")
def get_boards(page, page_size):
	offset_num = (page - 1) * page_size
	boards = (
    	Board.query
        .limit(page_size)
        .offset(offset_num)
        .all()
    )
    return [
       board._asdict()
       for board in boards
    ]

While researching for this blog post, I came across the opinion that “we shouldn’t use offset!”—which I found to be quite reasonable. However, since this is slightly off-topic for this post, I plan to study it further and summarize my findings later. Reference blog 👉 https://binux.tistory.com/148

If we add sorting and search functionality to the example above, we can implement a general API for retrieving lists of data.

Automatic pagination?🤔

Functional Definition

Then one day, I received this request from a product manager:

Please add a feature that automatically finds the ‘page number’ for a given post!

From a developer’s perspective, this request could be rephrased as follows:

Please calculate the page number in reverse using the post’s ID.

When I first heard this, I thought, “Huh? That sounds like an unfamiliar and difficult feature. Is it even possible…?” However, since they said it was absolutely! necessary from a functional standpoint, I checked to see if I could solve it technically.

Let’s write the query first

I started by writing a query in PostgreSQL. I wrote a query that could find the page for each ID.

sql
select
	b.id,
	(row_number() over() - 1) / 10 + 1 as "page_num"  -- 여기서 나눈 10이 page_size
from
	board b
order by
    b.id

The logic for calculating page_num can be broken down as follows:

ROW_NUMBER() OVER (): This is a function that assigns an order to the current row. Here, it’s used to number all rows.
1: Subtract 1 to set the order of the first row to 0.
/ 10: Divides each row’s index by page_size (10 in this case) to find the quotient. This ensures that every 10 rows share the same value.

Let’s verify the following results using a simple example.

There are approximately 100 records in a table named boards. Let’s create a query that returns the page number for each row—for pagination purposes only—when the table is sorted by id.

sql
select
	b.id,
    b.name,
	row_number() over() as "row_number_over",
	(row_number() over() - 1) / 10 as "minus_1_and_devide_10",
	(row_number() over() - 1) / 10 + 1 as "page"
from
	board b
order by
	b.id

How should we handle cases where the name in the table above does not contain '1'?

sql
select
	b.id,
    b.name,
	row_number() over() as "row_number_over",
	(row_number() over() - 1) / 10 as "minus_1_and_devide_10",
	(row_number() over() - 1) / 10 + 1 as "page"
from
	board b
where
	b.name not like '%1%'
order by
	b.id

Even if we exclude cases where the title contains '1', we can still determine the page number for each piece of content while performing pagination.

Now, if we retrieve the content ID from here and apply a WHERE clause, we can obtain the page number for that specific content.

💡 Important Note
At this point, you must use the query above as a subquery and apply a WHERE id = {id} condition. If you include it directly in the WHERE clause of the example above, it will be used in the page calculation, so you won’t get the desired page number.

sql
select
	*
from
	(
	select
		b.id,
		b.name,
		(row_number() over() - 1) / 10 + 1 as "page"
	from
		board b
	order by
		b.id
	) A -- subquery를 이용해야 한다
where
	A.id = 22

Will there be any performance issues?

My biggest concern when developing this feature was 속도. The reason was that I wondered, “If the number of data records increases due to the row_number() over() syntax, won’t the calculation speed naturally slow down?” I reasoned that with 1 million data points, the time required to calculate the row order for each data point would increase to at least O(n), even under the most optimistic estimate. So, I tested it by gradually adding data.

Let’s Measure Performance!

Using the EXPLATIN ANALYZE syntax, you can check the query execution plan and how much time each phase takes.

I’ve reused the query from above, but this time I’ve included the EXPLATIN ANALYZE syntax as well.

sql
EXPLAIN ANALYZE
select
	*
from
	(
	select
		B.id,
		B.name,
		(row_number() over() - 1) / 10 + 1 as "page"
	from
		board B
	order by
		B.id
	) A
where
	A.id = 100

The query execution plans and execution times for cases with only the data in 100개 and cases with the data in 100만개 are as follows.

When the board table contains 100 records
When the board table contains 1 million rows

We observed that when the data increased by a factor of 10,000, the execution time increased by approximately 5,560 times. In fact, with 1 million rows, the query alone took 0.5 seconds, resulting in a noticeable delay. We confirmed that if additional logic were added at the server API handler level, the API call itself could become quite slow.

How should this be resolved?

Option 1) Add a `LIMIT` clause to the subquery.

In this case, the LIMIT clause is used to reduce the time spent sorting rows in the subquery before calculating the page.

sql
EXPLAIN ANALYZE
select
	*
from
	(
	select
		B.id,
		B.name,
		(row_number() over() - 1) / 10 + 1 as "page"
	from
		board B
	order by
		B.id
	limit :target_id  -- 이곳이 추가가 된다. 여기서 target_id는 변수로 사용
	) A
where
	A.id = :target_id -- target_id를 변수로

Doing this reduces the time to retrieve a page of 100 records back down to around 0.002 seconds. However, this is not a fundamental solution. Retrieving a page of 1 million records still takes around 0.4 seconds.

Option 2) Retrieve all the data and then search for the data using binary search.

The reasoning behind this approach is that retrieving all the data would be faster than row_number() over(), and by using binary search, the time complexity could be reduced to O(log n).

sql
EXPLAIN ANALYZE
select * from board

The query above retrieves the entire table containing 1 million rows of data. It is much faster than using row_number() over().

python
def binary_search_row_number(target_id, boards):
	# 해당 id가 전체 데이터에서 몇 번째에 있는지 계산하기 위해 binary search로 찾는다.
	board_idx_dict = {
    	board.id: row_number
        for row_number, board in enumerate(boards)
    }
    low = 0
    high = TOTAL_NUMBER
    target_index = high // 2

    while low < high:
        target_index = board_idx_dict.get(target_id)
        if target_index is None:
            break
        if target_index > now_index:
            low = now_index
        elif target_index < now_index:
            high = now_index
        else:  # target_index == now_index
            break
        now_index = (high + low) // 2

    return target_index


@app.route("/api/boards")
def get_boards(page, page_size, target_id=None):
	# 자동으로 페이지를 계산하기 위해 여기를 추가 ===========
    if target_id is not None:
    	# 예제는 전체 데이터를 가져왔지만, 정렬 / 필터 등이 들어가면 id가 곧 행 순서 번호가 되지 않을 수 있다.
        total_boards = Board.query.all()
        auto_page = (binary_search_row_number(target_id) - 1) / page_size + 1
        if auto_page:
        	page = auto_page
    # ================================================
	offset_num = (page - 1) * page_size
	boards = (
    	Board.query
        .limit(page_size)
        .offset(offset_num)
        .all()
    )
    return [
       board._asdict()
       for board in boards
    ]

Since the execution time for a binary search is significantly reduced, the query can be even faster.
In this case, more memory may be required depending on the size of the data (due to loading the entire dataset). This should be carefully considered for servers with limited memory.

In Conclusion🤗

In real-world work, situations often arise where you need to apply additional functionality to basic features. In such cases, while you can solve the problem with a query, it’s also worth assessing whether a solution can be implemented at the query or backend code level to an appropriate extent. Of course, there’s no single “absolutely correct” method, but there are certainly “better” approaches.
This topic was something I hadn’t fully resolved yet (though it hasn’t been a major issue since the data volume is still small), but I’m glad it gave me the opportunity to reflect on it and study ways to improve—it was a valuable learning experience. I’ll make sure to implement these improvements properly in the future.