계층 구조의 데이터를 다루는 Recursive Query 도입기

Enter 🙋‍♀️

In this article, I'll introduce the problems we faced when fetching hierarchical data from our API server and how we improved the speed by introducing Recursive Query.

How I got into hierarchical data structures

I was taking over existing code to improve performance and further develop a category feature that allows users to categorize content.

계층형 데이터 구조 literally means that there is a hierarchy between data, which is the same as saying that there is a tree structure with parent-child relationships.
Data in this tree structure is typically,

plain
1) 글의 카테고리를 나눈다거나,
2) 폴더 구조를 지원한다거나,
3) 조직도 표현할 때

for the implementation.

The category feature I took over also deals with tree-like data, so the table design consisted of the following specs

column name	type	description
id	integer	primary key
parent_id	integer	parent category id (self-referential FK, nullable)
name	varchar	category name

(This will be the root category when parent_id is null).

API server implementation specification

Of the backend development specifications, the following are relevant to this article.

plain
Python Flask - 2.2.2
Flask-SQLAlchemy - 3.0.2

plain
PostgreSQL - 13.10

Understanding the existing code 🔎

The tables in the DB we defined earlier can be represented as ORM classes in flask using flask-sqlalchemy (ORM allows us to manipulate the data in the DB with python syntax).
We see that the Category class is defined by the following code.

python
from flask_sqlalchemy import SQLAlchemy

db = SQLAlchemy()

# config 설정은 생략

class Category(db.Model):
	__tablename__ = "categories"

    id = db.Column(db.Integer, primary_key=True, autoincrement=True, nullable=False)
    parent_id = db.Column(db.Integer, foreign_key=[id])
    name = db.Column(db.String)

And in a recursive function (method), I was able to get all the descendants of a particular category.

python
class Category(db.Model):
    # 생략
    @classmethod
    def get(cls, id_):
    	"""특정 id 값의 데이터를 가져옵니다."""
    	return cls.query.filter(cls.id == id_).one_or_none()

    def get_descendant(self):
    	"""category 인스턴스의 모든 자식(자손)을 가져와서 dictionary 형태로 보내줍니다."""
        children = Categories.query.filter(Categories.parent == self).all()
        children_list = []
        for child in children:
            ele = child.get_descendants()
            children_list.append(ele)
        return {
            "id": self.id,
            "name": self.name
            "children": res_children
        }

python
# 다음으로 전체 카테고리 정보를 불러옵니다
category = Category.get(1)
print(category.get_descendant())

Using the above query (assuming we have random data in the DB), we can get a recursive-looking dictionary that holds a list of child categories with the key children, like this

python
{
    "id": 1,
    "name": "first",
    "children": [
        {
            "id": 2,
            "name": "second",
            "children": []
        },
        {
            "id": 3,
            "name": "third",
            "children": [
                {
                    "id": 4,
                    "name": "fourth",
                    "children": [
                      	# ... 생략
                    ]
                },
                # ... 생략
             ]
        },
        {
            # .. 생략
        }
	]
}

But there's a problem 😱

Existing problem

There is a big problem with running the above code. As the amount of data grows, it becomes too slow. In fact, when the number of categories grew to over 100, the API call took over 10 seconds just to call that query.

Why?

get_descendant Every time I call a method called all() inside a method, the API server is firing off a query to the DB. And we're also looping through the for statement, calling the get_descendant method of the child category as many times as we have data. This means that when we run the above code, the number of queries to the DB increases,

n+1 query.

When we use an ORM in our app, we need to be careful to reduce the number of queries to the DB and reduce the number of commits, because more query hits or more commits means more opening and closing of the session, which leads to performance degradation within the API handler.

Therefore, we set out to find a solution with the following objectives

plain
1. 유지해야 할 것
- 위의 dictionary 형태를 유지할 것(재귀 형태의 데이터로)
- 단, 검색 기능이 존재하고 해당 검색어에 맞는 카테고리와 그 카테고리의 조상 노드를 모두 찾아줘야 함

2. 개선해야 할 것
- 속도
- n+1 쿼리

Let's make it better 🔧

We decided that the solution to 유지해야 할 것 and 개선해야 할 것 above is a recursive query.
recursive query is one of the postgresql cte's, you can union a start query and a query to iterate over to get data recursively.

With a recursive query, you can also pull the parent_id path of each piece of data with a single query. Run the following query

sql
WITH RECURSIVE category_cte AS (
	SELECT  -- 시작 쿼리를 작성합니다, 일반적으로 시작 쿼리는 parent_id가 null 경우가 됩니다.
		id,
		ARRAY[0] AS "c_path",
		"name",
		1 AS "depth"
	FROM categories
	WHERE parent_id IS NULL
	UNION ALL
	SELECT  -- 재귀로 돌릴 쿼리를 작성하여 UNION ALL 합니다
		c.id,
		category_cte."c_path" || c.parent_id ,  -- path를 계속 찾아 줌
		c."name",
		category_cte."depth" + 1  -- 현재 데이터의 depth를 찾아 줌
	FROM categories c, category_cte
    WHERE c.parent_id = category_cte.id
)
SELECT * FROM category_cte

This should result in the following

id	c_path	name	depth
1	{0}	first	1
2	{0, 1}	second	2
3	{0, 1}	third	2
4	{0, 1, 3}	fourth	3

This way, when you write a query looking for a name, you'll be able to find all the ancestor nodes.

sql
-- "th" 라는 검색어로 데이터를 찾아올 때
WITH RECURSIVE category_cte AS (
  -- 생략
)
SELECT * FROM category_cte
WHERE "name" LIKE '%th%'

id	c_path	name	depth
3	{0, 1}	third	2
4	{0, 1, 3}	fourth	3

If we rewrite this in SQLAlchemy, we can write it like this

python
from sqlalchemy import literal
from sqlalchemy.dialects import postgresql

class Category(db.Model):
    # 생략
    @classmethod
    def get_result_recurvise(cls, term):
    	"""재귀 쿼리 ORM을 이용해 데이터를 가져올 수 있습니다."""
        start_query = db.session.query(
            cls.id,
            cls.name,
            postgresql.array([cls.parent_id]).label("c_path"),
            literal(1).label("depth")
        ).filter(
            cls.parent_id.is_(None)
        ).cte("category_cte", recursive=True)  # 재귀 쿼리임을 여기서 표현합니다

        repeat_query = db.session.query(
            cls.id,
            cls.name,
            start_query.c.c_path + postgresql.array([cls.parent_id]),
            leteral(start_query.c.depth + 1)
        ).filter(
            cls.parent_id == start_query.c.id
        )
        recursive_query = start_query.union_all(repeat_query)
        query = db.session.query(recursive_query)

        return query.filter(
            cls.name.like(f"%{term}%")
        ).all()

Result?!

Since we had a 단, 검색 기능이 존재하고 해당 검색어에 맞는 카테고리와 그 카테고리의 조상 노드를 모두 찾아줘야 함 condition on what we needed to keep, we needed to find the IDs of the specific categories and their ancestor nodes, and then do one more Category.query.filter(Category.id.in({찾아와야 할 모든 카테고리 id의 리스트})).all() I had to add some extra queries to the code and create a separate python dictionary,

plain
n+1 쿼리 문제를 해결하였고, API 호출 속도도 1초 내로 줄일 수 있었습니다.

(Other than that, API calls were in the 1-second range, although it got a little slower as more features were added and more data was consumed)

This is not the end?

This recursive pattern using the parent_id column is the easiest way to deal with hierarchical data, but there are only a limited number of databases that can use Recursive Query, so it should be used in context.

Also, I was introduced to 클로저 패턴(closure pattern) and was told that it is easier to handle data and performs better than the pattern I was using before. However, since we were already well into development, we decided that it would be difficult to change our table structure right away.
However, there were some clear advantages of the closure pattern (the ability to structure nodes with more than one parent, the ease of parent lookup/child lookup, etc.

Closing thoughts 🤗

I was able to find a more efficient way to deal with hierarchical data structures and use ORMs, and to think more about it.
However, I realized that I need to think more deeply about the data structure that fits the actual function definition, and it was a motivation for me to study more about the definitions, pros and cons of various patterns.