Post

Enhancing CSV Processing in C++: Leveraging the Decorator Design Pattern for Transformative Column Operations

Enhancing CSV Processing in C++: Leveraging the Decorator Design Pattern for Transformative Column Operations

GitHub repository

During a recent project, I encountered the task of integrating diverse datasets into my data pipeline, all formatted as CSV files. Although the dataset’s structure was relatively straightforward, comprising essential columns like timestamp, latitude, longitude, a universally unique identifier (UUID) for user identification and the device origin for the user id, there was variability in how vendors denoted device origins—some using binary values (0 for iOS, 1 for Android), while others spelled out device names like ‘iPhone’ or ‘Google.’

In order to standardize the dataset for seamless processing, I embarked on implementing a robust solution and sought to unify the disparate data representations into a coherent format. Specifically, I aimed to map ‘0’ to ‘iPhone’ and ‘1’ to ‘Google.’ In addition to the mapping described, I had to perform several other transformations as part of the preprocessing step. Some of these transformations were part of a larger transformation chain and had to be executed in a specific order to ensure the accuracy and coherence of the data.

Opting for simplicity, I developed a straightforward C++ application to manage data transformations. The program reads YAML configuration files tailored to individual datasets, outlining the transformation sequences required. This approach ensures flexibility and ease of configuration while maintaining a clear and concise setup for data processing.

Below is an example of how a YAML configuration file might look for describing a transformation chain for a dataset:

# Additional configuration here

transforms:
  - type: unuuid
    column: maid
  - type: map
    column: id_type
    lookup:
      "0": "iPhone:"
      "1": "Google:"
  - type: append
    column: id_type
    from_column: maid

I won’t delve into the specifics of integrating third-party YAML libraries in C++. Instead, I’ll emphasize the use of design patterns, particularly the Decorator pattern, which has been instrumental in my implementation.

“Let’s begin by defining an abstract transformation interface named AbstractTransformer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/* AbstractTransformer.h */

#ifndef ABSTRACT_TRANSFORMER_H
#define ABSTRACT_TRANSFORMER_H

#include "csv/CSVRow.h"

class AbstractTransformer
{
protected:
    const std::string m_column;

public:
    AbstractTransformer();
    AbstractTransformer(std::string column);
    virtual ~AbstractTransformer();
    virtual void Operation(CSVRow &row) const = 0;
};

#endif

The implementation is kept simple:

1
2
3
4
5
6
7
8
9
/* AbstractTransformer.cpp */

#include "AbstractTransformer.h"

AbstractTransformer::AbstractTransformer() : m_column("") {}

AbstractTransformer::AbstractTransformer(std::string column) : m_column(column) {}

AbstractTransformer::~AbstractTransformer(){};

The concept behind the AbstractTransformer is to initialize it with a column name for which a transformation is intended, then execute this transformation on each row of a CSV dataset. Concrete transformers will be required to provide an implementation for the Operation() method. Let’s proceed by defining a base transformer that simply appends the word “BASE” in front of the column’s actual value.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/* BaseTransformer.h */

#ifndef BASE_TRANSFORMER_H
#define BASE_TRANSFORMER_H

#include "AbstractTransformer.h"
#include "Factory.h"

class BaseTransformer : public AbstractTransformer
{
public:
    BaseTransformer(std::string column);
    virtual ~BaseTransformer() override;
    virtual void Operation(CSVRow &row) const override;

private:
    REGISTER_DEC_TYPE(BaseTransformer);
};

#endif

The implementation will look something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/* BaseTransformer.cpp */

#include "BaseTransformer.h"
#include <iostream>

REGISTER_DEF_TYPE(BaseTransformer, void);

BaseTransformer::BaseTransformer(std::string column) : AbstractTransformer(column)
{
}

BaseTransformer::~BaseTransformer(){};

void BaseTransformer::Operation(CSVRow &row) const
{
    if (!row[m_column].empty())
    {
        row[m_column] = "BASE " + row[m_column];
    }
}

As previously mentioned, there’s often a need to execute a series of transformations, each contingent on the outcome of its predecessor. To address this, I’ve implemented the decorator pattern. This design paradigm allows for dynamic augmentation of object behaviors by encapsulating them within specialized wrapper objects. These wrappers mirror the methods of the target objects and efficiently delegate all incoming requests. However, they also possess the ability to modify the output either before or after it’s passed to the target. This approach bears similarities to the composite pattern, a favored alternative to inheritance.

The wrapper follows the same interface as the object it wraps. To ensure consistency for clients, we need to define a standard interface that all wrapper objects must adhere to. This way, regardless of which wrapper is used, clients can interact with them in the same way. Fortunately, we already have an interface for this purpose: our AbstractTransformer.

Let’s proceed by implementing our initial decorator. This specific decorator will simply pass the input row unchanged to our wrapped object, effectively performing no transformation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/* Decorator.h */

#ifndef DECORATOR_H
#define DECORATOR_H

#include <memory>
#include "AbstractTransformer.h"

class Decorator : public AbstractTransformer
{
protected:
    std::unique_ptr<AbstractTransformer> m_transformer;

public:
    Decorator(std::string column, std::unique_ptr<AbstractTransformer> transformer);
    virtual ~Decorator() override;
    virtual void Operation(CSVRow &row) const override;
};

#endif
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/* Decorator.cpp */

#include "Decorator.h"

Decorator::Decorator(std::string column, std::unique_ptr<AbstractTransformer> transformer) : m_transformer(std::move(transformer)), AbstractTransformer(column)
{
}

void Decorator::Operation(CSVRow &row) const
{
    if (this->m_transformer)
    {
        // Delegate work to wrapped transformer
        this->m_transformer->Operation(row);
    }
}

Decorator::~Decorator(){};

Let’s create a decorator that modifies specific column values in a CSV row by removing dashes/hyphens and converting them to lowercase.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
/* DecoratorUnuuid.h */

#ifndef DECORATOR_UNUUID_H
#define DECORATOR_UNUUID_H

#include "Decorator.h"
#include "Factory.h"

class DecoratorUnuuid : public Decorator
{
public:
    DecoratorUnuuid(std::string column, std::unique_ptr<AbstractTransformer> transformer);
    virtual ~DecoratorUnuuid() override;
    virtual void Operation(CSVRow &row) const override;

private:
    REGISTER_DEC_TYPE(DecoratorUnuuid);
};
#endif

The corresponding implementation is depicted below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include "DecoratorUnuuid.h"
#include <iostream>
#include <algorithm>

REGISTER_DEF_TYPE(DecoratorUnuuid, unuuid);

DecoratorUnuuid::DecoratorUnuuid(std::string column, std::unique_ptr<AbstractTransformer> transformer) : Decorator(column, std::move(transformer))
{
}

void DecoratorUnuuid::Operation(CSVRow &row) const
{
    Decorator::Operation(row);

    if (!row[m_column].empty())
    {
        row[m_column].erase(std::remove(row[m_column].begin(), row[m_column].end(), '-'), row[m_column].end());
        std::transform(row[m_column].begin(), row[m_column].end(), row[m_column].begin(), [](unsigned char c)
                       { return std::tolower(c); });
    }
}

DecoratorUnuuid::~DecoratorUnuuid(){};

Let’s proceed to create another decorator that appends the value row[from_column] to row[column].

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/* DecoratorAppend.h */

#ifndef DECORATOR_APPEND_H
#define DECORATOR_APPEND_H

#include "Decorator.h"
#include "Factory.h"
#include "config/ConfigParser.h"
#include <string>

class DecoratorAppend : public Decorator
{
private:
    REGISTER_DEC_TYPE(DecoratorAppend);
    std::string from_column;

public:
    DecoratorAppend(std::string column, std::unique_ptr<AbstractTransformer> transformer);
    virtual ~DecoratorAppend() override;
    virtual void Operation(CSVRow &row) const override;
    friend class ConfigParser;
};

#endif

With the corresponding implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/* DecoratorAppend.cpp */

#include "DecoratorAppend.h"

REGISTER_DEF_TYPE(DecoratorAppend, append);

DecoratorAppend::DecoratorAppend(std::string column_, std::unique_ptr<AbstractTransformer> transformer_) : Decorator(column_, std::move(transformer_))
{
}

void DecoratorAppend::Operation(CSVRow &row) const
{
    Decorator::Operation(row);

    if (!row[m_column].empty() && !row[from_column].empty())
    {
        row[m_column] += row[from_column];
    }
}

DecoratorAppend::~DecoratorAppend(){};

I think by now you get the point but you might be curious about the functionalities of REGISTER_DEC_TYPE(...) and REGISTER_DEF_TYPE(...). Here I am leveraging another established pattern, the Factory Method pattern, which advocates for substituting direct object construction calls (utilizing the new operator) with invocations of a dedicated factory method.

The idea is that when a developer creates a new decorator, it automatically registers itself with our factory. The factory shouldn’t require any modifications whenever a new decorator is added to the project.

Let’s declare a few struct that will help us creating :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
/* Factory.h */

#ifndef FACTORY_H
#define FACTORY_H

#include <map>
#include <memory>
#include <string>

#include "AbstractTransformer.h"
#include "Decorator.h"

#define REGISTER_DEC_TYPE(TYPE_NAME) \
    static DecoratorRegister<TYPE_NAME> reg

#define REGISTER_DEF_TYPE(TYPE_NAME, CONFIG_NAME) \
    DecoratorRegister<TYPE_NAME> TYPE_NAME::reg(#CONFIG_NAME)

// 1.
template <typename T, std::enable_if_t<std::is_base_of_v<Decorator, T>, bool> = true>
std::unique_ptr<AbstractTransformer> create_T(std::string column, std::unique_ptr<AbstractTransformer> wrapped)
{
    return std::make_unique<T>(column, std::move(wrapped));
}

// 2.
template <typename T, std::enable_if_t<!std::is_base_of_v<Decorator, T>, bool> = true>
std::unique_ptr<AbstractTransformer> create_T(std::string column, std::unique_ptr<AbstractTransformer> wrapped = nullptr)
{
    return std::make_unique<T>(column);
}

struct Factory
{
    template <typename T>
    using MFP = std::unique_ptr<T> (*)(std::string, std::unique_ptr<T>);
    using map_type = std::map<std::string, MFP<AbstractTransformer>>;

private:
    static map_type *map;

protected:
    static map_type *get_map()
    {
        // Never deleted. Exists until program termination.
        if (!map)
        {
            map = new map_type;
        }
        return map;
    }

public:
    static std::unique_ptr<AbstractTransformer> get_instance(const std::string &s, const std::string column, std::unique_ptr<AbstractTransformer> wrapped = nullptr)
    {
        map_type::iterator it = get_map()->find(s);
        if (it == get_map()->end())
        {
            return nullptr;
        }
        else
        {
            return it->second(column, std::move(wrapped));
        }
    }
};

template <typename T>
struct DecoratorRegister : Factory
{
    DecoratorRegister(const std::string &config_name)
    {
        get_map()->insert(std::make_pair<std::string, MFP<AbstractTransformer>>(static_cast<std::string>(config_name), &create_T<T>));
    }
};

Factory maintains a single member variable, which is a map. This map associates the name of the transformer we wish to construct with a function pointer. This function pointer references a function that takes two arguments: a column name (string) and a pointer to an AbstractTransformer, which serves as the wrapped object in the Decorator pattern. As you may have deduced, this function pointer directs to the constructor of a transformer. Additionally, the Factory class includes two static methods. One method allows access to the map, while the other method accepts a type name, a column name, and an optional target transformer. This method constructs a new decorator instance that wraps the target object by looking up the type name in the map and calling the associated function to create the instance.

To achieve this functionality, I’ve implemented two create_T<T>(...) methods. One of these methods passes a transformer instance to the constructor, while the other does not. To determine which method to use, I check whether Decorator is the base class of T. If it is, the compiler selects the create_T<T>(...) method that passes a wrapped object to the constructor. Otherwise, the method is invoked that calls a constructor without a target object. This functionality is achieved through the utilization of C++ type traits and the Substitution Failure Is Not An Error (SFINAE) principle.

When Decorator is a base class for T, std::is_base_of_v<A, T> evaluates to true. In this scenario, std::enable_if_t possesses a public member typedef named type which is a bool and we give it the value of true. bool = true is just an unnamed non-type template parameter with the value of true, ensuring it defaults to true without requiring an explicit value to be passed. Decorator is a base class for T, method 1 is chosen by the compiler.

1
2
3
4
5
6
7
8
template<typename T>
struct enable_if<true, T> { typedef T type; };

template<bool B, typename T>
struct enable_if {};

template< bool B, class T = void >
using enable_if_t = typename enable_if<B,T>::type;

If Decorator is not a base of T then the condition is false and std::enable_if_t has no member typedef called type. This means, whenever the implementation tries to access enable_if<B,T>::type when B = false, the general form of struct enable_if is selected that doesn’t have a type, so the type of the argument results in a substitution failure. According to the C++11 standard, when a substitution failure, such as the one shown above, occurs, type deduction for this particular type fails. That’s it. There’s no error involved. The compiler simply ignores this candidate and looks at the others. In which case the compile will choose method 2.

Finally, we define two macros: REGISTER_DEC_TYPE(T) and REGISTER_DEF_TYPE(T, name).

  • REGISTER_DEC_TYPE(T) serves as a shorthand for declaring a static variable DecoratorRegister<T> reg;.

  • REGISTER_DEF_TYPE(T, name) initializes the static variable declared by REGISTER_DEC_TYPE(T) by providing a name as a constuctor argument.

In essence, these macros simplify the process of declaring and initializing static variables within the context of decorator registration.

Dynamic initialization of an object with static storage duration is guaranteed to happen before execution of any function defined in the same translation unit. If there are no such functions, or your program never calls them, then there’s no guarantee it will ever be initialised.

To conclude, we’ll integrate all components and define a function that leverages our factory to create a new transformers chain.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
std::vector<std::unique_ptr<AbstractTransformer>> Transformers()
{
    std::vector<std::unique_ptr<AbstractTransformer>> transformers;
    if (m_config["transforms"])
    {
        std::unique_ptr<AbstractTransformer> last_ptr;
        for (const auto &d : m_config["transforms"])
        {
            std::string column = d["column"].as<std::string>();
            std::string type = d["type"].as<std::string>();
            std::unique_ptr<AbstractTransformer> ptr = Factory::get_instance(type, column, std::move(last_ptr));
            if (ptr)
            {
                if (!type.compare("map"))
                {
                    auto lookup = d["lookup"];
                    for (YAML::const_iterator it = lookup.begin(); it != lookup.end(); ++it)
                    {
                        std::string key = it->first.as<std::string>();
                        std::string value = it->second.as<std::string>();
                        (dynamic_cast<DecoratorMap *>(ptr.get())->lookup).insert(std::make_pair(key, value));
                    }
                }
                else if (!type.compare("append"))
                {
                    std::string from_column = d["from_column"].as<std::string>();
                    dynamic_cast<DecoratorAppend *>(ptr.get())->from_column = from_column;
                }
                else if (!type.compare("prepend"))
                {
                    std::string from_column = d["from_column"].as<std::string>();
                    dynamic_cast<DecoratorPrepend *>(ptr.get())->m_from_column = from_column;
                }
                else if (!type.compare("set"))
                {
                    std::string value = d["value"].as<std::string>();
                    dynamic_cast<DecoratorSet *>(ptr.get())->m_value = value;
                }

                std::cout << "Created transformer type '" << type << "' for column '" << column + "'" << std::endl;
                last_ptr = std::move(ptr);
            }
            else
            {
                std::cerr << "Unknown type '" << type << "'" << std::endl;
                kill(getpid(), SIGINT);
            }
        }

        if (last_ptr)
        {
            transformers.push_back(std::move(last_ptr));
        }
    }

    return transformers;
}

The code systematically traverses a list of transformer names, accompanied by their configuration parameters, provided within a YAML configuration object. In each iteration, it constructs a new transformer that encapsulates the previously created transformer. Upon completion, it yields a vector comprising the transformers.

I trust you found this post informative. While utilizing a list of transformers could suffice, I opted to introduce the Decorator pattern for educational purposes. Undoubtedly, enhancements such as avoiding hardcoded type names when invoking the factory method could be implemented. Nonetheless, for conveying my message, the current approach suffices admirably.

This post is licensed under CC BY 4.0 by the author.