Post

A Quick and Simple C++ Tool for Comparing CSV File Columns

A Quick and Simple C++ Tool for Comparing CSV File Columns

GitHub repository

This is another simple project inspired by some work I did in the past. I’m posting it here both as a future reference for myself and as a resource for other engineers who might encounter similar challenges. The script is written in C++ and is designed to read a variadic number of CSV files (at least two) and column indices for each file. It compares the CSV files based on the specified columns, and if any corresponding columns have matching values, the script notifies you.

As always, the code is mostly self-explanatory, so I won’t dive into every detail. I’ve included a GitHub repository link in this post, so be sure to check it out if you’re interested in the full implementation. The repository also includes helper scripts for generating input files for benchmarking.

The tool focuses on performance and handling large files, which is why I use disk-based file comparison. While disk-based comparison may seem counterintuitive to performance, loading large files into memory is not always practical.

Let’s jump into the main() method. To keep this post concise and focused on the core functionality, I’ve omitted some instantiation methods, include statements, and helper functions. For those interested, the complete code is available in the GitHub repository.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
int main(int argc, char **argv)
{
    /* omitted code */
    /*     ...      */

    auto worker = [&shutdown_requested, &cv_mutex, &cv, &fnames, &column_indices]()
    {
        std::unique_ptr<AbstractComparator> comparator = std::make_unique<DiskBasedComparator>(fnames, column_indices);
        comparator->compare();
        auto const &result = comparator->result();
        print_result(result);

        while (shutdown_requested.load() == false)
        {
            std::unique_lock lock(cv_mutex);
            // wait for up to an hour
            cv.wait_for(lock, std::chrono::seconds(1),
                        // when the condition variable is woken up and this predicate returns true, the wait is stopped
                        [&shutdown_requested]()
                        {
                            bool stop_waiting = shutdown_requested.load();
                            return stop_waiting; });
        }

        return shutdown_requested.load();
    };

    // spawn some workers
    std::vector<std::future<bool>> workers;
    for (int i = 0; i < 1; ++i)
    {
        workers.push_back(std::async(std::launch::async, worker));
    }

    std::cout << "Waiting for SIGTERM or SIGINT ([CTRL]+[c])...\n";

    int signal = ft_signal_handler.get();
    std::cout << "Received signal " << signal << "\n";

    // wait for workers
    for (auto &future : workers)
    {
        std::cout << "Worker observed shutdown request: " << std::boolalpha << future.get() << "\n";
    }

    std::cout << "Clean shutdown\n";

    return 0;
}

Essentially, I instantiate a lambda function (worker) that is invoked asynchronously in a separate thread to perform the comparison work. The program then waits for each the to finish.

The DiskBasedComparator reads each CSV file along with the list of columns to be considered for comparison. For each specified column, it creates a temporary file containing only that column using the extract_columns() method. For example, if we want to compare columns 0 and 2 from the file “ABC.csv” with colum 1 from the file “XYZ.csv”, it will generate three temporary files: ABC.csv0_tmp, ABC.csv2_tmp and XYZ.csv1_tmp, each containing the respective column from the original file.

In what can be considered a nested for-loop, the code compares each column file with every other column file (files belonging to the same original file are not compared with each other). For example, the outer loop iterates over the lines of the file ABC.csv0_tmp, while the inner loop iterates over the lines of the file XYZ.csv1_tmp. Each line is compared for equality, and if they match, the inner loop’s file name (XYZ.csv) along with the matching column index (1) is stored for the corresponding line in the outer loop.

If we have n files to compare, each seek value (line from the outer loop) should appear in exactly n-1 other files. If this condition is met, the matched string and its corresponding columns are stored in a global result map. This map essentially tracks which values were found across all files and the specific columns where the matches occurred.

The code also incorporates some performance optimizations, including heuristics and short-circuiting. For example, it starts the outer loop with the file that has the fewest rows, among other strategies, to improve efficiency.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
DiskBasedComparator::Filename_To_Pairs_Linecount_TmpFilename DiskBasedComparator::extract_columns(const std::map<std::string, std::set<int>> &fnames_cols)
{
    DiskBasedComparator::Filename_To_Pairs_Linecount_TmpFilename col_fnames;
    for (const auto &[fname, columns] : fnames_cols)
    {
        if (col_fnames.find(fname) == col_fnames.end())
        {
            col_fnames.emplace(std::make_pair(fname, std::vector<std::pair<int, std::string>>()));
        }

        for (auto &col : columns)
        {
            std::string col_fname = extract_column(fname.c_str(), col, &DiskBasedComparator::column_val_callback);
            if (col_fname.empty())
            {
                err(1, "Column %d for %s appears to be empty.", col, fname.c_str());
            }

            // For each file to compare store the number of lines of its shortest column
            col_fnames[fname].emplace_back(std::make_pair(col, col_fname));
        }
    }
    return col_fnames;
}

bool DiskBasedComparator::compare()
{
    for (int i = 0; i < fnames.size(); ++i)
    {
        fnames_cols.emplace(std::make_pair(fnames[i], Utilities::split(column_indices[i], ',')));
    }

    std::cout << "Comparing...\n";
    col_fnames = extract_columns(fnames_cols);
    DiskBasedComparator::Filename_To_Linecount line_count = compute_line_count(col_fnames);
    std::string fname_seek = file_with_min_count(line_count);
    unsigned long long total_comparissons = compute_total_comparissons(line_count);

    if (total_comparissons == 0)
    {
        std::cout << "Nothing to compare. Bye!\n";
    }
    else
    {
        std::cout << "Status:\n";
        unsigned long total_match_count = 0;
        std::size_t bar_len = std::min<unsigned long>(30, total_comparissons);
        unsigned long progress_chunk = total_comparissons / bar_len;
        unsigned long current_count = 0;
        std::string bar;

        // We start with the file that has the least rows
        for (auto &col_fname : col_fnames[fname_seek])
        {
            std::ifstream file_seek(col_fname.second);
            for (auto val_seek : ColumnRange(file_seek))
            {
                // If we have n files to compare, each seek value needs to appear in exactly n-1 other files
                int required_match_count = fnames.size() - 1;

                match_result cur_result;
                cur_result.val = val_seek;
                cur_result.fname_col.emplace(std::make_pair(fname_seek, std::set<int>()));
                cur_result.fname_col[fname_seek].emplace(col_fname.first);

                // Iterate over the rest of the CSV files
                for (const auto &[fname_cur, col_fnames_cur] : col_fnames)
                {
                    // Make sure we are not comparing with self
                    if (fname_seek.compare(fname_cur) != 0)
                    {
                        // Starting a new CSV file. Set match to false.
                        bool match = false;

                        // Iterate over the columns
                        for (const auto &col_fname_cur : col_fnames_cur)
                        {
                            std::ifstream file_cur(col_fname_cur.second);
                            for (auto val_cur : ColumnRange(file_cur))
                            {
                                if (val_seek.compare(val_cur) == 0)
                                {
                                    match = true;
                                    if (cur_result.fname_col.find(fname_cur) == cur_result.fname_col.end())
                                    {
                                        cur_result.fname_col.emplace(std::make_pair(fname_cur, std::set<int>()));
                                    }
                                    cur_result.fname_col[fname_cur].emplace(col_fname_cur.first);
                                }

                                ++current_count;
                                if (current_count % progress_chunk == 0 || current_count == total_comparissons)
                                {
                                    long perc = current_count / (total_comparissons * 1.0) * 100.0;
                                    show_progress(perc, bar, bar_len, total_match_count);
                                }
                            }
                        }

                        if (match)
                        {
                            // We had a match  in one of the columns of the current file.
                            --required_match_count;
                        }
                        else
                        {
                            // If we cannot find a match in any one the the CSV files, then there is no need to continue with the rest
                            break;
                        }
                    }
                }

                if (required_match_count == 0)
                {
                    process_result(cur_result);
                    ++total_match_count;
                }
            }
        }
    }

    return true;
}

void DiskBasedComparator::process_result(match_result &cur_result)
{
    if (result_set.find(cur_result.val) == result_set.end())
    {
        result_set.emplace(std::make_pair(cur_result.val, cur_result.fname_col));
    }
    else
    {
        for (const auto &[tmp_fname, columns] : cur_result.fname_col)
        {
            if (result_set[cur_result.val].find(tmp_fname) == result_set[cur_result.val].end())
            {
                result_set[cur_result.val].emplace(std::make_pair(tmp_fname, std::set<int>()));
            }
            result_set[cur_result.val][tmp_fname].insert(cur_result.fname_col[tmp_fname].begin(), cur_result.fname_col[tmp_fname].end());
        }
    }
}

I believe this covers the main idea behind the script. For the full code, please refer to the provided GitHub repository. I’ll conclude the post here, as going over each line would be beyond the scope of this discussion. If you have any questions, feel free to reach out.

This post is licensed under CC BY 4.0 by the author.