Skip to content

A flawed method for merging two datasets

Here is the original method for merging two datasets. It has some pretty significant flaws in both performance and design.

export function mergeData(data1, data2) {
    const extraFields = data1.reduce((fields, obj) => {
        Object.keys(obj).forEach(key => {
            if (!fields.includes(key)) {
                fields.push(key);
            }
        });
        return fields;
    }, []);
    data2.forEach(obj2 => {
        extraFields.forEach(field => {
            const matchingObj1 = data1.find(obj1 => obj1.DocumentNumber_id === obj2.DocumentNumber);
            obj2[field] = matchingObj1 ? matchingObj1[field] : null;
        });
    });
}

Here is a more optimized drop-in replacement method

export function fastMerge(data1, data2) {
    const reshaped = Object.fromEntries(data1.map(o => [o.DocumentNumber_id, o]))
    data2.forEach(obj => Object.assign(obj, reshaped?.[obj.DocumentNumber]))
}

Efficiency

The first, and largest, flaw with this code is the efficiency of the code. It's looping over things multiple times in multiple ways, doing things in a very inefficient way.

We can do some Big-O Notation analysis to determine just how bad it can get.

export function mergeData(data1, data2) {
    const extraFields = data1.reduce((fields, obj) => { // N(1)
        Object.keys(obj).forEach(key => { // M(2)
            if (!fields.includes(key)) { // M(2)
                fields.push(key);
            }
        });
        return fields;
    }, []);
    data2.forEach(obj2 => { // N(1)
        extraFields.forEach(field => { // M(1)
            const matchingObj1 = data1.find(obj1 => obj1.DocumentNumber_id === obj2.DocumentNumber); // N(1)
            obj2[field] = matchingObj1 ? matchingObj1[field] : null;
        });
    });
}
  1. O(N) speed, looping over the whole dataset
  2. O(M) speed, looping over all of the fields in the item

Ultimately, this has a time complexity of O(M²N+MN²), which simplifies toO(M²+N²). Hypothetically either M or N could be expensive, but in practice you're looking at M capping out at a few dozen values (because there are only so many fields in the model), whereas N can approach 100k. This code runs fine with small datasets, but it can take minutes to execute with tens or hundreds of thousands of objects to merge together.

It's also worth noting that nesting the data1.find() inside the extraFields.forEach() is particularly egregious from a performance perspective. I think it raises the true time complexity to O(Nᴹ) instead, but I'm not positive about that. By finding the data1 object with the matching document number inside the extraFields.forEach, rather than outside of it, the code must look through the whole data1 array for a matching object for every single field that exists in data1's keys. Instead, the find should have been done outside the extraFields.forEach, only once for each of the data2 values.

Optimized Version

export function fastMerge(data1, data2) {
    const reshaped = Object.fromEntries(data1.map(o => [o.DocumentNumber_id, o])) // 2N
    data2.forEach(obj => Object.assign(obj, reshaped?.[obj.DocumentNumber])) // N
}

This version has a time complexity of O(3N) instead, which simplifies to O(N), dramatically faster than the previous version. It achieves the improvements by first indexing all of the incoming data to be merged in by their document number once initially, by making a DocumentNumber-keyed Object to hold them all. This reduces the lookup time from O(N) to O(1), letting us do the actual lookup in O(N) time instead of O(N²) (which is significant when N is in the ballpark of 100,000 for a year's data). After indexing the incoming data, it simply loops over the target data, using Object.assign() to merge the incoming data into the target dataset.

Warning

One thing this version does not do is enforce that all keys present anywhere in the data1 dataset be present in all data2 objects in the end. However, that shouldn't be a significant problem. Proper optional chaining (?.) and nullish coalescing (??) operators in the right spot will handle missing values cleanly.

In this particular situation, the dataset was being used for a DataTables table (another large performance hit, but one that's less trivial to fix) which needed a defaultContent in the columnDefs to avoid complaining about missing values.

Design

In addition to the performance issues, the function has two significant design flaws.

First, it's a highly specialized method masquarading as a generic one. It completely depends on the DocumentNumber_id and DocumentNumber keys to do the merging; without those keys, the method can't do anything at all.

Second, it does an in-place mutation of the data. This isn't always a bad thing, but a function doing an in-place mutation is generally undesirable compared to something that returns data to use as-needed.

Potential solution

Here is a potential solution to those. The invocation changes slightly, and you need to catch the return and store it, but it allows much greater flexibility in terms of how the method could be used.

/**
 * Merge two arrays of objects
 * @param {object} target           An object holding the target data to be merged into
 * @param {Array} target.data       The array of target data objects to merge new data into
 * @param {string} [target.key]     The key to merge on, defaults to `id` if not provided
 * @param {Object} source           An object holding the source data to merge in
 * @param {Array} source.data       The array of source data to merge in
 * @param {string} [source.key]     The key to merge on, defaults to `id` if not provided
 * @returns {Object[]}              Returns an array of objects, merged by the provided keys
 */
export function mergeData(target, source) {
    const reshaped = Object.fromEntries(source.data.map(o => [o[target.key??'id'], o]))
    return target.data.map(obj => ({...obj, ...reshaped?.[obj[source.key??'id']]}))
}

This version does two things. First off, it takes in a pair of objects in the format of {data: Array, key: string}, which allows users to specify the unique key to use for each dataset in order to merge them by an arbitrary key if the default of id isn't appropriate. Second, it creates the final output to return by mapping over the data and returning that, instead of doing an in-place replacement.

However, there are two caveats with this method. First, it's only capable of taking two datasets and merging them together, it isn't useful in the situation when you have more than two datasets to merge. Second, it still has a flaw that the original does in that it's not a true merge of the two datasets, anything in the source that doesn't match a key in the target will just be skipped and ignored. Neither of those are fatal flaws in the original context, but they are a natural extension of the functionality.

/**
 * @typedef DatasetData
 * @param {Array} data      The array of data to merge in
 * @param {string} [key]    The key to merge on, defaults to `id` if not provided
 */

/**
 * Merge multiple arrays of objects
 * @param {DatasetData[]} datasets      An array of objects containing the data to merge and an optional key to merge on 
 * @returns {Object[]}                  An array of ojbects ultimately created by the merging
 */
export function mergeData(datasets) {
    let output = {}
    for (const {data,key} of datasets){
        data.forEach(obj => output[obj[key??'id']] = {...output[obj[key??'id']] ?? {}, ...obj})
    }
    return Object.values(output)
}
This alternate version takes an array of objects, each with their own data and key, and merges them all together into one array to return. This design could also be further extended with options for things like only having one final key field in the output, excluding objects with no match in the first dataset (as the original version does), early-return handling for situations where there are zero or one datasets provided, and so on.