Task Details

The task is to perform blocking for Entity Resolution, i.e., quickly filter out non-matches (tuple pairs that are unlikely to represent the same real-world entity) in a limited time to generate a small candidate set that contains a limited number of tuple pairs for matching.

Participants are asked to solve the task on two product datasets. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns). We will refer to each of these datasets as Di.

For each dataset Di, participants will be provided with the following resources:

  • Xi : a subset of the instances in Di
  • Yi : matching pairs in Xi x Xi. (The pairs not in Yi are non-matching pairs.)
  • Blocking Requirements: the size of the generated candidate set (i.e., the number of tuple pairs in the candidate set)

Note that matching pairs in Yi are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C). For a matching pair id1 and id2 with id1 < id2, Yi only includes (id1, id2) and doesn't include (id2, id1).

Your goal is to write a program that generates, for each Xi dataset, a candidate set of tuple pairs for matching Xi with Xi. The output must be stored in a CSV file containing the ids of tuple pairs in the candidate set. The CSV file must have two columns: "left_instance_id" and "right_instance_id" and the output file must be named "output.csv". The separator must be the comma. Note that we do not consider the trivial equi-joins (tuple pairs with left_instance_id = right_instance_id) as true matches. For a pair id1 and id2 (assume id1 < id2), please only include (id1, id2) and don't include (id2, id1) in your "output.csv".

Solutions will be evaluated over the complete dataset Di. Note that the instances in Di (except the sample Xi) will not be provided to participants. More details are available in the Evaluation Process section.

Both Xi and Yi are in CSV format.

Example of dataset Xi


Example of dataset Yi


More details about the datasets can be found in the dedicated Datasets section.

Example of output.csv


Output.csv format: The evaluation process expects "output.csv" to have 3000000 tuple pairs. The first 1000000 tuple pairs are for dataset X1 and the remaining pairs are for datasets X2. Please format "output.csv" accordingly. You can check out our provided baseline solution on how to produce a valid "ouput.csv".