The OpenCommunity Discovery report says:
“A standard could solve the problem of artificial boundaries. But it will solve it if, and only if, directories are adapted to pull in service data from other councils’ directories.”
The same is true for directories from other sectors relating to other geographical boundaries. Hence, the same service may be described by different publishers in different areas (nationally, regionally and locally) and in different sectors (dementia, health, recreation, …). We need to avoid that service appearing more than once in a trusted aggregated directory and to know which entry is the one to which people should be directed, as opposed to other duplicates.
The flow chart below illustrates a simplified scenario in which a single AgeUK service is described centrally by AgeUK which may be seen as a trusted service provider. These service details are imported by two directories but duplicated by local entries in two other directories.
All records are read by an aggregator for the place. The aggregator needs to avoid importing the same record twice, to show preferred service records and allow access to duplicates if needed for reference (eg to copy volunteered service information to an assured record).
We need a federated structure rather than relying on central updates. Within one federated group we need to know that records can be trusted to a level set by that group.
A fuller scenario is described in Appendix A. This involves multiple user roles contributing service, location and organization details. Data custodians and aggregators need to identify and resolve duplicates.
A data custodian will use service directory software that assigns a UUID (universally unique identifier) to every service, organization and location which originates in its directory.
The custodian will be expected to allow different provenance to be assigned to service, location and organization. For example a local weight-watchers service may be run in a church hall on behalf of the national organization.
A data custodian and an aggregator, via service directory software, will identify similar service records and apply rules for saying which is the preferred record.
Rules for record matching will evolve. They will include:
Matching identifiers using the OpenCommunity recommended extension of identifiers. Identifiers may include:
Organization company number
Organization charity commission number
Organization (or service?) Ofsted number
CQC location id
Lat/long within a certain distance from one-another
Web domain and email domains
Identical identifiers will indicate 100% matches and rules on the provenance of records will determine which is the preferred record. Other matches will be given an approximate matching score and a manual process will be used to compare records and assign the preferred records. Other matched records will be marked as “replaced by” the preferred record.
A “Trust Register” kept by each directory that aggregates records and does such matching will log the preferred record and the “replaced by” relationships.
Hence the Trust Register will hold for every service, organization and location in a directory:
Its publisher (ie the data custodian responsible for creating it)
Each holder of a trust register will make its contents available to all publishers (via a user interface, API and/or other mechanism) such that they can see where duplicates have been identified and stop maintaining locally records for services for which there is a separate trusted source.
Queries on service directories (including aggregated ones) will, by default, only return preferred records but will indicate where a record replaces others so that a consumer might review the preferred record alongside records it replaces.
A custodian aggregating data for a “place”, must determine if directories being aggregated are of suitable quality for inclusion. A place might choose to form a group of publishers and help improve data quality.
Data custodians will take responsibility for the assurance of their records, so their software will need to record the person or organization responsible for maintaining each record and for the quality of their assurers.
Assurers can be assessed on the number of records they assure, their validity and their richness scores.
Web methods for validating a service and assigning it a richness score are described in the document Prototyping an OpenCommunity Data Service model and in the sample API documentation.
The diagram below shows which data items fall under the control of an assurer for a service, an organization and a location.
Some custodians assign responsibility for data assurance according to service type.
As recommended by the Open Community Discovery work, a new object for identifiers is needed to allow duplicate organizations to be found.
Within a directory, a data custodian needs to record:
The person or organization responsible for creating and maintaining
The record’s assurer, which may or may not be the person responsible for its maintenance
The provenance of a record: “owner”, “regulator”, “assurer” or none of these
This can be achieved with a new “Provenance” object with these properties:
service | organization | location identifier
role (owner | regulator | assurer | none)
We recommend adding the provenance to the schema such that this information is shared between custodians (including aggregators) and so can be used in determining the preferred record amongst duplicates.
Age UK is a national organisation that has local franchises. Some things are done nationally and other things are local decisions. The following stories have not been researched with Age UK and are just to illustrate the sort of complexity of duplication that we will need to deal with. The stories are only from the roles that would be trusted to get the information correct. These stories do not deal with errors in the data. The main problem is that every trusted role could add the same service.