miércoles, 12 de febrero de 2020

Record Linkage and String Marching

Técnicas y herramientas para poder realizar matching de registros, limpieza y normalización de información, muchas de ellas utilizando Levenshtein Distance.

SQL Server:

Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server
Librería escrita en C# para ser definir funciones de comparación de textos en SQL Server.

Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL  -->
JaroWinklerStringSimilarity.sql (GitHub)

Cleaning Messy Data in SQL, Part 1: Fuzzy Matching Names


Master Data Services
Qué es?
Master Data Services Overview (MDS)

Data Quality Services
Data Quality Services Overview (DQS)


.Net:

SimMetrics (Java)
Similarity Metric Library, from edit distance's (Levenshtein, Gotoh, Jaro etc) to other metrics, (Soundex, Chapman). Work provided by UK Sheffield University funded by (AKT) an IRC sponsored by EPSRC, grant number GR/N15764/01.


StringSimilarity.NET
A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Jaro-Winkler, Longest Common Subsequence, cosine similarity etc.) are currently implemented. Based upon F23.StringSimilarity.
Ver en StringSimilarity.NET en GitHub

fuzzystring: Approximate String Comparision in C#

fuzzystring-standard: Approximate String Comparision in C#

TFIDF: term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Ver TFIDF en wikipedia.

Data Matching software: a list of (Fuzzy) Data Matching software. The software in this list is open source and/or freely available.

OpenRefine: (previously Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.



Python Record Linkage Toolkit
Library to link records in or between data sources

Overview of Record Linkage Methods


No hay comentarios:

Publicar un comentario