viernes, 21 de febrero de 2020
jueves, 13 de febrero de 2020
Normalización de datos geográficos: Provincias y Localidades
Geo PostCodes
https://es.geopostcodes.com/Argentina
Base de Datos MySQL de Código Postal Argentino (CPA) por Provincia y Localidad
https://www.phpcentral.com/pregunta/660/base-de-datos-mysql-de-codigo-postal-argentino-cpa-por-provincia-y-localidad
Solo provincias y código postal
Servicio de normalización de datos geográficos
https://datos.gob.ar/dataset/jgm-servicio-normalizacion-datos-geograficos
Contiene información para normalizar datos mediante API y base de localidades y provincias.
BAHRA
http://www.bahra.gob.ar/#descargas
La Base de Asentamientos Humanos de la República Argentina es la primera base de datos oficial y normalizada de localidades, parajes, entidades y bases antárticas del territorio nacional; donde se identifica unívocamente a todos los asentamientos humanos, registrando el nombre, coordenadas y código único, entre otros atributos.
Google
Maps Platform - Geocoding Service
https://developers.google.com/maps
API de Google paga para
obtener datos de localidades
Sitio |
GeoPos |
Cod
Pos |
Localidades |
Precio |
MapaNet |
✔ |
✕ |
21.677 |
5.100 - 37 U$D |
Geo PostCodes |
✔ |
✕ |
21.247 |
39.530 - 295 U$D |
Base de Datos MySQL |
✕ |
✔ |
22.963 |
GRATIS |
Datos Gob Ar |
✔ |
✕ |
17.699 |
GRATIS |
BAHRA |
✔ |
✕ |
14.770 |
GRATIS |
Actualmente tenemos 9.157 localidades cargadas en nuestra base.
miércoles, 12 de febrero de 2020
Record Linkage and String Marching
Técnicas y herramientas para poder realizar matching de registros, limpieza y normalización de información, muchas de ellas utilizando Levenshtein Distance.
SQL Server:
Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server
Librería escrita en C# para ser definir funciones de comparación de textos en SQL Server.
Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL -->
JaroWinklerStringSimilarity.sql (GitHub)
Cleaning Messy Data in SQL, Part 1: Fuzzy Matching Names
Master Data Services
Qué es?
Master Data Services Overview (MDS)
Data Quality Services
Data Quality Services Overview (DQS)
.Net:
SimMetrics (Java)
Similarity Metric Library, from edit distance's (Levenshtein, Gotoh, Jaro etc) to other metrics, (Soundex, Chapman). Work provided by UK Sheffield University funded by (AKT) an IRC sponsored by EPSRC, grant number GR/N15764/01.
StringSimilarity.NET
A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Jaro-Winkler, Longest Common Subsequence, cosine similarity etc.) are currently implemented. Based upon F23.StringSimilarity.
Ver en StringSimilarity.NET en GitHub
fuzzystring: Approximate String Comparision in C#
fuzzystring-standard: Approximate String Comparision in C#
TFIDF: term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Ver TFIDF en wikipedia.
Data Matching software: a list of (Fuzzy) Data Matching software. The software in this list is open source and/or freely available.
OpenRefine: (previously Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
Python Record Linkage Toolkit
Library to link records in or between data sources
Overview of Record Linkage Methods
SQL Server:
Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server
Librería escrita en C# para ser definir funciones de comparación de textos en SQL Server.
Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL -->
JaroWinklerStringSimilarity.sql (GitHub)
Cleaning Messy Data in SQL, Part 1: Fuzzy Matching Names
Master Data Services
Qué es?
Master Data Services Overview (MDS)
Data Quality Services
Data Quality Services Overview (DQS)
.Net:
SimMetrics (Java)
Similarity Metric Library, from edit distance's (Levenshtein, Gotoh, Jaro etc) to other metrics, (Soundex, Chapman). Work provided by UK Sheffield University funded by (AKT) an IRC sponsored by EPSRC, grant number GR/N15764/01.
StringSimilarity.NET
A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Jaro-Winkler, Longest Common Subsequence, cosine similarity etc.) are currently implemented. Based upon F23.StringSimilarity.
Ver en StringSimilarity.NET en GitHub
fuzzystring: Approximate String Comparision in C#
fuzzystring-standard: Approximate String Comparision in C#
TFIDF: term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Ver TFIDF en wikipedia.
Data Matching software: a list of (Fuzzy) Data Matching software. The software in this list is open source and/or freely available.
OpenRefine: (previously Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
Python Record Linkage Toolkit
Library to link records in or between data sources
Overview of Record Linkage Methods
Suscribirse a:
Entradas (Atom)