I have a database with dirty data like this example, where 26 distinct combinations below actually all refer to the same real building address/location.
I have a need to produce reports that are aggregated by real building address/location. When I say "real" I'm referring to the actual concrete object in the real-world which these rows of data refer to :-)
I could just run this thru Google Maps API to get a clean address and lat/lon, but I want to learn to solve this problem with Weka using standard algorithms.
What should I use for this? I've read about Bayesian Networks, that seems promising, but I'm not sure where to start.
All of the addresses/locations below should de-duplicate to a single address/location, chosen based on some probability-driven theory of correctness, and without dragging in false-positives corresponding to different addresses.
I'm also attaching a more complete dataset: 50kExampleAddresses.zip
All ideas appreciated!
Example data: 26 variations of the address "3950 N Lake Shore Dr"
I have a need to produce reports that are aggregated by real building address/location. When I say "real" I'm referring to the actual concrete object in the real-world which these rows of data refer to :-)
I could just run this thru Google Maps API to get a clean address and lat/lon, but I want to learn to solve this problem with Weka using standard algorithms.
What should I use for this? I've read about Bayesian Networks, that seems promising, but I'm not sure where to start.
All of the addresses/locations below should de-duplicate to a single address/location, chosen based on some probability-driven theory of correctness, and without dragging in false-positives corresponding to different addresses.
I'm also attaching a more complete dataset: 50kExampleAddresses.zip
All ideas appreciated!
Example data: 26 variations of the address "3950 N Lake Shore Dr"
Code:
@relation ExampleDirtyAddress
@attribute CTCX_HSN_NUMBER numeric
@attribute _Short_CP {N}
@attribute STR {'LAKE SHORE','LAKE SHORE DR','LAKE SHORE DRIVE',LAKESHORE,'Lake Shore','Lake Shore Drive',Lakeshore,'lake shore','lake shore dr.'}
@attribute _Short_STREETSUFFIX {DR}
@attribute CIT {CHICAGO,Chicago}
@attribute STATE {Illinois}
@attribute ZP {60607,60613,60618,60640,60657}
@attribute LAT_LATITUDE numeric
@attribute LNG_LONGITUDE numeric
@data
3950,N,'Lake Shore',?,CHICAGO,Illinois,60613,41.8386,-87.6077
3950,N,'LAKE SHORE',?,CHICAGO,Illinois,60613,41.885352,-87.725055
3950,N,'Lake Shore',?,Chicago,Illinois,60613,41.954163,-87.646351
3950,N,'LAKE SHORE',DR,CHICAGO,?,60613,41.953103,-87.6451
3950,N,'Lake Shore',DR,CHICAGO,Illinois,60607,41.953885,-87.645233
3950,N,'Lake Shore',DR,CHICAGO,Illinois,60607,41.953887,-87.645199
3950,N,'Lake Shore',DR,CHICAGO,Illinois,60607,41.954163,-87.646351
3950,N,'LAKE SHORE',DR,CHICAGO,Illinois,60613,41.845781,-87.806836
3950,N,'LAKE SHORE',DR,Chicago,Illinois,60613,41.88096,-87.62558
3950,N,'Lake Shore',DR,Chicago,Illinois,60613,41.935699,-87.633041
3950,N,'Lake Shore',DR,CHICAGO,Illinois,60613,41.953053,-87.645078
3950,N,'LAKE SHORE',DR,CHICAGO,Illinois,60613,41.953102,-87.6451
3950,N,'LAKE SHORE',DR,CHICAGO,Illinois,60613,41.953103,-87.6451
3950,N,'LAKE SHORE',DR,CHICAGO,Illinois,60613,41.953602,-87.645099
3950,N,'Lake Shore',DR,CHICAGO,Illinois,60613,41.953612,-87.645138
3950,N,'LAKE SHORE',DR,Chicago,Illinois,60613,41.953842,-87.645164
3950,N,'Lake Shore',DR,CHICAGO,Illinois,60613,41.953844,-87.645166
3950,N,'LAKE SHORE',DR,CHICAGO,Illinois,60613,41.953857,-87.645065
3950,N,'LAKE SHORE',DR,CHICAGO,Illinois,60613,41.953873,-87.645199
3950,N,'LAKE SHORE',DR,Chicago,Illinois,60613,41.953885,-87.645233
3950,N,'lake shore',DR,CHICAGO,Illinois,60613,41.953887,-87.645199
3950,N,'Lake Shore',DR,Chicago,Illinois,60613,41.953888,-87.6452
3950,N,'Lake Shore',DR,CHICAGO,Illinois,60613,41.953909,-87.6452
3950,N,'LAKE SHORE',DR,CHICAGO,Illinois,60613,41.95391,-87.6452
3950,N,'LAKE SHORE',DR,CHICAGO,Illinois,60613,41.95392,-87.645237
3950,N,'LAKE SHORE',DR,CHICAGO,Illinois,60613,41.954136,-87.645943
3950,N,'LAKE SHORE',DR,Chicago,Illinois,60613,41.954163,-87.646355
3950,N,'Lake Shore',DR,CHICAGO,Illinois,60613,41.954163,-87.646351
3950,N,'Lake Shore',DR,CHICAGO,Illinois,60613,41.954292,-87.647163
3950,N,'LAKE SHORE',DR,Chicago,Illinois,60618,41.9536,-87.645096
3950,N,'LAKE SHORE',DR,Chicago,Illinois,60618,41.954163,-87.646355
3950,N,'LAKE SHORE',DR,CHICAGO,Illinois,60640,41.954163,-87.646351
3950,N,'Lake Shore',DR,CHICAGO,Illinois,60657,41.935699,-87.633041
3950,N,'Lake Shore',DR,CHICAGO,Illinois,60657,41.941914,-87.639769
3950,N,'LAKE SHORE',DR,CHICAGO,Illinois,60657,41.953403,-87.645
3950,N,'LAKE SHORE',DR,CHICAGO,Illinois,60657,41.953602,-87.645099
3950,N,'LAKE SHORE',DR,CHICAGO,Illinois,60657,41.954163,-87.646351
3950,N,'LAKE SHORE DR',?,CHICAGO,Illinois,60613,41.953909,-87.6452
3950,N,'LAKE SHORE DR',?,CHICAGO,Illinois,60613,41.95391,-87.6452
3950,N,'lake shore dr.',?,CHICAGO,Illinois,60613,41.953909,-87.6452
3950,N,'Lake Shore Drive',?,CHICAGO,Illinois,60613,41.953909,-87.6452
3950,N,'LAKE SHORE DRIVE',?,CHICAGO,Illinois,60613,41.95391,-87.6452
3950,N,'LAKE SHORE DRIVE',?,CHICAGO,Illinois,60657,41.953403,-87.645
3950,N,'LAKE SHORE DRIVE',DR,CHICAGO,Illinois,60613,41.953403,-87.645
3950,N,'LAKE SHORE DRIVE',DR,CHICAGO,Illinois,60613,41.953859,-87.645065
3950,N,Lakeshore,DR,Chicago,Illinois,60613,41.953403,-87.645
3950,N,Lakeshore,DR,CHICAGO,Illinois,60613,41.953602,-87.645099
3950,N,LAKESHORE,DR,CHICAGO,Illinois,60613,41.953859,-87.645065
3950,N,Lakeshore,DR,CHICAGO,Illinois,60613,41.954163,-87.646351