475
A list of the most common mappings, where the
above‐mentioned heuristics is used, is presented in
Listing4.
_______________________________________________
[’tug’, ’pusher tug’]: 98
[’tanker’, ’lng tanker’]: 101 [’tanker’, ’bunkering
tanker’]: 119
[’dredger’, ’trailing suction hopper dredger’]: 187
[’tanker’, ’chemical tanker’]: 222
[’cargo’, ’ro-ro cargo’]: 342 [’tanker’, ’inland
tanker’]: 397 [’tanker’, ’lpg tanker’]: 507 [’passenger’,
’passengers ship’]: 644 [’fishing’, ’fishing vessel’]:
1020
[’passenger’, ’ro-ro/passenger ship’]: 1246 [’tanker’,
’crude oil tanker’]: 1347 [’tanker’, ’oil products
tanker’]: 1661 [’tanker’, ’oil/chemical tanker’]: 2086
[’cargo’, ’general cargo’]: 5536
_______________________________________________
Listing 4. Vessel type mappings, filtered using a simple
stringsimilaritymeasure.Thenumberstotherightreferto
the number of cases, when the two vessel type names in
brackets referred in two different data sources to a vessel
withthesameshipIdassigned.
Finally, manual analysis may be performed on
otherpotentialmappingsbyanexpertand,basedon
that, additional mappings may be added to the
systemknowledgebase.
3.5 Classificationsocieties
Each vessel belongs to a classification society. The
goal of the classification societies is “to provide
classificationandstatutoryservicesand
assistanceto
the maritime industry and regulatory bodies as
regards maritime safety and pollution prevention,
based on the accumulation of maritime knowledge
and technology”
10
. Names of classification societies,
similarlytootherdatatypes,areexpressedasstrings
andineachdatasourcethesameclassificationsociety
maybereferredto,usingadifferentstring.Therefore,
for each acquired classification society name in the
disambiguation process a proper identifier classId
shouldbeassigned. Inthe
SIMMOsystem,therewas
an initial list of known classification societies with
assignedclassIds.Thislistwaslaterextendedduring
thedisambiguationprocess.
_______________________________________________
[’bureau veritas’, ’nippon kaiji kyokai’]: 22 [’american
bureau of shipping’, ’bureau veritas’]: 29 [’det norske
veritas’, ’lloyds register’]: 32 [’american bureau of
shipping’, ’lloyds register’]: 41 [’dnv gl’,
’germanischer lloyd’]: 56
[’registro italiano navale’, ’american bureau of shipping
’]: 61
[’korean shipping register’, ’korean register’]: 121
[’dnv gl’, ’det norske veritas’]: 176
[’lloyd\’s shipping register’, ’lloyds register’]: 267
[’lloyds shipping register’, ’lloyds register’]: 605
_______________________________________________
Listing5. The most frequent mappings between
classificationsocieties based onthefactthat thesamevessel
was assigned different classification society strings in
different sources. The results are much worse than for
vesseltypes
The analysis of classification societies names
started with generation of mappings in the same
mannerasitwasdoneforflagsandvesseltypes,i.e.
bychecking,ifasingevesselindifferentdatasources
has different classification society names assigned.
However,inthecaseoftheclassificationsocieties,this
approach
did not bring a lot of correct results, as
10
See http://www.iacs.org.uk/document/public/explained/Class_What‐
Why&How.PDFfordetails.
shownonListing5;onlyafewofthemostcommon
mappingswere correctand used infurtheranalysis.
This is probably due to the fact that vessels may
change their classification society relatively often, in
comparison to change of the vessel type (e.g.
changing vessel type may require expensive
modifications
ofthevesselitself).Therefore,different
classification societies assigned to the same ship in
different sources may result from the fact that
information in one sources may be outdated in
comparisontoinformationprovidedintheotherone.
Taking into account the obtained results, it has
turned out that the
number of distinct classification
societynames,forwhichthe systemwasnot ableto
assign classId based on the string comparison
method, was only 192. Since, this number was
relativelysmall,amanualanalysisofthestringsand
assignment of the correct classIDs could have been
performed. Based on the analysis,
the system’s
knowledgebaseabouttheclassificationsocietyname
variants wasupdated.This allowedto disambiguate
allclassificationsocietynamestrings.
3.6 Companynames
Indifferentdatasourcesdifferentstringsmaybeused
to refer to the same company. In many cases, such
strings are similar, for exampleʺStar Shipping
Ltdʺ
andʺStar Shipping Limitedʺ. The aim of
disambiguation in this case is to determine if two
strings in fact refer to the same company and if so,
assignthesameidentifiercompanyIdtobothofthem.
In the first step, identification if different strings
refer to the same company
was performed by
utilizingastringsimilaritymeasure,namelytheJaro
distance[9].Havingtwostrings,thismeasurereturns
a numeric value between 0 and 1. The more similar
thestringsare,thehighervalueisreturned.
The basic difficulty in the disambiguation of
company names results from the fact
that even for
humans this task can be performed only with a
limited certainty level (saying to what extend the
output of the disambiguation is correct). It may be
even more difficult to define how the term “single
company”isunderstood andhowtorelatethattothe
analysisbeingperformed.
Let’sanalyzethefollowing
pairofcompanynames:“PalmaliRostov,Russia”and
“Palmali Shipping Services Instabul, Turkey”. It is
clear(atleastforahuman)thatthesestringsreferto
entities located in different countries. Still, after
performingasearchontheInternet,itmaybelearnt
that bothentities
belong to the same group, Palmali
GroupofCompanies
11
.Insuchcase,classifyingthese
twostringsasthenamesofeitherthesamecompany
ortwodifferentcompaniesdependsondefinitionofa
singlecompany.
Still, in some cases, names of companies are not
similar as faras Jaro measure is concerned, but still
they may refer to the
same company. For example,
let’sassume thatwehavethefollowingstrings: “U.S.,
Dept. of Transportation” and “USA Government‐
WashingtonDC,U.S.A”.Jarosimilaritybetweenthem
11
http://palmali.com.tr/en/default.asp