PDA

View Full Version : New web resource: what data format do you want?



museumoflondon
01-05-2008, 4:50 PM
Hi. We've recently launched a database of around 10,000 individuals (well, a few organisations too) involved in the photographic industry in Victorian London (http://www.photolondon.org.uk/default.asp). The site is the result of a partnership between several public bodies (see here (http://www.photolondon.org.uk//pages/contactUs.asp)), my own being the Museum of London.

Unsurprisingly, there are plenty of visits coming from genealogical/family history sites, including this one. I would like to make sure that we give people what they want, and I am keen to have your thoughts on what is good or bad on the site, from a general or genealogist's point of view. In particular I'd like to offer downloads of data in a format that would be useful to you, and the only one I've come across seems to be GEDCOM. As a programmer I'd have to say it looks like a bit of an old-fashioned and inflexible pig, but is it what would be most useful to people here? I could also offer microformats like hcard, but it all depends on what is most widely used I guess.

I'd appreciate your input,

Jeremy

Mary Anne
01-05-2008, 6:19 PM
Hi Jeremy |wave|

That's an interesting site you have there. I have a couple of observations to make about how I use photographs and information on photographers, in my genealogical work that may give you some insight into how we operate and what we are looking for.

I would likely be doing one of three things:

1) I HAVE a photo with a photographer's name on it, and probably little or no other indication of who the person is, or where the photo was taken. In this case, I would like to look up the photographer's name or the studio name and find out the location, so that might help me find the elusive realtive/person in the photo (search by photographer's name);

2) I have a relative who lived in a certain locale, and I would like some contemporary photographs of buildings or streets, or even of people in certain occupations, that could give me a better appreciation for what the place was like at the time my relative was there (search by location, including street address or enumeration district; search by occupation);

3) I have a relative who was a photographer, and want more information on where they lived and worked, their career, and if possible, examples of their work (search by photographer's name).

It is really only in the 3rd case that I would want or need to have the information you are providing in a GEDCOM. Yes, it is old-fashioned, but it has become the standard format for exchanging genealogical information among different software programs and on the web. It captures critical data fields that the genealogist is interested in - birth, baptism, marriage, death, occupation, etcetera. What it does is make it easy for anyone, using any software program, to import the data into their file - which might contain many thousands of people - so that the information is not so scrambled that it needs extensive massaging. Often the GEDCOM someone might wish to import also has hundreds or thousands of names in it, so having a standard format like the GEDCOM that puts major events where they are supposed to be (biths=births, baptisms=baptisms, etc) can be extremely important to help reduce the amount of work needed to bring the information into one's own file.

It seems to me if I were looking for a photographer relative on your database, the number of people I would be interested in, and therefore the amount of information required to transfer, would be relatively small, so I could just cut-and-paste or even re-type from any format you like - in fact, I could even just cut-and-paste from what you have posted on the web.

In the other 2 cases, I would be using the information as a research aid, and so for me, having the information as a GEDCOM format would not be terribly useful. So, the format almost becomes irelevant, although it probably should adhere to some standards, like rtf, jpeg, tif, gif, etcetera.

I think there are also some other things you should consider, although you may have already addressed these: What would be the terms of use for your information? Would I have free access to reproduce text and/or photos? Would you want a copyright of some sort? Would your service be free? Or by subscription?

Another important feature you should consider (and again, you may have already done this) is including primary source data with your information. For example, where did you get the studio address for the photographer? what organization or individual holds the photographs you are reproducing? It would not be good enough for me, as a researcher, to cite your website database as a source -- I would also want to know where I could look at the primary information myself.

Just a few thoughts of mine. Hope they are useful to you. I do have some rellies from London area (e.g. 40 Grafton Square in Clapham circa 1875; Norfolk Street Islington circa 1844) and look forward to seeing how I could use your database to find some photos that may relate to them and where they lived.


Mary Anne

v.wells
01-05-2008, 6:32 PM
A very interesting site and one I will more thoroughly explore. GEDCOM format seems fine as it is a universal file format for tree files. What types of data downloads are you thinking about offering.

Welcome to B-G forums|wave|

museumoflondon
03-05-2008, 4:40 PM
Hi both, thank you for your welcome, your kind opinions and your advice, and apologies for a somewhat slow reply. I'm glad to hear that the site is potentially of interest to you both.

It's very useful to have those views on GEDCOM, and Mary Anne's thoughts on how family historians are likely to use information on photographers, that's definitely clarified things for me. Yes, I can definitely see that it's only really those photographers that might fit into the family tree themselves that you would wish to have GEDCOM data for, and it makes sense too that this is of greatest importance for large masses of data, rather than one or two individuals. If it turns out that churning out (a subset of?) the data in GEDCOM is the work of a couple of hours, perhaps I'll do it anyway - it's bank holiday weekend after all, and unusually I'm alone for a couple of days! But otherwise, it sounds like it may not be worth the effort.

I have realised a couple of things about the site that I had forgotten, on account of it being probably three years since I worked on it (though it only launched last month....) Firstly, there is meant to be an essay explaining sources. Where no source is cited, I think census data is used but I need to get hold of and publish that essay since this is undoubtedly very important. Secondly, I actually have much better, more structured address data than is shown on the site or used in the search. I cannot recall why it's not used - I suspect it may be that we stopped doing this after a while on account of the workload of splitting up the prose text into nice addresses - but this does mean that at least for those people who have good data we could do things like plot them onto a map and link them with other geographical data (where road names haven't changed, anyway - this precedes postcodes, of course). Given that we do have more precise data that links people, addresses, occupations and dates in a structured format, perhaps that makes a stronger case for GEDCOM. But first I need to look into why we aren't showing all that data on the page already, or using details of locations in the search! It's all so long ago I forget. Hmm.

On the question of permission to use the data, you're quite right, there's nothing clear on the site. I think we need that, although the work is largely a compilation of publicly available data. For publication purposes David Webb would need to be asked and acknowledged, I would imagine, but for private use, well, that's why it's up there in the first place! I will suggest to the partners that we put something clear up there. As for the photos (e.g. http://www.photolondon.org.uk/pages/photoDetails.asp?phid=8) , there is no statement of copyright or licencing and there should be. Of course I'd suggest contacting the appropriate museum/library/archive, but we should have made it clear! Thanks for pointing this out, it's a silly oversight.

On the subject of photos themselves, it is my hope (and it was really the point of the whole PhotoLondon partnership) that there will in due course be many more pictures put online and associated with some of the 9,000 people in the database. Certainly my museum has large collections of some of these people but right now they aren't linked to that biographical database.

One final thing. I just heard about another couple of sites that may be of interest to you: Exhibitions of the Royal
Photographic Society 1870-1915 (http://erps.dmu.ac.uk) and Photographs Exhibited in Britain 1839-1865 (http://peib.dmu.ac.uk). Who knows, perhaps one day all these sites will be linked somehow. That would be cool.

So thanks again for the feedback, the ideas and the advice, it's all very useful and gratifying. I'd be only too pleased to hear more, and if anyone finds something they were looking for on the site, do let us know!

All the best, Jeremy

Geoffers
03-05-2008, 8:05 PM
Nice simple site - quick to open - I dislike sites which try and load too much on the home page.

Many will no doubt like Gedcom - I don't use it very much and certainly would not import a gedcom. My preference is to store data in spreadsheets.

One point on the layout of biographies and I hope you don't mind me pasting part of an entry here to give an example:

Where you have

Born in St Marylebone 1857.
2 brothers & 2 sisters.
STUDIOS: 1. 16 Dalston Lane, Hackney 1875 - 1893.
2. 24 Dalston Lane, Hackney 1875 - 1876.
3. 9 Cornhill, City of London 3 floor 1878 - 1879. Succeeded by Royal Exchange Portrait Co.

I would tend to standardise the layout so that the year/period is shown first:

1857: Born in St. Marylebone, son of John, 2 brotherss 2 sisters. One brother being Henry Robert Eason.
Studios
1875-1893: 16 Dalston Lane, Hackney
1875-1876: 24 Dalston Lane, Hackney
1878-1879: 9 Cornhill, City of London. Succeeded by Royal Exchange Portrait Co.

etc, but with the dates and information tabulated in two columns.

MarkJ
03-05-2008, 8:49 PM
Whatever method you use, please can you make it operating system and browser independant?
The number of sites which fail to work correctly in anything other than IE, or use junk like Flash or javascript with no other choice, drive me insane!
Likewise word documents, excel spreadsheets etc etc. I see the website is using Microsofts .asp language - hopefully you will avoid things which need the activeX component? That is not OS independant as you will be aware as well as another massive security problem ;)

Gedcom is not the answer either - that requires that the user has to have a suitable program. Even here, on a genealogy specific website, there are a significant number of people who cannot view gedcoms.

Much as I hate the things, perhaps pdf is a reasonable choice for data to be downloaded in?

Just my opinion as a "non standard" operating system user ;) Yes, I *can* view flash or javascripted stuff, but I choose not to - flash because allowing it means yet more ghastly moving adverts and javascript because it is a huge security risk for anyones PC - despite the fact that most "big" sites rely on it (yes, BBC, I am looking at you!)

Of course, building a website means making decisions - and, for convenience, you may choose the option which is most popular (flash stuff, javascript etc) - and that is your own choice and I can understand it :)

Mark

museumoflondon
04-05-2008, 11:31 AM
Hi Geoffers, thank you, glad you like it.

I'll go back and have a look at the data. I think it will support some of this. I think the reason why it's displayed in a less structured form is that only part of that text is contained in structured fields in the database. It was all held in a Word document and extracted by the hard labour of another partner, but discursive text and sources (as well as many addresses, it appears) weren't split up into database fields. Unfortunately it's not likely that we'll be able to do much more in that regard, nor re-editing the long-form text, because as you'll appreciate with such a large dataset and no resources it's just not going to happen. Personally I think this is particularly regrettable with regard to addresses, because it's nice to be able to make those links automatically between different people/organisations based on location. Ah well.
What this basically means is that I'd need to show what I could in tabular form in addition to showing the whole text as it stands, or we'd lose a lot of the info. I can see how this would be useful, though, so I will have a think about how best to accomplish it - perhaps a separate page showing whatever we can in a more structured form, or perhaps CSV or a spreadsheet doing the same.

Interesting to hear, too, that GEDCOM is of no use to you. Doubtless there are many other people out there that don't use an off-the-shelf package. I grabbed Legacy the other day just to play with GEDCOM, so I know that actually getting hold of the software isn't the problem. So there must be plenty of other good reasons why experts like yourself go another way. I wonder, though, whether there is a common set of fields that I could try to offer and that would satisfy most people? So if you use a spreadsheet, what are the column names you would most like to see in a spreadsheet I could make available for you?
Thanks again, Jeremy

museumoflondon
04-05-2008, 11:53 AM
Hi Mark, thanks for your thoughts. Hopefully you haven't experienced any O/S or browser issues with the site, please let me know if you have. It's very simple. There is some javascript but if you don't have it switched on then simple clicks work instead - basically, the help on the search form works if you move your mouse over, but without JS you can just click the "?" instead.

Though it doesn't apply to this site, really, I'd suggest that you could relax a bit about JavaScript and let it into your life. It's not nearly as risky as people think, if you're sensible, and there is a lot of fine-grained control over scripting in modern browsers (and also over what you allow Flash to do in the modern plugin) so you can enable it and still explicitly prevent it from doing certain things. Yes, like having a letter box, it still sometimes means you'll experience a little junk, but if you set your browser correctly it will at least let you avoid the equivalent of a letter bomb!

There's no ActiveX either. ASP is a server-side technology. It can work with client-side technology, of course, but there's no causative link that says ASP means there will be ActiveX. Both are kind of outdated, to be honest, but ASP is up to the job for this and I built it before my ASP.Net days!

It does sound, as you say, like GEDCOM is not going to do for everyone. For those who can't or don't wish to use it, I guess we could say either (a) cut and paste will be fine, if we assume they're only going to be interested in one or two records (c.f. Mary Anne's points) or (b) there may be another more structured, downloadable/saveable format that would be more useful. What do you think? If (b), would you like CSV (which could be used in Excel, OpenOffice, or countless other generic software packages)? I know you mentioned PDF, which may be possible, but how do you envisage using it? If you want simply to save a record as a document then PDF will do the job, as you say (as would TXT or RTF), but if you want to pull the data automatically into a genealogy programme a structured form would be needed and PDF wouldn't do for that. But as I say, that does depend on what you'd aim to do with the download. We might, of course, offer more than one format.

Thanks again for the input, I hope you find the site usable but please let me know if you do have problems. I doubt I've tested it as thoroughly as I might and probably haven't used the O/S and browser configuration you do, so I may have missed something important - please let me know if so!

All the best, Jeremy

Geoffers
04-05-2008, 11:57 AM
So if you use a spreadsheet, what are the column names you would most like to see in a spreadsheet I could make available for you?

I've indexed just coming up to 400,000 entries for my own area of research (NE Norfolk); these are currently in spreadsheets which vary in format according to the type of information. The basics that I think are necessary for any tabular format are:

When (Year or full date), Who (Primary and Secondary persons), Where (location), What (event), Source (where does the information come from).

I do use a commercial programme, Custodian3 (it will turn up using a search engine) into which I am slowly transferring my data. The trial version can be downloaded free if you wanted to experiment with various forms of infromation.

I can see why Gedcom is popular and the great majority of family historians do use it - I have one programme myself into which I have entered very basic detail just to produce basic diagrams of ancestry/descent.

But I don't like the prescriptive format of Gedcom, the means of recording source information; and do not under any circumstances transfer or accept information transferred by Gedcom because so much rubbish has been transferred between folk on this format.

Because it is so simple, someone sends another some information which is incorrect, it is imported and sent onto another. After this has happened half a dozen times and each has been added to some web-site it soon becomes difficult to pick out fact from fiction. I think Gedcom has been a dis-service to family history research; though I accept that I am probably in a minority of one in this opinion.

Whatever you do to your site, keep the home page simple

MarkJ
04-05-2008, 1:34 PM
Hi Mark, thanks for your thoughts. Hopefully you haven't experienced any O/S or browser issues with the site, please let me know if you have. It's very simple. There is some javascript but if you don't have it switched on then simple clicks work instead - basically, the help on the search form works if you move your mouse over, but without JS you can just click the "?" instead.

The site works well enough without scripts and I haven't spotted any issues with it as yet! Nice and simple - which is ideal for everyone.



Though it doesn't apply to this site, really, I'd suggest that you could relax a bit about JavaScript and let it into your life. It's not nearly as risky as people think, if you're sensible, and there is a lot of fine-grained control over scripting in modern browsers (and also over what you allow Flash to do in the modern plugin) so you can enable it and still explicitly prevent it from doing certain things. Yes, like having a letter box, it still sometimes means you'll experience a little junk, but if you set your browser correctly it will at least let you avoid the equivalent of a letter bomb!


Indeed you *can* restrict what js is permitted to do or not do, but most computer users are not "tech-savvy" and probably don't understand the options with regard to what can be permitted or not - assuming they even realise such things exist.
That is why I made the comment - if the site can avoid scripting, it helps :)




There's no ActiveX either. ASP is a server-side technology. It can work with client-side technology, of course, but there's no causative link that says ASP means there will be ActiveX. Both are kind of outdated, to be honest, but ASP is up to the job for this and I built it before my ASP.Net days!


Indeed - it is similar to php in that sense. I have looked at .asp once - I have some space on a server which allows me to run asp, but as I already had php scripts ready to roll, I use my Linux server for stuff instead. Annoying really - because the Winbox is free (I did some work for the owner and have a free account there).



It does sound, as you say, like GEDCOM is not going to do for everyone. For those who can't or don't wish to use it, I guess we could say either (a) cut and paste will be fine, if we assume they're only going to be interested in one or two records (c.f. Mary Anne's points) or (b) there may be another more structured, downloadable/saveable format that would be more useful. What do you think? If (b), would you like CSV (which could be used in Excel, OpenOffice, or countless other generic software packages)? I know you mentioned PDF, which may be possible, but how do you envisage using it? If you want simply to save a record as a document then PDF will do the job, as you say (as would TXT or RTF), but if you want to pull the data automatically into a genealogy programme a structured form would be needed and PDF wouldn't do for that. But as I say, that does depend on what you'd aim to do with the download. We might, of course, offer more than one format.


csv would work I expect. I have imported a csv file into OpenOffice and easily created a spreadsheet. With a bit (actually, a fair sized bit!) of effort, I converted it into a seachable database. More down to the limitations of OpenOffice or my understanding of it I think - in the end I imported the spreadsheet into another tool (called Kexi, part of KOffice) and created the database.

Would people wish to pull the data directly into their genealogy program? Not sure how others work, but I prefer to add data by hand. Going back to gedcom for example, if I import a gedcom with details regarding my grandfather and his family, but the gedcom I am importing contains different dates/names or maybe more or less detail - then I would be concerned that my original data would be overwritten. Even in the cases I have seen, it generally ends up with you having to manually manipulate the two gedcoms together to get a sensible result!
Ideally, offer two or three formats - then folks can pick what they want. You will never please everyone though ;)



Thanks again for the input, I hope you find the site usable but please let me know if you do have problems. I doubt I've tested it as thoroughly as I might and probably haven't used the O/S and browser configuration you do, so I may have missed something important - please let me know if so!

All the best, Jeremy

So far, so good! The site seems to be clean and easy to follow - even for a Luddite like myself ;)

museumoflondon
04-05-2008, 3:54 PM
Brilliant, thank you gents. Some thought-provoking stuff there. I think you're right, Mark, that more than one format is the best idea. In fact many potential users of the data may not be from the genealogical community at all, and GEDCOM would be a big "?!?" to them, although other data formats might help. I'm not in all honesty sure when I'll have a chance to do any of this, but when I do I'm now inclined to think it should be for CSV in the first instance. It does seem that most of our (well-structured) data fits well into schemes other than GEDCOM anyway, because it typically concerns a single individual and doesn't record relationships amongst them (that's only in the free text part, which I can't really do much with).

Geoffers, those headline data classes are helpful, I'll use them. It's a really interesting point you make about the very simplicity of passing information around causing a problem with losing track of where the data came from and how trustworthy it is. I note that there is a header element for recording who made the file, but this is an audit trail that stretches back just one step. As you suggest, that's not good enough really! It sounds like you're managing fine without GEDCOM if you've reached 400,000 records. On the plus side, at least there's steadily less of NE Norfolk to worry about so you should soon be done ;)

And Mark, I don't think it's luddite to want less junk popping up all over your browser so I'm with you there, too! Any sensible web developer will think carefully before doing anything that will exclude people from using their sites. This is pretty much a statutory obligation on publicly funded sites.

Thank you both again, Jeremy

Geoffers
04-05-2008, 9:01 PM
those headline data classes are helpful, I'll use them.

The extent of additional columns that might be used in any table depends on what you or a user of the site wish to analyse.

Thinking of photographers - I suppose you might wish to tabulate the type of photographic process used, or camera, where known; if an individual style of mount or trademark was used by which the work of individual photographers might be identified.

Of course, inclusion of such information may be rather specialised and have limited use for many - you may want to keep the data much simpler, in which case the headings previously suggested ought to cover most things.

Guy Etchells
04-05-2008, 10:42 PM
I may have the wrong end of the stick here but if images are being offered Gedcom is no use.
Gedcom is a text format it does not allow for images.

Having said that some genealogy programs offer an adapted form of gedcom which allows images to be transferred, whether the said images could be imported to another different program is a matter for debate.

Gedcom is supposed to be a standard format but with such adaptations to the standard the concept becomes meaningless.
Cheers
Guy

museumoflondon
05-05-2008, 9:20 AM
Thanks Guy. Don't worry, the site is about photographers (and other people in that trade), not photographs; or rather, that is at it's core (there are a few attached images) and data we're talking about here is the data about the people, not the piccies.

That said, the original spec of GEDCOM 5.5 does actually allow for references to images, and they can be delivered zipped up together, much as HTML could be (but not, say, PDF, where the image is actually embedded in the document - as you say, the GEDCOM file is just a text file).

GEDCOM is amazing in one way: the way no-one seems to be driving it forward any more. Or rather, from what I read, its creators in Utah have gone on with it in its XML form that itself dates back to 2002, and the software vendors have done nothing to follow it, preferring to stick with the horrible old 1995 version. Seems to me that that is the real shame: if people have gone ahead, as you say, with putting forks into the implementation then it helps no-one, whereas by using XML they could readily have put in extensions (namespaces) and perhaps hooked up other established formats. Like you say, it stops being a standard at all if everyone gradually goes in a different direction, but I guess that's inevitable given how old (and yucky) the 5.5 spec is.

Geoffers, photographic process is an idea, although I wonder whether it might be most useful at the searching stage i.e. a field on the search form to help people find photographers using a particular process. I'll have a look at the data, though, it may be possible to extract that information into a field for downloads. Otherwise it's in the descriptive text.

Cheers, Jeremy

Geoffers
05-05-2008, 11:04 AM
Geoffers, photographic process is an idea, although I wonder whether it might be most useful at the searching stage i.e. a field on the search form to help people find photographers using a particular process. I'll have a look at the data, though, it may be possible to extract that information into a field for downloads. Otherwise it's in the descriptive text.

It was just a thought for those who may have a very specific interest in the history of photgraphy and its development so that they may then possibly be able to advances in equipment and how quickly they were adopted. The inclusion of marks or possibly frame types being to help identify who may have taken photos and so trace where a photo may have been taken. - But it depends on the concept of the web-site.

One possibilty for inclusion, (if you do not already do so) might be to extract names of photographers from directories?

Good luck with what seems to be a very interesting project. Do let us know how it progresses.