#Spark on #Hadoop

October 22, 2014

I just watched a Hortonworks webinar about Hortonworks tackling integration of Spark with Hadoop 2.0 and YARN. An issue with Spark is locking and stalls in large multi-tenant Hadoop clusters based on the way Spark uses resources vs the rest of the Hadoop stack. Hortonworks aims to make Spark run fully within YARN and behave more like the rest of the stack.

There are some compelling reasons to like Spark. One, it provides a single consistent programming model across single instances, Spark clusters and Hadoop. Second, it draws on the vast library of Python analytics like NumPy and SciPy. Of course, on the downside it means being able to program in NumPy and SciPy which is pretty techy.

Still, if Hortonworks makes it work, next year could be the year of Spark.


Did Oracle miss the Big Data Boat

October 6, 2014

Not just Oracle but did the whole RDBMS industry let the Hadoop thing get away from them? They didn’t take it seriously and left the market wide open for open source to march in. Now they can’t get rid of it.

Over the next five years or so, all the spending will be on “Big Data”. But very little will actually be spent on “Big Data” database software because Hadoop is free. Now Hortonworks, Cloudera and MapR will make scads of money at their level but it will be a pittance by Oracle’s standards. Every customer will put their RDBMS systems in maintenance mode while they concentrate on “Big Data” which means flat spending and no growth for Oracle et al. Eventually the “Big Data” thing may saturate and we’ll get back to spending on core RDBMS systems but that’s years away.

The RDBMS vendors can try things like Oracle with its Exadata big data appliance but really this is either silly or desperate.

If you are an Oracle DBA look to a future doing upgrades, applying patches, maybe upgrading capacity to handle the load coming from Hadoop. Nobody is going to launch anything new and exciting on your platform.


Is #Hadoop the Next #bigthing

August 10, 2014

Yes Hadoop is the #bigdata thing but is it the next big thing in general? Is it time to stop talking about #bigdata and just say #data? It is time to take these so called bigdata technologies mainstream for the whole data centre?

There are some attractive ideas about having Hadoop in the centre of your data centre

· The cost of saving data drops so low, you can just save everything and worry about it later. Without all that expensive ETL, aggregation, and dimensional modelling running up costs for storing data you don’t even know if you really need yet.

· The cost moves to the extract side with "schema on read" . It may even be higher but presumably you know the value of extracting it at this point so it makes sense to have the cost here.

· It is truly enterprise scale. How often do we talk about 5 year plans to "boil the ocean" for some enterprise wide initiative like MDM . This is an ocean that can be boiled in days or even hours.

· It is open source so you can store your data without vendor lock in

The big question though, is what do you really do with it.

· I can’t imagine many business users writing Python scripts to run on Spark for example. The users that can will have a big early advantage though.

· So you still need those legacy data marts and report repositories hung off of it. Users already complain that their data warehouses and data marts aren’t real time enough now. doesn’t adding Hadoop in the middle make this worse?

· For now that is, until there are new tools that let users build and query datasets off the Hadoop cluster easily. Spark sort of does this now, though not easily. Once these are available, those data marts and report repositories will fade away

The modern data architecture will be something like the HortonWorks DataLake in the middle with new Hadoop enabled tools and all those legacy RDMSs and DataMarts strung around the periphery. All of the interesting work in the next 5 years will centre around this DataLake , the new tools, and the connections to it. Everything else will be just maintenance.


NoSQL in Odd Places

July 5, 2013

What started me thinking NoSQL will be where all the application action goes was finding NoSQL databases is unexpected places. Meaning not the usual ‘Big Data” but something like embedded systems for example.

Why would an embedded systems developer choose NoSQL over SQL for a small embedded database? On the surface you have to wonder but the answer is likely the programming model. It is a lot easier for a Java developer to work with Java objects and JSON documents than to map those objects to SQL. Most of them don’t understand SQL very well and they make a mess of it, so just pick something they understand better.

I was shocked to find an RDM system that runs on MongoDB. Surely RDM is one place that would need the robustness of a relational database but no it doesn’t. Look a bit deeper and you see why. It is a closed system, nobody will be doing adhoc queries and updates to an RDM, it is closed to just the app. While the database doesn’t provide much in the way of data consistency, the app has control and can do it. At the same time, it has to be able to deal with data is all sort of formats and variations, the schema-less database makes this easier.

I expect we’ll see more and more of this. The transaction control and object integrity moves out of the datastore and into the app framework. It is easier or faster to program there. Time to market is everything today and if NoSQL is good enough and faster to market, it will win most of the time.


SQL vs. NoSQL, OK, I Give Up

July 4, 2013

In the long running relational (SQL) vs NoSQL database debate, I’ve always been a big relational fan, I still believe it is the best technology to store data without error but I give in. It’s great technology but it’s too inaccessible. It’s too expensive, too fragile, too hard to manage, too hard to program etc. All the action from now on will be with NoSQL

I feel like the guy with a Sun Workstation in 1995 telling the world that Windows95 PCs are vastly inferior tech. They were but it didn’t matter. Windows95 was so accessible, it just took over everything. All the application action moved to that platform. And eventually Microsoft and Intel created a platform as good as any workstation and nobody has talked about workstations for years now.

Relational DBs will survive of course. Just as Sun servers now flourish hidden in every datacenter in the world, behind all that NoSQL will still be Oracle and DB2 holding the really important stuff. But only a few DBA geeks will ever see it. No interesting new apps will be written to it directly, just legacy maintenance. Business wil no longer put up with the high cost to run relational DBs, they will just run what they must, outsourced to the lowest bidder. Not a very interesting place to be.

I’ll grant an exception to MySQL. Since it was born into the online WEB world that NoSQL also serves ,it has some of that same easy accessibility. The SQL database for the NoSQL world perhaps.


MDM Inter-operability

July 20, 2012

Can MDM hubs inter-operate? Without a whole lot of custom code or arcane proprietary interfaces?

I attended Aaron Zornes presentation on the state of MDM today at the MDM Summit Toronto and frankly I found it to be a bit depressing. First, no one product does everything well. So to do everything well, I might want to use more than one vendor, if that was possible. Second, I will get several vendors products together whether I want to or not due to MDM and RDM being bundled into the application stacks by the big vendors like Microsoft and Oracle.

If I am going to have several MDM products together, can these things talk to each other? Now I’m sure there lots of SI’s out there that would be happy to charge handsomely to build some custom interfaces. There may also be some proprietary interfaces out there but these will always behave differently depending on the vendor and each will have its custom setup.

What I’d really like to see is an industry standard protocol for MDM hubs so that there would be relatively predictable results when hubs work together and without a whole lot of custom code or setup to support.

I’ve never seen such a thing though, or even a discussion about it.


Data Modeling Jobs

December 19, 2011

Last week at IRMAC, I heard Karen Lopez talking about the severe shortage of experienced data modelers. Sadly, I drifted out of the field about 10 years ago when the jobs dried up. When I used ERwin, it was still owned by Logicworks. I still have a Logicworks shirt from one of their user conferences

So, if had stuck with it and been gainfully unemployed as a data modeler the last 10 years, would I be in big demand now?


Follow

Get every new post delivered to your Inbox.

Join 25 other followers