Saturday, 11 July 2015

ECJ clarifies Database Directive scope in screen scraping case

EC on the legal protection of databases (Database Directive) in a case concerning the extraction of data from a third party’s website by means of automated systems or software for commercial purposes (so called 'screen scraping').

Flight data extracted

The case, Ryanair Ltd vs. PR Aviation BV, C-30/14, is of interest to a range of companies such as price comparison websites. It stemmed from  Dutch company PR Aviation operation of a website where consumers can search through flight data of low-cost airlines  (including Ryanair), compare prices and, on payment of a commission, book a flight. The relevant flight data is extracted from third-parties’ websites by means of ‘screen scraping’ practices.

Ryanair claimed that PR Aviation’s activity:

• amounted to infringement of copyright (relating to the structure and architecture of the database) and of the so-called sui generis database right (i.e. the right granted to the ‘maker’ of the database where certain investments have been made to obtain, verify, or present the contents of a database) under the Netherlands law implementing the Database Directive;

• constituted breach of contract. In this respect, Ryanair claimed that a contract existed with PR Aviation for the use of its website. Access to the latter requires acceptance, by clicking a box, of the airline’s general terms and conditions which, amongst others, prohibit unauthorized ‘screen scraping’ practices for commercial purposes.

Ryanair asked Dutch courts to prohibit the infringement and order damages. In recent years the company has been engaged in several legal cases against web scrapers across Europe.

The Local Court, Utrecht, and the Court of Appeals of Amsterdam dismissed Ryanair’s claims on different grounds. The Court of Appeals, in particular, cited PR Aviation’s screen scraping of Ryanair’s website as amounting to a “normal use” of said website within the meaning of the lawful user exceptions under Sections 6 and 8 of the Database Directive, which cannot be derogated by contract (Section 15).

Ryanair appealed

Ryanair appealed the decision before the Netherlands Supreme Court (Hoge Raad der Nederlanden), which decided to refer the following question to the ECJ for a preliminary ruling: “Does the application of [Directive 96/9] also extend to online databases which are not protected by copyright on the basis of Chapter II of said directive or by a sui generis right on the basis of Chapter III, in the sense that the freedom to use such databases through the (whether or not analogous) application of Article[s] 6(1) and 8, in conjunction with Article 15 [of Directive 96/9] may not be limited contractually?.”

The ECJ’s ruling

The ECJ (without the need of the opinion of the advocate general) ruled that the Database Directive is not applicable to databases which are not protected either by copyright or by the sui generis database right. Therefore, exceptions to restricted acts set forth by Sections 6 and 8 of the Directive do not prevent the database owner from establishing contractual limitations on its use by third parties. In other words, restrictions to the freedom to contract set forth by the Database Directive do not apply in cases of unprotected databases. Whether Ryanair’s website may be entitled to copyright or sui generis database right protection needs to be determined by the competent national court.

The ECJ’s decision is not particularly striking from a legal standpoint. Yet, it could have a significant impact on the business model of price comparison websites, aggregators, and similar businesses. Owners of databases that could not rely on intellectual property protection may contractually prevent extraction and use (“scraping”) of content from their online databases. Thus, unprotected databases could receive greater protection than the one granted by IP law.

Antitrust implications

However, the lawfulness of contractual restrictions prohibiting access and reuse of data through screen scraping practices should be assessed under an antitrust perspective. In this respect, in 2013 the Court of Milan ruled that Ryanair’s refusal to grant access to its database to the online travel agency Viaggiare S.r.l. amounted to an abuse of dominant position in the downstream market of information and intermediation on flights (decision of June 4, 2013 Viaggiare S.r.l. vs Ryanair Ltd). Indeed, a balance should be struck between the need to compensate the efforts and investments made by the creator of the database with the interest of third parties to be granted with access to information (especially in those cases where the latter are not entitled to copyright protection).

Additionally, web scraping triggers other issues which have not been considered by the ECJ’s ruling. These include, but are not limited to trademark law (i.e., whether the use of a company’s names/logos by the web scraper without consent may amount to trademark infringement), data protection (e.g., in case the scraping involves personal data), or unfair competition.

Source: http://www.globallegalpost.com/blogs/global-view/ecj-clarifies-database-directive-scope-in-screen-scraping-case-128701/

Saturday, 27 June 2015

Data Scraping - Increasing Accessibility by Scraping Information From PDF

You may have heard about data scraping which is a method that is being used by computer programs in extracting data from an output that comes from another program. To put it simply, this is a process which involves the automatic sorting of information that can be found on different resources including the internet which is inside an html file, PDF or any other documents. In addition to that, there is the collection of pertinent information. These pieces of information will be contained into the databases or spreadsheets so that the users can retrieve them later.

Most of the websites today have text that can be accessed and written easily in the source code. However, there are now other businesses nowadays that choose to make use of Adobe PDF files or Portable Document Format. This is a type of file that can be viewed by simply using the free software known as the Adobe Acrobat. Almost any operating system supports the said software. There are many advantages when you choose to utilize PDF files. Among them is that the document that you have looks exactly the same even if you put it in another computer so that you can view it. Therefore, this makes it ideal for business documents or even specification sheets. Of course there are disadvantages as well. One of which is that the text that is contained in the file is converted into an image. In this case, it is often that you may have problems with this when it comes to the copying and pasting.

This is why there are some that start scraping information from PDF. This is often called PDF scraping in which this is the process that is just like data scraping only that you will be getting information that is contained in your PDF files. In order for you to begin scraping information from PDF, you must choose and exploit a tool that is specifically designed for this process. However, you will find that it is not easy to locate the right tool that will enable you to perform PDF scraping effectively. This is because most of the tools today have problems in obtaining exactly the same data that you want without personalizing them.

Nevertheless, if you search well enough, you will be able to encounter the program that you are looking for. There is no need for you to have programming language knowledge in order for you to use them. You can easily specify your own preferences and the software will do the rest of the work for you. There are also companies out there that you can contact and they will perform the task since they have the right tools that they can use. If you choose to do things manually, you will find that this is indeed tedious and complicated whereas if you compare this to having professionals do the job for you, they will be able to finish it in no time at all. Scraping information from PDF is a process where you collect the information that can be found on the internet and this does not infringe copyright laws.

Source: http://ezinearticles.com/?Increasing-Accessibility-by-Scraping-Information-From-PDF&id=4593863

Saturday, 20 June 2015

Web Scraping: working with APIs

APIs present researchers with a diverse set of data sources through a standardised access mechanism: send a pasted together HTTP request, receive JSON or XML in return. Today we tap into a range of APIs to get comfortable sending queries and processing responses.

These are the slides from the final class in Web Scraping through R: Web scraping for the humanities and social sciences

This week we explore how to use APIs in R, focusing on the Google Maps API. We then attempt to transfer this approach to query the Yandex Maps API. Finally, the practice section includes examples of working with the YouTube V2 API, a few ‘social’ APIs such as LinkedIn and Twitter, as well as APIs less off the beaten track (Cricket scores, anyone?).

I enjoyed teaching this course and hope to repeat and improve on it next year. When designing the course I tried to cram in everything I wish I had been taught early on in my PhD (resulting in information overload, I fear). Still, hopefully it has been useful to students getting started with digital data collection, showing on the one hand what is possible, and on the other giving some idea of key steps in achieving research objectives.

Download the .Rpres file to use in Rstudio here

A regular R script with code-snippets only can be accessed here

Slides from the first session here

Slides from the second session here

Slides from the third session here

Source: http://www.r-bloggers.com/web-scraping-working-with-apis/

Tuesday, 9 June 2015

Three Common Methods For Web Data Extraction

Probably the most common technique used traditionally to extract data from web pages this is to cook up some regular expressions that match the pieces you want (e.g., URL's and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you're already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing "ontologies", or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they're often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it's probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what's the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

- If you're already familiar with regular expressions and at least one programming language, this can be a quick solution.

- Regular expressions allow for a fair amount of "fuzziness" in the matching such that minor changes to the content won't break them.

- You likely don't need to learn any new languages or tools (again, assuming you're already familiar with regular expressions and a programming language).

- Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It's also nice because the various regular expression implementations don't vary too significantly in their syntax.

Disadvantages:

- They can be complex for those that don't have a lot of experience with them. Learning regular expressions isn't like going from Perl to Java. It's more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.

- They're often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you'll see what I mean.

- If the content you're trying to match changes (e.g., they change the web page by adding a new "font" tag) you'll likely need to update your regular expressions to account for the change.

- The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You'll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there's no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

- You create it once and it can more or less extract the data from any page within the content domain you're targeting.

- The data model is generally built in. For example, if you're extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).

- There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

- It's relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.

- These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you're targeting.

- You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you'll only get into ontologies and artificial intelligence when you're planning on extracting information from a very large number of sources. It also makes sense to do this when the data you're trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

- Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.

- Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.

- Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

- The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.

- A potential cost. Most ready-to-go screen-scraping applications are commercial, so you'll likely be paying in dollars as well as time for this solution.

- A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you're locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you're using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don't mind paying a bit, you can save yourself a significant amount of time by using one. If you're doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you're probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we've been involved with that has actually required a hybrid approach of two of the aforementioned methods. We're currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term "number of bedrooms" can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we've done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it's handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we've written that uses ontologies in order to extract out the individual pieces we're after. Once the data has been extracted we then insert it into a database.

Source: http://ezinearticles.com/?Three-Common-Methods-For-Web-Data-Extraction&id=165416

Wednesday, 3 June 2015

Scraping the Royal Society membership list

To a data scientist any data is fair game, from my interest in the history of science I came across the membership records of the Royal Society from 1660 to 2007 which are available as a single PDF file. I’ve scraped the membership list before: the first time around I wrote a C# application which parsed a plain text file which I had made from the original PDF using an online converting service, looking back at the code it is fiendishly complicated and cluttered by boilerplate code required to build a GUI. ScraperWiki includes a pdftoxml function so I thought I’d see if this would make the process of parsing easier, and compare the ScraperWiki experience more widely with my earlier scraper.

The membership list is laid out quite simply, as shown in the image below, each member (or Fellow) record spans two lines with the member name in the left most column on the first line and information on their birth date and the day they died, the class of their Fellowship and their election date on the second line.

Later in the document we find that information on the Presidents of the Royal Society is found on the same line as the Fellow name and that Royal Patrons are formatted a little differently. There are also alias records where the second line points to the primary record for the name on the first line.

pdftoxml converts a PDF into an xml file, wherein each piece of text is located on the page using spatial coordinates, an individual line looks like this:

<text top="243" left="135" width="221" height="14" font="2">Abbot, Charles, 1st Baron Colchester </text>

This makes parsing columnar data straightforward you simply need to select elements with particular values of the “left” attribute. It turns out that the columns are not in exactly the same positions throughout the whole document, which appears to have been constructed by tacking together the membership list A-J with that of K-Z, but this can easily be resolved by accepting a small range of positions for each column.

Attempting to automatically parse all 395 pages of the document reveals some transcription errors: one Fellow was apparently elected on 16th March 197 – a bit of Googling reveals that the real date is 16th March 1978. Another fellow is classed as a “Felllow”, and whilst most of the dates of birth and death are separated by a dash some are separated by an en dash which as far as the code is concerned is something completely different and so on. In my earlier iteration I missed some of these quirks or fixed them by editing the converted text file. These variations suggest that the source document was typed manually rather than being output from a pre-existing database. Since I couldn’t edit the source document I was obliged to code around these quirks.

ScraperWiki helpfully makes putting data into a SQLite database the simplest option for a scraper. My handling of dates in this version of the scraper is a little unsatisfactory: presidential terms are described in terms of a start and end year but are rendered 1st January of those years in the database. Furthermore, in historical documents dates may not be known accurately so someone may have a birth date described as “circa 1782? or “c 1782?, even more vaguely they may be described as having “flourished 1663-1778? or “fl. 1663-1778?. Python’s default datetime module does not capture this subtlety and if it did the database used to store dates would need to support it too to be useful – I’ve addressed this by storing the original life span data as text so that it can be analysed should the need arise. Storing dates as proper dates in the database, rather than text strings means we can query the database using date based queries.

ScraperWiki provides an API to my dataset so that I can query it using SQL, and since it is public anyone else can do this too. So, for example, it’s easy to write queries that tell you the the database contains 8019 Fellows, 56 Presidents, 387 born before 1700, 3657 with no birth date, 2360 with no death date, 204 “flourished”, 450 have birth dates “circa” some year.

I can count the number of classes of fellows:

select distinct class,count(*) from `RoyalSocietyFellows` group by class

Make a table of all of the Presidents of the Royal Society

select * from `RoyalSocietyFellows` where StartPresident not null order by StartPresident desc

…and so on. These illustrations just use the ScraperWiki htmltable export option to display the data as a table but equally I could use similar queries to pull data into a visualisation.

Comparing this to my earlier experience, the benefits of using ScraperWiki are:

•    Nice traceable code to provide a provenance for the dataset;

•    Access to the pdftoxml library;

•    Strong encouragement to “do the right thing” and put the data into a database;

•    Publication of the data;

•    A simple API giving access to the data for reuse by all.

My next target for ScraperWiki may well be the membership lists for the French Academie des Sciences, a task which proved too complex for a simple plain text scraper…

Source: https://scraperwiki.wordpress.com/2012/12/28/scraping-the-royal-society-membership-list/

Friday, 29 May 2015

Web Scraping Services - A trending technique in data science!!!

Web scraping as a market segment is trending to be an emerging technique in data science to become an integral part of many businesses – sometimes whole companies are formed based on web scraping. Web scraping and extraction of relevant data gives businesses an insight into market trends, competition, potential customers, business performance etc.  Now question is that “what is actually web scraping and where is it used???” Let us explore web scraping, web data extraction, web mining/data mining or screen scraping in details.

What is Web Scraping?

Web Data Scraping is a great technique of extracting unstructured data from the websites and transforming that data into structured data that can be stored and analyzed in a database. Web Scraping is also known as web data extraction, web data scraping, web harvesting or screen scraping.

What you can see on the web that can be extracted. Extracting targeted information from websites assists you to take effective decisions in your business.

Web scraping is a form of data mining. The overall goal of the web scraping process is to extract information from a websites and transform it into an understandable structure like spreadsheets, database or csv. Data like item pricing, stock pricing, different reports, market pricing, product details, business leads can be gathered via web scraping efforts.

There are countless uses and potential scenarios, either business oriented or non-profit. Public institutions, companies and organizations, entrepreneurs, professionals etc. generate an enormous amount of information/data every day.

Uses of Web Scraping:

The following are some of the uses of web scraping:

•    Collect data from real estate listing

•    Collecting retailer sites data on daily basis

•    Extracting offers and discounts from a website.

•    Scraping job posting.

•    Price monitoring with competitors.

•    Gathering leads from online business directories – directory scraping

•    Keywords research

•    Gathering targeted emails for email marketing – email scraping

•    And many more.

There are various techniques used for data gathering as listed below:

•    Human copy-and-paste – takes lot of time to finish when data is huge

•    Programming the Custom Web Scraper as per the needs.

•    Using Web Scraping Softwares available in market.

Are you in search of web data scraping expert or specialist. Then you are at right place. We are the team of web scraping experts who could easily extract data from website and further structure the unstructured useful data to uncover patterns, and help businesses for decision making that helps in increasing sales, cover a wide customer base and ultimately it leads to business towards growth and success.

We have got expertise in all the web scraping techniques, scraping data from ajax enabled complex websites, bypassing CAPTCHAs, forming anonymous http request etc in providing web scraping services.

The web scraping is legal since the data is publicly and freely available on the Web. Smart WebTech can probably help you to achieve your scraping-based project goals. We would be more than happy to hear from you.

Source: http://webdata-scraping.com/web-scraping-trending-technique-in-data-science/

Tuesday, 26 May 2015

Endorsing web scraping

With more than 200 projects delivered, we stand firmly for new challenges every day. We have served above 60 clients and have won 86% of repeat business, as our main core is customer delight. Successive Softwares was approached by a client having a very exclusive set of requirements. For their project they required customised data mining, in real time to offer profitable information to their customers. Requirement stated scrapping of stock exchange data in real time so that end users can be eased in their marketing decisions. This posed as an ambitious task for us because it required processing of huge amount of data on a routine basis. We welcomed it as an event to evolve and do something aside of classic web application development.

We started with mock-ups, pursuing our very first step of IMPART Framework (Innovative Mock-up based Prototypes Analyzed to develop Reengineered Technology). Our team of experts thought of all the potential requirements with a flow and materialized it flawlessly into our mock up. It was a strenuous tasks but our excitement to do something which others still do not think of, filled our team with confidence and energy and things began to roll out perfectly. We presented our mock-up and statistics to the client as per our expectation client choose us, impressed with the efforts.

We started gathering requirements from client side and started to formulate design about the flow. The project required real time monitoring of stock exchange together with Prices, Market Turnover and then implement them into graphs. The front end part was an easy deal, we were already adept in playing with data the way required. The intractable task was to get the data. We researched and found that it can be achieved either with API or with Web Scarping and we moved with latter because of the limitations in API.

Web scraping is a compelling technique to get the required information straight out of the web page. Lack of documentation and not much forbearance forced us to make a slow start, but we kept all the requirements clear and new that we headed in the right direction.  We divided the scraping process into bits of different but related tasks. Firstly we needed to find the data which has to be captured, some of the problems faced were pagination and use of AJAX but with examination of endpoints in URL and the requests made when data is drawn, we surmounted these problems easily.

After targeting our data we focused on HTML parser which could extract data form all the targets. Using PHP we developed a parser extracting all the information and saving them in Database in a structured way.  After the required data present at our end we easily manipulated it into tables and charts and we used HIGHSTOCK for that. Entire Client side was developed in PHP with Zend frame work and we used MySQL 5.7 for server side.

During the whole development cycle our QA team insured we were delivering a quality product following all standards. We kept our client in the loop during the whole process keeping them informed about every step. Clients were also assured as they watched their project starting from scratch which developed into full fledge website. The process followed a strict time line releasing regular builds and implementing new improvements. We stood up to the expectation our client and delivered a product just as they visualized it to be.

Source: http://www.successivesoftwares.com/endorsing-web-scraping/

Monday, 25 May 2015

What you need to know about web scraping: How to understand, identify, and sometimes stop

NB: This is a gust article by Rami Essaid, co-founder and CEO of Distil Networks.

Here’s the thing about web scraping in the travel industry: everyone knows it exists but few know the details.

Details like how does web scraping happen and how will I know? Is web scraping just part of doing business online, or can it be stopped? And lastly, if web scraping can be stopped, should it always be stopped?

These questions and the challenge of web scraping are relevant to every player in the travel industry. Travel suppliers, OTAs and meta search sites are all being scraped. We have the data to prove it; over 30% of travel industry website visitors are web scrapers.

Google Analytics, and most other analytics tools do not automatically remove web scraper traffic, also called “bot” traffic, from your reports – so how would you know this non-human and potentially harmful traffic exists? You have to look for it.

This is a good time to note that I am CEO of a bot-blocking company called Distil Networks, and we serve the travel industry as well as digital publishers and eCommerce sites to protect against web scraping and data theft – we’re on a mission to make the web more secure.

So I am admittedly biased, but will do my best to provide an educational account of what we’ve learned to be true about web scraping in travel – and why this is an issue every travel company should at the very least be knowledgeable about.

Overall, I see an alarming lack of awareness around the prevalence of web scraping and bots in travel, and I see confusion around what to do about it. As we talk this through I’ll explain what these “bots” are, how to find them and how to manage them to better protect and leverage your travel business.

What are bots, web scrapers and site indexers? Which are good and which are bad?

The jargon around web scraping is confusing – bots, web scrapers, data extractors, price scrapers, site indexers and more – what’s the difference? Allow me to quickly clarify.

–> Bots: This is a general term that refers to non-human traffic, or robot traffic that is computer generated. Bots are essentially a line of code or a program that is created to perform specific tasks on a large scale.  Bots can include web scrapers, site indexers and fraud bots. Bots can be good or bad.

–> Web Scraper: (web harvesting or web data extraction) is a computer software technique of extracting information from websites (source, Wikipedia). Web scrapers are usually bad.

If your travel website is being scraped, it is most likely your competitors are collecting competitive intelligence on your prices. Some companies are even built to scrape and report on competitive price as a service. This is difficult to prove, but based on a recent Distil Networks study, prices seem to be main target.You can see more details of the study and infographic here.

One case study is Ryanair. They have been particularly unhappy about web scraping and won a lawsuit against a German company in 2008, incorporated Captcha in 2011 to stop new scrapers, and when Captcha wasn’t totally effective and Cheaptickets was still scraping, they took to the courts once again.

So Ryanair is doing what seems to be a consistent job of fending off web scrapers – at least after the scraping is performed. Unfortunately, the amount of time and energy that goes into identifying and stopping web scraping after the fact is very high, and usually this means the damage has been done.

This type of web scraping is bad because:

    Your competition is likely collecting your price data for competitive intelligence.

    Other travel companies are collecting your flights for resale without your consent.

    Identifying this type of web scraping requires a lot of time and energy, and stopping them generally requires a lot more.

Web scrapers are sometimes good

Sometimes a web scraper is a potential partner in disguise.

Meta search sites like Hipmunk sometimes get their start by scraping travel site data. Once they have enough data and enough traffic to be valuable they go to suppliers and OTAs with a partnership agreement. I’m naming Hipmunk because the Company is one of th+e few to fess up to site scraping, and one of the few who claim to have quickly stopped scraping when asked.

I’d wager that Hipmunk and others use(d) web scraping because it’s easy, and getting a decision maker at a major travel supplier on the phone is not easy, and finding legitimate channels to acquire supplier data is most definitely not easy.

I’m not saying you should allow this type of site scraping – you shouldn’t. But you should acknowledge the opportunity and create a proper channel for data sharing. And when you send your cease and desist notices to tell scrapers to stop their dirty work, also consider including a note for potential partners and indicate proper channels to request data access.

–> Site Indexer: Good.

Google, Bing and other search sites send site indexer bots all over the web to scour and prioritize content. You want to ensure your strategy includes site indexer access. Bing has long indexed travel suppliers and provided inventory links directly in search results, and recently Google has followed suit.

–> Fraud Bot: Always bad.

Fraud bots look for vulnerabilities and take advantage of your systems; these are the pesky and expensive hackers that game websites by falsely filling in forms, clicking ads, and looking for other vulnerabilities on your site. Reviews sections are a common attack vector for these types of bots.

How to identify and block bad bots and web scrapers

Now that you know the difference between good and bad web scrapers and bots, how do you identify them and how do you stop the bad ones? The first thing to do is incorporate bot-identification into your website security program. There are a number of ways to do this.

In-house

When building an in house solution, it is important to understand that fighting off bots is an arms race. Every day web scraping technology evolves and new bots are written. To have an effective solution, you need a dynamic strategy that is always adapting.

When considering in-house solutions, here are a few common tactics:

    CAPTCHAs – Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHA), exist to ensure that user input has not been generated by a computer. This has been the most common method deployed because it is simple to integrate and can be effective, at least at first. The problem is that Captcha’s can be beaten with a little workand more importantly, they are a nuisance to end usersthat can lead to a loss of business.

    Rate Limiting- Advanced scraping utilities are very adept at mimicking normal browsing behavior but most hastily written scripts are not. Bots will follow links and make web requests at a much more frequent, and consistent, rate than normal human users. Limiting IP’s that make several requests per second would be able to catch basic bot behavior.

    IP Blacklists - Subscribing to lists of known botnets & anonymous proxies and uploading them to your firewall access control list will give you a baseline of protection. A good number of scrapers employ botnets and Tor nodes to hide their true location and identity. Always maintain an active blacklist that contains the IP addresses of known scrapers and botnets as well as Tor nodes.

    Add-on Modules – Many companies already own hardware that offers some layer of security. Now, many of those hardware providers are also offering additional modules to try and combat bot attacks. As many companies move more of their services off premise, leveraging cloud hosting and CDN providers, the market share for this type of solution is shrinking.

    It is also important to note that these types of solutions are a good baseline but should not be expected to stop all bots. After all, this is not the core competency of the hardware you are buying, but a mere plugin.

Some example providers are:

    Impreva SecureSphere- Imperva offers Web Application Firewalls, or WAF’s. This is an appliance that applies a set of rules to an HTTP connection. Generally, these rules cover common attacks such as Cross-site Scripting (XSS) and SQL Injection. By customizing the rules to your application, many attacks can be identified and blocked. The effort to perform this customization can be significant and needs to be maintained as the application is modified.

    F5 – ASM – F5 offers many modules on their BigIP load balancers, one of which is the ASM. This module adds WAF functionality directly into the load balancer. Additionally, F5 has added policy-based web application security protection.

Software-as-a-service

There are website security software options that include, and sometimes specialize in web scraping protection. This type of solution, from my perspective, is the most effective path.

The SaaS model allows someone else to manage the problem for you and respond with more efficiency even as new threats evolve.  Again, I’m admittedly biased as I co-founded Distil Networks.

When shopping for a SaaS solution to protect against web scraping, you should consider some of the following factors:

•    Does the provider update new threats and rules in real time?

•    How does the solution block suspected non-human visitors?

•    Which types of proactive blocking techniques, such as code injections, does the provider deploy?

•    Which of the reactive techniques, such as rate limiting, are used?

•    Does the solution look at all of your traffic or a snapshot?

•    Can the solution block bots before they reach your infrastructure – and your data?

•    What kind of latency does this solution introduce?

I hope you now have a clearer understanding of web scraping and why it has become so prevalent in travel, and even more important, what you should do to protect and leverage these occurrences.

Source: http://www.tnooz.com/article/what-you-need-to-know-about-web-scraping-how-to-understand-identify-and-sometimes-stop/

Saturday, 23 May 2015

Roles of Data Mining in Predicting, Tracking, and Containing the Ebola Outbreak

One of the most diverse continents on earth, Africa astounds the world with its vast savannas and great deserts and with its ancient architecture and modern cities, but Africa also has its share of tragedies and woes.

First identified in Democratic Republic of Congo’s Ebola River in 1976, Ebola Hemorrhagic Fever, a deadly zoonotic disease caused by Ebola virus, has been spreading in West Africa like a wildfire, engulfing everything on its way and creating widespread panic.

What has added insult to injury is the fact that the region has long endured the severe consequences of civil wars and social conflicts, and diseases like malaria, HIV/AIDS, yellow fever, cholera etc. have remained endemic to the region for a long time, causing tens of thousands of deaths every year.

Reportedly, Ebola has already killed at least 2,296 people, and there are about 3,685 confirmed cases of infection. Mortality rate has been swinging between 50% to 90%, depending on the quality of care and nutrition. According to WHO, the disease is likely to infect as much as 20,000 people before it is finally brought under control.

Crisis of Data

When it comes to healthcare management, clinical data is one of the key components. The value of data becomes more urgent in the emergency situation like that of West Africa. The more relevant data you have, the bigger picture you can create for taking aggressive measures. To use Peter Drucker’s words, “What gets measured gets managed.”

Factual data is a precondition for the doctors and health science experts working in the field for measuring and managing the situation. Data helps them to assess their successes or failures and reorient their actions. One of the important reasons why the fight against the Ebola outbreak is turning out into a losing battle is the insufficiency of data. Recently, Scientific American magazine wrote:

Right now, there are not even enough beds for sick patients nor enough data coming in to help track cases. Surveillance and tracking of those who were possibly exposed to Ebola remain inadequate.

In Science magazine, Gretchen Vogel suggests that the death toll of Ebola patients could be much higher than it is currently estimated. She says, “Exactly how many unrecorded Ebola deaths have occurred will never be known. Health officials are keeping track of suspected and probable cases, many of which are people who died before they could be tested.” Greg Slabodkin voices similar concerns in Health Data Management and points at the need of an integrated global biosurveillance system.

The absence of reliable and actionable data has badly hampered the efforts of combatting Ebola and providing proper medical care to the victims. CDC Director Dr. Tom Frieden describes it as a “fog-of-war situation”.

Data Mining: Bots Were the First to Warn

When you flip the coin, however, the situation is not completely bleak and desperate. Even if Big Data technologies have fallen short in predicting, tracking, and containing the epidemic, mainly due to the lack of data from the ground, it has not entirely failed. Data scientists and healthcare experts world over are making concerted efforts to know, track, and defeat the Ebola virus—some on the ground and some in their labs.

The increasing level of collaboration among the biomedical specialists, geneticist, virologists, and IT experts has definitely contributed to slow down the transmission of the virulent disease dubbed as “the plague of modern day”. Médecins Sans Frontières and Healthmap.org are the excellent examples in this regard.

    “By deploying bots and crawlers and by using advanced machine learning algorithms, the Boston-based global infectious disease surveillance system, HealthMap was able to predict and raise concerns about the spread of a mysterious hemorrhagic fever in West Africa nine days earlier than WHO did.”

Run by a team of 45 researchers, epidemiologists, and software developers at Boston Children’s Hospital, HealthMap mines data from search engine queries, social media platforms, health information sites, news reports and crowd-sourced information to track the transmission of the disease and provides an up-to-date timeline report with an interactive map, making it easier for the international health agencies to devise more effective action plans.

HealthMap serves as a good example of how crucial Big Data and data mining technologies could be for handling a healthcare emergency with fact-based and data-driven decisions.

Ebola Data

In their letter to The Lancet, research scientist Rashid Ansumana and his colleagues, working on Ebola in Sierra Leone, highlighted on the need of developing epidemic surveillance systems “by adopting new data-sharing technologies.” They wrote, “Emerging technologies can help early warning systems, outbreak response, and communication between health-care providers, wildlife and veterinary professionals, local and national health authorities, and international health agencies.”

Data-Driven Initiatives to Control the Outbreak

The era of systematic use of data for making better epidemiological predictions and for finding effective healthcare solutions began with Google Flue Trends in 2007, and the rapidly developing tools, technologies, and practices in Big Data have increased the roles of data in healthcare management.

There are a number of data-driven undertakings in progress which have contributed to counter the raging spread of Ebola. Brockmann Lab, run by Professor Dirk Brockmann and his colleagues, for example, has created a computer model for studying correlations and probabilities in the explosion of new cases of infection.

World Airtraffic  Transportation and Relative Import Risk, Source: Brockmann Lab

By applying computational and statistical models, they predict which areas, cities or regions in the world are at the risk of becoming the next Ebola epidemic hotspots. Similarly, Alessandro Vespignani–a network scientist, statistical physicist, and Northeastern professor–has been using human mobility network data to track the cases of Ebola infection and dissemination.

The Swedish NGO Flowminder Foundation has been aggregating, mining, and analyzing anonymized mobile phone location data and is developing national mobility estimates for West Africa to help the local and international agencies to combat the disease.

Meanwhile, innovations with Epi Info VHF, a software tool for case management, contact tracing, analysis and reporting services for Ebola and other hemorrhagic fever outbreaks and OpenStreetMap project for getting location information and spatial data of the affected areas have further helped to guide the intervention initiatives.

However, with all optimism about the growing roles of Big Data and data mining, we also need to be mindful about their limitations. Newsweek aptly puts: “While no media-trawling bot could ever replace national and international health agencies, such tools may be starting to help fill in some of the most gaping holes in real-time knowledge.”

Source: http://www.grepsr.com/blog/data-mining-tracking-ebola-outbreak/

Wednesday, 20 May 2015

Hard-Scraped Hardwood Flooring: Restoration of History

Throughout History hardwood flooring has undergone dramatic changes from the meticulous hard-scraped hardwood polished floors of majestic plantations of the Deep South, to modern day technology providing maintenance free wood flooring designed for comfort and appearance. The hand-scraped hardwood floors of the South, depicted charm with old rustic nature and character that was often associated with this time era. To date, hand-scraped hardwood flooring is being revitalized and used in up-scale homes and places of businesses to restore the old country charm that once faded into oblivion.

As the name implies, hand-scraped flooring involves the retexturing the top layer of flooring material by various methods in an attempts to mimic the rustic appearance of flooring in yesteryears. Depending on the degree of texture required, hand scraping hardwood material is often accomplished by highly skilled craftsmen with specialized tools and years of experience perfecting this procedure. When properly done, hand-scraped hardwood floors add texture, richness and uniqueness not offered in any similar hardwood flooring product.

Rooted with history, these types of floors are available in finished or unfinished surfaces. The majority of the individuals selecting hand-scraped hardwood flooring elect a prefinished floor to reduce costs per square foot in installation and finishing labor charges, allowing for budget guidelines to bend, not break. As expected, hand-scraped flooring is expensive and depending on the grade and finish selected, can range from $15-40$ per square foot and beyond for material only. Preparation of the material is labor intensive adding to the overall cost per square foot dramatically. Recommended professional installation can and often does increase the cost per square foot as well, placing this method of hardwood flooring well out of reach of the average hardwood floor purchaser.

With numerous selections of hand-scraped finishes available, each finish is designed to bring out a different appearance making it a one-of-a-kind work of art. These numerous finish selections include:

• Time worn aged, dark coloring stain application bringing out grain characteristics

• Wire brushed, providing a highlighted "grainy" effect with obvious rough texture

• Hand sculpted, smoother distressed uniform appearance

• French Bleed, staining of edges and side joints with a much darker stain to give a bleeding effect to the wood

• Hand Hewn or Rough Sawn, with visible and noticeable saw marks

Regardless of the selection made, scraped flooring cannot be compared to any other available flooring material based on durability, strength and visual appearance. Limited by only the imagination and creativity, several wood species can be used to create unusual floor patterns, highlighting main focal points of personal libraries and art collections.

The precise process utilized in the creation of scraped floors projects a custom look with deep color and subtle warm highlights. With radiant natural light reflecting off this type of floor, the effect of beauty and depth is radiated in a fashion that fills the room with solitude and serenity encompassing all that enter. Hand-scraped hardwood floors speak of the past, a time of decent, a time or war and ambiguity towards other races and the blood- shed so that all men could be treated as equals. More than exquisite flooring, hand-scraped hardwood flooring is the restoration of History.

Source: http://ezinearticles.com/?Hard-Scraped-Hardwood-Flooring:-Restoration-of-History&id=6333218

Sunday, 17 May 2015

Introducing ScrapeShield: Discover, Defend & Deter Content Scraping

If you're a publisher, whether an individual blogger or major media outlet, you've undoubtedly experienced content scraping. Searching the web for an article you've published or other original content you've created and you find it copied and republished on some other random website. Often the site will be full of ads. And, sometimes, it will even rank higher in search results than your original work.

While you may envision an army of individuals copying and pasting your content on their sites, the truth is content scraping is typically an automated process with bots that grab original content and then republish it without human intervention onto link farm sites. CloudFlare has blocked many of these bots automatically in the past, but we decided it was time to do something to more actively stop them.

Introducing ScrapeShield

ScrapeShield is an app created by the CloudFlare team. It incorporates several existing CloudFlare features like email obfuscation and hotlink protection that serve to protect from content scraping and adds a number of new features as well. Because we believe every publisher of original content should be able to understand and control how their work is used, we're providing ScrapeShield free for every CloudFlare user.

Detect, Defend & Deter

ScrapeShield has different elements to help you detect when your content is scraped, defend your site against content scrapers, and even deter content scrapers from targeting you in the first place. If you enable ScrapeShield, CloudFlare will automatically insert invisible tracking beacons in your content. When automated bots scrape your content, they pull the beacons along with them. CloudFlare detects these beacons when they ping from sites that aren't your own. You can access your ScrapeShield control panel to see where your content is being republished. Not only is this useful in showing scraping, but you can also see users who are reading your content through proxy services like Flipboard or Pulse.

The data from the content beacons is fed back into CloudFlare's protection system. As CloudFlare identifies content scraping bots, we automatically prevent them from accessing your site. Just like Project Honey Pot, the original inspiration for CloudFlare, used traps to detect when spammers were harvesting email addresses, CloudFlare now uses data from ScrapeShield to identify content scrapers and keep them off publishers' sites.

Maze

We didn't want to just stop scrapers from attacking sites on CloudFlare, we also wanted to tie up their resources so they couldn't harm the rest of the web. To do this, we created Maze. Maze routes known content scrapers who are visiting ScrapeShield-protected sites into a virtual labyrinth of gibirish and gobbledygook. We dynamically throttle the bandwidth and speed so instead of the pages loading as fast as possible, the connection is held open to the scrapers and their resources are tied up.

We use excess resources on the CloudFlare network to generate Maze, and it doesn't consume any of our publishers' resources or add any additional load to their sites. What's beautiful about the system is that the only way that content scrapers can be sure they're avoiding Maze is to avoid CloudFlare's IP addresses entirely. For any content scrapers who may be reading this, here's a helpful list of all of our IPs so you can make sure to stay away.

No Pinning

Finally, with the rise of sites like Pinterest, innocent content scraping may become even more prolific. While many sites welcome their images being pinned, we wanted to make it easy to opt out. ScrapeShield includes an option to add the no-pinning meta tag to your site to prevent your images from being pinned to the site. As other similar services include a mechanism to opt out, expect that we'll include an easy way for you to do so right from the ScrapeShield interface.

The health of the web depends on publishers creating original content getting credit for their creations. Cloud Flare is committed to building a better web and we're extremely excited about ScrapeShield as a new tool to help publishers do exactly that.

Source: https://blog.cloudflare.com/introducing-scrapeshield-discover-defend-dete/

Wednesday, 13 May 2015

Kimono Is A Smarter Web Scraper That Lets You “API-ify” The Web, No Code Required

A new Y Combinator-backed startup called Kimono wants to make it easier to access data from the unstructured web with a point-and-click tool that can extract information from webpages that don’t have an API available. And for non-developers, Kimono plans to eventually allow anyone track data without needing to understand APIs at all.

This sort of smarter “web scraper” idea has been tried before, and has always struggled to find more than a niche audience. Previous attempts with similar services like Dapper or Needlebase, for example, folded. Yahoo Pipes still chugs along, but it’s fair to say that the service has long since been a priority for its parent company.

But Kimono’s founders believe that the issue at hand is largely timing.

“Companies more and more are realizing there’s a lot of value in opening up some of their data sets via APIs to allow developers to build these ecosystems of interesting apps and visualizations that people will share and drive up awareness of the company,” says Kimono co-founder Pratap Ranade. (He also delves into this subject deeper in a Forbes piece here). But often, companies don’t know how to begin in terms of what data to open up, or how. Kimono could inform them.

Plus, adds Ranade, Kimono is materially different from earlier efforts like Dapper or Needlebase, because it’s outputting to APIs and is starting off by focusing on the developer user base, with an expansion to non-technical users planned for the future. (Meanwhile, older competitors were often the other way around).

The company itself is only a month old, and was built by former Columbia grad school companions Ranade and Ryan Rowe. Both left grad school to work elsewhere, with Rowe off to Frog Design and Ranade at McKinsey. But over the nearly half-dozen or so years they continued their careers paths separately, the two stayed in touch and worked on various small projects together.

One of those was Airpapa.com, a website that told you which movies were showing on your flights. This ended up giving them the idea for Kimono, as it turned out. To get the data they needed for the site, they had to scrape data from several publicly available websites.

“The whole process of cleaning that [data] up, extracting it on a schedule…it was kind of a painful process,” explains Rowe. “We spent most of our time doing that, and very little time building the website itself,” he says. At the same time, while Rowe was at Frog, he realized that the company had a lot of non-technical designers who needed access to data to make interesting design decisions, but who weren’t equipped to go out and get the data for themselves.

With Kimono, the end goal is to simplify data extraction so that anyone can manage it. After signing up, you install a bookmarklet in your browser, which, when clicked, puts the website into a special state that allows you to point to the items you want to track. For example, if you were trying to track movie times, you might click on the movie titles and showtimes. Then Kimono’s learning algorithm will build a data model involving the items you’ve selected.

That data can be tracked in real time and extracted in a variety of ways, including to Excel as a .CSV file, to RSS in the form of email alerts, or for developers as a RESTful API that returns JSON. Kimono also offers “Kimonoblocks,” which lets you drop the data as an embed on a webpage, and it offers a simple mobile app builder, which lets you turn the data into a mobile web application.

For developer users, the company is currently working on an API editor, which would allow you to combine multiple APIs into one.

So far, the team says, they’ve been “very pleasantly surprised” by the number of sign-ups, which have reached ten thousand*. And even though only a month old, they’ve seen active users in the thousands.

Initially, they’ve found traction with hardware hackers who have done fun things like making an airhorn blow every time someone funds their Kickstarter campaign, for instance, as well as with those who have used Kimono for visualization purposes, or monitoring the exchange rates of various cryptocurrencies like Bitcoin and dogecoin. Others still are monitoring data that’s later spit back out as a Twitter bot.

Kimono APIs are now making over 100,000 calls every week, and usage is growing by over 50 percent per week. The company also put out an unofficial “Sochi Olympics API” to showcase what the platform can do.

The current business model is freemium based, with pricing that kicks in for higher-frequency usage at scale.

The Mountain View-based company is a team of just the two founders for now, and has initial investment from YC, YC VC and SV Angel.

Source: http://techcrunch.com/2014/02/18/kimono-is-a-smarter-web-scraper-that-lets-you-api-ify-the-web-no-code-required/

Sunday, 3 May 2015

Web Data Scraping - Scrape Business Data in no time

The Internet has evolved as one of the largest repositories of information for your business. You can design intelligent business processes to access a whole host of relevant information sources that will help you strategize, implement and deliver effective business objectives. Leveraging the benefits and usefulness of Web Scraping Tools is one such methodology that most businesses have adopted. Let us take a look at some of the ways it helps you easily scrape data relevant for your business.

Scraping for Business Information

Web Data Scraping is a technique, employed by most organizations. It involves the implementation of tools that help businesses extract unstructured data and convert them into usable business information. The focus of most scraping initiatives revolves around the organization’s need to glean the following information:

•    Competitor analysis to structure and strategist effectively

•    Price comparisons to price their products competitively

•    Customer feedbacks to enhance their product portfolio and provide customers with better brand experience   Market dynamics to help them identify areas of opportunities and threats

Using Scraping Tools

The abundance of information available on the Internet that helps you build up a productive business strategy can be easily extracted and leveraged to benefit your business. Tools have been designed with intuitive interface and intelligent algorithms which help in furthering this end.

Website Data Scraping tools are equipped for compatibility with a wide variety of applications so as to be able to explore a huge range of information sources.  These tools are fully automated and display the drag and drop facility ensuring users get to leverage the benefits of speed and convenience.

Data extraction tools are not only adept at extracting data, but are also equally well-equipped to combine relevant statistics from several social media platforms like YouTube, Twitter, and Google Analytics and so on. This helps businesses to analyse trends and plan strategies accordingly.

Challenges of the Data Scraping Process

Just as there is no dearth of data to be collected from the Web, there is also an abundance of web scraping tools to execute the data collection process. However, the capability of the tool to help you collect the appropriate data needs to be assured before you can proceed with its implementation. Some of the challenges faced by most businesses owing to their wrong choice of tools include the following:

•    Run-of-the-mill extraction tools are unable to scale up sufficiently in order to capture large volumes of data

•    Some tools are also unable to establish compatibility with most data sources and therefore do not provide a holistic data collection approach

•    Some tools are also not equipped to conduct an automatic detection of updates made to a data source and therefore end up providing inaccurate data.

In the light of all this it is essential that you identify the right tool for your need and select one that is embedded with an updated technology to help you achieve the following:

•    Ensure that you are able to access the appropriate data that you want

•    Help you structure it in the format you want

•    Provide quick and easy access to all available data sources no matter how complex

•    Run accurately and is a reliable source to help you churn out usable information.

Source: http://scraping-solutions.blogspot.in/2014_07_01_archive.html

Tuesday, 28 April 2015

Web Scraping – Effective Way of Improving Market Presence

Web scraping is a technique that is fast making its presence felt in the world of internet by its sheer weight of being effective. It is a technique that uses software to crawl through the internet and gather up all the relevant and important information that one would need for their products.

The information gathered by the web scraping can be used for various things such as data integration, web mashup, online comparison of price and much more. Web scraping uses sophisticated software that crawls through the internet and gathers up all related information for the entity that you are looking for. The information that is gathered up is an automated, systematic, and very structured way. This allows for easy understanding of the gathered information. Though this is one of the best ways for data extraction there are quite a few things that one must be aware of before getting into web scraping.

Being aware of the following things keep you at a better position not only leverage the best deal, but also to negotiate properly.

•    For data mining the first thing that one should be very sure of is the kind of data they want. One has to define properly what kind of data they want and also what would be the purpose of the same. For an instance if you wish to get a closer look at your competitors, it would be a wise to let the data scraping service providers know who your competitors are. This would allow them to gather better information. Similarly if you are looking for getting new customers getting contact data from existing players in the respective industry would be helpful.

•    One should also be aware of the structure in which they want the data. A simple data structure has the entity name in the row and the property of the entity is kept in the cells of the rows. However, one can also opt for data structure in chart. Apart from the above, there is just one more thing that one needs to keep in mind while using the data mining services; it is the number of data extraction. At times a onetime data extraction would be sufficient whereas at other times periodic extractions or general reports are required.

If you are aware of all the above points, then you are very much inline of going ahead and taking the help of scrape website data. Knowing the above points would allow you to know what exactly to ask from your vendor and likewise quote. One can make the most of the data extraction services with the help of either the web scraping or web crawling services.

Source: https://3idatascraping.wordpress.com/2014/01/07/web-scraping-effective-way-of-improving-market-presence/

Sunday, 26 April 2015

Scraping the Bottom of the Barrel - The Perils of Online Article Marketing

Many online article marketers so desperately wish to succeed, they want to dump corporate life and work for themselves out of their home. They decide they are going to create an online money making website. Therefore, they look around to see what everyone else is doing, and watch the methods others use to attract online buyers, and then they mimic their marketing, their strategies, and their business models.

Still, if you are copying what other people (less ethical people) are doing in online article marketing, those which are scraping the bottom of the barrel and using false advertising and misrepresentations, then all you are really doing is perpetuating distrust on the Internet. Therefore, you are hurting everyone, including people like me. You must realize that people like me don't appreciate that.

Let me give you a few examples of some of the things going on out there, thing that are being done by people who are ethically challenged. Far too many people write articles and then on their byline they send the Internet surfer or reader of the article to a website that has a squeeze page. The squeeze page has no real information on it, rather it asks for their name and e-mail address.

If the would-be Internet surfer is unwise enough to type in their name and email address they will be spammed by e-mail, receiving various hard-sell marketing pieces. Then, if the Internet Surfer does decide to put in their e-mail address, the website grants them access and then takes them to the page with information about what they are selling, or their online marketing "make you a millionaire" scheme.

Generally, these are five page sales letters, with tons of testimonials of people you've never heard of, and may not actually exist, and all sorts of unsubstantiated earnings claims of how much money you will make if you give them $39.35 by way of PayPal, for this limited offer "Now!" And they will send you an E-book with a strategic plan of how you can duplicate what they are doing. The reality is whatever they are doing is questionable to begin with.

If you are going to do online article marketing please don't scrape the bottom of the barrel, there's just too much competition down there from what I can see. Please consider all this.

Source: http://ezinearticles.com/?Scraping-the-Bottom-of-the-Barrel---The-Perils-of-Online-Article-Marketing&id=2710103

Wednesday, 22 April 2015

Hand Scraped Versus Machine Scraped Floors - The Distinction

In society today hardwood flooring has become the new must have. The days of carpet are gone, and if you have looked into bringing your home up to date with the styling of today you will have noticed by now that there are many different options. At times this may become very overwhelming, especially if you are not a hardwood specialist like most people are not. That is why this article is here to help you understand the many different options available to you.

The flooring type covered in this article is hand scraped flooring. This flooring type is a custom look flooring that is in very high demand in flooring marketplace, which is understandable because it is probably the most unique flooring there is. You can choose from many different types of wood species such as oak, maple, hickory, and most exotic species. There is computerized hand scraped that is when the manufacturer makes one piece of wood and places it into a computer that will cut thousands of different wood types with that one design. This type of process is also known as machine scraping. Hardwood floors employing this type of technology usually cost less, but most of the pieces look the same because the hand scraping is done by a machine.

Then you have actual hand scraped flooring that is done all by hand and takes more time and effort than machine scraped. This flooring is made custom each individual piece is scraped and notched in different ways, so every piece is unique. If you decide to purchase actual hand scraped flooring it will cost you more than mass produced computerized version but it will definitely be the more unique option. If you are the type of person who wants to have a one of kind floor then an actual hand scraped floor is the way to go.

So in conclusion hand scraped flooring is a great option for a lot of people. It comes in several different wood types, and several different colors. You can find flooring options for every budget and to meet every style. If having a custom floor in your home it may be important or not important on whether it be computer or done by hand. Most consumers cannot tell the difference between actual hand scraped flooring and machine scraped when just looking at a small sample. So when shopping at your local retailer ask the tough questions and find out if the manufacturer uses machine or authentic hand scrapping on their products.

To view your many options on hand scraped flooring please check out our website that covers all hardwood flooring options.

Source: http://ezinearticles.com/?Hand-Scraped-Versus-Machine-Scraped-Floors---The-Distinction&id=4151157

Saturday, 18 April 2015

What is HTML Scraping and how it works

There are many reasons why there may be a requirement to pull data or information from other sites, and usually the process begins after checking whether the site has an official API. There are very few people who are aware about the presence of structured data that is supported by every website automatically. We are basically talking about pulling data right from the HTML, also referred to as HTML scraping. This is an awesome way of gleaning data and information from third party websites.

Any webpage content that can be viewed can be scraped without any trouble. If there is any way provided by the website to the browser of the visitor to download content and use the same in a highly structured manner, in that case, accessing of the content programmatically is possible. HTML scraping works in an amazing manner.

Before indulging in HTML scraping, one can inspect the browser for network traffic. Site owners have a couple of tricks up their sleeve to thwart this access, but majority of them can be worked around.

Before moving on to how HTML scraping works, we must understand the reasons behind the same. Why is scraping needed? Once you get a satisfactory answer to this question, you can start looking for RSS or API feeds or various other traditional structured data forms. It is significant to understand that when compared with APIs, websites are more significant.

The most important advantage of the same is the maintenance of their websites where a lot of visitors visit rather than safeguarding structured data feeds. With Tweeter, the same has been publicly seen when it clamps down on the developer ecosystem. Many times, API feeds change or move without any prior warning. Many times, it can also be a deliberate attempt, but mostly, such issues or problems erupt as there is no authority or an organization that maintains or takes care of the structured data. It is rarely noticed, if the same gets severely mangled or goes offline. In case the website has certain issues or the website no longer works, the problem is more in the form of a ball in your court requiring dealing with the same without losing any time. api-comic-image

Rate limiting is another factor that needs a lot of thinking and in case of public websites, it virtually doesn’t exist. Besides some occasional sign up pages or captchas, many business websites fail to create and built defenses against any unwarranted automated access. Many times, a single website can be scraped for four hours straight without anyone noticing. There are chances that you would not be viewed under DDOS attack unless concurrent requests are being made by you. You will be seen just as an avid visitor or an enthusiast in the logs, that too, in case anyone is looking.

Another factor in HTML scraping is that one can easily access any website anonymously. Behavior tracking can be done with a few ways by the administrator of the website and this turns out to be beneficial if you want to privately gather the data. Many times, registration is imperative with APIs in order to get key and with any request being sent, this key also needs to be sent. But, in case of simple and straightforward HTTP requests, the visitor can stay anonymous besides cookies and IP address, which can again be spoofed.

The availability of HTML scraping is universal and there is no need to wait for the opening of the site for an API or for contacting anyone in the organization. One simply needs to spend some time and browse websites at a leisurely pace until the data you want is available and then find out the basic patterns to access the same.

Now you need to don a hat of a professional scraper and simply dive in. Initially, it may take some time to work up figuring out the way the data have been structured and the way it can be accessed just as we read APIs. If there is no documentation unlike APIs, you need to be a little more smart about it and use clever tricks.

Some of the most used tricks are

Data Fetching


The first thing that is required is data fetching. Find endpoints to begin with, that is the URLs that can help in returning the data that is required. If you are pretty sure about the data and the way it should be structured so as to match your requirements, you will require a particular subset for the same and later you can indulge in site browsing using the navigation tools.

GET Parameter

The URLs must be paid attention to and see the way it changes as you indulge in clicking between the sections and the way they divide into various subsections. Before starting, the other option that can be used is to straight away go to the search functionality of the site. Certain terms can be typed and the URL needs to be focused again for watching the changes on the basis of what is being searched. A GET parameter will be probably seen like q which changes on the basis of the search term used by you. Other GET parameters that are not being used can be removed from the URL until only the ones that are needed are left for data loading. Before a query string, there must always be a “?” beginning.

Now the time has come when you would have started to come across the data that you would like to see and want to access, but sometimes, there may be certain pagination issues that require to be dealt with. Due to these issues, you may not be able to see the data in its entirety. Single requests are kept away by many APIs as well from database slamming. Many times, clicking the next page can add some offset parameter that helps in data visibility on the page. All these steps will help you succeed in HTML scraping.

Source: https://www.promptcloud.com/blog/what-is-html-scraping-and-how-it-works/

Wednesday, 8 April 2015

The Coal Mining Industry And Investing In It

The History Of Coal Usage

Coal was initially used as a domestic fuel, until the industrial revolution, when coal became an integral part of manufacturing for creating electricity, transportation, heating and molding purposes. The large scale mining aspect of coal was introduced around the 18th century, and Britain was the first nation to successfully use advanced coal mining techniques, which involved underground excavation and mining.

Initially coal was scraped off the surface by different processes like drift and shaft mining. This has been done for centuries, and since the demand was quite low, these mining processes were more than enough to accommodate the demand in the market.

However, when the practical uses of using coal as fuel sparked industrial revolution, the demand for coal rose abruptly, leading to severe shortage of the coal output, gradually paving the way for new ways to extract coal from under the ground.

Coal became a popular fuel for all purposes, even to this day, due to their abundance and their ability to produce more energy per mass than other conventional solid fuels like wood. This was important as far as transportation, creating electricity and manufacturing processes are concerned, which allowed industries to use up less space and increase productivity. The usage of coal started to dwindle once alternate energies such as oil and gas began to be used in almost all processes, however, coal is still a primary fuel source for manufacturing processes to this day.

The Process Of Coal Mining

Extracting coal is a difficult and complex process. Coal is a natural resource, a fossil fuel that is a result of millions of years of decay of plants and living organisms under the ground. Some can be found on the surface, while other coal deposits are found deep underground.

Coal mining or extraction comes broadly in two different processes, surface mining, and deep excavation. The method of excavation depends on a number of different factors, such as the depth of the coal deposit below the ground, geological factors such as soil composition, topography, climate, available local resources, etc.

Surface mining is used to scrape off coal that is available on the surface, or just a few feet underground. This can even include mountains of coal deposit, which is extracted by using explosives and blowing up the mountains, later collecting the fragmented coal and process them.

Deep underground mining makes use of underground tunnels, which is built, or dug through, to reach the center of the coal deposit, from where the coal is dug out and brought to the surface by coal workers. This is perhaps the most dangerous excavation procedure, where the lives of all the miners are constantly at a risk.

Investing In Coal

Investing in coal is a safe bet. There are still large reserves of coal deposits around the world, and due to the popularity, coal will be continued to be used as fuel for manufacturing process. Every piece of investment you make in any sort of industry or a manufacturing process ultimately depends on the amount of output the industry can deliver, which is dependent on the usage of any form of fuel, and in most cases, coal.

One might argue that coal usage leads to pollution and lower standards of hygiene for coal workers. This was arguably true in former years; however, newer coal mining companies are taking steps to assure that the environmental aspects of coal mining and usage are kept minimized, all the while providing better working environment and benefits package for their workers. If you can find a mining company that promises all these, and the one that also works within the law, you can be assured safety for your investments in coal.

Source: http://ezinearticles.com/?The-Coal-Mining-Industry-And-Investing-In-It&id=5871879

Sunday, 5 April 2015

How Extracting Job Postings Can Increase Revenue for Job Board Websites

For many people, job board websites present a great opportunity to bolster their search for their ideal job opening. They are the ideal avenue for prospective employers to meet right-fit employees. Employers post information about different openings in their companies, and job seekers respond to those posts with their profiles and resumes. This process gets carried out until the employer in question hits upon the right candidate for the job and the deal is closed.

For the job board website owner, revenue can come from many sources. Most job board websites charge membership fee from employers to put up their posts about vacant positions for the consideration of job seekers registered to the website. Revenue can also come from targeted advertising, premium services and profile upgrades for both hiring and seeking parties.

The most important wealth that any job board website can have is the sheer volume of job openings on display. With a large volume of openings, job board website owners not only get insight about market trends and opportunities, they can also lure both job seekers and employers to try out their services, thereby benefiting from their massive participation. This is one aspect where job board website owners can choose to crawl job websites to get targeted, relevant and organic information. Such informative and intelligent crawling can go on to bring about an overall increase in their revenue.

How Job Boards Work

Job boards work on a simple principle – providing a common platform where demand meets supply. Companies are always looking to hire the right people for their vacancies. Similarly bright professionals are always looking for a chance to get hired and go on to work with a reputed company. A job board website aims to be an interface between these two parties. They encourage job seekers to create accounts and profiles, while inviting prospective employers to make their own profiles and post their requirements in the form of job listings, much like posting classified advertisement in a newspaper. Interested candidates can then reply to these listings with their own information, resumes and cover letters to take the hiring process ahead.

To reach the optimal volume of job postings, seeker profiles and accurate information about all parties involved, one thing that can provide a giant boost to any job board website is scraping job websites. As the owner of a job board website, you can choose to scrape job listings for a wide number of practical and effective applications which can bring better revenue for your business.

How Does Scraping Work?

If you are looking to multiply your revenue, increase your reach and penetrate the already overcrowded job board market, you can choose to employ the services of a company that provides web scraping for job listings. Using the information you provide and the requirements that you specify, these professionals use a web scraper to crawl job listings across a large number of job board websites.

All the data that is received is stored for further examination and use. By using this technique, you can gather a wealth of important data on critical pointers, including detailed job seeker profiles, detailed employer profiles and accurate descriptions of actual, live job postings. This treasure trove of information can help you in a number of different ways.

For starters, you can be ensured that you always have a good volume of job postings in your job board website – a sure way to attract both job seekers and employers. You can also ensure that your job postings remain current and updated, and significantly cut down on your maintenance burden by employing this innovative and effective technique.

How Extracting Job Postings Can Help?


Extracting job listings from websites can be an immensely beneficial process for your job board website. It is a no-nonsense, hassle-free process that enables you to keep all your listings and all your job seeker profiles detailed, comprehensive and full of current and accurate information.

For a job board website, using data from publicly available websites for targeted use can give rise to some significant challenges –

•    Scale – You have to actively monitor thousands of relevant websites on a daily basis to ensure the integrity and reliability of the generated data.

•    Speed – You have to keep monitoring all your old resources over and over again to ensure that your listings remain current and any instance of an expired job listing gets flagged down so that you can remove it from your own job board website.

•    Breadth – You face the challenge of being able to regularly retrieve good quality content from websites which are difficult to access.

Recruitment Life Cycle

Using a highly customized, highly efficient and fully automated tool like a web scraper can help you overcome these challenges, and draw in significant revenues for your job board website. Handling such a vast and challenging task manually is not only counterintuitive, but it can also prove to be an extremely expensive undertaking in the long-term. Hiring professionals who can use automated tools for job scraping eliminates the drudgery and dreariness of this massive task. Additionally, it increases speed and efficiency, and makes it really easy for you to receive highly targeted and relevant information which is tailor made for your requirements. The data that you get from job site scraping is organic, actionable data which you can integrate in your business workflow and increase your revenue.

An efficient web scraper constantly scrapes job listings from a large number of high authority sources including job boards, online classified advertisements, Fortune 1000 websites, trade association websites and other sources. The fully automated process provides you with real-time information and feedback about changes and new additions to thousands of sources that feature job listings.

A highly customized web scraper can give a number of important advantages when it comes to scraping job boards. These advantages can help you with automating the process efficiently. This assists in cutting down on processed time duration and the proper streamlining of your web scraping endeavor –

•    You can monitor scores of important and relevant websites in real-time. You can also set up alerts the moment old job postings are taken down and new ones are posted.

•    You can customize web scraping tools to track particular data fields only for changes or updates. This way you can avoid repetitive reloading off old, already collected data.

•    You can also configure your web scraping tool to be agnostic towards changes in websites, thereby taking out a large chunk of possible maintenance time.

•    You can configure your web scraping tool to normalize and standardize data fields in resumes. This will make them ready for automated comparison and matching.

The Advantages

To compete in the highly competitive and saturated job board market, you need to go that extra mile with your own job board website. With the help of expert automated web scraping services, you can give your job board website the edge that it needs to survive and rise above the stiff competition.

You can continue to increase the scale of your operations and multiply your revenue. Using various means of web scraping you are able to provide authoritative, relevant and current information about job listings. When you successfully integrate automated web scraping into your workflow, it adds meaningful value on many levels -

Competitive Edge

You can surge ahead of the competition by offering your customers heightened speed and accuracy for all your job listings. Since you get real-time alerts when new job postings are added and old ones are removed, you can make information available to your customers within a short span of time. Thus, you can ensure that all content on your job board website stays current and updated.

Crawl Job boards

Quality Listings


You can bring in new business by offering a detailed and comprehensive list of job listings both in terms of quality and volume. Using a highly customized and sophisticated automated web scraping tool to gather your information works in your favor. It enables you to further simplify your access to specialized niches like classifieds sections of online newspapers, trade association websites and message boards.

Tremendous Value-add


You can provide more value to potential employers by maintaining a large database containing job seeker profiles. These are accurate right down to the very last detail. This can be achieved efficiently by scraping and storing full profiles at high speeds. You can then configure your scraper to record any changes or updates to individual profiles.

Targeted profiling

You can provide your customers with information that goes an extra mile and establishes a differentiating factor from your competitors. This can be achieved easily by fine tuning your web scraping efforts. Here, you need to enhance individual profiles by adding and supplementing them with further external information. This information can be pulled from public sources like social networks and on boarded with your existing profiles.

High degree of accuracy

You can achieve unmatched accuracy while matching applicant profiles to specific job listings. This can be achieved by normalizing important fields of applicant profiles and resumes to support fast, efficient automated comparisons, checks and matches.

Source: https://www.promptcloud.com/blog/how-scraping-job-postings-from-job-portals-helps-job-board-websites/

Friday, 27 March 2015

Make Your Business More Intelligent with Web Data Extraction services

Data extraction is that the most practiced technique which will assist you realizes the pertaining knowledge for your existing business or any personal use. Many times, we discover that experts’ copy and paste data manually from web content or transfer the complete web site that may be a waste of your time and energy.

Now with the new technique of Data extraction you'll crawl through hundreds and many web content so as to extract specific knowledge and at the very same time save this information or data within the following manner.
  •     CSV FILE
  •     XML FILE or Any other custom format for future use.

Below given are some instances of Data extraction process:

  •     Conduct a government portal, extracting names of voters for a survey
  •     Seek for competitor websites for product valuation and information on features
  •     Utilize web scraping to download images from a stock photography site for website design

How can Data Extraction serve you?

 You can extract data from any kind of websites like

Extract Data from any kind of Websites: Directories, Classified Websites, News Websites, Blogs, Articles, Job Portals, Search Engines, Ecommerce Websites, Social Media Websites and any kind of websites whose content can be accessible. Extract Emails, Contacts, Price/Rate, Features, Contact Names, Contact Details, Full Text, Live updates, ASINs, Meta Tags, Address, Phone, Fax, Latitude & Longitude, Images, Links, Reviews, Ratings, etc. Help in Data Collection, Competitor Analysis, Research, Business Intelligence, Social Media Trend analysis, Brand Monitoring, Lead Data Collection, Website & Competitor Web Monitoring, etc. Deliver Data in any Database, Excel, CSV, Access, Text, My SQL, SQL, Oracle, etc. and in any format Custom Services of Web Data Extraction as per client need one time Data Delivery or Continued/Scheduled Data Delivery

The next one is Website Data Scraping:

 Website Data Scraping is that method of extracting such information or data from web site by utilising specific software system program accessible from evidenced web site solely.

This extracted data may be utilised by somebody and for any functions as per their requirements; data extracted may be employed in totally different industries. There are a unit several corporations providing best website data scraping services.

It is one such field that has active developments and conjointly shares a standard objective that wants breakthrough within the following:
  •     Text Processing
  •     Semantic Understanding
  •     Artificial Intelligence
  •     Human Computer Interactions

There are several users or finish users, corporations and specialists that require info or information that's accessible in some or the opposite format. In such cases Web Data Extraction will tailor the necessity of extracting information from any tested supply and preserve the information on a selected destination.

The source platform contains:
  •     Excel
  •     CSV
  •     MySQL and
  •     Others

Websitedatascraping.com is enough capable to web data scraping, website data scraping, web scraping services, website scraping services, data scraping services, product information scraping and yellowpages data scraping.