Bringing Up a WordPress Site in the Age of Hackers

Imagine if people walking by on your street walked up to your house and tried the door handle – several times a day! That’s the situation on the Internet in the year 2017. It’s very likely that hackers will start scanning and probing your website within hours of first creating it – even before you’ve published the presence of the site with Google, Bing and other search engines.

These days we hear stories in the news all the time about hackers making major security breaches in large corporations, but your own website is equally open to attack. And WordPress, being used by millions of websites to date, makes it a favorite target of hackers by virtue of its popularity.

If you’re using a reasonably well known hosting platform for your WordPress site (Godaddy, Hostgator, Dreamhost, etc), the hackers know about those hosts. Internet hosting services get assigned blocks of sequential IP addresses, and the hackers scan these address sequences, one by one, looking for well know features that can be exploited to break in and do some kind of damage.

This post details the experience with just one web host, but since ranges of assigned IP addresses are easy to find, it’s likely that many hosting sites experience similar issues.

Looking for Signs of Hackers

Most web servers (serving WordPress or some other web site design) keep an access log: a line-by-line record of files that are requested by visitors to the web site. Tools like Webalizer or AWStats, commonly available on hosting servers, analyze these logs to summarize useful information about the web site: what pages are most visited, how many hits in a day/week/month, where web site visitors are accessing the site from. The access log records information for all visitors, both legitimate and hacker, human and robot. But the visits from hackers often have some distinct characteristics that distinguish them from others. (under cPanel, these are the “Raw Access Logs”)

A raw line in an access log can be difficult to read; here’s a sample:

212.83.152.183 - - [04/Aug/2017:06:16:38 -0400] "GET /wp-login.php HTTP/1.1" 200 4255 "-" "Mozilla/5.0 (Windows NT 5.1; rv:29.0) Gecko/20100101 Firefox/29.0"

In this post, examples of access log entries are distilled and simplified to look like this:

. . User-Agent:  Mozilla/5.0 (Windows NT 5.1; rv:29.0) Gecko/20100101 Firefox/29.0
. . Referrer:    -
04/Aug/2017:06:16:38 -0400  200  GET  /wp-login.php

The fields shown in this listing are date and time of the request (04/Aug/2017:06:16:38 -0400)”, response status (“200”), request method (“GET”), and request path (“/wp-login.php”). The request path is the resource (page or part of the page) that was requested; it is listed without the site’s hostname, so the full URL of the example request shown above would be “http://curiousprog.com/wp-login.php”.

The listing also shows the user agent and the referer for one (or more) requests from the same visitor. The user agent is the identifying string for the kind of browser that the visitor was using (Firefox in the above example). The referer is the page from which the request was made. If the user clicked a link to get to this page, the page where they clicked will be listed as the referer (example: “www.google.com” for a click on a Google search result).

A legitimate request for a page on a WordPress site will result in a dozen or more requests: first for the HTML of the page, but then more requests for the many other parts of the page – Javascript files (for dynamic interaction), CSS files (that control appearance and layout of the page, etc). This is how a browser gets the information in order to display the page.

In contrast, a request from a hacker will often be just a single request, directly to a page or some other valuable resource. It will most often skip making the additional requests for the other parts of the page, because it already has what it needs with the first request – the content of the page, or even just the knowledge that a page with a certain name is present and accessible.

A request from a hacker will often have no user agent (blank or “-“) or a user agent that is in a weird format (something that a user’s browser wouldn’t report); a hacker’s request will also often have a blank referer. Search engines (also frequent visitors to websites) can also have an unusual looking user agent, but they will often identify themselves as a “bot” (robot), “spider”, or “crawler”, like “Googlebot” or “bingbot” or “Baiduspider”.

Response status is significant; this tells something about whether or not the request succeeded (and we can sometimes infer from the status whether a request was a legitimate one from a user or an attempt by a hacker).

Some common status codes that will be seen in the listings we examine are:

200 OK the requested file or resource was successfully returned
302 Found (Moved Temporarily) a common part of a login process, usually OK
304 Not Modified a previously cached copy of a file was served again (it hasn’t changed since it was cached)
403 Forbidden an attempt was made to request something in an insecure way
404 Not Found an attempt to request something that isn’t there
405 Method Not Allowed a HTTP request method other than the usual “GET” or “POST” (or a small set of other possible methods) was attempted
406 Not Acceptable A problem delivering the requested data back (mismatch in format?)

For more information on HTTP access response codes see this page: HTTP response status codes – HTTP

Where Are They Coming From?

Throughout this post, when I cite a particular attack I will say that it is “apparently from” a certain location (determined from the IP addresses of the requests). The location that a suspicious request is coming from is often not the real source of the attack: hackers generally do not mount attacks directly from their own servers. Instead they break in to another site and install “command and control” software to run the attack, and they then control the attack via this intermediate server. This is one great reason among many to secure your own site: without precautions, your server could become part of a hacker’s “bot net”, a collection of compromised systems all mounting an attack on other systems!

With this, here is a chronicle of hacks and attacks that were attempted within a few hours, a few days, and a few weeks of the site’s initial startup.

Wednesday, August 2

I registered my domain name today and brought up my WordPress site for the first time! And it didn’t take long for the questionable web requests to show up.

First at the door and several times during the day I saw requests like the ones below. Apparently from Las Vegas, Nevada, these are simple requests to get the “home page” of the web site (shortened to “/”), but then to also get a path that’s a “URL encoded” version of the site name:

. . User-Agent:  Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)
. . Referrer:    -
02/Aug/2017:06:49:00 -0400  403  GET  /
02/Aug/2017:06:49:01 -0400  404  GET  /http%3A%2F%2Fwww%2ECURIOUSPROG%2Ecom%2F
02/Aug/2017:07:10:02 -0400  403  GET  /
02/Aug/2017:07:10:02 -0400  404  GET  /http%3A%2F%2Fwww%2ECURIOUSPROG%2Ecom%2F

“URL Encoding” of elements of the URL replaces certain punctuation characters with two digit (hexadecimal) encodings: “%3A” is “:”, “%2F” is “/”. It’s puzzling why this encoded request was made, but it’s clear that it wasn’t successful: these requests resulted in a 404 (“Not Found”) response code. Perhaps some site scanning tool was misconfigured and the operator didn’t realize it. In any case, this was not a normal request for a page of the site!

The 403 (“Forbidden”) responses in this case are a result of a security measure that I set up temporarily in the .htaccess file of the site for just the first day it was up. This configuration (inspired by this posting: Hackers Find Fresh WordPress Sites Within 30 Minutes) forbade any request from any server other than the one that I was using to log into the site to finish configuring it.

At various times later on August 2, these requests arrived:

. . User-Agent:  python-requests/2.18.1
. . Referrer:    -
02/Aug/2017:06:11:18 -0400  403  GET  /
. . User-Agent:  -
. . Referrer:    -
02/Aug/2017:13:13:07 -0400  403  HEAD  /
. . User-Agent:  Curl/PHP 5.6.30 (http://github.com/shuber/curl)
. . Referrer:    -
02/Aug/2017:17:50:36 -0400  403  GET  /

The above are apparently from Montreal, Washington DC, and Tampa respectively.

Look at the User-Agent in these requests: it should indicate MSIE (Internet Explorer) or Firefox or Chrome or Safari. Instead, the requests list odd looking user agents: “python-request” and “Curl”. These are both tools that are used to download the contents of web pages programmatically – no human is reading a page that is returned this way (or at least they’re not reading it in a web browser).

It could be that these are “web crawlers” or “spiders” that are used by Google, Bing, Baidu, or other search sites to collect searchable information form web pages – but by convention a legitimate web crawler will always include the name of the service (like “Googlebot”) in the user agent. This is very likely a “discovery” query from a would-be hacker, just to see what kind of site is at the address. For most of the tools like ‘curl’ and ‘python-request’, it’s possible to add some configuration to “fake” a real User Agent (and referer), but the user of these tools either didn’t know how to or wasn’t careful enough to set it up.

The second “HEAD” (instead of GET) request is particularly suspicious – it lists no user agent nor referer in the request, and the HEAD request doesn’t even return any content for the page – just some header information on the request and its response (similar to the header on an email). It’s beyond the scope of this posting to discuss this in detail, but suffice it to say that a hacker can learn some things about a website by making a HEAD request, such as the name and version of web server being used (Apache, Nginx, etc), which might reveal an exploit that can be used to break into the site. A web browser used by a legitimate visitor would very rarely make a HEAD request like this.

Friday, August 4

On the second day that the website was up, a “brute force” login discovery attack was run against my site, apparently from Paris, France:

. . User-Agent:  Mozilla/4.0 (compatible; Synapse)
. . Referrer:    -
03/Aug/2017:19:43:14 -0400  200  GET  /wp-login.php
03/Aug/2017:19:43:16 -0400  404  GET  /?author=1
04/Aug/2017:01:44:19 -0400  200  GET  /wp-login.php
04/Aug/2017:01:44:19 -0400  406  POST  /wp-login.php
04/Aug/2017:01:45:21 -0400  200  GET  /wp-login.php
04/Aug/2017:01:45:21 -0400  406  POST  /wp-login.php
...many lines skipped...
04/Aug/2017:07:54:19 -0400  200  GET  /wp-login.php
04/Aug/2017:07:54:20 -0400  406  POST  /wp-login.php
04/Aug/2017:07:56:21 -0400  200  GET  /wp-login.php
04/Aug/2017:07:56:21 -0400  406  POST  /wp-login.php
04/Aug/2017:07:58:18 -0400  200  GET  /wp-login.php
04/Aug/2017:07:58:18 -0400  406  POST  /wp-login.php

This attack continued for 12 hours, making a pair of GET/POST requests to “/wp-login.php” about every minute, for a total of 640 requests. The POST requests (which could carry a username and password for an actual login requests) get a 406 (“Not Acceptable”) response, indicating that these requests did not succeed.

Note the user agent: another case where it’s not indicating a kind of web browser (MSIE, Chrome, etc). A quick internet search revealed that Synapse is a web probing tool used in SQL injection attacks – a strong indicator that this is a hack attempt!

Something was making repeated requests to log in to the site, most likely trying different usernames and passwords from a list in each request. The “GET /wp-login.php” simply requests the content of the WordPress login page, but the “POST /wp-login.php” can actually do a login. If the login attempt succeeds (200 status), a hacker can use this to log in and attack the site: insert code to capture passwords, links to other malicious sites, advertisements for various things, etc. Once they’re in, they have free reign (more so if they discover an account with administrator control of the site).

Also on August 4: apparently from Poland, Brazil, and a couple of servers in Arizona:

04/Aug/2017:06:15:22 -0400  404  GET  /dump.sql
04/Aug/2017:06:31:00 -0400  404  GET  /db.sql
04/Aug/2017:06:52:29 -0400  404  GET  /backup.sql
04/Aug/2017:07:10:20 -0400  404  GET  /sql.sql
04/Aug/2017:07:42:56 -0400  404  GET  /dump.sql.zip
04/Aug/2017:07:58:43 -0400  404  GET  /db.sql.bz2
04/Aug/2017:08:24:32 -0400  404  GET  /dump.tar
...

This appears to be an enumeration of possible files and folders related to backups (a.k.a. “dumps”) of the MySQL database for the WordPress site. These are probably the names of files that are created by WordPress plugins that offer an automated backup feature. A full database dump will include the contents of the WordPress users table with names and passwords (the names will be in clear text, the passwords will be encrypted). A hacker could extract the usernames and (with some work) the passwords for logins to the site, again, giving them access to modify it. The 404 (“Not Found”) responses show that these attempts were unsuccessful.

While these requests appear to be coming from four distinct servers, the unique and unusual nature of these requests suggests that this was a simple “botnet”, four servers each running the same scanning software, controlled from one location.

The above are just a sample of the 37 different filenames that were requested in this attack. Here is the complete list of filenames:

/1.sql /dump.sql.gz /site.sql
/backup.sql /dump.sql.tgz /sql.gz
/backup.sql.bz2 /dump.sql.zip /sqlmanager
/backup.sql.gz /dump.tar /sql.sql
/dbdump.sql.gz /dump.tar.gz /sql.tar
/db.sql /dump.tgz /sql.tar.gz
/db.sql.bz2 /mysql /sql.tgz
/db.sql.gz /mysqladmin /sqlweb
/db.sql.zip /mysql-admin /sql.zip
/db.tar /mysqlmanager /temp.sql
/db.tar.gz /mysql.sql /users.sql
/dump.gz /mysql.zip /websql
/dump.sql

Wednesday, August 9

Apparently from Sofia, Bulgaria, the following is an attempt at accessing some internals of the site using the “xmlrpc” interface.

. . User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36
. . Referrer: http://curiousprog.com
09/Aug/2017:10:12:02 -0400 200 GET /
09/Aug/2017:10:12:04 -0400 405 GET /xmlrpc.php
09/Aug/2017:10:12:05 -0400 200 POST /xmlrpc.php
09/Aug/2017:10:12:06 -0400 302 GET /wp-login.php
09/Aug/2017:10:12:08 -0400 404 GET /?author=1

The POST with the 200 response shows that one request succeeded! POST requests hide any parameters to the request, so it’s not clear whether this was just a check that the interface was live, or a real attempt at a hack. Whatever this request was trying to do, it didn’t seem to do any damage to the site.

The “/?author=1” request is interesting: this is a commonly known exploit on WordPress, an attempt at finding valid user names on the WordPress site. If this request succeeds, some (semi-human readable) JSON data would be returned with the information. The number in the request is the internal id of the user; it can be any integer value (1, 2, 3). Most often, user number 1 is the first user defined when the WordPress site is first created – always an admin user.

Discovering usernames for a WordPress site cuts the work in half for brute forcing login discovery: knowing the username, the attacker only has to guess passwords.

As can be seen above, the “/?author=1” request failed with a “404 Not Found” error.

In addition to the “/?author=N” request above, the request “/wp-json/wp/v2/users/” is another well known WordPress exploit that will return JSON data for all registered users (there will be some examples of this below).

Sunday, August 20

Another example of attempts at WordPress user name enumeration and attempts at breaking in using /xmlrpc.php, apparently from Poland:

. . User-Agent:  Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0
. . Referrer:    -
20/Aug/2017:17:29:22 -0400  200  GET  /
20/Aug/2017:17:29:24 -0400  200  GET  /wp-includes/wlwmanifest.xml
20/Aug/2017:17:29:25 -0400  404  GET  /?author=1
20/Aug/2017:17:29:25 -0400  401  GET  /wp-json/wp/v2/users/
20/Aug/2017:17:29:26 -0400  403  POST /xmlrpc.php
20/Aug/2017:17:49:34 -0400  200  GET  /
20/Aug/2017:17:49:36 -0400  200  GET  /wp-includes/wlwmanifest.xml
20/Aug/2017:17:49:36 -0400  404  GET  /?author=1
20/Aug/2017:17:49:37 -0400  401  GET  /wp-json/wp/v2/users/
20/Aug/2017:17:49:37 -0400  403  POST /xmlrpc.php

This request is trying both of the well known hacks to get user information, as well as making a xmlrpc request.

There are also requests here for the file “/wp-includes/wlwmanifest.xml”. This is a config file that is associated with “Windows Live Writer”, an application for authoring blog posts. The Microsoft Essentials site indicates that Windows Live Writer reached “end of life” this January and is no longer supported. A blog post on chickgeek.org suggests a way to edit a file in WordPress to prevent this file from being accessed.

Friday, August 25

Today brought another poke at xmlrpc.php, wp-login.php, and attempts to get the names of users with ids 1, 2, and 3 using the “/author” request as discussed above, apparently from a site in Ukraine:

. . User-Agent:  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36
. . Referrer:    http://curiousprog.com
25/Aug/2017:08:18:44 -0400  403  GET  /xmlrpc.php
25/Aug/2017:08:18:45 -0400  302  GET  /wp-login.php
25/Aug/2017:08:18:49 -0400  404  GET  /?author=1
25/Aug/2017:08:18:51 -0400  404  GET  /?author=2
25/Aug/2017:08:18:55 -0400  404  GET  /?author=3

Sunday, September 3

It’s been a month now that the site has been up, and intermittent suspicious visits continue, mostly requests for “/wp-login.php” and “/xmlrpc.php”.

Somewhere along the line in the first month, I did make a request to Google to begin indexing my site. I started seeing periodic requests from Google “Bots” (web spiders, a.k.a. indexers). Looking back, most of the requests were for pages of my blog: the home page, pages of individual articles, pages that list all posts for tags or categories):

Client: crawl-66-249-66-82.googlebot.com, 66.249.66.82
. . User-Agent:  Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
. . Referrer:    -
05/Aug/2017:12:31:33 -0400  200  GET  /
05/Aug/2017:13:29:36 -0400  200  GET  /2017/08/04/curiosity-and-the-programmer/
05/Aug/2017:13:31:27 -0400  200  GET  /tag/curiousity/
05/Aug/2017:22:39:17 -0400  200  GET  /feed/

But in addition to browsing paths that I expected, the bot was browsing something unfamiliar, a “.well-known” directory:

Client: crawl-66-249-64-157.googlebot.com, 66.249.64.157
. . User-Agent:  Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
. . Referrer:    -
14/Aug/2017:11:04:47 -0400  404  GET  /.well-known/apple-app-site-association
14/Aug/2017:11:05:05 -0400  404  GET  /apple-app-site-association
 
Client: crawl-66-249-64-159.googlebot.com, 66.249.64.159
. . User-Agent:  Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
. . Referrer:    -
15/Aug/2017:10:42:59 -0400  404  GET  /.well-known/assetlinks.json

I got curious; some internet searches quickly found the explanation. This is a relatively new feature of the Internet, RFC 5785, published April 2010 (RFC 5785 – Defining Well-Known Uniform Resource Identifiers (URIs)). From the RFC:

“It is increasingly common for Web-based protocols to require the discovery of policy or other information about a host (“site-wide metadata”) before making a request…this memo defines a path prefix in HTTP(S) URIs for these “well-known locations”, “/.well-known/”.”

In this case, the two “well known” files being requested are involved in associating a web site with a native smart phone app. This is a valid feature for a search engine to be interested in: it can list the associated native app on the web page in search results. See this site for a more detailed discussion:
Google bot hits on files ‘apple-app-site-association’ and ‘assetlinks.json’ showing up in Google Webmasters as pages with errors. It’s understandable, then, that Google would be looking for files like these – it’s looking to add connections to native apps to its search results.

Summary

In just the first month that the Curious Prog web site was live, I saw a wide variety of attempts at breaking in to the site or learning information to provide a way to break in, looking for features of WordPress with known vulnerabilities. The above are only examples; there were a lot more attempts. But the examples shown in this post capture the general character of hack attempts on the site:

  • Learning the names of authors on the site by making elementary requests to the site
  • Making requests to xmlrpc.php to run code in the server
  • A brute force attack on “wp-login.php” attempting to discover valid usernames and passwords
  • Looking for stray database backup files which would carry internal information for the site (user names and encrypted password)

Curiosity also led to learning about the “/.well-known” directory (RFC 5785).

What can be done?

What can be done to protect from these attacks?

  • Select long passwords, greater than 8 characters: with more possibilities, it becomes more time consuming to guess passwords via brute force attacks.
  • Make sure that your password isn’t an obvious one (for example, “password”, “123456”, and “letmein”). Brute force attacks will try these on your site! Check a site like Worst, most common passwords for the last 5 years and see if your password is on the list, and change it if so! Follow the advice given for creating passwords on most web sites: combinations of letters and numbers, uppercase and lowercase letters, maybe a sequence of words that are memorable.
  • Uninstall any plugins that aren’t really being used on your site, and any theme other than the theme you’re using. Themes and plugins sometimes have their own bugs that can create open doors for hackers – they’re patched once they’re found, but sometimes some time passes before this initial discovery. So if you have a database plugin installed on your site and aren’t using it any more, uninstall it – remember that database file enumeration attack!
  • Disable vulnerable features that aren’t being used (“xmlrpc.php” and the “/?author=1” and “/wp-json/wp/v2/users” requests). But understand that disabling these features may prevent some features from working (see below for details)
  • Enable daily full backups of your WordPress files and the database. If a hacker does break into your site and does some damage to it (adding spam or links to spam sites) you can have your site restored to a good state from a backup. Many hosting companies offer backup features, and there are database backup plugins available for WordPress, too (but do some research on any backup plugin you’re considering using to make sure that it doesn’t have a security vulnerability that exposes the backed up database to external access as noted above)

I disabled the “xmlrpc.php” call using the following setting in the “.htaccess” file at the root of the site:

<Files xmlrpc.php>
order deny,allow
deny from all
</Files>

Note that xmlrpc.php is used for tracebacks, disabling it will keep traceback comments from working. The Jetpack plugin also relies on using xmlrpc.php for some of its features, so disabling xmlrpc.php would interfere with its use.

The Wordfence Plugin “community edition” (free) successfully disabled the “/?author=1” and “/wp-json/wp/v2/users” interfaces that were targets of quite a few attacks. It also monitors accesses to the site and automatically instates temporary blocks on IP addresses that were doing brute force attacks (i.e., very frequent, unusual requests on the site). The blocks seem to last for about a week and then get released. The Wordfence plugin will also do frequent scans of your WordPress site and will notify you of anything unusual that it finds, including the presence of unexpected files – this might help with the database backup file issue discussed in this post.

Doing Forensics on HTTP Access Logs

For the curious of mind, hosting accounts often provide access to compiled statistics from the HTTP access logs for the hosted site, frequently using AWStats or Webalizer.

AWStats has a section that shows search site spiders/bots that have visited the site; it also provides a list of the number of hits by status code (other than 200). This includes “404” (Not Found) errors that might list attempts by hackers to access files with well known security vulnerabilities. Or they might just be truly missing files, that happens sometimes – a theme or plugin leaves off a file, but its absence doesn’t have a major effect on the appearance or performance of the site.

The “Connect to Site” section of AWStats lists referers – watch for unusual, unexpected sources (something other than a legitimate web browser or a search engine bot).

In Webalizer the “Total Sites” section lists the hostnames of frequent accessors…might see something unusual there (possibly a brute force attack creating a large number of hits on a single resource?).

“Total Countries” lists countries hits came from – look for outliers, particularly from small countries or isolated islands around the world. These could be legitimate visits that you should welcome to your site, but they might also be systems hijacked by hackers and used as bots for an attack!

The access logs in their original form are often available under “Raw Access Logs” in the cPanel management interface that’s available on many hosting servers.

Using the IP address listed in an access log entry, you can get information about where a request came from using the ‘whois’ command. It’s available at the command line on Mac OS X and Linux systems. An online tool is available at GeekTools Whois to do the same search. A typical response from a ‘whois’ request will look like this (remember, this is only where the request is apparently from):

OrgName:        RIPE Network Coordination Centre
Address:        P.O. Box 10096
City:           AmsterdamPostalCode:     1001EB
Country:        NLnetname:        OTE-SA
descr:          Multiprotocol Service Provider to other ISP's and End Users
country:        GRaddress:        Ote SA (Hellenic Telecommunications Organisation)
address:        Kifissias 99
address:        GR-15124 Athensaddress:        Greecedescr:          OTEnet

The first entries represent the server that provides the information for the whois response (in this case, a RIPENet server located in Amsterdam in the Netherlands). The second set of information shows where the server is located (in this case, Greece).

Versions

WordPress 4.8.2
WordFence Plugin 6.3.15
AWStats 7.4
Webalizer Version 2.23

References

Hackers Find Fresh WordPress Sites Within 30 Minutes
Worst, most common passwords for the last 5 years | Computerworld
Why passwords have never been weaker – and crackers have never been stronger | Ars Technica
GeekTools Whois
Administration Over SSL – WordPress Codex
HTTP response status codes – HTTP | MDN
List of HTTP status codes – Wikipedia
RFC 5785 – Defining Well-Known Uniform Resource Identifiers (URIs)
The “.well-known” directory on webservers (aka: RFC 5785)

Add a Comment

Your email address will not be published. Required fields are marked *