Skip to content

DOC: added string processing comparison with R #16502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions doc/source/comparison_with_r.rst
Original file line number Diff line number Diff line change
Expand Up @@ -530,6 +530,103 @@ For more details and examples see :ref:`categorical introduction <categorical>`
:ref:`differences to R's factor <categorical.rfactor>`.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a section tag here, like: _compare_with_r.string (actually if you can add them some of the sub-sections would be great). you put right after the sub-section label.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you aren't familiar with sphinx, you need to start the line with a .. http://www.sphinx-doc.org/en/stable/markup/inline.html#cross-referencing-arbitrary-locations

String Processing
-----------------

Length
~~~~~~

R determines the length of a character string with the ``nchar`` function.
``nchar`` includes leading and trailing blanks. Use ``nchar`` and ``trimws``
to exclude leading and trailing blanks.

.. code-block:: none
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a R highlter?

Copy link
Contributor

@TomAugspurger TomAugspurger May 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://pygments.org/docs/lexers/#lexers-for-the-r-s-languages

r or rconsole should work. Probably rconsole if you're showing output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I realize that these produce the same output, so we don't actually show the output (I think) elsewhere, so maybe that is ok (though obviously the code formatting would be nice)


df <- data.frame(color = c('red', ' blue', 'green ', ' yellow '))
nchar(as.character(df$color))
nchar(trimws(as.character(df$color)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to show output here (IOW run the R code and show the output too if you can)


Python determines the length of a character string with the ``len`` function.
``len`` includes leading and trailing blanks. Use ``len`` and ``strip``
to exclude leading and trailing blanks.

.. code-block:: none
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> ipython:: python (and for all running pandas code)


df = pd.DataFrame({'color': ['red', ' blue', 'green ', ' yellow ']})
df['color'].str.len()
df['color'].str.strip().str.len()


Find Position
~~~~~~~~~~~~~

R determines the position of a character in a string with the
``regexpr`` function. ``regexpr`` takes the string defined by
the first argument and searches for the first position of the substring
you supply as the second argument.

.. code-block:: none

df <- data.frame(sex = c('MALE', 'FEMALE'))
pos = regexpr("ALE", df$sex)
pos[1:2]

Python determines the position of a character in a string with the
``find`` function. ``find`` searches for the first position of the
substring. If the substring is found, the function returns its
position. Keep in mind that Python indexes are zero based whereas
R indexes are 1 based.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler: "Keep in mind that Python 0-indexes, whereas R 1-indexes"


.. code-block:: none

df = pd.DataFrame({'sex': ['MALE', 'FEMALE']})
df['sex'].str.find("ALE")

Substring
~~~~~~~~~

R extracts a substring from a string based on its position
with the ``substr`` function.

.. code-block:: none

df <- data.frame(sex = c('MALE', 'FEMALE'))
substr(df$sex, 1, 1)

In Python, you can use ``[]`` notation to extract a substring
from a string by position locations. Keep in mind that Python
indexes are zero-based.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly simpler: "Keep in mind that Python 0-indexes"


.. code-block:: none

df = pd.DataFrame({'sex': ['MALE', 'FEMALE']})
df['sex'].str[0:1]


Upcase and Lowcase
~~~~~~~~~~~~~~~~~~

The R ``toupper`` and ``tolower`` functions change the case of the
character string.

.. code-block:: none

df <- data.frame(name = c('Johnny Bravo', 'Alex Mack'))
toupper(df$name)
tolower(df$name)

The equivalent Python functions are ``upper`` and ``lower``.
In addition, Python's ``title`` function changes the string to
proper case.

.. code-block:: none
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment above


df = pd.DataFrame({'name': ['Johnny Bravo', 'Alex Mack']})
df['name'].str.upper()
df['name'].str.lower()
df['name'].str.title()


.. |c| replace:: ``c``
.. _c: http://stat.ethz.ch/R-manual/R-patched/library/base/html/c.html

Expand Down